Using Playwright to crawl data – Part 1

Tram Ho

On the occasion of the beginning of spring, I wish Viblo brothers and sisters good health and success ^^

Hello everyone, today I would like to share how to use Playwright – a testing framework that was just released a while ago… a little while ago, darling of M$. To make the article more practical, I would like to take the example of crawling a driver’s license test website ( https://hoclaixehcm.vn/thi-bang-lai-xe-may-a1-online/ )

Because the article will be quite long, I will split it into 2 parts, write 1 post per week to help… lazy.

The end result should look like this: Github repo

Let’s go.

What is Playwright?

Playwright is a Testing framework launched around 2020, backed by Microsoft.

Playwright was developed by puppeteer engineers, born later, it inherits the outstanding features of the previous generation (Cypress, Selenium), and has its own outstanding features such as: auto wait, cross browser, cross platform,… (see more at Playwright’s homepage: Link )

Usually Playwright will use it for testing, but I will use it for crawling.

Create a project with Playwright

Playwright support multiple languages. Here I use Node.js for simplicity

After typing the command, Playwright will ask to enter information such as: using Typescript or Javascript, where to put the test code, whether there are more GitHub Actions, ..

If you don’t want to customize anything, just press enter.

It will look like the log below.

Ok delicious! You cd into the playwright-crawler folder and open it with your favorite IDE.

Analysis of direction

Try the app

First I tried the application, see how it works as follows:

Go to https://hoclaixehcm.vn/thi-bang-lai-xe-may-a1-online/ to see a list of exam questions (from 1->10 above and 1->20 below). My goal is to get this data in json image.png

Click on topic 1, I see the interface as follows:

  • Overview overview: includes a list of questions
  • Button to end the exam to submit the essay
  • Questions and answers. There are questions with pictures, there are questions without.

image.png Click to try to submit the article, see a popup confirming whether to submit the article or not (pay attention to this detail so that I will talk more about the handle page event in the next post) image.png After submitting, an explanation box appears in the interface image.png Try inspecting the DOM and see what happens image.png Based on the test results above, I will go to the topic and submit it for an explanation, without doing each test, I will see the following: image.png The questions and answers are essentially in the DOM. It’s just hidden/shown depending on the selected sentence –> reads the DOM to parse the question and answer

Prepare

I have prepared the xpaths to use:

  • Xpath button submit: //input[@id='nopbai']
  • Question Xpath //*[@id="data${i}"]/div[1]/div[2]/strong , where i is the order of the question
  • Image Xpath: //*[@id="data${i}"]/div[1]/img
  • Xpath answer: //*[@id="data${i}"]//div[@class='cautraloi']//label
  • Xpath explained: //*[@id="data${i}"]/div[2]/div/p

Direction to do

I will code the pseudocode to make the following post fill in the blanks for you to follow easily:

Ok, this post ends here.

In the next post, I will implement the code and explain in detail each paragraph so that you can understand how to do it ^^

The article is excerpted from my blog: https://minhvu278.wordpress.com/2023/02/19/su-dung-playwright-de-crawl-du-lieu-phan-1/

Share the news now

Source : Viblo