On the occasion of the beginning of spring, I wish Viblo brothers and sisters good health and success ^^
Hello everyone, today I would like to share how to use Playwright – a testing framework that was just released a while ago… a little while ago, darling of M$. To make the article more practical, I would like to take the example of crawling a driver’s license test website ( https://hoclaixehcm.vn/thi-bang-lai-xe-may-a1-online/ )
Because the article will be quite long, I will split it into 2 parts, write 1 post per week to help… lazy.
The end result should look like this: Github repo
Let’s go.
What is Playwright?
Playwright is a Testing framework launched around 2020, backed by Microsoft.
Playwright was developed by puppeteer engineers, born later, it inherits the outstanding features of the previous generation (Cypress, Selenium), and has its own outstanding features such as: auto wait, cross browser, cross platform,… (see more at Playwright’s homepage: Link )
Usually Playwright will use it for testing, but I will use it for crawling.
Create a project with Playwright
Playwright support multiple languages. Here I use Node.js for simplicity
1 2 | npm init playwright@latest |
After typing the command, Playwright will ask to enter information such as: using Typescript or Javascript, where to put the test code, whether there are more GitHub Actions, ..
If you don’t want to customize anything, just press enter.
It will look like the log below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | $ npm init playwright@latest npm WARN exec The following package was not found and will be installed: create-playwright@latest Getting started with writing end-to-end tests with Playwright: Initializing project in '.' ? Do you want to use TypeScript or JavaScript? ... √ Do you want to use TypeScript or JavaScript? · TypeScript ? Where to put your end-to-end tests? » tests √ Where to put your end-to-end tests? · tests ? Add a GitHub Actions workflow? (y/N) » false √ Add a GitHub Actions workflow? (y/N) · false √ Install Playwright browsers (can be done manually via 'npx playwright install')? (Y/n) · true Initializing NPM project (npm init -y)… Wrote to D:codepetplaywright-crawlerpackage.json: { "name": "playwright-crawler", "version": "1.0.0", "description": "", "main": "index.js", "scripts": { "test": "echo "Error: no test specified" && exit 1" }, "keywords": [], "author": "", "license": "ISC" } Installing Playwright Test (npm install --save-dev @playwright/test)… ✔ Success! Created a Playwright Test project at D:codepetplaywright-crawler Inside that directory, you can run several commands: npx playwright test Runs the end-to-end tests. npx playwright test --project=chromium Runs the tests only on Desktop Chrome. npx playwright test example Runs the tests in a specific file. npx playwright test --debug Runs the tests in debug mode. npx playwright codegen Auto generate tests with Codegen. We suggest that you begin by typing: npx playwright test And check out the following files: - .testsexample.spec.ts - Example end-to-end test - .tests-examplesdemo-todo-app.spec.ts - Demo Todo App end-to-end tests - .playwright.config.ts - Playwright Test configuration Visit https://playwright.dev/docs/intro for more information. ✨ Happy hacking! 🎭 |
Ok delicious! You cd into the playwright-crawler folder and open it with your favorite IDE.
Analysis of direction
Try the app
First I tried the application, see how it works as follows:
Go to https://hoclaixehcm.vn/thi-bang-lai-xe-may-a1-online/ to see a list of exam questions (from 1->10 above and 1->20 below). My goal is to get this data in json
Click on topic 1, I see the interface as follows:
- Overview overview: includes a list of questions
- Button to end the exam to submit the essay
- Questions and answers. There are questions with pictures, there are questions without.
Prepare
I have prepared the xpaths to use:
- Xpath button submit:
//input[@id='nopbai']
- Question Xpath
//*[@id="data${i}"]/div[1]/div[2]/strong
, where i is the order of the question - Image Xpath:
//*[@id="data${i}"]/div[1]/img
- Xpath answer:
//*[@id="data${i}"]//div[@class='cautraloi']//label
- Xpath explained:
//*[@id="data${i}"]/div[2]/div/p
Direction to do
I will code the pseudocode to make the following post fill in the blanks for you to follow easily:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | // Truy cập vào trang // Click vào đề // Click nộp bài // Parse dữ liệu của từng câu hỏi: // - Lấy câu hỏi // - Phân loại câu hỏi có phải câu liệt hay không (có chứa dấu *) // - Lấy hình ảnh của câu hỏi, chưa có trong máy tính thì download về // - Lấy câu trả lời // - Đánh dấu câu trả lời nào là đúng // - Lấy phần giải thích // Lưu dữ liệu câu hỏi vào file json |
Ok, this post ends here.
In the next post, I will implement the code and explain in detail each paragraph so that you can understand how to do it ^^
The article is excerpted from my blog: https://minhvu278.wordpress.com/2023/02/19/su-dung-playwright-de-crawl-du-lieu-phan-1/