“Digging mine” with Puppeteer

Tram Ho

Written by Vu Van Phong

1. Headless browser

Headless browser is a term used to refer to the browser running without using a graphical interface, instead communicating with the browser via the command line interface. Headless browsers allow you to understand HTML as a regular browser, through which you can get information about web page components such as layout, colors, fonts, even JavaScript implementation, … Thanks to the possibilities, Headless browser is suitable for testing 1 special website, Automation Testing.

In addition to usability for Automation Testing, Headless browser can also be used to do some things like create a crawler to scratch data, screen screenshots, … There are lots of cool things we can do. through using Headless browser.

2. Puppeteer

Puppeteer is a ‘Node library’ developed by Google that provides APIs that control Chrome or Chromium through DevTools Protocol . The default Puppeteer runs in headless mode but can also be installed to run non-headless . Most of the things that can be done manually on the browser can be done with Puppeteer.

Considering the possibility of a framework for implementing Automation Test , Puppeteer still has many limitations compared to Selenium, or functional Webdriver / O when focusing on Chrome browser without supporting various platforms. Browser. However, to do some tools, it is very suitable because of the simplicity, easy to install, can run under headless mode without interfaces, so it is quite fast.

Maybe you are interested

10 mobile app trends are expected to dominate 2019

Some CSS tricks that Frontend itself may not even know (Part 1)

3. Create a crawler with Puppeteer

Problem:

Recently, to play around with some things, I need to search for Japanese grammar data. In the site Mazzi Dictionary I find quite a lot of data I want. The job is to create a crawler to retrieve that data and save it to your database.
Previously, when I needed to make a tool crawler, I used some libraries like scrapy , beautifulsoup to crawl static page. However, in my current case the data is being rendered via Javascript, creating a normal crawler as usual is no longer feasible. One solution is to use Scapy + Splash to solve this problem, but Splash’s script is written in Lua, so I switched to another approach that is using Puppeteer because of its simplicity.

Problem solving:

In this example using 2 libraries is mongoose , and puppeteer can be easily installed via npm :

  • npm install mongoose
  • npm install puppeteer

First we create a model.js file to save the data to the database:

Creating a crawler.js file to define crawler, crawler must perform operations:

  1. From the homepage choose to switch to Ngữ pháp tab
  2. The Grammar tab displays 1 page, each page contains 12 grammar, click on each grammar pattern in turn and get the data in the popup displayed.
  3. After taking all the data of 1 page, move to the next page.

Visit the Mazzi Dictionary page, and get grammatical data samples:

Page

grammar

Definition of getData() function getData() : Our operation includes clicking on a grammar template -> Retrieving data from popup -> click close popup

Write a function that inserts data into the database:

And here are the results: result Thanks to everyone who followed the post

Link source code: Demo

Share the news now

Source : Viblo