Crawl Data Website without API

Tram Ho

General introduction

Today I will guide everyone to crawl the website of the University that I used to study at University of Science – Hue University . I will introduce to everyone about a puppeteer, a headless browser and a puppeteer it uses for.

What is Headless Browser?

Headless Browser is a web browser with no user interface. Headless browsers provide automatic interaction of a web page in an environment like other popular web browsers, but it is done through a command line interface or over a communication network. You can refer to here . Here I have summarized a simple sentence that headless Browser instead of browsing the web, it is used to scratch data, take screenshots of web pages, …

What is Puppeteer?

Puppeteer is a library of NodeJS, which helps you control the headless Chrome. You can learn more here

Let’s Get Started

Up Ideas

I will crawl the website of the University of Science credit training website. So what will I crawl from this page, I will crawl all subject names with the number of credits along with the subject’s scores, from this data we can see the current learning situation. Developable functions: Draw graphs, calculate average score, calculate the number of points of remaining subjects that need to be achieved in order to qualify as Excellent, Good, Average students …

These functions can be completely developed on a website or mobile platform and this is essential to meet the needs of most students.

Installation And Setup

First you create yourself a folder in the folder that contains the folders used to crawl data.

Initialize the application with the package.json file

In your application’s root directory and type npm init to initialize your application with the package.json file.

npm init

Then you install the puppeteer module to crawl it. To install puppeteer you must first install NodeJS here.

npm install puppeteer

After installing the module, you will create an index.js file for me to write a crawling program.

Start Code Come On

In the index.js file, you require the library to:

const puppeteer = require('puppeteer');

Next, we will create a browser using the launch () method and access the University of Science credit training page as follows:

To be able to crawl the website data you need to call the page.evaluate API. It is a very important API that allows us to run the script to get the content returned.

Now I will go to the website to see its HTML structure. To analyze and extract data easily.

The structure mainly contains points for each subject is a table with many rows of tr, many columns td. The bottom line here is that we have to distinguish which is the subject line (Semester: 1 – School year: 2019-2020) and which row of data contains the scores we need to get. And the difference I found out is that the colspan property of the td on each page:

  • The line containing the title will include a <tr><td colspan="4"></td><td colspan="12"></td></tr> tag <tr><td colspan="4"></td><td colspan="12"></td></tr>
  • The line containing the point data td does not have the col-span attribute.

Now you open the console tab in chrome dev tools to be able to scratch data by writing JavaScript code

And here is the result when we test it in the console:

After testing ok we go to the index.js file and here is the complete code:

Next, you open the terminal in VScode or in the cmd window, type your own node index.js. And let’s see how it crawls.

Epilogue

So Done with Puppeteer and Nodejs Data Crawl Tutorial. I hope that after this article you can know and understand more about puppeteer, you can expand on new ideas. And you can do projects yourself that do not need to be too special, but it is done by yourself as a result of your learning.

Share the news now

Source : Viblo