A small project about Web Scraping

Tram Ho

Hey yoo, hi everyone, in this article I will talk about how to build a small Web Scraping project and test demo code.

You can check out my previous two posts to learn more about Scrapy.

Article 1

Article 2

Choose site

In this article I will use the Web site https://9to5mac.com/ to make a demo. I will get all the articles on this site.

Web analytics

Before we begin, we have to know what kind of render the site is.

There are 2 common types of renderings, which are HTML and JSON.

I tried going to the network tab and found …

Oh this site renders JSON, so easy …

Initialize project

I initialized the project with the commands

In MacSpider class, convert start_urls to start_urls = ['https://9to5mac.com/?infinity=scrolling']

We configure the properties of the MacSpider class as follows:

Write code for Spider

Add Class Request to make http Request to get data.

The start_requests method always runs first when running the Spider. Since this site is an infinite scroll page so I set the page to 10, in fact I tested it to have more than 1000 pages.

I use the POST method with formdata as the index of the page and will call the callback function containing the response is the result returned after making an HTTP request.

After each request, I will parse the response from the text into json by:

Response after Scrape about looks like the following, I tried with Postman:

Realizing that this Json-type response is missing data in each Item, this data can be retrieved according to the permalink field contained in the response so I continue to request to get the remaining data.

The results returned as shown

The amount of missing data of each Item I will merge with that Item.

Write code for pipelines

Each returned Item will go through the pipeline that I configured in the pipelines file, the pipeline will initialize and save the response into a JSON file.

Run code

Run the code with the command:

We run the code and … tada … get the following result.


Above, I have demoed a small project about Scrape data, hope everyone likes it. If you like please give me a click up, thanks ❤️

Source code: https://github.com/KaynAssassin/ninetofivemac

Share the news now

Source : Viblo