Practice crawling with Scrapy Framework

Tram Ho

Introduction

Hi everyone, I have learned a little about Scrapy these days so I want to write a few lines to see what I have learned and do a little demo.

Suppose that recently you want to import a Macbook to sell for a bargain, and now want to research what other shops they sell and how the basket price is, you will have to go to their website and view each one directly. good product @@. No no, I want to see these data in the form of professional reporting statistics instead of having to click each product to view it manually, both eye pain and time consuming. Having to think about using Scrapy will help me crawl the Macbook’s data and export it as json, csv, … more intuitive and easy to manipulate for more statistics and reports.

Now let me learn and perform a little demo, and in this demo I will crawl the basic data of the Macs on thegioididong.

1. Create project scrapy

First, since scrapy is a python framework, we need to install python, scrapy first. Follow the scrapy ‘s homepage steps to proceed with the installation

Next, I will create project scrapy called tutorial with the following command:

The project tutorial that we just created has the following structure:

2. Create spider

After creating the project, now we need to create a spider to crawl with the following command:

The above command has created a spider MacbookTgddSpider in the spiders directory as follows:

As you’ve seen:

  • This spider, named macbook_tgdd will be used to run cmd at crawl.
  • start_urls : this is the starting address for the spider, which can be a list of urls corresponding to the domain in allowed_domains . At this demo I will start at the Macbook product link.
  • parse() : It is the function that I will write code to control the spider to crawl the data that I want to get from the above url.

3. Select the data you want to crawl

First, we need to pre-define what is the data we want. Here I need to list the Macbook market sold in thegioididong, so I will crawl the product name, original price, sale price and the average number of votes of each Macbook. After having idea which items you want to crawl, the next thing is to define those items in the items.py file as follows:

Now let’s go to the main thing, to crawl the right item you need, you need to select the right object on the DOM. You can use Css selectors or Xpath, but each has advantages and disadvantages. own points so you can refer to your own reference to use accordingly

In this demo I will use the Css selectors to select the items I want to get, turn on inspect and copy the selector of those items as follows:

Performing in turn with the next item, after a while of code, my file spiders/macbook_tgdd.py will be as follows:

I will explain a little bit of the code above:

  • parse() : Because the data of each product outside the Macbook list may not meet its crawling needs, it is necessary to go to each product link inside the li tag in that list, each li contains 1 product. Products. Based on the href of each product, I will send scrapy.Request to that product’s url and call parse_macbook() to crawl.
  • parse_macbook() : In this function, I will proceed to crawl the items that I define in DemoScrapyItem() . Take a careful look at the website and inspect to select the item correctly in the DOM: v, for example, when the shop is out of stock, the selector of the product price will be different, or the selector of the sale price will be different when buying online. Should consider carefully when copying the selector, do not always copy it lest it not get the correct data =))

4. Conduct a crawl

Now that you can pull the Macbook’s data on thegioididong, Scrapy supports exporting data to different formats such as JSON, CSV and XML. Because I am used to looking at JSON, here I will output JSON: v

As a result, I got the 403 error. Because websites often have mechanisms to block bots from crawling their data. Must handle this error immediately by configuring more User-Agent in DEFAULT_REQUEST_HEADERS that Scrapy comments in the settings.py file as follows, and also you configure utf-8 to avoid font error:

Ok, now run the above crawl command to see what you can get:

Please check manually a bit compared to thegioididong web to see if I have properly crawled: v, after checking, there are a number of null price_sale because the Macbook cannot be rate_average , and for rate_average null, the Macbook does not have a rate. always: v, the rest are 100% match. So that success =))

Conclude

So I have crawled some information that helps to research the market for my Macbook business, but this is just a basic demo and the number of Macbook sold in thegioididong is still small so you see. manual research will be faster =)) But in case there are more products and are paginated to tens of hundreds, there is an eye firefly explosion: v

Note a little that my post can only be crawled at the time of writing, you need to keep track of interface updates from thegioididong to see if they have changed the HTML or not to adjust the selector accordingly, avoiding the field. The HTML matching has been changed, resulting in the wrong selector not being able to correctly crawl the data.

Thank you for taking the time to read your article, if there is anything missing, you can leave a comment below the article to study and improve. (bow)

Share the news now

Source : Viblo