Introduction
Hi everyone, I have learned a little about Scrapy these days so I want to write a few lines to see what I have learned and do a little demo.
Suppose that recently you want to import a Macbook to sell for a bargain, and now want to research what other shops they sell and how the basket price is, you will have to go to their website and view each one directly. good product @@. No no, I want to see these data in the form of professional reporting statistics instead of having to click each product to view it manually, both eye pain and time consuming. Having to think about using Scrapy will help me crawl the Macbook’s data and export it as json, csv, … more intuitive and easy to manipulate for more statistics and reports.
Now let me learn and perform a little demo, and in this demo I will crawl the basic data of the Macs on thegioididong.
1. Create project scrapy
First, since scrapy is a python framework, we need to install python, scrapy first. Follow the scrapy ‘s homepage steps to proceed with the installation
Next, I will create project scrapy called tutorial
with the following command:
1 2 | scrapy startproject demo_scrapy |
The project tutorial that we just created has the following structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | demo_scrapy/ scrapy.cfg # deploy configuration file demo_scrapy/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file middlewares.py # project middlewares file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py |
2. Create spider
After creating the project, now we need to create a spider to crawl with the following command:
1 2 | scrapy genspider macbook_tgdd www.thegioididong.com/laptop-apple-macbook |
The above command has created a spider MacbookTgddSpider
in the spiders directory as follows:
1 2 3 4 5 6 7 8 9 10 | import scrapy class MackbookTgddSpider(scrapy.Spider): name = 'macbook_tgdd' allowed_domains = ['www.thegioididong.com'] start_urls = ['https://www.thegioididong.com/laptop-apple-macbook/'] def parse(self, response): pass |
As you’ve seen:
- This spider, named
macbook_tgdd
will be used to run cmd at crawl. start_urls
: this is the starting address for the spider, which can be a list of urls corresponding to the domain inallowed_domains
. At this demo I will start at the Macbook product link.parse()
: It is the function that I will write code to control the spider to crawl the data that I want to get from the above url.
3. Select the data you want to crawl
First, we need to pre-define what is the data we want. Here I need to list the Macbook market sold in thegioididong, so I will crawl the product name, original price, sale price and the average number of votes of each Macbook. After having idea which items you want to crawl, the next thing is to define those items in the items.py
file as follows:
1 2 3 4 5 6 7 8 | import scrapy class DemoScrapyItem(scrapy.Item): product_name = scrapy.Field() price_sale =scrapy.Field() price = scrapy.Field() rate_average = scrapy.Field() |
Now let’s go to the main thing, to crawl the right item you need, you need to select the right object on the DOM. You can use Css selectors or Xpath, but each has advantages and disadvantages. own points so you can refer to your own reference to use accordingly
In this demo I will use the Css selectors to select the items I want to get, turn on inspect and copy the selector of those items as follows:
Performing in turn with the next item, after a while of code, my file spiders/macbook_tgdd.py
will be as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | import scrapy from demo_scrapy.items import DemoScrapyItem class MacbookTgddSpider(scrapy.Spider): name = 'macbook_tgdd' allowed_domains = ['www.thegioididong.com'] start_urls = ['https://www.thegioididong.com/laptop-apple-macbook/'] def parse(self, response): # Request tới từng sản phẩm có trong danh sách các Macbook dựa vào href for item_url in response.css("li.item > a ::attr(href)").extract(): yield scrapy.Request(response.urljoin(item_url), callback=self.parse_macbook) # Nếu có sản phẩm thì sẽ gọi tới function parse_macbook # nếu có sản phẩm kế tiếp thì tiếp tục crawl next_page = response.css("li.next > a ::attr(href)").extract_first() if next_page: yield scrapy.Request(response.urljoin(next_page), callback=self.parse) def parse_macbook(self, response): item = DemoScrapyItem() item['product_name'] = response.css( 'div.rowtop > h1 ::text').extract_first() # Tên macbook out_of_stock = response.css('span.productstatus ::text').extract_first() # Tình trạng còn hàng hay không if out_of_stock: item['price'] = response.css( 'strong.pricesell ::text').extract_first() else: item['price'] = response.css( 'aside.price_sale > div.area_price.notapply > strong ::text').extract_first() discount_online = response.css('div.box-online.notapply').extract_first() # Check nếu có giảm giá khi mua online hay không if discount_online: item['price_sale'] = response.css( 'aside.price_sale > div.box-online.notapply > div > strong ::text').extract_first() else: item['price_sale'] = response.css( 'span.hisprice ::text').extract_first() item['rate_average'] = response.css('div.toprt > div.crt > div::attr(data-gpa)').extract_first() yield item |
I will explain a little bit of the code above:
parse()
: Because the data of each product outside the Macbook list may not meet its crawling needs, it is necessary to go to each product link inside theli
tag in that list, eachli
contains 1 product. Products. Based on the href of each product, I will sendscrapy.Request
to that product’s url and callparse_macbook()
to crawl.parse_macbook()
: In this function, I will proceed to crawl the items that I define inDemoScrapyItem()
. Take a careful look at the website and inspect to select the item correctly in the DOM: v, for example, when the shop is out of stock, the selector of the product price will be different, or the selector of the sale price will be different when buying online. Should consider carefully when copying the selector, do not always copy it lest it not get the correct data =))
4. Conduct a crawl
Now that you can pull the Macbook’s data on thegioididong, Scrapy supports exporting data to different formats such as JSON, CSV and XML. Because I am used to looking at JSON, here I will output JSON: v
1 2 | scrapy crawl macbook_tgdd -o macbook_tgdd.json |
As a result, I got the 403 error. Because websites often have mechanisms to block bots from crawling their data. Must handle this error immediately by configuring more User-Agent
in DEFAULT_REQUEST_HEADERS
that Scrapy comments in the settings.py
file as follows, and also you configure utf-8 to avoid font error:
1 2 3 4 5 6 7 8 9 | DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0', } FEED_EXPORT_ENCODING = 'utf-8' |
Ok, now run the above crawl command to see what you can get:
Please check manually a bit compared to thegioididong web to see if I have properly crawled: v, after checking, there are a number of null price_sale
because the Macbook cannot be rate_average
, and for rate_average
null, the Macbook does not have a rate. always: v, the rest are 100% match. So that success =))
Conclude
So I have crawled some information that helps to research the market for my Macbook business, but this is just a basic demo and the number of Macbook sold in thegioididong is still small so you see. manual research will be faster =)) But in case there are more products and are paginated to tens of hundreds, there is an eye firefly explosion: v
Note a little that my post can only be crawled at the time of writing, you need to keep track of interface updates from thegioididong to see if they have changed the HTML or not to adjust the selector accordingly, avoiding the field. The HTML matching has been changed, resulting in the wrong selector not being able to correctly crawl the data.
Thank you for taking the time to read your article, if there is anything missing, you can leave a comment below the article to study and improve. (bow)