Who, Machine Learning is a hot trend in the era of technology 4.0. In order to work with it, one of the most important things is data, the greater the amount of data and the greater the authenticity, the better for the training. In this article, I would like to introduce how to collect data with Scrapy.
Setting
The first requirement to use Scrapy is to install Python3 and Scrapy (of course ^^).
Python3
1.Open the terminal and enter the command
1 2 | sudo add-apt-repository ppa:jonathonf/python-3.6 && sudo apt-get update |
2.Install python 3.6 with the command
1 2 | sudo apt-get install python3.6 |
Scrapy
1.Install Scrapy by command
1 2 | pip install scrapy |
Write the program first
In this article I will use the website https://9to5mac.com/ to demo for Scraping.
Initialize Project
1. Open a terminal, initialize Project Scrapy first with the command
1 2 | scrapy startproject "tên project" |
2. Use cd to point to Project, then initialize Spider with command
1 2 | scrapy genspider "tên spider" "tên domain" |
Here I put it:
1 2 | scrapy genspider macspider 9to5mac.com |
Write code for the first Spider
1. Use the IDE to open the project, open the Spider file in the spiders folder.
2. To begin, I will import the Request object with the syntax:
1 2 | from scrapy import Request |
3. Start the request by initializing the start_requests method.
1 2 3 4 | def start_requests(self): for url in self.start_urls: yield Request(url = url,callback = self.parse_info) |
The parse_info callback function will contain the object that is the response of the html page.
Create the parse_info method
1 2 3 4 | def parse_info(self,response): y = response.text print(y) |
We then run Spider using the command:
1 2 | scrapy crawl macspider |
As we can see in the image below, the data has been scanned
Conclude
Okay my friends, through this article I introduced through Scrapy, the next article I will introduce its underlying engine, thank you for your interest.
1 2 | VIỆT NAM VÔ ĐỊCH |