A small project about Web Scraping

Monday, 04/05/2020

Tram Ho

Hey yoo, hi everyone, in this article I will talk about how to build a small Web Scraping project and test demo code.

You can check out my previous two posts to learn more about Scrapy.

Article 1

Article 2

Choose site

In this article I will use the Web site https://9to5mac.com/ to make a demo. I will get all the articles on this site.

Web analytics

Before we begin, we have to know what kind of render the site is.

There are 2 common types of renderings, which are HTML and JSON.

I tried going to the network tab and found …

Oh this site renders JSON, so easy …

Initialize project

I initialized the project with the commands

scrapy startproject ninetofivemac

1 2	scrapy startproject ninetofivemac

cd ninetofivemac

1 2	cd ninetofivemac

scrapy genspider mac 9to5mac.com

1 2	scrapy genspider mac 9to5mac.com

In MacSpider class, convert start_urls to start_urls = ['https://9to5mac.com/?infinity=scrolling']

We configure the properties of the MacSpider class as follows:

name = 'mac'
allowed_domains = ['9to5mac.com']
start_urls = ['https://9to5mac.com/?infinity=scrolling']

name = 'mac'

allowed_domains = ['9to5mac.com']

start_urls = ['https://9to5mac.com/?infinity=scrolling']

Write code for Spider

Add Class Request to make http Request to get data.

from scrapy import Request

1 2	from scrapy import Request

The start_requests method always runs first when running the Spider. Since this site is an infinite scroll page so I set the page to 10, in fact I tested it to have more than 1000 pages.

I use the POST method with formdata as the index of the page and will call the callback function containing the response is the result returned after making an HTTP request.

def start_requests(self):
        for i in range(1,10):
            yield scrapy.http.FormRequest(url=self.start_urls[0],formdata = {"page":str(i)},callback=self.parse_info)

def start_requests(self):

for i in range(1,10):

yield scrapy.http.FormRequest(url=self.start_urls[0],formdata = {"page":str(i)},callback=self.parse_info)

After each request, I will parse the response from the text into json by:

res = json.loads(response.text)

1 2	res = json.loads(response.text)

Response after Scrape about looks like the following, I tried with Postman:

Realizing that this Json-type response is missing data in each Item, this data can be retrieved according to the permalink field contained in the response so I continue to request to get the remaining data.

def parse_info(self,response):
        res = json.loads(response.text) 
        if res["success"] == True :
            for i in res["data"]["posts"]:  
                yield Request(url=i["permalink"]+"json/content/",callback=self.parse_final_info,meta={"item":i})

def parse_info(self,response):

res = json.loads(response.text)

if res["success"] == True :

for i in res["data"]["posts"]:

yield Request(url=i["permalink"]+"json/content/",callback=self.parse_final_info,meta={"item":i})

The results returned as shown

The amount of missing data of each Item I will merge with that Item.

def parse_final_info(self,response):   
        res = response.meta.get("item")
        sub_content = json.loads(response.text)
        res["sub_content"] = sub_content
        yield res

def parse_final_info(self,response):

res = response.meta.get("item")

sub_content = json.loads(response.text)

res["sub_content"] = sub_content

yield res

Write code for pipelines

Each returned Item will go through the pipeline that I configured in the pipelines file, the pipeline will initialize and save the response into a JSON file.


from scrapy.exporters import JsonItemExporter

class JsonPipeline(object):
    def __init__(self):
        self.file = open("1.json", 'wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

from scrapy.exporters import JsonItemExporter

class JsonPipeline(object):

def __init__(self):

self.file = open("1.json", 'wb')

self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)

self.exporter.start_exporting()

def close_spider(self, spider):

self.exporter.finish_exporting()

self.file.close()

def process_item(self, item, spider):

self.exporter.export_item(item)

return item

Run code

Run the code with the command:

scrapy crawl "tên spider"

1 2	scrapy crawl "tên spider"

We run the code and … tada … get the following result.

Conclude

Above, I have demoed a small project about Scrape data, hope everyone likes it. If you like please give me a click up, thanks

Source code: https://github.com/KaynAssassin/ninetofivemac

Share the news now

Source : Viblo

A small project about Web Scraping

Choose site

Web analytics

Initialize project

Write code for Spider

Write code for pipelines

Run code

Conclude

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers