Architectural overview of Scrapy

Tram Ho

overview

Scrapy is an open source Python Framework https://github.com/scrapy/scrapy that supports Crawling and Scraping data from Web pages by downloading HTML and extract data from them.

Scrapy Architecture

The process from the beginning of the request to the successful extraction of information, the flow of data is executed through the engine.

Scrapy Components

1.Scrapy Engine

The engine is responsible for controlling the flow between components in the system and triggering events when certain actions occur.

2.Scheduler

The task is to receive requests from the engine and put it into a queue to arrange the URLs in a Download order.

3.Downloader

It is responsible for downloading the Source HTML of the website and submitting it to the Engine.

4.Spider

As a class written by Developer, it is responsible for analyzing the response and extracting items, recreating a new URL and reloading it to the Scheduler via Engine.

5.Item Pipeline

The task is to handle the Item after being extracted by spider, then saved to the database.

6.Downloader middlewares

As a link between Engine and Downloader, they handle requests that are pushed from the engine and the responses generated from the Downloader.

7.Spider middlewares

As a link between Engine and Spider, they are responsible for processing the input (response) of the Spider and the output (items and requests).

Execution process of the system

  1. The engine initiates a Request to start crawling from Spider.
  2. The engine schedules a request from the spider and requests the next request to crawl.
  3. The scheduler sends the next request to the engine.
  4. Engine sends Request to Downloader, goes through Downloader Middleware.
  5. After the HTML source download is complete, Downloader initiates an Object Response returned via the Engine, the process goes through Downloader MiddleWare.
  6. Engine receives Respoonse from Downloader, and sends it to Spider for processing, the process goes through Spider Middleware.
  7. The Spider processes the Response and returns the scanned items, then initiates the Request to the Engine, via Spider Middleware.
  8. The engine sends processed items to Item pipelines, then sends the processed requests to the scheduler and requests (if any) the next request to crawl.
  9. The process repeats like step 1, until there are no more requests from the scheduler.

Conclude

Above I have introduced the underlying structure of Scrapy, how components work together. Thank you for your interest.!

Reference source: https://docs.scrapy.org/en/latest/topics/architecture.html

Share the news now

Source : Viblo