Crawl data using Laravel , proxy and simple HTML Dom

Tram Ho

Today I will show you how to get data from any website using laravel, proxy and html dom. In this article, I will take an example of an Amazon product crawl.

Setting

First, go to this site to download the file simple_html_dom.php to the laravel’s Helpers directory (for example, the directory you created yourself, you can put in any folder you want). Then open the composer.json file and add the newly created file path to the autoload

then run the composer dumpautoload so this file is loaded into laravel’s library.

Code

To crawl the data I will create the command file then from the command call to the laravel jobs. If you use this, you can push the entire crawl task to run automatically as well as push the run into the queue and then we can use the supervisor to start multiple processes at the same time to run at the same time. but I recommend that up to 5 processes run at a time, amazone will block any IP that has many requests coming in a short period of time (this can use public proxy or private proxy to bypass).

  • First you need to create an AwsProductCrawler.php file in the app/Console/Commands folder with the following content:

This file has quite a simple content, it is just to get the merchants that need to crawl in the database, in the merchant you need to have merchant_id to be able to access this merchant’s list to get all products.

Next, you need to create an AwsCrawlerLink.php file in AwsCrawlerLink.php Jobs folder with the following content:

This file is responsible for retrieving the entire url of the product, then it continues to push to AwsCrawlerDetail to get details of products

The getContent function has the following content:

Finally, the section that takes details of the product, also the longest and hardest part, has the following content AwsCrawlerDetail.php :