Crawling the site using Laravel Dusk Spider

Tram Ho

Introduce

Crawl data is an unfamiliar term in the marketing industry, Seo Services. Because crawling is the technique that the robots of the popular search engines today use such as Google, Yahoo, Bing, Yandex, Baidu … Crawler’s main job is to collect data from any website, or just predetermined and parse the HTML source code to read data and extract data information according to the request that the user sets or the data requested by Search Engine.

What is Laravel Dusk

Laravel Dusk is a browser automation tool provided by Laravel. It has the ability to access your web application or any other website in a browser, similar to an actual user running your website. Although the main purpose of Laravel Dusk is for automated testing, it can also be used for web scanning.

Install Laravel Dusk

Installing Laravel Dusk is pretty straightforward. We can use composer to do just that:

After the package is installed, we can use the artisan command to install and generate the default files:

Prepare the migration file and the database table

To save the data after crawling, we create a Page table

Our migration will look like this:

Dusk Spider Test

Now we will try running Laravel Dusk offline

The contents of the duskSpiderTest file will be as follows:

  • startUrl and domain are the website we will be crawling
  • setUp method is used to refresh the database every time a test is run
  • We start to get the data in the urlSpider function which will call the getLinks function
  • getLinks processes the urls, fetches all the current web links and sends them into the database
  • isValidUrl , trimUrl is the function to check if the link is correct

Finally run dusk with artisan command:

The article was translated from Crawling website using Laravel Dusk

Share the news now

Source : Viblo