The Perfect Combination of Scrapy and Splash – The ultimate solution to your website using JavaScript?

Friday, 26/03/2021

Tram Ho

1. Introduction

In the previous article about Scrapy , I learned the basics of Scrapy and made a small demo to crawl data from thegioididong website. You may find that with Scrapy we can crawl any website, I used to think so and tried with famous e-commerce sites like Shopee, Lazada, … but paid data back to nothing Thus, the limitation of Scrapy is that the web pages use javascript to render, but nowadays, JS users, JS users know how to do saoo.

Now, let’s continue to find a solution to see why =))

2. Check the site using javascript to render

As I mentioned above, Scrapy’s disadvantage is that it is not possible to crawl web pages that use js render. So, every time you have a mindset that you want to crawl a website with Scrapy, check this right before you plug your face into the code and make the effort to pour the river.

To know if this page renders with js content, you just need to Ctrl + U and see if there is html content as shown, or just a body tag and js inserted later. Or use Chrome Extension Quick Javascript Switcher , this tool allows to enable or disable Javascript of web pages very simply with 1 click.

After a while of learning, I found that Scrapy can be used with a headless browser like Splash to wait for the website to render the content and cookies, and then send the generated HTML back to the crawler to disassemble as usual, which is costly. add a little extra time to wait but still faster than the other Scrapy options.

Here are some of the functions that Splash gives us:

process multiple webpages in parallel;
get HTML results and / or take screenshots;
turn OFF images or use Adblock Plus rules to make rendering faster;
execute custom JavaScript in page context;
write Lua browsing scripts;
develop Splash Lua scripts in Splash-Jupyter Notebooks.
get detailed rendering info in HAR format.

Not only that, there is also a scrapy-splash lib for Scrapy too. So there are swords in hand, let’s go conquer the web using JS =)))

3. Practice

3.1. Create project and spider

I will use the last project to perform this demo, if you do not know how to create, you can follow that article to create a project.

This time Shopee will be my conquest target, using Quick Javascript Switcher to see it is true that it uses real JS, it crawl_shopee legs, itchy feet, I immediately created a crawl_shopee Spider to crawl all products of any one shop above. Shopee, for a simple demo, here I choose to temporarily shop the Apple Flagship Store because there is only 1 product page, creating a spider with the following command:

scrapy genspider shopee_crawl https://shopee.vn/shop/88201679/search

1 2	scrapy genspider shopee_crawl https://shopee.vn/shop/88201679/search

I just walked to the door of the shop, but couldn’t come inside, now I remember that I forgot the sword at home, hurry back to get it: v

3.2. Splash settings and scrapy-splash

To install Splash , you must first have Docker . Once you have Docker , you just need to run the following 2 commands:

$ sudo docker pull scrapinghub/splash

1 2	$ sudo docker pull scrapinghub/splash

$ sudo docker run -p 8050:8050 scrapinghub/splash

1 2	$ sudo docker run -p 8050:8050 scrapinghub/splash

Currently only swords only, want to use more swords, then continue to use the following command to install scrapy-splash :

$ pip install scrapy scrapy-splash

1 2	$ pip install scrapy scrapy-splash

Add a little config in the settings.py file as follows:

# ...
SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
COOKIES_ENABLED = True # Nếu cần dùng Cookie
SPLASH_COOKIES_DEBUG = False
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
}
DOWNLOAD_DELAY = 10
# ...

# ...

SPLASH_URL = 'http://localhost:8050'

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

COOKIES_ENABLED = True # Nếu cần dùng Cookie

SPLASH_COOKIES_DEBUG = False

SPIDER_MIDDLEWARES = {

'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

}

DOWNLOADER_MIDDLEWARES = {

'scrapy_splash.SplashCookiesMiddleware': 723,

'scrapy_splash.SplashMiddleware': 725,

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,

}

DOWNLOAD_DELAY = 10

# ...

Okayyy, so Splash and scrapy-splash is fully installed, you have enough swords, let’s go =)))

3.3. Identify the target data you want to crawl

Enter the Apple Flagship Store to see what information you can get on this page. Observing and I can get important information such as product name, original price, promotional price, how much has been sold. Turn on inspect and analyze the website structure to be able to select the elements in the DOM correctly =))

Next, create a new class ProductItem to define the data you want to get in the items.py file as follows:

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    price_sale = scrapy.Field()
    sold = scrapy.Field()

class ProductItem(scrapy.Item):

name = scrapy.Field()

price = scrapy.Field()

price_sale = scrapy.Field()

sold = scrapy.Field()

Now, go back to ShopeeCrawlSpider to crawl this shop, I will show the code and explain below:

import scrapy
from scrapy_splash import SplashRequest
from demo_scrapy.items import ProductItem

class ShopeeCrawlSpider(scrapy.Spider):
    name = 'shopee_crawl'
    allowed_domains = ['shopee.vn']
    start_urls = ['https://shopee.vn/shop/88201679/search']
    
    render_script = """
        function main(splash)
            local url = splash.args.url
            assert(splash:go(url))
            assert(splash:wait(5))

            return {
                html = splash:html(),
                url = splash:url(),
            }
        end
        """ 

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse, 
                endpoint='render.html',
                args={
                    'wait': 5,
                    'lua_source': self.render_script,
                }
            )
    
    def parse(self, response):
        item = ProductItem()
        
        for product in response.css("div.shop-search-result-view__item"):
            item["name"] = product.css("div._36CEnF ::text").extract_first()
            item["price"] = product.css("div._3_-SiN ::text").extract_first()
            item["price_sale"] = product.css("span._29R_un ::text").extract_first()
            item["sold"] = product.css("div.go5yPW ::text").extract_first()
            
            yield item

import scrapy

from scrapy_splash import SplashRequest

from demo_scrapy.items import ProductItem

class ShopeeCrawlSpider(scrapy.Spider):

name = 'shopee_crawl'

allowed_domains = ['shopee.vn']

start_urls = ['https://shopee.vn/shop/88201679/search']

render_script = """

function main(splash)

local url = splash.args.url

assert(splash:go(url))

assert(splash:wait(5))

return {

html = splash:html(),

url = splash:url(),

}

end

"""

def start_requests(self):

for url in self.start_urls:

yield SplashRequest(

url,

self.parse,

endpoint='render.html',

args={

'wait': 5,

'lua_source': self.render_script,

}

)

def parse(self, response):

item = ProductItem()

for product in response.css("div.shop-search-result-view__item"):

item["name"] = product.css("div._36CEnF ::text").extract_first()

item["price"] = product.css("div._3_-SiN ::text").extract_first()

item["price_sale"] = product.css("span._29R_un ::text").extract_first()

item["sold"] = product.css("div.go5yPW ::text").extract_first()

yield item

Now I will explain what the above code did:

About the name , domain and start_urls I explained in the previous post so I won’t repeat it here.
Normally, Scrapy will parse directly from the urls in the start_urls list, but because there is a first step using JS, we can’t do it, we have to go through the start_requests function, use SplashRequest wait 5s and return html , url is rendered for the upcoming request using the render_script fragment written in the script as shown above. There is a little suggestion that after you open port 8050 for Splash , you can go to 0.0.0.0:8050 to see the examples that Splash has built in for us or can go to the docs API. Lua’s script if you need to write more complex scripts.
After the html and url have been rendered, the request is sent directly to the parse, at this time the response of the parse will be the rendered web page with the url corresponding to the above SplashRequest url .
Now the response has been fully rendered in html , go back to selecting the correct selector to get the data you want, if it’s too difficult, just turn on inspect and copy the selector, remember to analyze it carefully. before that select: v

3.4. Proceed to crawl the data

Okayyy, having taken the time to carry the sword to the shop, now see the results obtained after conquering this shop with the following command:

scrapy crawl shopee_crawl -o product.json

1 2	scrapy crawl shopee_crawl -o product.json

Remember to add User-Agent in DEFAULT_REQUEST_HEADERS or the local variable USER_AGENT in the settings.py file to prevent bot crawling. Now, open the product.json file to see if the data that I got is really the same with the shop or not:

Check a little, and see that the number of products and crawled information are matched, successful =))

4. Conclusion

So I completed a small demo to introduce the basics of the combination of Scrapy and Splash . Actually, this is a highly appreciated combination compared to other solutions I can read such as using Requests-HTML (only for Python> 3.6) or Selenium (Consume more system resources when make lots of requests while crawling => overload).

As I noted in the previous post , my post can only be crawled during the time I write, you need to keep track of shopee interface updates to see if they have changed the HTML or not to adjust the selector accordingly. , avoid HTML changes that lead to the wrong selector not being able to correctly crawl the data.

Maybe in the next post, I will perform a detailed crawl of information contained in each product and take into account that the website has product pagination.

Thank you for taking the time to read your article, if there is anything missing, you can leave a comment below the article to study and improve. (bow)

5. References

https://medium.com/@doanhtu/scrapy-and-splash- marriage-couples-quite-viet-c7745dc9ab56

https://github.com/scrapy-plugins/scrapy-splash

Share the news now

Source : Viblo