“Scratch” business data with Beautiful Soup in an extremely simple way

Tram Ho

At the end of the free year, I was surfing Facebook when I saw her texting me, asking me to get some business data for spam sale  I am also free, so I will help you too. It’s been a long time without coding
First introduced via https://vinabiz.org/ is a page that allows us to view information of businesses in Vietnam by province, city, industry, … of course basic information but enough to let the sales go spam

To “scratch” data from a website, you must first determine the following:

  • Data structure to retrieve (data fields, data types)
  • Websites load data into the browser, usually the website will call the API or render directly into the HTML page
  • What technology to use to “scratch”
  • Code, code and code

Also during the implementation process will encounter other problems (eg the trick to the site against scratches, …) I will clarify in the following section. We start following the steps from top to bottom. This article I will use Python to code, but why is Python, please answer that it is fast, easy to code, easy to run, use text editor and also code without carrying the IDE out (here I use Visual Studio Code of MS)

B1: Determine the data structure to get Information of a business on https://vinabiz.org/ will look like this:

Create a Class with attributes that are the fields of interest

Okay we’re done with the first step, simple ?

B2: Determine how the website loads data
The simplest and most desirable way is that the website loads data by calling API. Start opening the Dev tool and check the Network section. And it is true nothing more black, this page has called API actually, but that is API Ads ? I decided to play this file again and took away the HTML sml !!!!

Continue to use the Network to find requests that return data, and fortunately, just Get the request to the company url and we have all the data. ?

That’s it, step 2 again, seems easier than eating porridge ?

B3: identify technology to “scratch”
The language I said before is to use Python, but to detach HMTL, there is a famous and familiar library, Beautiful Soup. This library supports many different languages ​​and of course yes for Python, I will not talk about setup anymore, you can see more here.

B4: Complete information and embark on scratch

Configure Logging to be able to use Python’s built-in logging feature, which is very convenient for debugging (actually using print () is also my favorite color. ? )

Looking at the way paging we can see this site paging by path at the end of the url, the number of pages corresponds to that path. So we need to specify the start page and the end page to retrieve data.

Start declaring the necessary arguments:

  • url: link contains the list of businesses, can be obtained by province, city, county, district bla bla bla bla
  • start: start getting page
  • end: the last page to retrieve
  • out: file to save scratchable data

If you want color, you can add validate for urguments ? This is an extra part, so if else can I quickly ?

The main crawl flow will behave as follows:

  • Request to the url containing the business listing -> get the link list containing the details of each business
  • The request continues with the link containing the business information obtained in the previous step -> extract the returned HTML to get the necessary information
  • After getting all the information on this business page, continue to request to the next page and repeat the extraction.

To make HTTP requests here, I use Python’s Requests library. After installing to use, just import into the code file.

The function retrieves the url list containing business details by page. Business listings are saved by a list of </div> with classs “row margin-right-15 margin-left-10” . After the request comes up and I receive the HTML I use Beautiful Soup to filter out all those divs and get the value of the href attribute of the <a> tag in it, which is the link containing the details of the business.

Once I have a list of detailed urls of businesses, I use a simple loop and write a function to retrieve the details of each business. When I came here I discovered something quite interesting, if you do not log in, when you access the details page some information such as email, phone number of director, bla bla will not be displayed. This is a fun trick of the website, and I am looking to bypass it.

In my request, I discovered that the server will check the user’s loggin session through the attached cookies.

To bypass simply login to the site and then declare a cookie corresponding to the login session and send the request

The response part is the returned html section containing information, followed by writing a function to extract information from this html pile. Data is placed in a table with class “table table-bordered” , the information corresponding to each row and column in that table.

Write a function to extract the data of this table with the input response from get_company_detail (url) and return a Company object as declared at the beginning of the article.

After doing this part, I thought that was the end, but e * o ? When logging the peeled-off data, the email is encoded as [email protected] . Looking back at the card containing the email information, the email is encrypted as follows

Because when accessed by browser, the email still appears normally -> think this email is decode by js on the client after loading the page. Keep searching to find the email-decode.js file This is exactly what we need

But the problem is that this is javascript code, so it takes a conversion step to python code. During the conversion process I discovered a number of functions that are not related to decode but just fix html elements after decode. After converting email-decode.js to Python, we will get:

After you have the function to decode the email, edit the email information set of the Company object as follows:

Improve get_company_detail (url) function

The complete crawl flow, the business data will be stored in the list of Company object with the variable company_arr

The last thing is to write data to excel file ? To read / write excel yourself or use the python xlwt library you can read how to install and use on the home page. First, a function is needed to write the header for the file

Next write the function to write data to the file from list company_arr

Complete data logging function

Everything is 99% done, I write one more main function that runs when we call the file

And now, test run and see the results
Open a terminal and type

Result

That’s it =)) Now you can use the information to do whatever you do ?

Full source here, hope you use, don’t spam too much ?

Share the news now

Source : Viblo