At the end of the free year, I was surfing Facebook when I saw her texting me, asking me to get some business data for spam sale I am also free, so I will help you too. It’s been a long time without coding
First introduced via https://vinabiz.org/ is a page that allows us to view information of businesses in Vietnam by province, city, industry, … of course basic information but enough to let the sales go spam
To “scratch” data from a website, you must first determine the following:
- Data structure to retrieve (data fields, data types)
- Websites load data into the browser, usually the website will call the API or render directly into the HTML page
- What technology to use to “scratch”
- Code, code and code
Also during the implementation process will encounter other problems (eg the trick to the site against scratches, …) I will clarify in the following section. We start following the steps from top to bottom. This article I will use Python to code, but why is Python, please answer that it is fast, easy to code, easy to run, use text editor and also code without carrying the IDE out (here I use Visual Studio Code of MS)
B1: Determine the data structure to get Information of a business on https://vinabiz.org/ will look like this:
Create a Class with attributes that are the fields of interest
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <span class="token keyword">class</span> <span class="token class-name">Company</span> <span class="token punctuation">:</span> official_name <span class="token operator">=</span> <span class="token string">''</span> trading_name <span class="token operator">=</span> <span class="token string">''</span> bussiness_code <span class="token operator">=</span> <span class="token string">''</span> date_of_license <span class="token operator">=</span> <span class="token string">''</span> start_working_date <span class="token operator">=</span> <span class="token string">''</span> status <span class="token operator">=</span> <span class="token string">''</span> address <span class="token operator">=</span> <span class="token string">''</span> phone <span class="token operator">=</span> <span class="token string">''</span> email <span class="token operator">=</span> <span class="token string">''</span> director <span class="token operator">=</span> <span class="token string">''</span> director_phone <span class="token operator">=</span> <span class="token string">''</span> accountant <span class="token operator">=</span> <span class="token string">''</span> accountant_phone <span class="token operator">=</span> <span class="token string">''</span> business_lines <span class="token operator">=</span> <span class="token string">''</span> <span class="token keyword">def</span> <span class="token function">__repr__</span> <span class="token punctuation">(</span> self <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token keyword">return</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> self <span class="token punctuation">.</span> __dict__ <span class="token punctuation">)</span> |
Okay we’re done with the first step, simple
B2: Determine how the website loads data
The simplest and most desirable way is that the website loads data by calling API. Start opening the Dev tool and check the Network section. And it is true nothing more black, this page has called API actually, but that is API Ads I decided to play this file again and took away the HTML sml !!!!
Continue to use the Network to find requests that return data, and fortunately, just Get the request to the company url and we have all the data.
That’s it, step 2 again, seems easier than eating porridge
B3: identify technology to “scratch”
The language I said before is to use Python, but to detach HMTL, there is a famous and familiar library, Beautiful Soup. This library supports many different languages and of course yes for Python, I will not talk about setup anymore, you can see more here.
B4: Complete information and embark on scratch
Configure Logging to be able to use Python’s built-in logging feature, which is very convenient for debugging (actually using print () is also my favorite color. )
1 2 3 4 | <span class="token keyword">import</span> logging log_format <span class="token operator">=</span> <span class="token string">'[%(levelname)s] - %(message)s'</span> logging <span class="token punctuation">.</span> basicConfig <span class="token punctuation">(</span> level <span class="token operator">=</span> <span class="token string">'INFO'</span> <span class="token punctuation">,</span> <span class="token builtin">format</span> <span class="token operator">=</span> log_format <span class="token punctuation">)</span> |
Looking at the way paging we can see this site paging by path at the end of the url, the number of pages corresponds to that path. So we need to specify the start page and the end page to retrieve data.
Start declaring the necessary arguments:
- url: link contains the list of businesses, can be obtained by province, city, county, district bla bla bla bla
- start: start getting page
- end: the last page to retrieve
- out: file to save scratchable data
1 2 3 4 5 6 7 8 9 | <span class="token keyword">import</span> argparse parser <span class="token operator">=</span> argparse <span class="token punctuation">.</span> ArgumentParser <span class="token punctuation">(</span> <span class="token punctuation">)</span> parser <span class="token punctuation">.</span> add_argument <span class="token punctuation">(</span> <span class="token string">"--url"</span> <span class="token punctuation">,</span> <span class="token string">"-u"</span> <span class="token punctuation">,</span> <span class="token builtin">help</span> <span class="token operator">=</span> <span class="token string">"base url"</span> <span class="token punctuation">)</span> parser <span class="token punctuation">.</span> add_argument <span class="token punctuation">(</span> <span class="token string">"--start"</span> <span class="token punctuation">,</span> <span class="token string">"-s"</span> <span class="token punctuation">,</span> <span class="token builtin">help</span> <span class="token operator">=</span> <span class="token string">"start page"</span> <span class="token punctuation">)</span> parser <span class="token punctuation">.</span> add_argument <span class="token punctuation">(</span> <span class="token string">"--end"</span> <span class="token punctuation">,</span> <span class="token string">"-e"</span> <span class="token punctuation">,</span> <span class="token builtin">help</span> <span class="token operator">=</span> <span class="token string">"end page"</span> <span class="token punctuation">)</span> parser <span class="token punctuation">.</span> add_argument <span class="token punctuation">(</span> <span class="token string">"--out"</span> <span class="token punctuation">,</span> <span class="token string">"-o"</span> <span class="token punctuation">,</span> <span class="token builtin">help</span> <span class="token operator">=</span> <span class="token string">"output file"</span> <span class="token punctuation">)</span> args <span class="token operator">=</span> parser <span class="token punctuation">.</span> parse_args <span class="token punctuation">(</span> <span class="token punctuation">)</span> |
If you want color, you can add validate for urguments This is an extra part, so if else can I quickly
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <span class="token keyword">def</span> <span class="token function">check_input</span> <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token keyword">if</span> args <span class="token punctuation">.</span> url <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token punctuation">:</span> logging <span class="token punctuation">.</span> error <span class="token punctuation">(</span> <span class="token string">'Please enter base url'</span> <span class="token punctuation">)</span> <span class="token punctuation">;</span> sys <span class="token punctuation">.</span> exit <span class="token punctuation">(</span> <span class="token number">0</span> <span class="token punctuation">)</span> <span class="token keyword">if</span> args <span class="token punctuation">.</span> start <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token punctuation">:</span> logging <span class="token punctuation">.</span> error <span class="token punctuation">(</span> <span class="token string">'Please enter start page'</span> <span class="token punctuation">)</span> <span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token builtin">int</span> <span class="token punctuation">(</span> args <span class="token punctuation">.</span> start <span class="token punctuation">)</span> <span class="token operator"><=</span> <span class="token number">0</span> <span class="token punctuation">:</span> logging <span class="token punctuation">.</span> error <span class="token punctuation">(</span> <span class="token string">'Please enter start page > 0'</span> <span class="token punctuation">)</span> <span class="token punctuation">;</span> sys <span class="token punctuation">.</span> exit <span class="token punctuation">(</span> <span class="token number">0</span> <span class="token punctuation">)</span> <span class="token keyword">if</span> args <span class="token punctuation">.</span> end <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token punctuation">:</span> logging <span class="token punctuation">.</span> error <span class="token punctuation">(</span> <span class="token string">'Please enter end page'</span> <span class="token punctuation">)</span> <span class="token punctuation">;</span> sys <span class="token punctuation">.</span> exit <span class="token punctuation">(</span> <span class="token number">0</span> <span class="token punctuation">)</span> <span class="token keyword">if</span> <span class="token builtin">int</span> <span class="token punctuation">(</span> args <span class="token punctuation">.</span> start <span class="token punctuation">)</span> <span class="token operator">></span> <span class="token builtin">int</span> <span class="token punctuation">(</span> args <span class="token punctuation">.</span> end <span class="token punctuation">)</span> <span class="token punctuation">:</span> logging <span class="token punctuation">.</span> error <span class="token punctuation">(</span> <span class="token string">'Please enter start page < end page'</span> <span class="token punctuation">)</span> <span class="token punctuation">;</span> sys <span class="token punctuation">.</span> exit <span class="token punctuation">(</span> <span class="token number">0</span> <span class="token punctuation">)</span> <span class="token keyword">if</span> args <span class="token punctuation">.</span> out <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token punctuation">:</span> logging <span class="token punctuation">.</span> error <span class="token punctuation">(</span> <span class="token string">'Please enter output file'</span> <span class="token punctuation">)</span> <span class="token punctuation">;</span> sys <span class="token punctuation">.</span> exit <span class="token punctuation">(</span> <span class="token number">0</span> <span class="token punctuation">)</span> |
The main crawl flow will behave as follows:
- Request to the url containing the business listing -> get the link list containing the details of each business
- The request continues with the link containing the business information obtained in the previous step -> extract the returned HTML to get the necessary information
- After getting all the information on this business page, continue to request to the next page and repeat the extraction.
To make HTTP requests here, I use Python’s Requests library. After installing to use, just import into the code file.
1 2 | <span class="token keyword">import</span> requests |
The function retrieves the url list containing business details by page. Business listings are saved by a list of </div> with classs “row margin-right-15 margin-left-10” . After the request comes up and I receive the HTML I use Beautiful Soup to filter out all those divs and get the value of the href attribute of the <a> tag in it, which is the link containing the details of the business.
1 2 3 4 5 6 7 8 9 10 11 12 13 | <span class="token keyword">def</span> <span class="token function">request_list_company</span> <span class="token punctuation">(</span> page <span class="token punctuation">)</span> <span class="token punctuation">:</span> company_url_list <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">]</span> logging <span class="token punctuation">.</span> info <span class="token punctuation">(</span> <span class="token string">"getting list of company in page "</span> <span class="token operator">+</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> page <span class="token punctuation">)</span> <span class="token punctuation">)</span> url <span class="token operator">=</span> args <span class="token punctuation">.</span> url <span class="token keyword">if</span> <span class="token builtin">int</span> <span class="token punctuation">(</span> page <span class="token punctuation">)</span> <span class="token operator">></span> <span class="token number">1</span> <span class="token punctuation">:</span> url <span class="token operator">=</span> url <span class="token operator">+</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> page <span class="token punctuation">)</span> response <span class="token operator">=</span> requests <span class="token punctuation">.</span> get <span class="token punctuation">(</span> url <span class="token punctuation">)</span> soup <span class="token operator">=</span> BeautifulSoup <span class="token punctuation">(</span> response <span class="token punctuation">.</span> content <span class="token punctuation">,</span> <span class="token string">'html.parser'</span> <span class="token punctuation">)</span> list_of_company_div <span class="token operator">=</span> soup <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">"div"</span> <span class="token punctuation">,</span> class_ <span class="token operator">=</span> <span class="token string">"row margin-right-15 margin-left-10"</span> <span class="token punctuation">)</span> <span class="token keyword">for</span> company_div <span class="token keyword">in</span> list_of_company_div <span class="token punctuation">:</span> <span class="token keyword">if</span> company_div <span class="token punctuation">.</span> find <span class="token punctuation">(</span> <span class="token string">'a'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token string">'href'</span> <span class="token punctuation">]</span> <span class="token punctuation">:</span> company_url_list <span class="token punctuation">.</span> append <span class="token punctuation">(</span> company_div <span class="token punctuation">.</span> find <span class="token punctuation">(</span> <span class="token string">'a'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token string">'href'</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> logging <span class="token punctuation">.</span> info <span class="token punctuation">(</span> <span class="token string">'Get total '</span> <span class="token operator">+</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> company_url_list <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">' company url'</span> <span class="token punctuation">)</span> <span class="token keyword">return</span> company_url_list |
Once I have a list of detailed urls of businesses, I use a simple loop and write a function to retrieve the details of each business. When I came here I discovered something quite interesting, if you do not log in, when you access the details page some information such as email, phone number of director, bla bla will not be displayed. This is a fun trick of the website, and I am looking to bypass it.
In my request, I discovered that the server will check the user’s loggin session through the attached cookies.
To bypass simply login to the site and then declare a cookie corresponding to the login session and send the request
1 2 3 4 5 6 7 | cookie <span class="token operator">=</span> <span class="token string">'__cfduid=dba2b91eb8eca08fdd298...'</span> <span class="token keyword">def</span> <span class="token function">get_company_details</span> <span class="token punctuation">(</span> url <span class="token punctuation">)</span> <span class="token punctuation">:</span> url <span class="token operator">=</span> <span class="token string">'https://vinabiz.org/'</span> <span class="token operator">+</span> url logging <span class="token punctuation">.</span> info <span class="token punctuation">(</span> <span class="token string">'Get company details in '</span> <span class="token operator">+</span> url <span class="token punctuation">)</span> response <span class="token operator">=</span> requests <span class="token punctuation">.</span> get <span class="token punctuation">(</span> url <span class="token punctuation">,</span> headers <span class="token operator">=</span> <span class="token punctuation">{</span> <span class="token string">'Cookie'</span> <span class="token punctuation">:</span> cookie <span class="token punctuation">}</span> <span class="token punctuation">)</span> |
The response part is the returned html section containing information, followed by writing a function to extract information from this html pile. Data is placed in a table with class “table table-bordered” , the information corresponding to each row and column in that table.
Write a function to extract the data of this table with the input response from get_company_detail (url) and return a Company object as declared at the beginning of the article.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | <span class="token keyword">def</span> <span class="token function">parse_company_detail</span> <span class="token punctuation">(</span> rows <span class="token punctuation">)</span> <span class="token punctuation">:</span> emailCode <span class="token operator">=</span> <span class="token boolean">None</span> company <span class="token operator">=</span> Company <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> official_name <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> trading_name <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">3</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> bussiness_code <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">2</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> date_of_license <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">2</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">3</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> start_working_date <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">3</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">3</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> status <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">4</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'div'</span> <span class="token punctuation">,</span> class_ <span class="token operator">=</span> <span class="token string">'alert alert-success fade in'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">0</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> address <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">7</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> phone <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">8</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> phone <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">9</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> director <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">12</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> director_phone <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">12</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> accountant <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">14</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> company <span class="token punctuation">.</span> accountant_phone <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">14</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">3</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> get_text <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">return</span> company |
After doing this part, I thought that was the end, but e * o When logging the peeled-off data, the email is encoded as [email protected] . Looking back at the card containing the email information, the email is encrypted as follows
1 2 3 4 | <span class="token tag"><span class="token tag"><span class="token punctuation"><</span> a</span> <span class="token attr-name">href</span> <span class="token attr-value"><span class="token punctuation">=</span> <span class="token punctuation">"</span> /cdn-cgi/l/email-protection#2e4d4140495a574a435e004d4100425a4a6e49434f4742004d4143 <span class="token punctuation">"</span></span> <span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"><</span> span</span> <span class="token attr-name">class</span> <span class="token attr-value"><span class="token punctuation">=</span> <span class="token punctuation">"</span> __cf_email__ <span class="token punctuation">"</span></span> <span class="token attr-name">data-cfemail</span> <span class="token attr-value"><span class="token punctuation">=</span> <span class="token punctuation">"</span> c1a2aeafa6b5b8a5acb1efa2aeefadb5a581a6aca0a8adefa2aeac <span class="token punctuation">"</span></span> <span class="token punctuation">></span></span> [email <span class="token entity" title="">&#160;</span> protected] <span class="token tag"><span class="token tag"><span class="token punctuation"></</span> span</span> <span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation"></</span> a</span> <span class="token punctuation">></span></span> ; |
Because when accessed by browser, the email still appears normally -> think this email is decode by js on the client after loading the page. Keep searching to find the email-decode.js file This is exactly what we need
But the problem is that this is javascript code, so it takes a conversion step to python code. During the conversion process I discovered a number of functions that are not related to decode but just fix html elements after decode. After converting email-decode.js to Python, we will get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | <span class="token keyword">import</span> urllib <span class="token punctuation">.</span> parse <span class="token keyword">def</span> <span class="token function">r</span> <span class="token punctuation">(</span> e <span class="token punctuation">,</span> t <span class="token punctuation">)</span> <span class="token punctuation">:</span> r <span class="token operator">=</span> e <span class="token punctuation">[</span> t <span class="token punctuation">:</span> t <span class="token operator">+</span> <span class="token number">2</span> <span class="token punctuation">]</span> <span class="token keyword">return</span> <span class="token builtin">int</span> <span class="token punctuation">(</span> r <span class="token punctuation">,</span> base <span class="token operator">=</span> <span class="token number">16</span> <span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">decode</span> <span class="token punctuation">(</span> n <span class="token punctuation">,</span> c <span class="token punctuation">)</span> <span class="token punctuation">:</span> o <span class="token operator">=</span> <span class="token string">''</span> a <span class="token operator">=</span> r <span class="token punctuation">(</span> n <span class="token punctuation">,</span> c <span class="token punctuation">)</span> i <span class="token operator">=</span> c <span class="token operator">+</span> <span class="token number">2</span> xs <span class="token operator">=</span> i <span class="token keyword">for</span> x <span class="token keyword">in</span> <span class="token builtin">range</span> <span class="token punctuation">(</span> i <span class="token punctuation">,</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> n <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token keyword">if</span> xs <span class="token keyword">in</span> <span class="token builtin">range</span> <span class="token punctuation">(</span> i <span class="token punctuation">,</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> n <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> l <span class="token operator">=</span> r <span class="token punctuation">(</span> n <span class="token punctuation">,</span> xs <span class="token punctuation">)</span> <span class="token operator">^</span> a o <span class="token operator">+=</span> <span class="token builtin">chr</span> <span class="token punctuation">(</span> l <span class="token punctuation">)</span> xs <span class="token operator">=</span> xs <span class="token operator">+</span> <span class="token number">2</span> <span class="token keyword">else</span> <span class="token punctuation">:</span> <span class="token keyword">break</span> <span class="token keyword">try</span> <span class="token punctuation">:</span> o <span class="token operator">=</span> urllib <span class="token punctuation">.</span> parse <span class="token punctuation">.</span> unquote <span class="token punctuation">(</span> urllib <span class="token punctuation">.</span> parse <span class="token punctuation">.</span> quote <span class="token punctuation">(</span> o <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token keyword">return</span> o <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e <span class="token punctuation">:</span> logging <span class="token punctuation">.</span> error <span class="token punctuation">(</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> e <span class="token punctuation">)</span> <span class="token punctuation">)</span> |
After you have the function to decode the email, edit the email information set of the Company object as follows:
1 2 3 4 | <span class="token keyword">if</span> rows <span class="token punctuation">[</span> <span class="token number">9</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find <span class="token punctuation">(</span> <span class="token string">'span'</span> <span class="token punctuation">,</span> class_ <span class="token operator">=</span> <span class="token string">'__cf_email__'</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> emailCode <span class="token operator">=</span> rows <span class="token punctuation">[</span> <span class="token number">9</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'td'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">1</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find <span class="token punctuation">(</span> <span class="token string">'span'</span> <span class="token punctuation">,</span> class_ <span class="token operator">=</span> <span class="token string">'__cf_email__'</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token string">'data-cfemail'</span> <span class="token punctuation">]</span> <span class="token keyword">if</span> emailCode <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span> <span class="token punctuation">:</span> company <span class="token punctuation">.</span> email <span class="token operator">=</span> decode <span class="token punctuation">(</span> emailCode <span class="token punctuation">,</span> <span class="token number">0</span> <span class="token punctuation">)</span> <span class="token keyword">else</span> <span class="token punctuation">:</span> company <span class="token punctuation">.</span> email <span class="token operator">=</span> <span class="token string">''</span> |
Improve get_company_detail (url) function
1 2 3 4 5 6 7 8 9 | <span class="token keyword">def</span> <span class="token function">get_company_details</span> <span class="token punctuation">(</span> url <span class="token punctuation">)</span> <span class="token punctuation">:</span> url <span class="token operator">=</span> <span class="token string">'https://vinabiz.org/'</span> <span class="token operator">+</span> url logging <span class="token punctuation">.</span> info <span class="token punctuation">(</span> <span class="token string">'Get company details in '</span> <span class="token operator">+</span> url <span class="token punctuation">)</span> response <span class="token operator">=</span> requests <span class="token punctuation">.</span> get <span class="token punctuation">(</span> url <span class="token punctuation">,</span> headers <span class="token operator">=</span> <span class="token punctuation">{</span> <span class="token string">'Cookie'</span> <span class="token punctuation">:</span> cookie <span class="token punctuation">}</span> <span class="token punctuation">)</span> soup <span class="token operator">=</span> BeautifulSoup <span class="token punctuation">(</span> response <span class="token punctuation">.</span> content <span class="token punctuation">,</span> <span class="token string">'html.parser'</span> <span class="token punctuation">)</span> rows <span class="token operator">=</span> soup <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">"table"</span> <span class="token punctuation">,</span> class_ <span class="token operator">=</span> <span class="token string">"table table-bordered"</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span> <span class="token number">0</span> <span class="token punctuation">]</span> <span class="token punctuation">.</span> find_all <span class="token punctuation">(</span> <span class="token string">'tr'</span> <span class="token punctuation">)</span> company <span class="token operator">=</span> parse_company_detail <span class="token punctuation">(</span> rows <span class="token punctuation">)</span> company_arr <span class="token punctuation">.</span> append <span class="token punctuation">(</span> company <span class="token punctuation">)</span> |
The complete crawl flow, the business data will be stored in the list of Company object with the variable company_arr
1 2 3 4 5 6 7 8 | <span class="token keyword">def</span> <span class="token function">craw</span> <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> company_arr <span class="token punctuation">.</span> clear <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span> <span class="token punctuation">(</span> <span class="token builtin">int</span> <span class="token punctuation">(</span> args <span class="token punctuation">.</span> start <span class="token punctuation">)</span> <span class="token punctuation">,</span> <span class="token builtin">int</span> <span class="token punctuation">(</span> args <span class="token punctuation">.</span> end <span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token number">1</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> company_url_list <span class="token operator">=</span> request_list_company <span class="token punctuation">(</span> i <span class="token punctuation">)</span> <span class="token keyword">for</span> company_url <span class="token keyword">in</span> company_url_list <span class="token punctuation">:</span> get_company_details <span class="token punctuation">(</span> company_url <span class="token punctuation">)</span> logging <span class="token punctuation">.</span> info <span class="token punctuation">(</span> <span class="token string">'Get information of total '</span> <span class="token operator">+</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> company_arr <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">' companies'</span> <span class="token punctuation">)</span> |
The last thing is to write data to excel file To read / write excel yourself or use the python xlwt library you can read how to install and use on the home page. First, a function is needed to write the header for the file
1 2 3 4 5 6 7 | <span class="token keyword">def</span> <span class="token function">write_sheet_header</span> <span class="token punctuation">(</span> sheet <span class="token punctuation">)</span> <span class="token punctuation">:</span> sheet_header <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">'Tên chính thức'</span> <span class="token punctuation">,</span> <span class="token string">'Tên giao dịch'</span> <span class="token punctuation">,</span> <span class="token string">'Mã doanh nghiệp'</span> <span class="token punctuation">,</span> <span class="token string">'Ngày cấp'</span> <span class="token punctuation">,</span> <span class="token string">'Ngày bắt đầu hoạt động'</span> <span class="token punctuation">,</span> <span class="token string">'Trạng thái'</span> <span class="token punctuation">,</span> <span class="token string">'Địa chỉ'</span> <span class="token punctuation">,</span> <span class="token string">'Điện thoại'</span> <span class="token punctuation">,</span> <span class="token string">'Email'</span> <span class="token punctuation">,</span> <span class="token string">'Giám đốc'</span> <span class="token punctuation">,</span> <span class="token string">'SĐT giám đốc'</span> <span class="token punctuation">,</span> <span class="token string">'Kế toán'</span> <span class="token punctuation">,</span> <span class="token string">'SĐT kế toán'</span> <span class="token punctuation">,</span> <span class="token string">'Nghành nghề'</span> <span class="token punctuation">]</span> <span class="token keyword">for</span> header <span class="token keyword">in</span> sheet_header <span class="token punctuation">:</span> sheet <span class="token punctuation">.</span> write <span class="token punctuation">(</span> <span class="token number">0</span> <span class="token punctuation">,</span> sheet_header <span class="token punctuation">.</span> index <span class="token punctuation">(</span> header <span class="token punctuation">)</span> <span class="token punctuation">,</span> header <span class="token punctuation">)</span> |
Next write the function to write data to the file from list company_arr
1 2 3 4 5 6 7 | <span class="token keyword">def</span> <span class="token function">write_sheet_data</span> <span class="token punctuation">(</span> sheet <span class="token punctuation">,</span> data <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token keyword">for</span> company <span class="token keyword">in</span> data <span class="token punctuation">:</span> attributes_arr <span class="token operator">=</span> <span class="token builtin">list</span> <span class="token punctuation">(</span> company <span class="token punctuation">.</span> __dict__ <span class="token punctuation">.</span> keys <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> attributes_arr <span class="token punctuation">)</span> <span class="token keyword">for</span> att <span class="token keyword">in</span> attributes_arr <span class="token punctuation">:</span> sheet <span class="token punctuation">.</span> write <span class="token punctuation">(</span> data <span class="token punctuation">.</span> index <span class="token punctuation">(</span> company <span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token number">1</span> <span class="token punctuation">,</span> attributes_arr <span class="token punctuation">.</span> index <span class="token punctuation">(</span> att <span class="token punctuation">)</span> <span class="token punctuation">,</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> <span class="token builtin">getattr</span> <span class="token punctuation">(</span> company <span class="token punctuation">,</span> att <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> |
Complete data logging function
1 2 3 4 5 6 7 8 9 10 | <span class="token keyword">def</span> <span class="token function">write_result</span> <span class="token punctuation">(</span> data <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token builtin">file</span> <span class="token operator">=</span> args <span class="token punctuation">.</span> out <span class="token operator">+</span> <span class="token string">'.xls'</span> logging <span class="token punctuation">.</span> info <span class="token punctuation">(</span> <span class="token string">'Save result to file'</span> <span class="token punctuation">)</span> wb <span class="token operator">=</span> Workbook <span class="token punctuation">(</span> <span class="token punctuation">)</span> sheet <span class="token operator">=</span> wb <span class="token punctuation">.</span> add_sheet <span class="token punctuation">(</span> <span class="token string">'Data'</span> <span class="token punctuation">)</span> write_sheet_header <span class="token punctuation">(</span> sheet <span class="token punctuation">)</span> write_sheet_data <span class="token punctuation">(</span> sheet <span class="token punctuation">,</span> data <span class="token punctuation">)</span> wb <span class="token punctuation">.</span> save <span class="token punctuation">(</span> <span class="token builtin">file</span> <span class="token punctuation">)</span> logging <span class="token punctuation">.</span> info <span class="token punctuation">(</span> <span class="token string">'Saved to '</span> <span class="token operator">+</span> <span class="token builtin">file</span> <span class="token punctuation">)</span> |
Everything is 99% done, I write one more main function that runs when we call the file
1 2 3 4 5 6 7 | <span class="token keyword">def</span> <span class="token function">main</span> <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> check_input <span class="token punctuation">(</span> <span class="token punctuation">)</span> craw <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span> <span class="token punctuation">:</span> main <span class="token punctuation">(</span> <span class="token punctuation">)</span> |
And now, test run and see the results
Open a terminal and type
Result
That’s it =)) Now you can use the information to do whatever you do
Full source here, hope you use, don’t spam too much