Hôm nay mình sẽ giới thiệu các bạn làm sao có thể lấy được dữ liệu từ bất kỳ một trang web nào sử dụng laravel, proxy và html dom
Trong bài viết này mình sẽ lấy ví dụ là crawl product của amazon
Cài đặt
Đầu tiên các bạn vào site này download file simple_html_dom.php
để vào thư mục Helpers
của laravel chẳng hạn (thư mục mình tự tạo ra, bạn có thể bỏ vào bất cứ thu mục nào bạn muốn).
sau đó mở file composer.json
ra và thêm đường dẫn file vừa tạo vào phần autoload
1 2 3 4 5 6 7 8 9 10 11 12 13 | <span class="token property">"autoload"</span><span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"files"</span><span class="token operator">:</span> <span class="token punctuation">[</span> <span class="token string">"app/Helpers/simple_html_dom.php"</span> # thêm vào đây <span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token property">"psr-4"</span><span class="token operator">:</span> <span class="token punctuation">{</span> <span class="token property">"App\"</span><span class="token operator">:</span> <span class="token string">"app/"</span> <span class="token punctuation">}</span><span class="token punctuation">,</span> <span class="token property">"classmap"</span><span class="token operator">:</span> <span class="token punctuation">[</span> <span class="token string">"database/seeds"</span><span class="token punctuation">,</span> <span class="token string">"database/factories"</span> <span class="token punctuation">]</span> <span class="token punctuation">}</span><span class="token punctuation">,</span> |
rồi chạy composer dumpautoload
để file này được load vào thư viện của laravel.
Code
Để crawl dữ liệu mình sẽ tạo ra file command
sau đó từ command
gọi sang phần jobs của laravel. nếu dùng thế này mình có thể đẩy toàn bộ tác vụ crawl chạy tự động cũng như đẩy phần chạy vào queue
rồi chúng ra có thể dùng supervisor
để start 1 lúc nhiều process lên chạy cùng 1 lúc. nhưng mình khuyên là nên để tối đa 5 process chạy 1 lúc thôi, amazone sẽ block IP nào có nhiều request đến trong 1 khoảng thời gian ngắn (cái này có thể dùng public proxy hoặc private proxy để vượt qua được).
- Đầu tiên bạn cần tạo 1 file
AwsProductCrawler.php
trong thưc mụcapp/Console/Commands
có nội dung như sau:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | <span class="token php language-php"><span class="token delimiter important"><?php</span> <span class="token keyword">namespace</span> <span class="token package">AppConsoleCommands</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">AppJobsAwsCrawlerLink</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">DB</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateConsoleCommand</span><span class="token punctuation">;</span> <span class="token keyword">class</span> <span class="token class-name">AwsProductCrawler</span> <span class="token keyword">extends</span> <span class="token class-name">Command</span> <span class="token punctuation">{</span> <span class="token comment">/** * The name and signature of the console command. * * @var string */</span> <span class="token keyword">protected</span> <span class="token variable">$signature</span> <span class="token operator">=</span> <span class="token single-quoted-string string">'aws:product'</span><span class="token punctuation">;</span> <span class="token comment">/** * The console command description. * * @var string */</span> <span class="token keyword">protected</span> <span class="token variable">$description</span> <span class="token operator">=</span> <span class="token single-quoted-string string">'aws product crawler, run one time a week'</span><span class="token punctuation">;</span> <span class="token comment">/** * Create a new command instance. * * @return void */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">__construct</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">parent</span><span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">__construct</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">const</span> <span class="token constant">LIMIT</span> <span class="token operator">=</span> <span class="token number">25</span><span class="token punctuation">;</span> <span class="token comment">/** * Execute the console command. * * @return mixed */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">handle</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token constant">DB</span><span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">table</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'merchants'</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">orderBy</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'id'</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">chunk</span><span class="token punctuation">(</span>self<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token constant">LIMIT</span><span class="token punctuation">,</span> <span class="token keyword">function</span> <span class="token punctuation">(</span><span class="token variable">$merchants</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">foreach</span> <span class="token punctuation">(</span><span class="token variable">$merchants</span> <span class="token keyword">as</span> <span class="token variable">$merchant</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> AwsCrawlerLink<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">dispatch</span><span class="token punctuation">(</span><span class="token variable">$merchant</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> </span> |
file này có nội dung khá đơn giản, nó chỉ là lấy những merchants nào cần crawl có trong database thôi, trong bản merchant các bạn cần có merchant_id để có thể vào list của merchant này get toàn bộ product về.
Tiếp theo bạn cần tạo 1 file AwsCrawlerLink.php
trong thư mục Jobs
của laravel có nội dung sau:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | <span class="token php language-php"><span class="token delimiter important"><?php</span> <span class="token keyword">namespace</span> <span class="token package">AppJobs</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">AppHelpersAwsClient</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateBusQueueable</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateContractsQueueShouldQueue</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateFoundationBusDispatchable</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateQueueInteractsWithQueue</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateQueueSerializesModels</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateSupportFacadesLog</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">MockeryException</span><span class="token punctuation">;</span> <span class="token keyword">class</span> <span class="token class-name">AwsCrawlerLink</span> <span class="token keyword">implements</span> <span class="token class-name">ShouldQueue</span> <span class="token punctuation">{</span> <span class="token keyword">use</span> <span class="token package">Dispatchable</span><span class="token punctuation">,</span> InteractsWithQueue<span class="token punctuation">,</span> Queueable<span class="token punctuation">,</span> SerializesModels<span class="token punctuation">;</span> <span class="token keyword">const</span> <span class="token constant">CLASS_DETAIL_PRODUCT</span> <span class="token operator">=</span> <span class="token single-quoted-string string">'.a-text-normal'</span><span class="token punctuation">;</span> <span class="token keyword">const</span> <span class="token constant">TIME_OUT</span> <span class="token operator">=</span> <span class="token number">300</span><span class="token punctuation">;</span> <span class="token keyword">protected</span> <span class="token variable">$seller</span><span class="token punctuation">;</span> <span class="token comment">/** * Create a new job instance. * @param $seller * @return void */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">__construct</span><span class="token punctuation">(</span><span class="token variable">$seller</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">seller</span> <span class="token operator">=</span> <span class="token variable">$seller</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/** * Execute the job. * * @return void */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">handle</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">try</span> <span class="token punctuation">{</span> <span class="token function">set_time_limit</span><span class="token punctuation">(</span>self<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token constant">TIME_OUT</span><span class="token punctuation">)</span><span class="token punctuation">;</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'Start crawl product link, seller = '</span> <span class="token punctuation">.</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">seller</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">merchant_id</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$awsBaseUrl</span> <span class="token operator">=</span> <span class="token function">env</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'BASE_AWS_URL'</span><span class="token punctuation">,</span> <span class="token function">config</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'common.default_aws_url'</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$sellerBaseUrl</span> <span class="token operator">=</span> <span class="token variable">$awsBaseUrl</span> <span class="token punctuation">.</span> <span class="token single-quoted-string string">'/s?me='</span> <span class="token punctuation">.</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">seller</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">merchant_id</span><span class="token punctuation">;</span> <span class="token variable">$endPage</span> <span class="token operator">=</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">countPage</span><span class="token punctuation">(</span><span class="token variable">$sellerBaseUrl</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token variable">$i</span> <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span> <span class="token variable">$i</span> <span class="token operator"><=</span> <span class="token variable">$endPage</span><span class="token punctuation">;</span> <span class="token variable">$i</span><span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$urlWithPage</span> <span class="token operator">=</span> <span class="token variable">$sellerBaseUrl</span> <span class="token punctuation">.</span> <span class="token single-quoted-string string">'&page='</span> <span class="token punctuation">.</span> <span class="token variable">$i</span><span class="token punctuation">;</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'start get list products $urlAwsSeller ='</span> <span class="token punctuation">.</span> <span class="token variable">$urlWithPage</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$html</span> <span class="token operator">=</span> AwsClient<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">getContent</span><span class="token punctuation">(</span><span class="token variable">$urlWithPage</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">is_array</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">continue</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$html</span> <span class="token operator">=</span> <span class="token function">str_get_html</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">foreach</span> <span class="token punctuation">(</span><span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span>self<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token constant">CLASS_DETAIL_PRODUCT</span><span class="token punctuation">)</span> <span class="token keyword">as</span> <span class="token variable">$productDetailUrl</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token variable">$urlDetail</span> <span class="token operator">=</span> <span class="token variable">$productDetailUrl</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">href</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$urlDetail</span> <span class="token operator">=</span> <span class="token function">env</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'BASE_AWS_URL'</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'https://www.amazon.co.jp'</span><span class="token punctuation">)</span> <span class="token punctuation">.</span> <span class="token variable">$urlDetail</span><span class="token punctuation">;</span> AwsCrawlerDetail<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">dispatch</span><span class="token punctuation">(</span><span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">seller</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">id</span><span class="token punctuation">,</span> <span class="token variable">$urlDetail</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">catch</span> <span class="token punctuation">(</span><span class="token class-name">Exception</span> <span class="token variable">$exception</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token function">report</span><span class="token punctuation">(</span><span class="token variable">$exception</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'End crawl product, seller = '</span> <span class="token punctuation">.</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">seller</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">merchant_id</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">private</span> <span class="token keyword">function</span> <span class="token function">countPage</span><span class="token punctuation">(</span><span class="token variable">$urlAwsSeller</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"Start get count page, url= <span class="token interpolation"><span class="token punctuation">{</span><span class="token variable">$urlAwsSeller</span><span class="token punctuation">}</span></span>"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$html</span> <span class="token operator">=</span> AwsClient<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">getContent</span><span class="token punctuation">(</span><span class="token variable">$urlAwsSeller</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">is_array</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$html</span> <span class="token operator">=</span> <span class="token function">str_get_html</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// find end page more than 9 page</span> <span class="token variable">$page</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'.a-disabled'</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token variable">$page</span> <span class="token operator">&&</span> <span class="token function">isset</span><span class="token punctuation">(</span><span class="token variable">$page</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">plaintext</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"page count is <span class="token interpolation"><span class="token punctuation">{</span><span class="token variable">$page</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">plaintext</span><span class="token punctuation">}</span></span>"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$pageCount</span> <span class="token operator">=</span> <span class="token variable">$page</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">plaintext</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// find end page not more than 9 page</span> <span class="token variable">$page</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'.a-normal'</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token variable">$page</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$page</span> <span class="token operator">=</span> <span class="token function">end</span><span class="token punctuation">(</span><span class="token variable">$page</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">isset</span><span class="token punctuation">(</span><span class="token variable">$page</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">plaintext</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"page count is <span class="token interpolation"><span class="token punctuation">{</span><span class="token variable">$page</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">plaintext</span><span class="token punctuation">}</span></span>"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$pageCount</span> <span class="token operator">=</span> <span class="token variable">$page</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">plaintext</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> AwsClient<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">cleanHtml</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">isset</span><span class="token punctuation">(</span><span class="token variable">$pageCount</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token variable">$pageCount</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">error</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"======= Cannot get countPage urlAwsSeller = <span class="token interpolation"><span class="token punctuation">{</span><span class="token variable">$urlAwsSeller</span><span class="token punctuation">}</span></span> or maybe count Page = 1"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token number">1</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> </span> |
File này có nhiệm vụ lấy toàn bộ url
của product về, xong nó tiếp tục đẩy sang AwsCrawlerDetail
để lấy chi tiết thông tin của products
function getContent
có nội dung như sau:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | <span class="token keyword">public</span> <span class="token keyword">static</span> <span class="token keyword">function</span> <span class="token function">getData</span><span class="token punctuation">(</span><span class="token variable">$url</span><span class="token punctuation">,</span> <span class="token variable">$proxy</span> <span class="token operator">=</span> <span class="token boolean">false</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$client</span> <span class="token operator">=</span> <span class="token keyword">new</span> <span class="token class-name">Client</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">try</span> <span class="token punctuation">{</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token variable">$proxy</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$content</span> <span class="token operator">=</span> <span class="token variable">$client</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">get</span><span class="token punctuation">(</span><span class="token variable">$url</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token variable">$content</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getBody</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getContents</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$content</span> <span class="token operator">=</span> <span class="token variable">$client</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">get</span><span class="token punctuation">(</span><span class="token variable">$url</span><span class="token punctuation">,</span> <span class="token punctuation">[</span> <span class="token single-quoted-string string">'proxy'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$proxy</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'connect_timeout'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">20</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'timeout'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">60</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'allow_redirects'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token boolean">false</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'headers'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token punctuation">[</span> <span class="token single-quoted-string string">'User-Agent'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token single-quoted-string string">'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'</span><span class="token punctuation">,</span> <span class="token punctuation">]</span> <span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token variable">$content</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getBody</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getContents</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">catch</span> <span class="token punctuation">(</span><span class="token class-name">Exception</span> <span class="token variable">$exception</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">error</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"(getData) Exception messages = <span class="token interpolation"><span class="token punctuation">{</span><span class="token variable">$exception</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getMessage</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">}</span></span>"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">error</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"(getData) status code = <span class="token interpolation"><span class="token punctuation">{</span><span class="token variable">$exception</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getCode</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">}</span></span>"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token punctuation">[</span> <span class="token single-quoted-string string">'error'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token boolean">true</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'code'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$exception</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getCode</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> |
Cuối cùng là phần lấy thông tin chi tiết của product, cũng là phần dài và khó nhất có nội dung như sau AwsCrawlerDetail.php
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 | <span class="token php language-php"><span class="token delimiter important"><?php</span> <span class="token keyword">namespace</span> <span class="token package">AppJobs</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">AppHelpersAwsClient</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">AppModelsCategory</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">AppModelsProduct</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">AppModelsProductDetail</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">AppModelsProductStar</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">CarbonCarbon</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateBusQueueable</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateContractsQueueShouldQueue</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateFoundationBusDispatchable</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateQueueInteractsWithQueue</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateQueueSerializesModels</span><span class="token punctuation">;</span> <span class="token keyword">use</span> <span class="token package">IlluminateSupportFacadesLog</span><span class="token punctuation">;</span> <span class="token keyword">class</span> <span class="token class-name">AwsCrawlerDetail</span> <span class="token keyword">implements</span> <span class="token class-name">ShouldQueue</span> <span class="token punctuation">{</span> <span class="token keyword">use</span> <span class="token package">Dispatchable</span><span class="token punctuation">,</span> InteractsWithQueue<span class="token punctuation">,</span> Queueable<span class="token punctuation">,</span> SerializesModels<span class="token punctuation">;</span> <span class="token keyword">protected</span> <span class="token variable">$seller_id</span><span class="token punctuation">;</span> <span class="token keyword">const</span> <span class="token constant">TIME_OUT</span> <span class="token operator">=</span> <span class="token number">300</span><span class="token punctuation">;</span> <span class="token comment">// second</span> <span class="token keyword">protected</span> <span class="token variable">$detailProductUrl</span><span class="token punctuation">;</span> <span class="token keyword">protected</span> <span class="token variable">$asin</span><span class="token punctuation">;</span> <span class="token comment">/** * Create a new job instance. * @param $sellerId * @param $detailProductUrl * @param $asin * @return void */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">__construct</span><span class="token punctuation">(</span><span class="token variable">$sellerId</span><span class="token punctuation">,</span> <span class="token variable">$detailProductUrl</span><span class="token punctuation">,</span> <span class="token variable">$asin</span> <span class="token operator">=</span> <span class="token keyword">null</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">seller_id</span> <span class="token operator">=</span> <span class="token variable">$sellerId</span><span class="token punctuation">;</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">detailProductUrl</span> <span class="token operator">=</span> <span class="token variable">$detailProductUrl</span><span class="token punctuation">;</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">asin</span> <span class="token operator">=</span> <span class="token variable">$asin</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/** * Execute the job. * * @return void */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">handle</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token function">set_time_limit</span><span class="token punctuation">(</span>self<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token constant">TIME_OUT</span><span class="token punctuation">)</span><span class="token punctuation">;</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"Start crawl product detail, url = "</span> <span class="token punctuation">.</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">detailProductUrl</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">checkAlreadyCrawl</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'product already crawl, end this product!'</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">try</span> <span class="token punctuation">{</span> <span class="token variable">$html</span> <span class="token operator">=</span> AwsClient<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">getContent</span><span class="token punctuation">(</span><span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">detailProductUrl</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">is_array</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'ignore this product content = '</span><span class="token punctuation">,</span> <span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$html</span> <span class="token operator">=</span> <span class="token function">str_get_html</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token variable">$html</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"content null"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> AwsClient<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">cleanHtml</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$arrProduct</span> <span class="token operator">=</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getProductData</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$productDetail</span> <span class="token operator">=</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getProductDetail</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$productStar</span> <span class="token operator">=</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getProductReviewStartDetail</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">,</span> <span class="token variable">$productDetail</span><span class="token punctuation">[</span><span class="token single-quoted-string string">'review_count'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token variable">$arrProduct</span><span class="token punctuation">[</span><span class="token single-quoted-string string">'asin'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">debug</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'save product: '</span><span class="token punctuation">,</span> <span class="token variable">$arrProduct</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$product</span> <span class="token operator">=</span> Product<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">saveProduct</span><span class="token punctuation">(</span><span class="token variable">$arrProduct</span><span class="token punctuation">,</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">asin</span><span class="token punctuation">)</span><span class="token punctuation">;</span> ProductDetail<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">saveProductDetail</span><span class="token punctuation">(</span><span class="token variable">$product</span><span class="token punctuation">,</span> <span class="token variable">$productDetail</span><span class="token punctuation">)</span><span class="token punctuation">;</span> ProductStar<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">saveProductStar</span><span class="token punctuation">(</span><span class="token variable">$product</span><span class="token punctuation">,</span> <span class="token variable">$productStar</span><span class="token punctuation">)</span><span class="token punctuation">;</span> Category<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">saveCategory</span><span class="token punctuation">(</span><span class="token variable">$product</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token single-quoted-string string">'name'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$productDetail</span><span class="token punctuation">[</span><span class="token single-quoted-string string">'category'</span><span class="token punctuation">]</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> AwsClient<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">cleanHtml</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">catch</span> <span class="token punctuation">(</span><span class="token class-name">Exception</span> <span class="token variable">$exception</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">error</span><span class="token punctuation">(</span><span class="token variable">$exception</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getMessage</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">checkAlreadyCrawl</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">asin</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$product</span> <span class="token operator">=</span> Product<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">where</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'asin'</span><span class="token punctuation">,</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">asin</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">else</span> <span class="token punctuation">{</span> <span class="token variable">$product</span> <span class="token operator">=</span> Product<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">where</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'detail_aws_url'</span><span class="token punctuation">,</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">detailProductUrl</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token variable">$product</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> ProductDetail<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">where</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'product_id'</span><span class="token punctuation">,</span> <span class="token variable">$product</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">id</span><span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span><span class="token function">whereDate</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'created_at'</span><span class="token punctuation">,</span> Carbon<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">now</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">format</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'Y-m-d'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span><span class="token function">first</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/** * get product data from html dom * @param $html * @return array */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">getProductData</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$asin</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'#cerberus-data-metrics'</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token variable">$asin</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$asin</span> <span class="token operator">=</span> <span class="token variable">$asin</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getAllAttributes</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$img</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'#imgTagWrapperId img'</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token variable">$img</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$img</span> <span class="token operator">=</span> <span class="token variable">$img</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getAttribute</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'data-old-hires'</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$sellAt</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'.date-first-available .value'</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">plaintext</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token keyword">null</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token variable">$sellAt</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$sellAt</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'#productDetailsTable ul li'</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">plaintext</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token keyword">null</span><span class="token punctuation">;</span> <span class="token variable">$sellAt</span> <span class="token operator">=</span> <span class="token function">str_replace</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'Amazon.co.jp での取り扱い開始日:'</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">''</span><span class="token punctuation">,</span> <span class="token variable">$sellAt</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token punctuation">[</span> <span class="token single-quoted-string string">'name'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'#productTitle'</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">plaintext</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token keyword">null</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'url_img'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$img</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'asin'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$asin</span><span class="token punctuation">[</span><span class="token single-quoted-string string">'data-asin'</span><span class="token punctuation">]</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token keyword">null</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'seller_id'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">seller_id</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'sell_at'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token function">trim</span><span class="token punctuation">(</span><span class="token variable">$sellAt</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'detail_aws_url'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">detailProductUrl</span><span class="token punctuation">,</span> <span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/** * get product detail from html dom * @param $html * @return array */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">getProductDetail</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$asin</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'#cerberus-data-metrics'</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token variable">$asin</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$asin</span> <span class="token operator">=</span> <span class="token variable">$asin</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getAllAttributes</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$avgReview</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'#acrPopover'</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">title</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token variable">$acrCustomerReviewText</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'#acrCustomerReviewText'</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">plaintext</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token variable">$ranking</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'#SalesRank .value'</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">innertext</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token variable">$ranking</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$ranking</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'#SalesRank'</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">innertext</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token variable">$ranking</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$ranking</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'.pdTab'</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token operator">></span><span class="token property">innertext</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$ranking</span> <span class="token operator">=</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getCatAndRank</span><span class="token punctuation">(</span><span class="token variable">$ranking</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token punctuation">[</span> <span class="token single-quoted-string string">'price'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token function">str_replace</span><span class="token punctuation">(</span><span class="token single-quoted-string string">','</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">''</span><span class="token punctuation">,</span> <span class="token variable">$asin</span><span class="token punctuation">[</span><span class="token single-quoted-string string">'data-asin-price'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token number">0</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'currency_code'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$asin</span><span class="token punctuation">[</span><span class="token single-quoted-string string">'data-asin-currency-code'</span><span class="token punctuation">]</span> <span class="token operator">?</span><span class="token operator">?</span> <span class="token single-quoted-string string">'JPY'</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'avg_review'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getAvgReviewFromString</span><span class="token punctuation">(</span><span class="token variable">$avgReview</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'review_count'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getNumberFromString</span><span class="token punctuation">(</span><span class="token variable">$acrCustomerReviewText</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'ranking'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getRankingFromString</span><span class="token punctuation">(</span><span class="token variable">$ranking</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">'category'</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getCatFromString</span><span class="token punctuation">(</span><span class="token variable">$ranking</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/** * get start count for product review count * @param $html * @param $total * @return mixed */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">getProductReviewStartDetail</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">,</span> <span class="token variable">$total</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$arr</span><span class="token punctuation">[</span><span class="token single-quoted-string string">'total_star'</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token variable">$total</span><span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token variable">$i</span> <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span> <span class="token variable">$i</span> <span class="token operator"><=</span> <span class="token number">5</span><span class="token punctuation">;</span> <span class="token variable">$i</span><span class="token operator">++</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$star</span> <span class="token operator">=</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getAStart</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">,</span> <span class="token variable">$i</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// percent</span> <span class="token variable">$star</span> <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token variable">$star</span> <span class="token operator">*</span> <span class="token variable">$total</span><span class="token punctuation">)</span> <span class="token operator">/</span> <span class="token number">100</span><span class="token punctuation">;</span> <span class="token variable">$arr</span><span class="token punctuation">[</span><span class="token double-quoted-string string">"star_<span class="token interpolation"><span class="token variable">$i</span></span>"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token punctuation">(</span>int<span class="token punctuation">)</span><span class="token function">round</span><span class="token punctuation">(</span><span class="token variable">$star</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token variable">$arr</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/** * get data a star * @param $html * @param $int * @return int */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">getAStart</span><span class="token punctuation">(</span><span class="token variable">$html</span><span class="token punctuation">,</span> <span class="token variable">$int</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token variable">$star</span> <span class="token operator">=</span> <span class="token variable">$html</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">find</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"#histogramTable .<span class="token interpolation"><span class="token punctuation">{</span><span class="token variable">$int</span><span class="token punctuation">}</span></span>star"</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token variable">$star</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$star</span> <span class="token operator">=</span> <span class="token variable">$star</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getAttribute</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'aria-label'</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$star</span> <span class="token operator">=</span> <span class="token variable">$this</span><span class="token operator">-</span><span class="token operator">></span><span class="token function">getNumberFromString</span><span class="token punctuation">(</span><span class="token variable">$star</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$star</span> <span class="token operator">=</span> <span class="token punctuation">(</span>int<span class="token punctuation">)</span><span class="token function">preg_replace</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"/<span class="token interpolation"><span class="token variable">$int</span></span>/"</span><span class="token punctuation">,</span> <span class="token single-quoted-string string">''</span><span class="token punctuation">,</span> <span class="token variable">$star</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token variable">$star</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/** * convert rate string to number * @param $str * @return mixed */</span> <span class="token keyword">public</span> <span class="token keyword">static</span> <span class="token keyword">function</span> <span class="token function">getAvgReviewFromString</span><span class="token punctuation">(</span><span class="token variable">$str</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">try</span> <span class="token punctuation">{</span> <span class="token variable">$str</span> <span class="token operator">=</span> <span class="token function">explode</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'うち'</span><span class="token punctuation">,</span> <span class="token variable">$str</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token variable">$str</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">error</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"cannot get rate 1"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$matches</span> <span class="token operator">=</span> <span class="token function">array_map</span><span class="token punctuation">(</span><span class="token single-quoted-string string">'floatval'</span><span class="token punctuation">,</span> <span class="token variable">$str</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token variable">$matches</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$matches</span> <span class="token operator">=</span> <span class="token function">array_filter</span><span class="token punctuation">(</span><span class="token variable">$matches</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token variable">$matches</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token function">min</span><span class="token punctuation">(</span><span class="token variable">$matches</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">catch</span> <span class="token punctuation">(</span><span class="token class-name">Exception</span> <span class="token variable">$exception</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token function">report</span><span class="token punctuation">(</span><span class="token variable">$exception</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> Log<span class="token punctuation">:</span><span class="token punctuation">:</span><span class="token function">error</span><span class="token punctuation">(</span><span class="token double-quoted-string string">"cannot get rate 2"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">/** * get category and product ranking * @param $str * @return string */</span> <span class="token keyword">public</span> <span class="token keyword">function</span> <span class="token function">getCatAndRank</span><span class="token punctuation">(</span><span class="token variable">$str</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token variable">$str</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token variable">$str</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token variable">$str</span> <span class="token operator">=</span> <span class="token function">preg_replace</span><span class="token punctuation">(</span>'<span class="token shell-comment comment">#(<a.*?></span></span>).*?(<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>)#m', '$1$2', $str); $str = preg_replace('#(<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>ul.*?</span><span class="token punctuation">></span></span>).*?(<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>ul</span><span class="token punctuation">></span></span>)#m', '$1$2', $str); $str = preg_replace('#(<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>b.*?</span><span class="token punctuation">></span></span>).*?(<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>b</span><span class="token punctuation">></span></span>)#m', '$1$2', $str); $str = preg_replace('#(<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>tr.*?</span><span class="token punctuation">></span></span>).*?(<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>tr</span><span class="token punctuation">></span></span>)#m', '$1$2', $str); $str = preg_replace('#(<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>style.*?</span><span class="token punctuation">></span></span><span class="token style language-css"><span class="token punctuation">)</span>.*?<span class="token punctuation">(</span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>style</span><span class="token punctuation">></span></span>)#m', '$1$2', $str); $str = trim(strip_tags($str)); $str = str_replace('()', '', $str); return trim($str); } /** * get product ranking from string * @param $str * @return int */ public function getRankingFromString($str) { if (!$str) { return $str; } $ranking = explode('-', $str); if (!isset($ranking[1])) { return 0; } return $this->getNumberFromString($ranking[1]); } /** * get category from string * @param $str * @return int|string */ public function getCatFromString($str) { if (!$str) { return '未定'; } $ranking = explode('-', $str); if (!isset($ranking[0])) { return '未定'; } return trim($ranking[0]); } /** * get number from string * @param $str * @return int */ public function getNumberFromString($str) { if (!$str) { return 0; } return (int)filter_var($str, FILTER_SANITIZE_NUMBER_INT); } } |
Để chạy crawl này bạn chạy lệnh sau:
1 2 | php artisan aws:product |
Proxy
Bạn để ý trong function getContent
mình có để 1 tham số nữa là proxy
, bạn có thể truyền proxy vào theo dạng http:192.162.1.15:8080
để ẩn ip server hiện tại của bạn đi.
proxy này kiếm ở đây nhé: https://hidemy.name/en/, mình đã mua code
của trang này để xử dụng, các bạn có thể mua hoặc kiếm proxy từ một nguồn khác cũng được.