Giới thiệu
Bài viết hướng dẫn các bạn mới học cú pháp python có thể sử dụng thư viện requests của python để download ảnh của một tài khoản Instagram một cách đơn giản .
Chuẩn bị
cài đặt thư viện requests : pip install requests
Phân tích
Đầu tiên, chúng ta truy cập vào page của user cần crawl ảnh
F12 -> network , dễ dàng tìm được api để lấy ảnh của instagram
1 2 | https://www.instagram.com/graphql/query/?query_hash=003056d32c2554def87228bc3fd9668a&variables=%7B%22id%22%3A%224499737748%22%2C%22first%22%3A12%2C%22after%22%3A%22QVFEU0wtaE15VUNGLUd5dXNKR0FHbWx2UmlKS0ZlcDZBVXpFNkdTeXhycFN4SHVhVWJwZzNsTld0cU1xS1RLa1huT2w0X0dnS0tLWnVfUVlsNU5JOTJKRw%3D%3D%22%7D |
api có dạng:
1 2 | https://www.instagram.com/graphql/query/?query_hash=003056d32c2554def87228bc3fd9668a&variables={"id":"4499737748","first":12,"after":"QVFEX0l4TElsblNiSklTSDJaXzZsLUE3ajlvTE44UktYR2lPNm1SOWtRWmR2d21VZWJNUEJKdHVXU3hIOGNDS2FKQWNhdVBaZk5wZGpmMGRkTG1rZTV6Tg=="} |
first: số ảnh sẽ lấy bắt đầu từ after.
Với after = “” chúng ta có được 12 ảnh đầu tiên (after = end_cursor của requests trước nó)
Vậy quá trình crawl của chúng ta sẽ là
request api đầu tiên -> Crawl ảnh, end_cursor, kiểm tra còn trang phía sau không? -> lại gửi api với end_cursor ở lần gọi api trước nếu còn.
quá trình Crawl ảnh : Sử dụng kết quả từ requests api -> chuyển sang json -> check xem bài viết là ảnh hay video->Check xem có gồm các ảnh khác không -> lấy các url ảnh
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | nextLink <span class="token operator">=</span> <span class="token string">'https://www.instagram.com/graphql/query/?query_hash=003056d32c2554def87228bc3fd9668a&variables={"id":"'</span><span class="token operator">+</span><span class="token builtin">id</span><span class="token operator">+</span><span class="token string">'","first":12,"after":"'</span><span class="token operator">+</span>end<span class="token operator">+</span><span class="token string">'"}'</span> res <span class="token operator">=</span> r<span class="token punctuation">.</span>get<span class="token punctuation">(</span>nextLink<span class="token punctuation">)</span><span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> edges <span class="token operator">=</span> res<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'user'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edge_owner_to_timeline_media'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edges'</span><span class="token punctuation">]</span> <span class="token keyword">for</span> e <span class="token keyword">in</span> edges<span class="token punctuation">:</span> is_video <span class="token operator">=</span> e<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'is_video'</span><span class="token punctuation">]</span> <span class="token keyword">if</span><span class="token punctuation">(</span>is_video <span class="token keyword">is</span> <span class="token boolean">False</span><span class="token punctuation">)</span><span class="token punctuation">:</span> link<span class="token punctuation">.</span>append<span class="token punctuation">(</span>e<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'display_url'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">if</span> <span class="token string">"edge_sidecar_to_children"</span> <span class="token keyword">in</span> e<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">:</span> ne <span class="token operator">=</span> e<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edge_sidecar_to_children'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edges'</span><span class="token punctuation">]</span> <span class="token keyword">for</span> nee <span class="token keyword">in</span> ne<span class="token punctuation">:</span> is_video <span class="token operator">=</span> nee<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'is_video'</span><span class="token punctuation">]</span> <span class="token keyword">if</span><span class="token punctuation">(</span>is_video <span class="token keyword">is</span> <span class="token boolean">False</span><span class="token punctuation">)</span><span class="token punctuation">:</span> link<span class="token punctuation">.</span>append<span class="token punctuation">(</span>nee<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'display_url'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> |
Kiểm tra có trang tiếp theo hay không và lấy end_cursor:
1 2 3 | end <span class="token operator">=</span> res<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'user'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edge_owner_to_timeline_media'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'page_info'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'end_cursor'</span><span class="token punctuation">]</span> check <span class="token operator">=</span> res<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'user'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edge_owner_to_timeline_media'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'page_info'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'has_next_page'</span><span class="token punctuation">]</span> |
Cuối cùng, tạo thư mục mới và tải ảnh:
1 2 3 4 | current_path <span class="token operator">=</span> os<span class="token punctuation">.</span>getcwd<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">try</span><span class="token punctuation">:</span> os<span class="token punctuation">.</span>mkdir<span class="token punctuation">(</span>current_path <span class="token operator">+</span> <span class="token string">"\"</span><span class="token operator">+</span><span class="token builtin">id</span><span class="token operator">+</span><span class="token string">"\"</span><span class="token punctuation">)</span> <span class="token keyword">except</span><span class="token punctuation">:</span><span class="token keyword">pass</span> |
1 2 3 4 5 6 7 | <span class="token keyword">for</span> l <span class="token keyword">in</span> link<span class="token punctuation">:</span> file_name <span class="token operator">=</span> <span class="token builtin">str</span><span class="token punctuation">(</span>l<span class="token punctuation">)</span><span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'?'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token builtin">id</span><span class="token operator">+</span><span class="token string">'/'</span><span class="token operator">+</span> file_name<span class="token punctuation">,</span> <span class="token string">"wb"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> <span class="token builtin">file</span><span class="token punctuation">:</span> response <span class="token operator">=</span> r<span class="token punctuation">.</span>get<span class="token punctuation">(</span>l<span class="token punctuation">)</span> <span class="token builtin">file</span><span class="token punctuation">.</span>write<span class="token punctuation">(</span>response<span class="token punctuation">.</span>content<span class="token punctuation">)</span> <span class="token builtin">file</span><span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> |
Full Code :
Bạn có thể tải source code về, thay id bằng id tìm được trong api để bắt đầu tải ảnh.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | <span class="token keyword">import</span> requests <span class="token keyword">as</span> r <span class="token keyword">import</span> os <span class="token builtin">id</span> <span class="token operator">=</span> <span class="token string">'3762891297'</span> current_path <span class="token operator">=</span> os<span class="token punctuation">.</span>getcwd<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">try</span><span class="token punctuation">:</span> os<span class="token punctuation">.</span>mkdir<span class="token punctuation">(</span>current_path <span class="token operator">+</span> <span class="token string">"\"</span><span class="token operator">+</span><span class="token builtin">id</span><span class="token operator">+</span><span class="token string">"\"</span><span class="token punctuation">)</span> <span class="token keyword">except</span><span class="token punctuation">:</span><span class="token keyword">pass</span> linkStart <span class="token operator">=</span> <span class="token string">'https://www.instagram.com/graphql/query/?query_hash=003056d32c2554def87228bc3fd9668a&variables={"id":"'</span><span class="token operator">+</span><span class="token builtin">id</span><span class="token operator">+</span><span class="token string">'","first":12,"after":""}'</span> <span class="token keyword">print</span><span class="token punctuation">(</span>linkStart<span class="token punctuation">)</span> nextLink<span class="token operator">=</span> <span class="token string">''</span> firstres <span class="token operator">=</span> r<span class="token punctuation">.</span>get<span class="token punctuation">(</span>linkStart<span class="token punctuation">)</span><span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> check <span class="token operator">=</span> firstres<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'user'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edge_owner_to_timeline_media'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'page_info'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'has_next_page'</span><span class="token punctuation">]</span> end <span class="token operator">=</span> firstres<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'user'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edge_owner_to_timeline_media'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'page_info'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'end_cursor'</span><span class="token punctuation">]</span> <span class="token comment"># while(check != False):</span> link <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">while</span><span class="token punctuation">(</span>check <span class="token operator">!=</span> <span class="token boolean">False</span><span class="token punctuation">)</span><span class="token punctuation">:</span> nextLink <span class="token operator">=</span> <span class="token string">'https://www.instagram.com/graphql/query/?query_hash=003056d32c2554def87228bc3fd9668a&variables={"id":"'</span><span class="token operator">+</span><span class="token builtin">id</span><span class="token operator">+</span><span class="token string">'","first":12,"after":"'</span><span class="token operator">+</span>end<span class="token operator">+</span><span class="token string">'"}'</span> res <span class="token operator">=</span> r<span class="token punctuation">.</span>get<span class="token punctuation">(</span>nextLink<span class="token punctuation">)</span><span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> edges <span class="token operator">=</span> res<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'user'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edge_owner_to_timeline_media'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edges'</span><span class="token punctuation">]</span> <span class="token keyword">for</span> e <span class="token keyword">in</span> edges<span class="token punctuation">:</span> is_video <span class="token operator">=</span> e<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'is_video'</span><span class="token punctuation">]</span> <span class="token keyword">if</span><span class="token punctuation">(</span>is_video <span class="token keyword">is</span> <span class="token boolean">False</span><span class="token punctuation">)</span><span class="token punctuation">:</span> link<span class="token punctuation">.</span>append<span class="token punctuation">(</span>e<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'display_url'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">if</span> <span class="token string">"edge_sidecar_to_children"</span> <span class="token keyword">in</span> e<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">:</span> ne <span class="token operator">=</span> e<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edge_sidecar_to_children'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edges'</span><span class="token punctuation">]</span> <span class="token keyword">for</span> nee <span class="token keyword">in</span> ne<span class="token punctuation">:</span> is_video <span class="token operator">=</span> nee<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'is_video'</span><span class="token punctuation">]</span> <span class="token keyword">if</span><span class="token punctuation">(</span>is_video <span class="token keyword">is</span> <span class="token boolean">False</span><span class="token punctuation">)</span><span class="token punctuation">:</span> link<span class="token punctuation">.</span>append<span class="token punctuation">(</span>nee<span class="token punctuation">[</span><span class="token string">'node'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'display_url'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> end <span class="token operator">=</span> res<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'user'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edge_owner_to_timeline_media'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'page_info'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'end_cursor'</span><span class="token punctuation">]</span> check <span class="token operator">=</span> res<span class="token punctuation">[</span><span class="token string">'data'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'user'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'edge_owner_to_timeline_media'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'page_info'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'has_next_page'</span><span class="token punctuation">]</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>link<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">for</span> l <span class="token keyword">in</span> link<span class="token punctuation">:</span> file_name <span class="token operator">=</span> <span class="token builtin">str</span><span class="token punctuation">(</span>l<span class="token punctuation">)</span><span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'?'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token builtin">id</span><span class="token operator">+</span><span class="token string">'/'</span><span class="token operator">+</span> file_name<span class="token punctuation">,</span> <span class="token string">"wb"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> <span class="token builtin">file</span><span class="token punctuation">:</span> response <span class="token operator">=</span> r<span class="token punctuation">.</span>get<span class="token punctuation">(</span>l<span class="token punctuation">)</span> <span class="token builtin">file</span><span class="token punctuation">.</span>write<span class="token punctuation">(</span>response<span class="token punctuation">.</span>content<span class="token punctuation">)</span> <span class="token builtin">file</span><span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> link <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">if</span><span class="token punctuation">(</span>check <span class="token operator">==</span> <span class="token boolean">False</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">break</span> |
p/s: code được viết một cách thô sơ nhất, khuyến khích sửa lại để clean hơn.