用BeautifulSoup刮Instagram
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了用BeautifulSoup刮Instagram相关的知识,希望对你有一定的参考价值。
我正试图从Instagram中的“按标签搜索”中获取特定字符串。我想从这里获取url img:
<img alt="#yeşil #manzara #doğa
#yayla #nature #naturelovers #adventuretime #adventures #mountainstaries
#picture #şehirdenuzak #tatil #holiday #cow #potography #view #kütükev
#naturelife #animal #amazing #kar #winter #winteriscomming #mapavr1 #artvin
#tulumile #insaatr #tulumci #rize
class="_2di5p" sizes="171px" srcset="https://scontent-mxp11.cdninstagram.com/vp/c883e0c4267c003843fafeda255f1329/5A9D3C97/t51.2885-15/s150x150/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 150w,
https://scontent-mxp1-1.cdninstagram.com/vp/6a3480f8658b50c691bcc100a96cc6f0/5A9CC9DC/t51.2885-15/s240x240/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 240w,
https://scontent-mxp1-1.cdninstagram.com/vp/461c138e15f52420c3fbc075fab027eb/5A9DD808/t51.2885-15/s320x320/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 320w,
https://scontent-mxp1-1.cdninstagram.com/vp/ad5d67f1c9ea77d78d145501e73c2ea0/5A9CAF9D/t51.2885-15/s480x480/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 480w,
https://scontent-mxp1-1.cdninstagram.com/vp/e0636f79adc1ae53f7321d10fe60f275/5A9CD134/t51.2885-15/s640x640/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 640w"
src="https://scontent-mxp1-1.cdninstagram.com/vp/e0636f79adc1ae53f7321d10fe60f275/5A9CD134/t51.2885-15/s640x640/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg" style="">
所以基本上我想得到这个字符串(那是最后一个240w):
https://scontent-mxp1-1.cdninstagram.com/vp/6a3480f8658b50c691bcc100a96cc6f0/../n.jpg
我尝试用Python编写这段代码,但它不起作用
import requests
from bs4 import BeautifulSoup
request = requests.get("https://www.instagram.com/explore/tags/nature/")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("srcset")
print(element.text.strip())
也许真正的问题是在页面中有21个这样的元素,但是开始我想了解如何获得该字符串。
(而且,如果你们中的任何一个人知道bs4的好教程或书籍,你能告诉我吗?)
答案
您无法看到任何输出的原因是使用javascript将图像动态添加到页面源。因此,您提供的HTML在页面源中不可用。最简单的方法是使用Selenium。
但是,还有一种方法可以解决这个问题。查看页面源代码,您可以使用JSON格式的<script>
标记获取您所访问的数据。相关数据的形式为:
"thumbnail_resources": [
{
"src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/a3ed0ee1af581f1c1fe6170b8c080e7c/5B2CA660/t51.2885-15/s150x150/e35/28433503_571483933190064_5347634166450094080_n.jpg",
"config_width": 150,
"config_height": 150
},
{
"src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/7a0bb4fb1b5d5e3b179c58a2b9472b9f/5B2C535F/t51.2885-15/s240x240/e35/28433503_571483933190064_5347634166450094080_n.jpg",
"config_width": 240,
"config_height": 240
},
要获取JSON,您可以使用此代码(来自this answer的代码):
script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
获取所有图像的图像链接的代码:
import json
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.instagram.com/explore/tags/nature/')
soup = BeautifulSoup(r.text, 'lxml')
script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
for post in data['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']:
image_src = post['node']['thumbnail_resources'][1]['src']
print(image_src)
部分输出:
https://instagram.fpnq3-1.fna.fbcdn.net/vp/e8a78407fb61de834cad7f10eca830fc/5A9DC375/t51.2885-15/s240x240/e15/c0.80.640.640/28766397_174603559842180_1092148752455565312_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/3a20f36647c86c2196f259b5d14ebf82/5A9D5BC9/t51.2885-15/s240x240/e15/28433802_283862648812409_3322859933120069632_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/82216be4596dd9da862ba267cdeab517/5B144226/t51.2885-15/s240x240/e35/c0.135.1080.1080/28157436_941679549319762_5605299824451649536_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/e50eab90b2e0951d67922e49b495e1fc/5B3EC9B8/t51.2885-15/s240x240/e35/c135.0.810.810/28754107_179533402825352_1137703808411893760_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/d3a13e7b81a65421b4318b57fb8ee24e/5B4D9EFF/t51.2885-15/s240x240/e35/28433583_375555202918683_1951892035636035584_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/1b0aeea1b9be983498192d350e039aa0/5B43C583/t51.2885-15/s240x240/e35/28156427_154249191953160_9219472301039288320_n.jpg
...
注意:[1]
线上的image_src = post['node']['thumbnail_resources'][1]['src']
是240w。您可以分别使用0,1,2,3或4代表150w,240w,320w,480w或640w。此外,如果您想要任何关于任何图像的数据,例如喜欢的数量,评论,标题等;一切都在这个JSON(data
变量)中可用。
以上是关于用BeautifulSoup刮Instagram的主要内容,如果未能解决你的问题,请参考以下文章
如何用 BeautifulSoup 抓取 Instagram