如何从 Instagram 网络浏览器中抓取关注者?

Posted

技术标签:

【中文标题】如何从 Instagram 网络浏览器中抓取关注者?【英文标题】:How to web scrape followers from Instagram web browser? 【发布时间】:2016-09-11 01:20:03 【问题描述】:

谁能告诉我如何访问基础 URL 以查看给定用户的 Instagram 关注者?我可以使用 Instagram API 做到这一点,但鉴于审批流程的待定更改,我决定改用抓取。

Instagram 网络浏览器允许您查看任何给定公共用户的关注者列表 - 例如,要查看 Instagram 的关注者,请访问“https://www.instagram.com/instagram”,然后单击关注者 URL 以打开一个通过查看者进行分页的窗口(注意:您必须登录您的帐户才能查看此内容)。

我注意到当此窗口弹出时 URL 更改为“https://www.instagram.com/instagram/followers”,但我似乎无法查看此 URL 的基础页面源。

由于它出现在我的浏览器窗口中,我认为我将能够抓取。但是我必须使用像 Selenium 这样的包吗?有谁知道底层 URL 是什么,所以我不必使用 Selenium?

例如,我可以通过访问“instagram.com/instagram/media/”直接访问底层的提要数据,我可以从中对所有迭代进行抓取和分页。我想对关注者列表做类似的事情,并直接访问这些数据(而不是使用 Selenium)。

【问题讨论】:

可以取消关注者列表,但您需要登录 Instagram。你不能抓住私人用户的追随者。 【参考方案1】:

更新:2020 年 3 月

这只是 Levi 答案,在某些部分进行了小幅更新,因为现在它并没有成功退出驱动程序。默认情况下,这也会获得所有关注者,正如其他人所说,它不适合很多关注者

import itertools

from explicit import waiter, XPATH
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep

def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")
    sleep(3)
    # Login
    driver.find_element_by_name("username").send_keys(username)
    driver.find_element_by_name("password").send_keys(password)
    submit = driver.find_element_by_tag_name('form')
    submit.submit()

    # Wait for the user dashboard page to load
    WebDriverWait(driver, 15).until(
        EC.presence_of_element_located((By.LINK_TEXT, "See All")))


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/0/".format(account))

    # Click the 'Follower(s)' link
    # driver.find_element_by_partial_link_text("follower").click
    sleep(2)
    driver.find_element_by_partial_link_text("follower").click()

    # Wait for the followers modal to load
    waiter.find_element(driver, "//div[@role='dialog']", by=XPATH)
    allfoll = int(driver.find_element_by_xpath("//li[2]/a/span").text)
    # At this point a Followers modal pops open. If you immediately scroll to the bottom,
    # you hit a stopping point and a "See All Suggestions" link. If you fiddle with the
    # model by scrolling up and down, you can force it to load additional followers for
    # that person.

    # Now the modal will begin loading followers every time you scroll to the bottom.
    # Keep scrolling in a loop until you've hit the desired number of followers.
    # In this instance, I'm using a generator to return followers one-by-one
    follower_css = "ul div li:nth-child() a.notranslate"  # Taking advange of CSS's nth-child functionality
    for group in itertools.count(start=1, step=12):
        for follower_index in range(group, group + 12):
            if follower_index > allfoll:
                raise StopIteration
            yield waiter.find_element(driver, follower_css.format(follower_index)).text

        # Instagram loads followers 12 at a time. Find the last follower element
        # and scroll it into view, forcing instagram to load another 12
        # Even though we just found this elem in the previous for loop, there can
        # potentially be large amount of time between that call and this one,
        # and the element might have gone stale. Lets just re-acquire it to avoid
        # tha
        last_follower = waiter.find_element(driver, follower_css.format(group+11))
        driver.execute_script("arguments[0].scrollIntoView();", last_follower)


if __name__ == "__main__":
    account = ""  # <account to check>
    driver = webdriver.Firefox(executable_path="./geckodriver")
    try:
        login(driver)
        print('Followers of the "" account'.format(account))
        for count, follower in enumerate(scrape_followers(driver, account=account), 1):
            print("\t:>3: ".format(count, follower))
    finally:
        driver.quit()

【讨论】:

我得到这个错误:` raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: follower` @Newskooler 你没有在脚本末尾更新account 变量。【参考方案2】:

编辑:2018 年 12 月更新:

自发布此消息以来,Insta 领域的情况发生了变化。这是一个更新后的脚本,它更加 Python 并且更好地利用了 XPATH/CSS 路径。

请注意,要使用此更新后的脚本,您必须安装 explicit 包 (pip install explicit),或将带有 waiter 的每一行转换为纯硒显式等待。

import itertools

from explicit import waiter, XPATH
from selenium import webdriver


def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")

    # Login
    waiter.find_write(driver, "//div/input[@name='username']", username, by=XPATH)
    waiter.find_write(driver, "//div/input[@name='password']", password, by=XPATH)
    waiter.find_element(driver, "//div/button[@type='submit']", by=XPATH).click()

    # Wait for the user dashboard page to load
    waiter.find_element(driver, "//a/span[@aria-label='Find People']", by=XPATH)


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/0/".format(account))

    # Click the 'Follower(s)' link
    # driver.find_element_by_partial_link_text("follower").click()
    waiter.find_element(driver, "//a[@href='/instagram/followers/']", by=XPATH).click()

    # Wait for the followers modal to load
    waiter.find_element(driver, "//div[@role='dialog']", by=XPATH)

    # At this point a Followers modal pops open. If you immediately scroll to the bottom,
    # you hit a stopping point and a "See All Suggestions" link. If you fiddle with the
    # model by scrolling up and down, you can force it to load additional followers for
    # that person.

    # Now the modal will begin loading followers every time you scroll to the bottom.
    # Keep scrolling in a loop until you've hit the desired number of followers.
    # In this instance, I'm using a generator to return followers one-by-one
    follower_css = "ul div li:nth-child() a.notranslate"  # Taking advange of CSS's nth-child functionality
    for group in itertools.count(start=1, step=12):
        for follower_index in range(group, group + 12):
            yield waiter.find_element(driver, follower_css.format(follower_index)).text

        # Instagram loads followers 12 at a time. Find the last follower element
        # and scroll it into view, forcing instagram to load another 12
        # Even though we just found this elem in the previous for loop, there can
        # potentially be large amount of time between that call and this one,
        # and the element might have gone stale. Lets just re-acquire it to avoid
        # that
        last_follower = waiter.find_element(driver, follower_css.format(follower_index))
        driver.execute_script("arguments[0].scrollIntoView();", last_follower)


if __name__ == "__main__":
    account = 'instagram'
    driver = webdriver.Chrome()
    try:
        login(driver)
        # Print the first 75 followers for the "instagram" account
        print('Followers of the "" account'.format(account))
        for count, follower in enumerate(scrape_followers(driver, account=account), 1):
            print("\t:>3: ".format(count, follower))
            if count >= 75:
                break
    finally:
        driver.quit()

我做了一个快速的基准测试来显示性能如何随着您尝试以这种方式抓取的关注者越多而呈指数级下降:

$ python example.py
Followers of the "instagram" account
Found    100 followers in 11 seconds
Found    200 followers in 19 seconds
Found    300 followers in 29 seconds
Found    400 followers in 47 seconds
Found    500 followers in 71 seconds
Found    600 followers in 106 seconds
Found    700 followers in 157 seconds
Found    800 followers in 213 seconds
Found    900 followers in 284 seconds
Found   1000 followers in 375 seconds

原帖: 你的问题有点令人困惑。例如,我不太确定“我可以从中抓取并通过所有迭代进行分页”实际上意味着什么。您目前使用什么来抓取和分页?

无论如何,instagram.com/instagram/media/instagram.com/instagram/followers 不是同一类型的端点。 media 端点似乎是一个 REST API,配置为返回一个易于解析的 JSON 对象。

据我所知,followers 端点并不是真正的 RESTful 端点。相反,Instagram AJAX 在您单击“关注者”按钮后将信息发送到页面源(使用 React?)。我认为如果不使用 Selenium 之类的东西,您将无法获得这些信息,Selenium 可以加载/呈现向用户显示关注者的 javascript

此示例代码将起作用:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")

    # Login
    driver.find_element_by_xpath("//div/input[@name='username']").send_keys(username)
    driver.find_element_by_xpath("//div/input[@name='password']").send_keys(password)
    driver.find_element_by_xpath("//span/button").click()

    # Wait for the login page to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.LINK_TEXT, "See All")))


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/0/".format(account))

    # Click the 'Follower(s)' link
    driver.find_element_by_partial_link_text("follower").click()

    # Wait for the followers modal to load
    xpath = "//div[@style='position: relative; z-index: 1;']/div/div[2]/div/div[1]"
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, xpath)))

    # You'll need to figure out some scrolling magic here. Something that can
    # scroll to the bottom of the followers modal, and know when its reached
    # the bottom. This is pretty impractical for people with a lot of followers

    # Finally, scrape the followers
    xpath = "//div[@style='position: relative; z-index: 1;']//ul/li/div/div/div/div/a"
    followers_elems = driver.find_elements_by_xpath(xpath)

    return [e.text for e in followers_elems]


if __name__ == "__main__":
    driver = webdriver.Chrome()
    try:
        login(driver)
        followers = scrape_followers(driver, "instagram")
        print(followers)
    finally:
        driver.quit()

由于多种原因,这种方法存在问题,其中最主要的原因是它相对于 API 有多慢。

【讨论】:

太棒了,感谢您抽出宝贵时间回复。我的问题可能令人困惑,因为我有点困惑!我不是专业的程序员,也不太了解事物的后端。但是您的回复非常有帮助,也是我想要的! @user812783765 如果这是您正在寻找的答案,请接受此答案,以便其他人可以看到它已得到回答。 这是一个非常聪明的脚本。希望看到一个 php 实现能够在他们的探索标签页面上单击加载更多按钮。 制作了一个小的 javascript 函数,可以在 csv 文件中导出关注者。 gist.github.com/hardiksondagar/732c5efa1c4327f12d8a343c960853d6 2019 年 10 月的消息:XPATH 更改:"//div/label/input[@name='username']"【参考方案3】:

我注意到之前的答案不再有效,因此我根据之前的答案制作了一个更新版本,其中包括滚动功能(以获取列表中的所有用户,而不仅仅是最初加载的用户)。此外,这会刮掉追随者和追随者。 (你也需要download chromedriver)

import time
from selenium import webdriver as wd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# The account you want to check
account = ""

# Chrome executable
chrome_binary = r"chrome.exe"   # Add your path here


def login(driver):
    username = ""   # Your username
    password = ""   # Your password

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")

    # Login
    driver.find_element_by_xpath("//div/input[@name='username']").send_keys(username)
    driver.find_element_by_xpath("//div/input[@name='password']").send_keys(password)
    driver.find_element_by_xpath("//span/button").click()

    # Wait for the login page to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.LINK_TEXT, "See All")))


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/0/".format(account))

    # Click the 'Follower(s)' link
    driver.find_element_by_partial_link_text("follower").click()

    # Wait for the followers modal to load
    xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]"
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, xpath)))

    SCROLL_PAUSE = 0.5  # Pause to allow loading of content
    driver.execute_script("followersbox = document.getElementsByClassName('_gs38e')[0];")
    last_height = driver.execute_script("return followersbox.scrollHeight;")

    # We need to scroll the followers modal to ensure that all followers are loaded
    while True:
        driver.execute_script("followersbox.scrollTo(0, followersbox.scrollHeight);")

        # Wait for page to load
        time.sleep(SCROLL_PAUSE)

        # Calculate new scrollHeight and compare with the previous
        new_height = driver.execute_script("return followersbox.scrollHeight;")
        if new_height == last_height:
            break
        last_height = new_height

    # Finally, scrape the followers
    xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]/ul/li"
    followers_elems = driver.find_elements_by_xpath(xpath)

    followers_temp = [e.text for e in followers_elems]  # List of followers (username, full name, follow text)
    followers = []  # List of followers (usernames only)

    # Go through each entry in the list, append the username to the followers list
    for i in followers_temp:
        username, sep, name = i.partition('\n')
        followers.append(username)

    print("______________________________________")
    print("FOLLOWERS")

    return followers

def scrape_following(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/0/".format(account))

    # Click the 'Following' link
    driver.find_element_by_partial_link_text("following").click()

    # Wait for the following modal to load
    xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]"
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, xpath)))

    SCROLL_PAUSE = 0.5  # Pause to allow loading of content
    driver.execute_script("followingbox = document.getElementsByClassName('_gs38e')[0];")
    last_height = driver.execute_script("return followingbox.scrollHeight;")

    # We need to scroll the following modal to ensure that all following are loaded
    while True:
        driver.execute_script("followingbox.scrollTo(0, followingbox.scrollHeight);")

        # Wait for page to load
        time.sleep(SCROLL_PAUSE)

        # Calculate new scrollHeight and compare with the previous
        new_height = driver.execute_script("return followingbox.scrollHeight;")
        if new_height == last_height:
            break
        last_height = new_height

    # Finally, scrape the following
    xpath = "/html/body/div[4]/div/div/div[2]/div/div[2]/ul/li"
    following_elems = driver.find_elements_by_xpath(xpath)

    following_temp = [e.text for e in following_elems]  # List of following (username, full name, follow text)
    following = []  # List of following (usernames only)

    # Go through each entry in the list, append the username to the following list
    for i in following_temp:
        username, sep, name = i.partition('\n')
        following.append(username)

    print("\n______________________________________")
    print("FOLLOWING")
    return following


if __name__ == "__main__":
    options = wd.ChromeOptions()
    options.binary_location = chrome_binary # chrome.exe
    driver_binary = r"chromedriver.exe"
    driver = wd.Chrome(driver_binary, chrome_options=options)
    try:
        login(driver)
        followers = scrape_followers(driver, account)
        print(followers)
        following = scrape_following(driver, account)
        print(following)
    finally:
        driver.quit()

【讨论】:

以上是关于如何从 Instagram 网络浏览器中抓取关注者?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 python 中抓取 Instagram 帐户信息

如何通过 http 请求在 Instagram 中获取关注者和关注列表

用 PHP 抓取 Instagram

如何在 Instagram 中获取其他页面的关注者计数?

如何在 Instagram 中获取我的关注者列表数据

如何在网络上抓取喜欢 Instagram 图片的用户?