带有“加载更多”分页的列表的 BeautifulSoup 子页面

Posted

技术标签:

【中文标题】带有“加载更多”分页的列表的 BeautifulSoup 子页面【英文标题】:BeautifulSoup subpages of list with "load more" pagination 【发布时间】:2016-10-14 10:14:28 【问题描述】:

这里很新,所以提前道歉。我正在寻找来自https://angel.co/companies 的所有公司描述的列表以供使用。我尝试过的基于 Web 的解析工具并没有削减它,所以我正在寻找编写一个简单的 python 脚本。我是否应该先获取所有公司 URL 的数组,然后遍历它们?任何资源或方向都会有所帮助——我查看了 BeautifulSoup 的文档和一些帖子/视频教程,但我对模拟 json 请求等事情很感兴趣(见这里:Get all links with BeautifulSoup from a single page website ('Load More' feature))

我看到一个脚本,我认为它正在调用其他列表:

o.on("company_filter_fetch_page_complete", function(e) 
    return t.ajax(
        url: "/companies/startups",
        data: e,
        dataType: "json",
        success: function(t) 
            return t.html ? 
                (E().find(".more").empty().replaceWith(t.html),
                 c()) : void 0
        
    )
),

谢谢!

【问题讨论】:

如果有帮助,此脚本的上方是:过滤器:function(s,l) var c, u, d, h, p, f, g, m, v, y, b, _ , w, x, C, A, k, S, N, T, E, D, I, $, P, M; return u = new o(s(".currently-showing"),l.data("sort")), u.set_data(l.data("init_data")), u.render( fetch: !l.数据(“新”)), 【参考方案1】:

你要抓取的数据是使用ajax动态加载的,你需要做很多工作才能得到你真正想要的html:

import requests
from bs4 import BeautifulSoup

header = 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest",
    

with requests.Session() as s:
    r = s.get("https://angel.co/companies").content
    csrf = BeautifulSoup(r).select_one("meta[name=csrf-token]")["content"]
    header["X-CSRF-Token"] = csrf
    ids = s.post("https://angel.co/company_filters/search_data", data="sort": "signal", headers=header).json()
    _ids = "".join(["ids%5B%5D=&".format(i)  for i in ids.pop("ids")])
    rest = "&".join(["=".format(k,v) for k,v in ids.items()])
    url = "https://angel.co/companies/startups?".format(_ids, rest)
    rsp = s.get(url, headers=header)
    print(rsp.json())

我们首先需要获得一个有效的 csrf-token,这是初始请求所做的,然后我们需要发布到https://angel.co/company_filters/search_data

这给了我们:

"ids":[296769,297064,60,63,112,119,130,160,167,179,194,236,281,287,312,390,433,469,496,516],"total":908164,"page":1,"sort":"signal","new":false,"hexdigest":"3f4980479bd6dca37e485c80d415e848a57c43ae"

它们是我们到达https://angel.co/companies/startups 所需的参数,即我们的最后一个请求:

然后,该请求为我们提供了更多包含 html 和所有公司信息的 json:

"html":"<div class=\" dc59 frs86 _a _jm\" data-_tn=\"companies/results ...........

要发布的内容太多了,但这是您需要解析的内容。

所以把它们放在一起:

In [3]: header = 
   ...:     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
   ...:     "X-Requested-With": "XMLHttpRequest",
   ...: 

In [4]: with requests.Session() as s:
   ...:         r = s.get("https://angel.co/companies").content
   ...:         csrf = BeautifulSoup(r, "lxml").select_one("meta[name=csrf-token]")["content"]
   ...:         header["X-CSRF-Token"] = csrf
   ...:         ids = s.post("https://angel.co/company_filters/search_data", data="sort": "signal", headers=header).json()
   ...:         _ids = "".join(["ids%5B%5D=&".format(i) for i in ids.pop("ids")])
   ...:         rest = "&".join(["=".format(k, v) for k, v in ids.items()])
   ...:         url = "https://angel.co/companies/startups?".format(_ids, rest)
   ...:         rsp = s.get(url, headers=header)
   ...:         soup = BeautifulSoup(rsp.json()["html"], "lxml")
   ...:         for comp in soup.select("div.base.startup"):
   ...:                 text = comp.select_one("div.text")
   ...:                 print(text.select_one("div.name").text.strip())
   ...:                 print(text.select_one("div.pitch").text.strip())
   ...:         
Frontback
Me, now.
Outbound
Optimizely for messages
Adaptly
The Easiest Way to Advertise Across The Social Web.
Draft
Words with Friends for Fantasy (w/ real money)
Graphicly
an automated ebook publishing and distribution platform
Appstores
App Distribution Platform
eVenues
Online Marketplace & Booking Engine for Unique Meeting Spaces
WePow
Video & Mobile Recruitment
DoubleDutch
Event Marketing Automation Software
ecomom
It's all good
BackType
Acquired by Twitter
Stipple
Native advertising for the visual web
Pinterest
A Universal Social Catalog
Socialize
Identify and reward your most influential users with our drop-in social platform.
StyleSeat
Largest and fastest growing marketplace in the $400B beauty and wellness industry
LawPivot
99 Designs for legal
Ostrovok
Leading hotel booking platform for Russian-speakers
Thumb
Leading mobile social network that helps people get instant opinions
AppFog
Making developing applications on the cloud easier than ever before
Artsy
Making all the world’s art accessible to anyone with an Internet connection.

就分页而言,您每天只能浏览 20 页,但要获得全部 20 页,只需将page:page_no 添加到我们的表单数据中即可获得所需的新参数data="sort": "signal","page":page,当您点击加载更多你可以看到发布的内容:

所以最后的代码:

import requests
from bs4 import BeautifulSoup

def parse(soup):

        for comp in soup.select("div.base.startup"):
            text = comp.select_one("div.text")
            yield (text.select_one("div.name").text.strip()), text.select_one("div.pitch").text.strip()

def connect(page):
    header = 
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest",
    

    with requests.Session() as s:
        r = s.get("https://angel.co/companies").content
        csrf = BeautifulSoup(r, "lxml").select_one("meta[name=csrf-token]")["content"]
        header["X-CSRF-Token"] = csrf
        ids = s.post("https://angel.co/company_filters/search_data", data="sort": "signal","page":page, headers=header).json()
        _ids = "".join(["ids%5B%5D=&".format(i) for i in ids.pop("ids")])
        rest = "&".join(["=".format(k, v) for k, v in ids.items()])
        url = "https://angel.co/companies/startups?".format(_ids, rest)
        rsp = s.get(url, headers=header)
        soup = BeautifulSoup(rsp.json()["html"], "lxml")
        for n, p in parse(soup):
            yield n, p
for i in range(1, 21):
    for name, pitch in connect(i):
        print(name, pitch)

显然,您解析的内容取决于您,但您在浏览器中看到的所有结果都将可用。

【讨论】:

Padraic,感谢您周到的回答。考虑到我是如何解决这个问题的,我敢肯定没有你我不会走得太远。奇怪的是,我一直得到 383 个独特的项目作为输出。知道为什么会这样吗?我相信该页面应该会输出接近 900k 的结果。 @TylerHudson-Crimi,如果您点击加载更多 19 次,您将看到 您已达到每个查询的最大 20 页 @TylerHudson-Crimi,你真正想要什么信息? 这个想法是在所有这些启动描述上训练一个 NN,以生成一个随机启动描述生成器。现在我想克莱默会更有趣,如果天使列表保护他们的列表太好了。 imsdb.com/transcripts/Seinfeld-Good-News,-Bad-News.html 需要使用加载更多按钮来解决一些网络报废的问题,遇到了这个问题,这是迄今为止我在这个网站上读到的最具描述性的答案。我从来不知道我能做这么多,直到现在还有这么多事情要做。谢谢你的回答!

以上是关于带有“加载更多”分页的列表的 BeautifulSoup 子页面的主要内容,如果未能解决你的问题,请参考以下文章

使用 ajax 和分页的带有 ASP 下拉列表控件的 Select2

Magnolia 中用于内容列表的分页表视图

带有分页的 UICollectionView

带有分页的全屏 UICollectionView 内的 UIWebView

vue 封装可复用列表组件

带有分页的 UIScrollView 的约束