Web抓python(beautifulsoup)多页和子页面

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Web抓python(beautifulsoup)多页和子页面相关的知识,希望对你有一定的参考价值。

我创造了我的汤:

import pandas as pd 
import requests
from bs4 import BeautifulSoup
import os
import string

for i in string.ascii_uppercase[:27]:
    url = "https://myanimelist.net/anime.php?letter={}".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')

我正在尝试从网站上创建一个数据框抓取这个网站“qazxsw诗人,我想进入第一步动漫标题,eps,类型

其次详细介绍了每个动漫(像这样的页面:https://myanimelist.net)我想收集用户指定包含的分数(例如:

https://myanimelist.net/anime/2928/hack__GU_Returner

<a href="https://myanimelist.net/profile/Tii__">Tii__</a>

你能帮忙收集所有这些信息吗?

如果我的要求不清楚,请告诉我。

答案

这可以使用<table border="0" width="105" cellpadding="0" cellspacing="0" class="borderClass" style="border-width: 1px;"> <tbody><tr> <td class="borderClass bgColor1"><strong>Overall</strong></td> <td class="borderClass bgColor1"><strong>10</strong></td> </tr> <tr> <td class="borderClass" align="left">Story</td> <td class="borderClass">10</td> </tr> <tr> <td class="borderClass" align="left">Animation</td> <td class="borderClass">9</td> </tr> <tr> <td class="borderClass" align="left">Sound</td> <td class="borderClass">9</td> </tr> <tr> <td class="borderClass" align="left">Character</td> <td class="borderClass">9</td> </tr> <tr> <td class="borderClass" style="border-width: 0;" align="left">Enjoyment</td> <td class="borderClass" style="border-width: 0;">10</td> </tr> </tbody></table> 函数直接使用pandas完成:

read_html()

这将返回在给定URL处找到的所有表的列表。在您的情况下,您只需要第二个表。这将为您提供一个数据帧开始:

import pandas as pd 
import string

df = pd.DataFrame()

for i in string.ascii_uppercase[:1]:#[:27]:
    url = "https://myanimelist.net/anime.php?letter={}".format(i)
    print url
    tables = pd.read_html(url, header=0)

    if df.empty:
        df = tables[2]
    else:
        df = pd.concat([df, tables[2]])

print df    

要使用BeautifulSoup执行此操作,您可以使用以下方法:

    Unnamed: 0                                              Title     Type  Eps.  Score
0          NaN  A Kite add  Sawa is a school girl, an orphan, ...      OVA     2   6.67
1          NaN  A Piece of Phantasmagoria add  A collection of...      OVA    15   6.25
2          NaN  A Play add  Music Video for the group ALT, mad...    Music     1   4.62
3          NaN  A Smart Experiment add  Bonus short included o...  Special     1   4.95
4          NaN  A-Channel add  Tooru and Run have been best fr...       TV    12   7.04

对于每部电影,都会创建所有找到的分数列表并将其附加到from bs4 import BeautifulSoup import pandas as pd import string import requests columns = [u'Title', u'Type', u'Eps.', u'Score'] df = pd.DataFrame() for i in string.ascii_uppercase[:27]: url = "https://myanimelist.net/anime.php?letter={}".format(i) r = requests.get(url) soup = BeautifulSoup(r.text, 'html.parser') table = soup.find_all('table')[2] for tr in table.find_all('tr')[1:]: row = [td.get_text(strip=True) for td in tr.find_all('td')[1:5]] url_sub = tr.find('a')['href'] print url_sub r_sub = requests.get(url_sub) soup_sub = BeautifulSoup(r_sub.text, 'html.parser') all_scores = [] # each title has multiple lists of scores # Select all of the user assigned score tables for div in soup_sub.select('div.spaceit.textReadability.word-break.pt8.mt8'): scores = [] # scores for one block for tr_sub in div.div.table.find_all('tr'): scores.append([td_sub.text for td_sub in tr_sub.find_all('td')]) all_scores.append(scores) print all_scores # These probably need adding to the row. Not all have scores. df_row = pd.DataFrame([row], columns=columns) if df.empty: df = df_row else: df = pd.concat([df, df_row]) print df ,尽管不清楚如何将其添加到主数据帧中。

例如,分数可能如下所示:

all_scores

以上是关于Web抓python(beautifulsoup)多页和子页面的主要内容,如果未能解决你的问题,请参考以下文章

python beautifulsoup应用问题?

使用 urllib 和 BeautifulSoup 通过 Python 从 Web 检索信息

Python的基本Web Scraping(Beautifulsoup和Requests)

python爬虫第六天

使用Beautifulsoup时删除标签

在 Python3 中使用 request_html 和 BeautifulSoup 使用 select/option 抓取 Web 数据