从抓取bs4中过滤python中的数据

Posted 2023-02-15

技术标签:

【中文标题】从抓取bs4中过滤python中的数据【英文标题】：filtering data in python from scraping bs4 【发布时间】：2021-12-24 08:13:42 【问题描述】：

我对如何过滤从 ebay 上的 scraping 数据中获得的数据有点困惑，这里的代码如下：

from bs4 import BeautifulSoup
import requests

url ='https://www.ebay.fr/sch/267/i.html?_from=R40&_nkw=star+wars&_sop=10&_ipg=200'

def get_data(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def parse(soup):
    results = soup.find_all('div', 'class' : 's-item__info clearfix')
    for item in results:
        data = []
        try:
            Title = item.find('h3', 'class': 's-item__title').text.replace('Nouvelle annonce','')
            Price = item.find('span', 'class':'s-item__price').text
            Link = item.find('a', 'class' : 's-item__link')['href']

            products = 'Title' : Title, 'Price' : Price, 'Link' : Link
            data.append(products)
            print(data)

        except:
            continue
    return
soup = get_data(url)
parse(soup)

使用该代码，我可以从 ebay 的页面中获取所有书籍，但我只想要我在 print(data) 使用以下关键字时获得的列表中的特定书籍：

['Title': 'Star Wars - Rebels T05', 'Price': '8,53 EUR', 'Link': 'https://www.ebay.fr/itm/265401372083?hash=item3dcb278db3:g:g00AAOSwTmBhjXjq']
['Title': 'Official Lego� Star Wars Annual 2016 (Lego Annuals), , Used; Good Book', 'Price': '8,42 EUR', 'Link': 'https://www.ebay.fr/itm/165178509530?hash=item26756808da:g:NU4AAOSwsldhjXi2']
['Title': 'Thrawn (Star Wars) de Zahn, Timothy | Livre | état très bon', 'Price': '10,95 EUR', 'Link': 'https://www.ebay.fr/itm/124998742900?hash=item1d1a817374:g:zBQAAOSwSGFhjXPt']
['Title': 'STARFIX 007 1983 STAR WARS La guerre des étoiles III Les PREDATEURS GWENDOLINE', 'Price': '12,90 EUR', 'Link': 'https://www.ebay.fr/itm/294540446774?hash=item4493fa8c36:g:EMUAAOSwWjxhjXNe']
['Title': 'Star Wars, Der Kristallstern de McIntyre, Vonda N.,... | Livre | état acceptable', 'Price': '3,53 EUR', 'Link': 'https://www.ebay.fr/itm/124998670341?hash=item1d1a805805:g:6xIAAOSwKmZhjWPn']

我想使用关键字：“Thrawn”，所以我只得到第 3 行：

['Title': 'Thrawn (Star Wars) de Zahn, Timothy | Livre | état très bon', 'Price': '10,95 EUR', 'Link': 'https://www.ebay.fr/itm/124998742900?hash=item1d1a817374:g:zBQAAOSwSGFhjXPt']

此时我被卡住了，我尝试了很多尝试if，string，attrs，但直到现在我没有得到任何结果，那么我该如何实现“关键字”？ :)

谢谢

【问题讨论】：

【参考方案1】：

有几种方法可以找到包含关键字“Thrawn”的书名。

首先，单个数据元素是字典，因此必须使用 str(dict) 修改基本字符串。

book_titles = parse(soup)
book = [title for title in book_titles if 'Thrawn' in str(title)]
print(book)
# output 
['Title': 'Thrawn (Star Wars) de Zahn, Timothy | Livre | état très bon', 'Price': '10,95 EUR', 'Link': 'https://www.ebay.fr/itm/124998742900?hash=item1d1a817374:g:zBQAAOSwSGFhjXPt', 'Title': 'Star Wars™ Thrawn de Zahn, Timothy | Livre | état très bon', 'Price': '10,77 EUR', 'Link': 'https://www.ebay.fr/itm/124997651763?hash=item1d1a70cd33:g:FhoAAOSwPF9hjIs-']

book_titles = 解析（汤）这是另一种使用正则表达式的方法。

book = [title for title in book_titles if regex.search('Thrawn',  str(title))]
print(book)
# output 
['Title': 'Thrawn (Star Wars) de Zahn, Timothy | Livre | état très bon', 'Price': '10,95 EUR', 'Link': 'https://www.ebay.fr/itm/124998742900?hash=item1d1a817374:g:zBQAAOSwSGFhjXPt', 'Title': 'Star Wars™ Thrawn de Zahn, Timothy | Livre | état très bon', 'Price': '10,77 EUR', 'Link': 'https://www.ebay.fr/itm/124997651763?hash=item1d1a70cd33:g:FhoAAOSwPF9hjIs-']

这是另一种方式：

book_titles = parse(soup)
for title in book_titles:
    for key, value in title.items():
        if key == 'Title':
            if 'Thrawn' in value:
                print(title)
                # output 

                'Title': 'Thrawn (Star Wars) de Zahn, Timothy | Livre | état très bon', 'Price': '10,95 EUR',
                 'Link': 'https://www.ebay.fr/itm/124998742900?hash=item1d1a817374:g:zBQAAOSwSGFhjXPt'

                'Title': 'Star Wars™ Thrawn de Zahn, Timothy | Livre | état très bon', 'Price': '10,77 EUR',
                 'Link': 'https://www.ebay.fr/itm/124997651763?hash=item1d1a70cd33:g:FhoAAOSwPF9hjIs-'

函数解析也需要返回数据，所以这样做：

def parse(soup):
    data = []
    results = soup.find_all('div', 'class' : 's-item__info clearfix')
    for item in results:
        try:
            Title = item.find('h3', 'class': 's-item__title').text.replace('Nouvelle annonce','')
            Price = item.find('span', 'class':'s-item__price').text
            Link = item.find('a', 'class': 's-item__link')['href']

            products = 'Title': Title, 'Price': Price, 'Link': Link
            data.append(products)
        except:
            continue
    return data

这是查找多本书的一种方法：

book_titles = parse(soup)
for title in book_titles:
    for key, value in title.items():
        if key == 'Title':
            for book in ['INTEGRALE', 'Thrawn']:
                if book in value:
                    print(title)
                    # output 

                    'Title': 'Thrawn (Star Wars) de Zahn, Timothy | Livre | état très bon', 'Price': '10,95 EUR',
                     'Link': 'https://www.ebay.fr/itm/124998742900?hash=item1d1a817374:g:zBQAAOSwSGFhjXPt'

                    'Title': 'DARK MAUL INTEGRALE , STAR WARS LEGENDES, LIVRE NEUF (RARE) ', 'Price': '25,00 EUR',
                     'Link': 'https://www.ebay.fr/itm/175018282734?hash=item28bfe70eee:g:t1YAAOSwOsFhhW1Q'

                    'Title': 'LES OMBRES DE L EMPIRE INTEGRALE , STAR WARS LEGENDES, LIVRE NEUF ',
                     'Price': '20,00 EUR',
                     'Link': 'https://www.ebay.fr/itm/175018277970?hash=item28bfe6fc52:g:ASgAAOSwyFphhWsW'

                    'Title': 'Star Wars™ Thrawn de Zahn, Timothy | Livre | état très bon', 'Price': '10,77 EUR',
                     'Link': 'https://www.ebay.fr/itm/124997651763?hash=item1d1a70cd33:g:FhoAAOSwPF9hjIs-'

【讨论】：

非常感谢 :) 我实际上在 Try / 中设置了 if 条件，但我猜它为什么不起作用。我尝试了你写的第三种方式，效果很好如果我想在多个关键字中搜索怎么办？我尝试了以下方法：keywords = 'INTEGRALE', 'Thrawn'

for title in book_titles:     for key, value in title.items():         if key == 'Title':             if keywords in value:                 print(title)

，所以我得到了所有包含这个词的书：INTEGRALE、Thrrawn 等……但上面写着：TypeError: 'in <string>' requires string as left operand, not set 查看我答案的更新部分。这是你想做的吗？是的，它很完美 :) 谢谢你，for .... in ... ：允许你以某种方式转换/声明你想要的数据吗？ for .... in ... 是一个 for 循环，参考：w3schools.com/python/python_for_loops.asp

以上是关于从抓取bs4中过滤python中的数据的主要内容，如果未能解决你的问题，请参考以下文章

在 Python 中使用 BS4、Selenium 抓取动态数据并避免重复

需要帮助使用 bs4 和 python 从幻灯片中抓取图像

python中的抓取显示无值[重复]

Python爬虫--2019大学排名数据抓取

爬虫入门之爬取策略 XPath与bs4实现

python爬虫时，bs4无法读取网页标签中的文本