我如何在网络上抓取某些没有附加属性的单词？

Posted 2023-02-16

技术标签:

【中文标题】我如何在网络上抓取某些没有附加属性的单词？【英文标题】：How can I web scrape certain words that don't have an attribute attached to them? 【发布时间】：2021-10-02 20:03:05 【问题描述】：

首先，我想指出我是网络抓取的初学者。我刚刚开始一个从https://coinmarketcap.com 中刮取数据的项目。目前，我专注于抓取加密货币的名称（即比特币、以太坊、Tether 等）。但是，我能得到的最好的结果是货币的名称，后跟一堆格式，如颜色、字体大小、类等。我该如何编码，这样我就可以只存储货币的名称而没有这个额外的信息。这是我当前的代码：

import requests
from bs4 import BeautifulSoup

#array of just crypto names
names = []

#gets content from site
site = requests.get("https://coinmarketcap.com")

#opens content from site
info = site.content
soup = BeautifulSoup(info,"html.parser")

#class ID for name of crypto
type_name = 'sc-1eb5slv-0 iJjGCS'

#crypto names + other unnecessary info
names_raw = soup.find_all('p', attrs='class': 'sc-1eb5slv-0 iJjGCS')

for type_name in names_raw:
    print(type_name.text, type_name.next_sibling)

如果图片更有用： my current code

如您所见，我只有 20 行，但很难弄清楚这一点。感谢您给我的任何帮助或建议。

【问题讨论】：

【参考方案1】：

要从此页面获取加密货币的名称和代码，您可以使用下一个示例：

import requests
from bs4 import BeautifulSoup

url = "https://coinmarketcap.com"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for td in soup.select("td:nth-of-type(3)"):
    t = " ".join(tag.text for tag in td.select("p, span")).strip()
    print(":<30 :<10".format(*t.rsplit(maxsplit=1)))

打印：

Bitcoin                        BTC       
Ethereum                       ETH       
Tether                         USDT      
Binance Coin                   BNB       
Cardano                        ADA       
XRP                            XRP       
USD Coin                       USDC      
Dogecoin                       DOGE      
Polkadot                       DOT       
Binance USD                    BUSD      
Uniswap                        UNI       
Bitcoin Cash                   BCH       
Litecoin                       LTC       
Chainlink                      LINK      
Solana                         SOL       
Wrapped Bitcoin                WBTC      
Polygon                        MATIC     
Ethereum Classic               ETC       
Stellar                        XLM       
THETA                          THETA     

...and so on.

【讨论】：

哇，这绝对有效！不过，我对循环有点迷茫。如果有人可以帮助我了解循环中的用法，那将非常有帮助。非常感谢您的回复 Andrej。 @CW soup.select("td:nth-of-type(3)") 选择表中的第三列。然后在每个单元格中我们会找到每个<p> 和<span> 标签，将它们连接在一起并拆分名称和缩写。

以上是关于我如何在网络上抓取某些没有附加属性的单词？的主要内容，如果未能解决你的问题，请参考以下文章