python3 webscraping-循环只返回一次迭代
Posted
技术标签:
【中文标题】python3 webscraping-循环只返回一次迭代【英文标题】:python3 webscraping- loop returns only one iteration 【发布时间】:2022-01-14 10:57:00 【问题描述】:python3 网络抓取)我正在尝试从 html 数据中提取表格并将其存储到一个新的数据框中。我需要所有的 'td' 值,但是当我尝试迭代时,循环只返回第一行,而不是所有行。下面是我的代码和输出
!pip install yfinance
!pip install pandas
!pip install requests
!pip install bs4
!pip install plotly
import yfinance as yf
import pandas as pd
import requests
from bs4 import BeautifulSoup
import plotly.graph_objects as go
from plotly.subplots import make_subplots
def make_graph(stock_data, revenue_data, stock):
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=("Historical Share Price", "Historical Revenue"), vertical_spacing = .3)
stock_data_specific = stock_data[stock_data.Date <= '2021--06-14']
revenue_data_specific = revenue_data[revenue_data.Date <= '2021-04-30']
fig.add_trace(go.Scatter(x=pd.to_datetime(stock_data_specific.Date, infer_datetime_format=True), y=stock_data_specific.Close.astype("float"), name="Share Price"), row=1, col=1)
fig.add_trace(go.Scatter(x=pd.to_datetime(revenue_data_specific.Date, infer_datetime_format=True), y=revenue_data_specific.Revenue.astype("float"), name="Revenue"), row=2, col=1)
fig.update_xaxes(title_text="Date", row=1, col=1)
fig.update_xaxes(title_text="Date", row=2, col=1)
fig.update_yaxes(title_text="Price ($US)", row=1, col=1)
fig.update_yaxes(title_text="Revenue ($US Millions)", row=2, col=1)
fig.update_layout(showlegend=False,
height=900,
title=stock,
xaxis_rangeslider_visible=True)
fig.show()
tsla = yf.Ticker("TSLA")
tsla
tesla_data = tsla.history(period="max")
tesla_data
tesla_data.reset_index(inplace=True)
tesla_data.head()
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data = requests.get(url).text
soup = BeautifulSoup(html_data, 'html.parser')
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all('tr'):
col = row.find_all("td")
date = col[0].text
revenue = col[1].text
tesla_revenue = tesla_revenue.append("Date":date, "Revenue":revenue, ignore_index=True)
tesla_revenue
DATE | Revenue | |
---|---|---|
0 | 2008 | 15$ |
【问题讨论】:
【参考方案1】:使用适当的类和标签查找主表
res=requests.get("https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue")
soup=BeautifulSoup(res.text,"html.parser")
teable=soup.find("table",class_="historical_data_table table")
main_data=table.find_all("tr")
现在将数据附加到列表并创建列表数据列表,以便为 DataFrame 创建行数据
main_lst=[]
for i in main_data[1:]:
lst=[data.get_text(strip=True) for data in i.find_all("td")]
main_lst.append(lst)
现在使用该数据显示为df
import pandas as pd
df=pd.DataFrame(columns=["Date","Price"],data=main_lst)
df
输出:
Date Price
0 2020 $31,536
1 2019 $24,578
2 2018 $21,461
3 2017 $11,759
...
在一个班轮中使用pandas
df=pd.read_html("https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue")
print(len(df))
print(df[0])
输出
6
Date Price
0 2020 $31,536
1 2019 $24,578
2 2018 $21,461
3 2017 $11,759
...
【讨论】:
哇,解决问题的另一种方法!非常感谢:)我不知道这种方式,但多亏了你我现在才开始学习:)【参考方案2】:会发生什么?
它工作正常,但您将数据附加到循环之外,因此您总是会得到最后一次迭代的结果。
如何解决?
修复缩进并将附加部分放入循环中
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all('tr'):
col = row.find_all("td")
date = col[0].text
revenue = col[1].text
tesla_revenue = tesla_revenue.append("Date":date, "Revenue":revenue, ignore_index=True)
tesla_revenue
示例
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data = requests.get(url).text
soup = BeautifulSoup(html_data, 'html.parser')
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all('tr'):
col = row.find_all("td")
date = col[0].text
revenue = col[1].text
tesla_revenue = tesla_revenue.append("Date":date, "Revenue":revenue, ignore_index=True)
tesla_revenue
输出
Date | Revenue | |
---|---|---|
0 | 2020 | $31,536 |
1 | 2019 | $24,578 |
2 | 2018 | $21,461 |
3 | 2017 | $11,759 |
4 | 2016 | $7,000 |
5 | 2015 | $4,046 |
6 | 2014 | $3,198 |
... | ... | ... |
【讨论】:
非常感谢!!它真的很有帮助:) 我为此苦苦挣扎了几个小时,但现在我知道缩进是问题多亏了你:) 祝你有美好的一天!以上是关于python3 webscraping-循环只返回一次迭代的主要内容,如果未能解决你的问题,请参考以下文章
Python 3.5 - 如何对javascript呈现的页面进行webscraping
在 macOS High Sierra (10.13.6) 上使用 R (v3.6.0) 中的 PhantomJS 进行 Webscraping Javascript 表返回部分表