python3 webscraping-循环只返回一次迭代

Posted

技术标签:

【中文标题】python3 webscraping-循环只返回一次迭代【英文标题】:python3 webscraping- loop returns only one iteration 【发布时间】:2022-01-14 10:57:00 【问题描述】:

python3 网络抓取)我正在尝试从 html 数据中提取表格并将其存储到一个新的数据框中。我需要所有的 'td' 值,但是当我尝试迭代时,循环只返回第一行,而不是所有行。下面是我的代码和输出

!pip install yfinance
!pip install pandas
!pip install requests
!pip install bs4
!pip install plotly

import yfinance as yf
import pandas as pd
import requests
from bs4 import BeautifulSoup
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def make_graph(stock_data, revenue_data, stock):
 fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=("Historical Share Price", "Historical Revenue"), vertical_spacing = .3)
 stock_data_specific = stock_data[stock_data.Date <= '2021--06-14']
 revenue_data_specific = revenue_data[revenue_data.Date <= '2021-04-30']
 fig.add_trace(go.Scatter(x=pd.to_datetime(stock_data_specific.Date, infer_datetime_format=True), y=stock_data_specific.Close.astype("float"), name="Share Price"), row=1, col=1)
 fig.add_trace(go.Scatter(x=pd.to_datetime(revenue_data_specific.Date, infer_datetime_format=True), y=revenue_data_specific.Revenue.astype("float"), name="Revenue"), row=2, col=1)
 fig.update_xaxes(title_text="Date", row=1, col=1)
 fig.update_xaxes(title_text="Date", row=2, col=1)
 fig.update_yaxes(title_text="Price ($US)", row=1, col=1)
 fig.update_yaxes(title_text="Revenue ($US Millions)", row=2, col=1)
 fig.update_layout(showlegend=False,
 height=900,
 title=stock,
 xaxis_rangeslider_visible=True)
 fig.show()

tsla = yf.Ticker("TSLA")
tsla

tesla_data = tsla.history(period="max")
tesla_data


tesla_data.reset_index(inplace=True)
tesla_data.head()

url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data  = requests.get(url).text


soup = BeautifulSoup(html_data, 'html.parser')

tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all('tr'): 
 col = row.find_all("td")
 date = col[0].text
 revenue = col[1].text
tesla_revenue = tesla_revenue.append("Date":date, "Revenue":revenue, ignore_index=True)
tesla_revenue



DATE Revenue
0 2008 15$

【问题讨论】:

【参考方案1】:

使用适当的类和标签查找主表

res=requests.get("https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue")

soup=BeautifulSoup(res.text,"html.parser")
teable=soup.find("table",class_="historical_data_table table")
main_data=table.find_all("tr")     

现在将数据附加到列表并创建列表数据列表,以便为 DataFrame 创建行数据

main_lst=[]
for i in main_data[1:]:
    lst=[data.get_text(strip=True) for data in i.find_all("td")]
    main_lst.append(lst)

现在使用该数据显示为df

import pandas as pd
df=pd.DataFrame(columns=["Date","Price"],data=main_lst)
df

输出:

    Date    Price
0   2020    $31,536
1   2019    $24,578
2   2018    $21,461
3   2017    $11,759
...

在一个班轮中使用pandas

df=pd.read_html("https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue")
print(len(df))
print(df[0])

输出

6

    Date    Price
0   2020    $31,536
1   2019    $24,578
2   2018    $21,461
3   2017    $11,759

...

【讨论】:

哇,解决问题的另一种方法!非常感谢:)我不知道这种方式,但多亏了你我现在才开始学习:)【参考方案2】:

会发生什么?

它工作正常,但您将数据附加到循环之外,因此您总是会得到最后一次迭代的结果。

如何解决?

修复缩进并将附加部分放入循环中

tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all('tr'): 
    col = row.find_all("td")
    date = col[0].text
    revenue = col[1].text
    tesla_revenue = tesla_revenue.append("Date":date, "Revenue":revenue, ignore_index=True)
tesla_revenue

示例

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data  = requests.get(url).text

soup = BeautifulSoup(html_data, 'html.parser')

tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all('tr'): 
    col = row.find_all("td")
    date = col[0].text
    revenue = col[1].text
    tesla_revenue = tesla_revenue.append("Date":date, "Revenue":revenue, ignore_index=True)
tesla_revenue

输出

Date Revenue
0 2020 $31,536
1 2019 $24,578
2 2018 $21,461
3 2017 $11,759
4 2016 $7,000
5 2015 $4,046
6 2014 $3,198
... ... ...

【讨论】:

非常感谢!!它真的很有帮助:) 我为此苦苦挣扎了几个小时,但现在我知道缩进是问题多亏了你:) 祝你有美好的一天!

以上是关于python3 webscraping-循环只返回一次迭代的主要内容,如果未能解决你的问题,请参考以下文章

Python 3.5 - 如何对javascript呈现的页面进行webscraping

在 macOS High Sierra (10.13.6) 上使用 R (v3.6.0) 中的 PhantomJS 进行 Webscraping Javascript 表返回部分表

python3-while与continue

Python3 高级用法

Python2 和 Python3区别

Python3 isnumeric()方法