从表数据中获取 Pandas 中的空 DataFrame

Posted

技术标签:

【中文标题】从表数据中获取 Pandas 中的空 DataFrame【英文标题】:Getting Empty DataFrame in pandas from table data 【发布时间】:2022-01-23 09:05:57 【问题描述】:

我正在使用打印命令获取数据,但在 Pandas DataFrame 中,结果为:Empty DataFrame,Columns: [],Index: [`]

脚本:

from bs4 import BeautifulSoup
import requests
import re
import json
import pandas as pd

url='http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1640132253903&t=XNAS:AAPL'

req=requests.get(url).text
#print(req)
data=re.search(r'jsonp1640132253903\((\.*\)\)',req).group(1)
json_data=json.loads(data)['componentData']
#print(json_data)
# with open('index.html','w') as f:
#     f.write(json_data)

soup=BeautifulSoup(json_data,'lxml')
for tr in soup.select('tr'):
    row_data=[td.get_text(strip=True) for td in tr.select('td,th') if td.text]
    if not row_data:
         continue

    if len(row_data) < 12:
        row_data = ['Particulars'] + row_data
    #print(row_data)
                         
df=pd.DataFrame(row_data)
print(df)

打印结果:

['详情', '2012-09', '2013-09', '2014-09', '2015-09', '2016-09', '2017-09', '2018-09', '2019-09'、'2020-09'、'2021-09'、'TTM'] ['RevenueUSD Mil', '156,508', '170,910', '182,795', '233,715', '215,639', '229,234', '265,595', '260,174', '274,515', '365,817', '365']8 ['毛利率%', '43.9', '37.6', '38.6', '40.1', '39.1', '38.5', '38.3', '37.8', '38.2', '41.8', '41.8' ] ['营业收入百万美元', '55,241', '48,999', '52,503', '71,230', '60,024', '61,344', '70,898', '63,930', '66,288', '108,949', '108,949' ] ['营业利润率%', '35.3', '28.7', '28.7', '30.5', '27.8', '26.8', '26.7', '24.6', '24.1', '29.8', '29.8' ] ['净收入百万美元', '41,733', '37,037', '39,510', '53,394', '45,687', '48,351', '59,531', '55,256', '57,411', '94,680', '94,680'] ['每股收益USD', '1.58', '1.42', '1.61', '2.31', '2.08', '2.30', '2.98', '2.97', '3.28', '5.61', '5.61'

预期输出:

2012-09 2013-09 2014-09 2015-09 2016-09 2017-09 2018-09 2019-09 2020-09 2021-09 TTM

Revenue USD Mil 156,508 170,910 182,795 233,715 215,639 229,234 265,595 260,174 274,515 365,817 365,817
Gross Margin %  43.9    37.6    38.6    40.1    39.1    38.5    38.3    37.8    38.2    41.8    41.8
Operating Income USD Mil    55,241  48,999  52,503  71,230  60,024  61,344  70,898  63,930  66,288  108,949 108,949
Operating Margin %  35.3    28.7    28.7    30.5    27.8    26.8    26.7    24.6    24.1    29.8    29.8
Net Income USD Mil  41,733  37,037  39,510  53,394  45,687  48,351  59,531  55,256  57,411  94,680  94,680
Earnings Per Share USD  1.58    1.42    1.61    2.31    2.08    2.30    2.98    2.97    3.28    5.61    5.61
Dividends USD   0.09    0.41    0.45    0.49    0.55    0.60    0.68    0.75    0.80    0.85    0.85
Payout Ratio % *    —   27.4    28.5    22.3    24.8    26.5    23.7    25.1    23.7    16.3    15.2
Shares Mil  26,470  26,087  24,491  23,172  22,001  21,007  20,000  18,596  17,528  16,865  16,865
Book Value Per Share * USD  4.25    4.90    5.15    5.63    5.93    6.46    6.04    5.43    4.26    3.91    3.85
Operating Cash Flow USD Mil 50,856  53,666  59,713  81,266  65,824  63,598  77,434  69,391  80,674  104,038 104,038
Cap Spending USD Mil    -9,402  -9,076  -9,813  -11,488 -13,548 -12,795 -13,313 -10,495 -7,309  -11,085 -11,085
Free Cash Flow USD Mil  41,454  44,590  49,900  69,778  52,276  50,803  64,121  58,896  73,365  92,953  92,953
Free Cash Flow Per Share * USD  1.58    1.61    1.93    2.96    2.24    2.41    2.88    3.07    4.04    5.57    —
Working Capital USD Mil 19,111  29,628  5,083   8,768   27,863  27,831  14,473  57,101  38,321  9,355

预期列:

'Particulars', '2012-09', '2013-09', '2014-09', '2015-09', '2016-09', '2017-09', '2018-09', '2019-09', '2020-09', '2021-09', 'TTM'  

【问题讨论】:

【参考方案1】:

@QHarr 的回答是迄今为止最直接的,但如果您想知道您的代码出了什么问题,那就是您正在为循环的每次迭代重置变量 row_data

为了使您的代码正常工作,您可以将每一行作为一个元素存储在列表中。然后构建一个DataFrame,你可以将这个行列表和列名传递给pd.DataFrame

data = []
soup=BeautifulSoup(json_data,'lxml')
for tr in soup.select('tr'):
    row_data=[td.get_text(strip=True) for td in tr.select('td,th') if td.text]
    if not row_data:
        continue
    elif len(row_data) < 12:
        columns = ['Particulars'] + row_data
    else:
        data.append(row_data)
                         
df=pd.DataFrame(data, columns=columns)

结果:

>>> df
                      Particulars  2012-09  2013-09  2014-09  2015-09  2016-09  2017-09  2018-09  2019-09  2020-09  2021-09      TTM
0                  RevenueUSD Mil  156,508  170,910  182,795  233,715  215,639  229,234  265,595  260,174  274,515  365,817  365,817
1                  Gross Margin %     43.9     37.6     38.6     40.1     39.1     38.5     38.3     37.8     38.2     41.8     41.8
2         Operating IncomeUSD Mil   55,241   48,999   52,503   71,230   60,024   61,344   70,898   63,930   66,288  108,949  108,949
3              Operating Margin %     35.3     28.7     28.7     30.5     27.8     26.8     26.7     24.6     24.1     29.8     29.8
4               Net IncomeUSD Mil   41,733   37,037   39,510   53,394   45,687   48,351   59,531   55,256   57,411   94,680   94,680
5           Earnings Per ShareUSD     1.58     1.42     1.61     2.31     2.08     2.30     2.98     2.97     3.28     5.61     5.61
6                    DividendsUSD     0.09     0.41     0.45     0.49     0.55     0.60     0.68     0.75     0.80     0.85     0.85
7                Payout Ratio % *        —     27.4     28.5     22.3     24.8     26.5     23.7     25.1     23.7     16.3     15.2
8                       SharesMil   26,470   26,087   24,491   23,172   22,001   21,007   20,000   18,596   17,528   16,865   16,865
9       Book Value Per Share *USD     4.25     4.90     5.15     5.63     5.93     6.46     6.04     5.43     4.26     3.91     3.85
10     Operating Cash FlowUSD Mil   50,856   53,666   59,713   81,266   65,824   63,598   77,434   69,391   80,674  104,038  104,038
11            Cap SpendingUSD Mil   -9,402   -9,076   -9,813  -11,488  -13,548  -12,795  -13,313  -10,495   -7,309  -11,085  -11,085
12          Free Cash FlowUSD Mil   41,454   44,590   49,900   69,778   52,276   50,803   64,121   58,896   73,365   92,953   92,953
13  Free Cash Flow Per Share *USD     1.58     1.61     1.93     2.96     2.24     2.41     2.88     3.07     4.04     5.57        —
14         Working CapitalUSD Mil   19,111   29,628    5,083    8,768   27,863   27,831   14,473   57,101   38,321    9,355        —

【讨论】:

【参考方案2】:

使用 read_html 创建 DataFrame,然后删除 na 行

json_data=json.loads(data)['componentData']
pd.read_html(json_data)[0].dropna(axis=0, how='all')

【讨论】:

以上是关于从表数据中获取 Pandas 中的空 DataFrame的主要内容,如果未能解决你的问题,请参考以下文章

pandas基础学习

如何从列类型列表中删除 pandas DataFrame 中的空值

Python中DataFrames的DataFrame(Pandas)

为啥使用numpy和pandas来进行数据处理?

Pandas缺失值inf与nan处理实践

Pandas缺失值inf与nan处理实践