pandas与JSON,HTML,EXCEL

Posted 2021-04-30 Young的编程日记

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了pandas与JSON,HTML,EXCEL相关的知识，希望对你有一定的参考价值。

JSON

先来复习下，对于json格式，我们可以使用Python的标准库json库中.loads方法将json格式的字符串准换成dict，也可以使用.dumps方法将dict转换成str

In [280]: import json

In [278]: obj = """
     ...: {"name": "Wes",
     ...:  "places_lived": ["United States", "Spain", "Germany"],
     ...:  "pet": null,
     ...:  "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
     ...:
     ...:               {"name": "Katie", "age": 38,
     ...:                "pets": ["Sixes", "Stache", "Cisco"]}]
     ...: }
     ...: """

In [279]: type(obj)
Out[279]: str

n [281]: a = json.loads(obj)

n [283]: type(a)
ut[283]: dict

n [284]: b = json.dumps(a)

n [286]: type(b)
ut[286]: str

如何将（一个或一组）JSON对象转换为DataFrame或其他便于分析的数据结构就由你决定了。最简单方便的方式是：向DataFrame构造器传入一个字典的列表（就是原先的JSON对象），并选取数据字段的子集：

In [289]: df1 = pd.DataFrame(a['siblings'],columns=['name','age'])  #这里是制定了columns=['name','age']

In [290]: df1
Out[290]:
    name  age
0  Scott   30
1  Katie   38

我们之前生成DataFrame都是给一个字典或者是二维数组，其实我们给一个列表也可以生成DF，上面的代码中a['siblings']就是一个列表。

In [292]: li = [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
     ...:  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]
     ...:

In [293]: df2 = pd.DataFrame(li)    #如果我们不指定columns，可以看到上面li这个列表他也是会自动分配

In [294]: df2
Out[294]:
   age   name                    pets
0   30  Scott            [Zeus, Zuko]
1   38  Katie  [Sixes, Stache, Cisco]

In [295]: df2.columns
Out[295]: Index(['age', 'name', 'pets'], dtype='object')

对于一些特别格式的json数据，我们可以使用.read_json方法直接读取生成DataFrame，也可以使用.to_json方法从pandas输出到json

In [299]: !type examples\example.json
[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]

In [300]: data = pd.read_json('examples\example.json')

In [301]: data
Out[301]:
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

In [302]: type(data)
Out[302]: pandas.core.frame.DataFrame

In [303]: data.columns
Out[303]: Index(['a', 'b', 'c'], dtype='object')

In [305]: js = data.to_json()

In [306]: js
Out[306]: '{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2"
:9}}'

In [308]: type(js)
Out[308]: str

In [307]: pd.read_json(js)
Out[307]:
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

html

pandas有一个内置的方法，read_html，它可以使用lxml和Beautiful Soup自动将HTML文件中的表格解析为DataFrame对象。

pandas.read_html有一些选项，默认条件下，它会搜索、尝试解析

标签内的的表格数据。结果是一个列表的DataFrame对象：

In [14]: tables = pd.read_html('examples/fdic_failed_bank_list.html')

In [20]: len(tables)
Out[20]: 1

In [18]: pd.options.display.max_rows=10

In [19]: tables
Out[19]:
[                             Bank Name        ...               Updated Date
 0                          Allied Bank        ...          November 17, 2016
 1         The Woodbury Banking Company        ...          November 17, 2016
 2               First CornerStone Bank        ...          September 6, 2016
 3                   Trust Company Bank        ...          September 6, 2016
 4           North Milwaukee State Bank        ...              June 16, 2016
 ..                                 ...        ...                        ...
 542                 Superior Bank, FSB        ...            August 19, 2014
 543                Malta National Bank        ...          November 18, 2002
 544    First Alliance Bank & Trust Co.        ...          February 18, 2003
 545  National State Bank of Metropolis        ...             March 17, 2005
 546                   Bank of Honolulu        ...             March 17, 2005

 [547 rows x 7 columns]]


In [21]: type(tables)   #她是一个DF格式的list对象
Out[21]: list

二进制数据格式

使用.to_pickle将DF存储为二进制数据，.read_pickle直接读取二进制数据

In [22]: frame = pd.read_csv('examples\ex1.csv')

In [24]: frame
Out[24]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

In [31]: frame.to_pickle(r'examples\test_pickle')

In [32]: pd.read_pickle('examples/test_pickle')
Out[32]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

EXCEL文件

读写excel文件，有两种方法，分别看下

In [47]: xlsx = pd.ExcelFile('examples/ex1.xlsx')   #

In [48]: frame = pd.read_excel(xlsx,'Sheet1')

In [49]: frame
Out[49]:
  a   b   c   d message
  1   2   3   4   hello
  5   6   7   8   world
  9  10  11  12     foo

n [50]: type(frame)
ut[50]: pandas.core.frame.DataFrame

In [60]:  writer = pd.ExcelWriter(r'C:\Users\admin\Desktop\test2.xlsx')

In [62]: frame.to_excel(writer,'Sheet1')

In [63]: writer.save()

In [66]: frame = pd.read_excel(r'examples\ex1.xlsx','Sheet1')

In [67]: frame
Out[67]:
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

In [68]: frame.to_excel(r'C:\Users\admin\Desktop\test3.xlsx')

Web APIs交互

这里其实不是很明白，云里雾里

In [113]: import requests

In [114]: url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

In [115]: resp = requests.get(url)

In [116]: resp
Out[116]: <Response [200]>

In [117]: data = resp.json()    #响应对象的json方法会返回一个包含被解析过的JSON字典，加载到一个Python对象中

In [118]: data[0]['title']
Out[118]: 'Period does not round down for frequencies less that 1 hour'

data中的每个元素都是一个包含所有GitHub主题页数据（不包含评论）的字典。我们可以直接传递数据到DataFrame，并提取感兴趣的字段：

In [119]: issues = pd.DataFrame(data, columns=['number', 'title',
   .....:                                      'labels', 'state'])

In [120]: issues
Out[120]:
    number                                              title  \
0    17666  Period does not round down for frequencies les...   
1    17665           DOC: improve docstring of function where   
2    17664               COMPAT: skip 32-bit test on int repr   
3    17662                          implement Delegator class
4    17654  BUG: Fix series rename called with str alterin...   
..     ...                                                ...   
25   17603  BUG: Correctly localize naive datetime strings...   
26   17599                     core.dtypes.generic --> cython   
27   17596   Merge cdate_range functionality into bdate_range   
28   17587  Time Grouper bug fix when applied for list gro...   
29   17583  BUG: fix tz-aware DatetimeIndex + TimedeltaInd...   
                                               labels state  
0                                                  []  open  
1   [{'id': 134699, 'url': 'https://api.github.com...  open  
2   [{'id': 563047854, 'url': 'https://api.github....  open  
3                                                  []  open  
4   [{'id': 76811, 'url': 'https://api.github.com/...  open  
..                                                ...   ...  
25  [{'id': 76811, 'url': 'https://api.github.com/...  open  
26  [{'id': 49094459, 'url': 'https://api.github.c...  open  
27  [{'id': 35818298, 'url': 'https://api.github.c...  open  
28  [{'id': 233160, 'url': 'https://api.github.com...  open  
29  [{'id': 76811, 'url': 'https://api.github.com/...  open  
[30 rows x 4 columns]

以上是关于pandas与JSON,HTML,EXCEL的主要内容，如果未能解决你的问题，请参考以下文章