pandas与JSON,HTML,EXCEL
Posted Young的编程日记
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了pandas与JSON,HTML,EXCEL相关的知识,希望对你有一定的参考价值。
JSON
先来复习下,对于json格式,我们可以使用Python的标准库json库中.loads
方法将json格式的字符串准换成dict,也可以使用.dumps
方法将dict转换成str
In [280]: import json
In [278]: obj = """
...: {"name": "Wes",
...: "places_lived": ["United States", "Spain", "Germany"],
...: "pet": null,
...: "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
...:
...: {"name": "Katie", "age": 38,
...: "pets": ["Sixes", "Stache", "Cisco"]}]
...: }
...: """
In [279]: type(obj)
Out[279]: str
n [281]: a = json.loads(obj)
n [283]: type(a)
ut[283]: dict
n [284]: b = json.dumps(a)
n [286]: type(b)
ut[286]: str
如何将(一个或一组)JSON对象转换为DataFrame或其他便于分析的数据结构就由你决定了。最简单方便的方式是:向DataFrame构造器传入一个字典的列表(就是原先的JSON对象),并选取数据字段的子集:
In [289]: df1 = pd.DataFrame(a['siblings'],columns=['name','age']) #这里是制定了columns=['name','age']
In [290]: df1
Out[290]:
name age
0 Scott 30
1 Katie 38
我们之前生成DataFrame都是给一个字典或者是二维数组,其实我们给一个列表也可以生成DF,上面的代码中a['siblings']
就是一个列表。
In [292]: li = [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
...: {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]
...:
In [293]: df2 = pd.DataFrame(li) #如果我们不指定columns,可以看到上面li这个列表他也是会自动分配
In [294]: df2
Out[294]:
age name pets
0 30 Scott [Zeus, Zuko]
1 38 Katie [Sixes, Stache, Cisco]
In [295]: df2.columns
Out[295]: Index(['age', 'name', 'pets'], dtype='object')
对于一些特别格式的json数据,我们可以使用.read_json
方法直接读取生成DataFrame,也可以使用.to_json
方法从pandas输出到json
In [299]: !type examples\example.json
[{"a": 1, "b": 2, "c": 3},
{"a": 4, "b": 5, "c": 6},
{"a": 7, "b": 8, "c": 9}]
In [300]: data = pd.read_json('examples\example.json')
In [301]: data
Out[301]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
In [302]: type(data)
Out[302]: pandas.core.frame.DataFrame
In [303]: data.columns
Out[303]: Index(['a', 'b', 'c'], dtype='object')
In [305]: js = data.to_json()
In [306]: js
Out[306]: '{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2"
:9}}'
In [308]: type(js)
Out[308]: str
In [307]: pd.read_json(js)
Out[307]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
html
pandas有一个内置的方法,read_html
,它可以使用lxml和Beautiful Soup自动将HTML文件中的表格解析为DataFrame对象。
pandas.read_html有一些选项,默认条件下,它会搜索、尝试解析
标签内的的表格数据。结果是一个列表的DataFrame对象:In [14]: tables = pd.read_html('examples/fdic_failed_bank_list.html')
In [20]: len(tables)
Out[20]: 1
In [18]: pd.options.display.max_rows=10
In [19]: tables
Out[19]:
[ Bank Name ... Updated Date
0 Allied Bank ... November 17, 2016
1 The Woodbury Banking Company ... November 17, 2016
2 First CornerStone Bank ... September 6, 2016
3 Trust Company Bank ... September 6, 2016
4 North Milwaukee State Bank ... June 16, 2016
.. ... ... ...
542 Superior Bank, FSB ... August 19, 2014
543 Malta National Bank ... November 18, 2002
544 First Alliance Bank & Trust Co. ... February 18, 2003
545 National State Bank of Metropolis ... March 17, 2005
546 Bank of Honolulu ... March 17, 2005
[547 rows x 7 columns]]
In [21]: type(tables) #她是一个DF格式的list对象
Out[21]: list
二进制数据格式
使用.to_pickle
将DF存储为二进制数据,.read_pickle
直接读取二进制数据
In [22]: frame = pd.read_csv('examples\ex1.csv')
In [24]: frame
Out[24]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
In [31]: frame.to_pickle(r'examples\test_pickle')
In [32]: pd.read_pickle('examples/test_pickle')
Out[32]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
EXCEL文件
读写excel文件,有两种方法,分别看下
In [47]: xlsx = pd.ExcelFile('examples/ex1.xlsx') #
In [48]: frame = pd.read_excel(xlsx,'Sheet1')
In [49]: frame
Out[49]:
a b c d message
1 2 3 4 hello
5 6 7 8 world
9 10 11 12 foo
n [50]: type(frame)
ut[50]: pandas.core.frame.DataFrame
In [60]: writer = pd.ExcelWriter(r'C:\Users\admin\Desktop\test2.xlsx')
In [62]: frame.to_excel(writer,'Sheet1')
In [63]: writer.save()
In [66]: frame = pd.read_excel(r'examples\ex1.xlsx','Sheet1')
In [67]: frame
Out[67]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
In [68]: frame.to_excel(r'C:\Users\admin\Desktop\test3.xlsx')
Web APIs交互
这里其实不是很明白,云里雾里
In [113]: import requests
In [114]: url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
In [115]: resp = requests.get(url)
In [116]: resp
Out[116]: <Response [200]>
In [117]: data = resp.json() #响应对象的json方法会返回一个包含被解析过的JSON字典,加载到一个Python对象中
In [118]: data[0]['title']
Out[118]: 'Period does not round down for frequencies less that 1 hour'
data中的每个元素都是一个包含所有GitHub主题页数据(不包含评论)的字典。我们可以直接传递数据到DataFrame,并提取感兴趣的字段:
In [119]: issues = pd.DataFrame(data, columns=['number', 'title',
.....: 'labels', 'state'])
In [120]: issues
Out[120]:
number title \
0 17666 Period does not round down for frequencies les...
1 17665 DOC: improve docstring of function where
2 17664 COMPAT: skip 32-bit test on int repr
3 17662 implement Delegator class
4 17654 BUG: Fix series rename called with str alterin...
.. ... ...
25 17603 BUG: Correctly localize naive datetime strings...
26 17599 core.dtypes.generic --> cython
27 17596 Merge cdate_range functionality into bdate_range
28 17587 Time Grouper bug fix when applied for list gro...
29 17583 BUG: fix tz-aware DatetimeIndex + TimedeltaInd...
labels state
0 [] open
1 [{'id': 134699, 'url': 'https://api.github.com... open
2 [{'id': 563047854, 'url': 'https://api.github.... open
3 [] open
4 [{'id': 76811, 'url': 'https://api.github.com/... open
.. ... ...
25 [{'id': 76811, 'url': 'https://api.github.com/... open
26 [{'id': 49094459, 'url': 'https://api.github.c... open
27 [{'id': 35818298, 'url': 'https://api.github.c... open
28 [{'id': 233160, 'url': 'https://api.github.com... open
29 [{'id': 76811, 'url': 'https://api.github.com/... open
[30 rows x 4 columns]
以上是关于pandas与JSON,HTML,EXCEL的主要内容,如果未能解决你的问题,请参考以下文章