利用Python进行数据分析_Pandas_数据加载存储与文件格式
Posted 木东
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了利用Python进行数据分析_Pandas_数据加载存储与文件格式相关的知识,希望对你有一定的参考价值。
申明:本系列文章是自己在学习《利用Python进行数据分析》这本书的过程中,为了方便后期自己巩固知识而整理。
1 pandas读取文件的解析函数
read_csv 读取带分隔符的数据,默认分隔符 逗号
read_table 读取带分隔符的数据,默认分隔符 “\\t”
read_fwf 读取定宽、列格式数据(无分隔符)
read_clipboard 读取剪贴板中的数据(将网页转换为表格)
1.1 读取excel数据
import pandas as pd import numpy as np file = \'D:\\example.xls\' pd = pd.read_excel(file) pd
运行结果:
1.1.1 不显示表头
pd = pd.read_excel(file,header=None)
运行结果:
1.1.2 设置表头
pd = pd.read_excel(file,names=[\'Year\',\'Name\',\'Math\',\'Chinese\',\'EngLish\',\'Avg\'])
运行结果:
1.1.3 指定索引
pd = pd.read_excel(file,index_col= \'姓名\')
运行结果:
2 读取CSV数据
import pandas as pd import numpy as np pd = pd.read_csv("d:\\\\test.csv",engine=\'python\') pd
运行结果:
import pandas as pd import numpy as np pd = pd.read_table("d:\\\\test.csv",engine=\'python\') pd
运行结果:
import pandas as pd import numpy as np pd = pd.read_fwf("d:\\\\test.csv",engine=\'python\') pd
运行结果:
3 将数据写出到文本格式
将数据写出到csv格式,默认分隔符 逗号
import pandas as pd import numpy as np pd = pd.read_fwf("d:\\\\test.csv",engine=\'python\') pd.to_csv("d:\\\\test1.csv",encoding=\'gbk\')
运行结果:
4 手工处理分隔符格式
单字符分隔符文件,直接用csv模块
import pandas as pd
import numpy as np
import csv
file = \'D:\\\\test.csv\'
pd = pd.read_csv(file,engine=\'python\')
pd.to_csv("d:\\\\test1.csv",encoding=\'gbk\',sep=\'/\')
f = open("d:\\\\test1.csv")
reader = csv.reader(f)
for line in reader:
print(line)
运行结果:
4.1 缺失值填充
import pandas as pd import numpy as np import csv file = \'D:\\\\test.csv\' pd = pd.read_csv(file,engine=\'python\') pd.to_csv("d:\\\\test1.csv",encoding=\'gbk\',sep=\'/\',na_rep=\'NULL\') f = open("d:\\\\test1.csv") reader = csv.reader(f) for line in reader: print(line)
运行结果:
4.2 JSON
4.2.1 json.loads 可将JSON字符串转换成Python形式
import pandas as pd import numpy as np import json obj = """{ "sucess" : "1", "header" : { "version" : 0, "compress" : false, "times" : 0 }, "data" : { "name" : "BankForQuotaTerrace", "attributes" : { "queryfound" : "1", "numfound" : "1", "reffound" : "1" }, "columnmeta" : { "a0" : "DATE", "a1" : "DOUBLE", "a2" : "DOUBLE", "a3" : "DOUBLE", "a4" : "DOUBLE", "a5" : "DOUBLE", "a6" : "DATE", "a7" : "DOUBLE", "a8" : "DOUBLE", "a9" : "DOUBLE", "b0" : "DOUBLE", "b1" : "DOUBLE", "b2" : "DOUBLE", "b3" : "DOUBLE", "b4" : "DOUBLE", "b5" : "DOUBLE" }, "rows" : [ [ "2017-10-28", 109.8408691012081, 109.85566362201733, 0.014794520809225841, 1.0, null, "", 5.636678251676443, 5.580869556115291, 37.846934105222246, null, null, null, null, null, 0.061309012867495856 ] ] } } """ result = json.loads(obj) result
运行结果:
4.2.2 json.dumps可将Python字符串转换成JSON形式
result = json.loads(obj)
asjson=json.dumps(result)
asjson
运行结果:
4.2.3 JSON数据转换成DataFrame
import pandas as pd import numpy as np from pandas import DataFrame import json obj = """{ "sucess" : "1", "header" : { "version" : 0, "compress" : false, "times" : 0 }, "data" : { "name" : "BankForQuotaTerrace", "attributes" : { "queryfound" : "1", "numfound" : "1", "reffound" : "1" }, "columnmeta" : { "a0" : "DATE", "a1" : "DOUBLE", "a2" : "DOUBLE", "a3" : "DOUBLE", "a4" : "DOUBLE", "a5" : "DOUBLE", "a6" : "DATE", "a7" : "DOUBLE", "a8" : "DOUBLE", "a9" : "DOUBLE", "b0" : "DOUBLE", "b1" : "DOUBLE", "b2" : "DOUBLE", "b3" : "DOUBLE", "b4" : "DOUBLE", "b5" : "DOUBLE" }, "rows" : [ [ "2017-10-28", 109.8408691012081, 109.85566362201733, 0.014794520809225841, 1.0, null, "", 5.636678251676443, 5.580869556115291, 37.846934105222246, null, null, null, null, null, 0.061309012867495856 ] ] } } """ result = json.loads(obj) result jsondf = DataFrame(result[\'data\'],columns = [\'name\',\'attributes\',\'columnmeta\'],index={1,2,3}) jsondf
运行结果:
备注:其中attributes和columnmeta,存在嵌套,这个问题后面再补充。
4.3 XML和HTML
爬取同花顺网页中的列表数据,并转换成DataFrame
在爬取的时候,我这里没有考虑爬分页的数据,有兴趣的可以自己尝试,我这里主要是想尝试爬取数据后转成DataFrame
代码如下:
import pandas as pd import numpy as np from pandas.core.frame import DataFrame from lxml.html import parse import requests from bs4 import BeautifulSoup import time url = \'http://data.10jqka.com.cn/market/longhu/\' headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"} response = requests.get(url = url,headers = headers) html = response.content soup = BeautifulSoup(html, \'lxml\') s = soup.find_all(\'div\',\'yyb\') # 获取dataframe所需的columns def getcol(): col = [] for i in s: lzs = i.find_all(\'thead\') for k in lzs: lbs = k.find_all(\'th\') for j in lbs: col.append(j.text.strip(\'\\n\')) return col # 获取dataframe所需的values def getvalues(): val = [] for j in s: v = j.find_all(\'tbody\') for k in v: vv = k.find_all(\'tr\') list = [] for l in vv: tdlist = [] vvv = l.find_all(\'td\') for m in vvv: tdlist.append(m.text) list.append(tdlist) return(list) if __name__ == "__main__": cols = getcol() values = getvalues() data=DataFrame(values,columns=cols) print(data)
运行结果:
4.4 二进制数据格式
pandas对象的save方法保存,load方法读回到Python
4.5 HDF5格式
HDF是层次型数据格式,HDF5文件含一个文件系统式的节点结构,支持多个数据集、元数据,可以高效的分块读写。Python中的HDF5库有2个接口:PyTables和h5py。
海量数据应该考虑用这个,现在我没用着,先不研究了。
4.6 使用HTML和Web API
import requests
import pandas as pd
from pandas import DataFrame
import json
url = \'http://t.weather.sojson.com/api/weather/city/101030100\'
resp = requests.get(url)
data = json.loads(resp.text)#这里的data是一个dict
jsondf = DataFrame(data[\'cityInfo\'],columns =[\'city\',\'cityId\',\'parent\',\'updateTime\'],index=[1])#实例化
jsondf
运行结果:
4.7 使用数据库
4.7.1 sqlite3
import sqlite3 import pandas.io.sql as sql
con = sqlite3.connect()
sql.read_frame(\'select * from test\',con)#con 是一个连接对象
4.7.1 MongoDB
没装。先搁置。
以上是关于利用Python进行数据分析_Pandas_数据加载存储与文件格式的主要内容,如果未能解决你的问题,请参考以下文章
利用Python进行数据分析_Pandas_数据加载存储与文件格式
《利用python进行数据分析》之《第二章引言》学习笔记_1