python之pandas简介
Posted wztshine
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python之pandas简介相关的知识,希望对你有一定的参考价值。
参考自:https://www.php.cn/python-tutorials-427622.html
- 安装第三方库
pip install numpy
pip install pandas
pip install xlrd # 操作excel时会用到
- 构造数据 DataFrame
第一个参数:
? 字典:作为每列数据,字典的键作为列名,字典的值作为列的数据
? 嵌套列表:[[1,2,3],[4,5,6]],则每个内嵌的列表作为每列值(需要搭配第二个参数来作为列名)
第二个参数:
? 列表:列表的每个元素作为列名
import numpy as np
import pandas as pd
def println(*args):
print(*args,end=‘
==================================结果分割线==================================
‘)
df = pd.DataFrame(
{
"id":[1001,1002,1003,1004,1005,1006],
"date":pd.date_range(‘20130102‘, periods=6),
"city":[‘Beijing ‘, ‘SH‘, ‘ guangzhou‘, ‘shanghai‘, ‘sh ‘, ‘BEIJING ‘],
"age":[23,44,54,32,34,32],
"category":[‘100-A‘,‘100-B‘,‘110-A‘,‘110-C‘,‘210-A‘,‘130-F‘],
"price":[1200,np.nan,2133,5433,np.nan,4432] # np.nan是空数据的意思
},
columns =[‘id‘,‘date‘,‘city‘,‘category‘,‘age‘,‘price‘]) # 此行的列表可以不写,因为上面的字典的键已经当做列名了,如果写了,则以这行的数据作为列名,字典的键就被无视了
- 查看数据类型,基础信息,是否为空
println(df,type(df)) # 打印数据
println(df.shape) # 打印数据的规模:(几行,几列)
println(df.info()) # 显示数据信息,类型等。
println(df.dtypes) # 显示数据类型
println(df[‘id‘].dtype) # 显示 id列 的类型
println(df.isnull()) # 实现是否为空
println(df[‘price‘].isnull()) # 显示 price列 是否有空数据
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 shanghai 110-C 32 5433.0
4 1005 2013-01-06 sh 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 <class ‘pandas.core.frame.DataFrame‘>
==================================结果分割线==================================
(6, 6)
==================================结果分割线==================================
<class ‘pandas.core.frame.DataFrame‘>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 6 non-null int64
1 date 6 non-null datetime64[ns]
2 city 6 non-null object
3 category 6 non-null object
4 age 6 non-null int64
5 price 4 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 304.0+ bytes
None
==================================结果分割线==================================
id int64
date datetime64[ns]
city object
category object
age int64
price float64
dtype: object
==================================结果分割线==================================
int64
==================================结果分割线==================================
id date city category age price
0 False False False False False False
1 False False False False False True
2 False False False False False False
3 False False False False False False
4 False False False False False True
5 False False False False False False
==================================结果分割线==================================
0 False
1 True
2 False
3 False
4 True
5 False
Name: price, dtype: bool
- 数据去重,数据,列名,前10行,后10行
println(df[‘id‘].unique()) # 获取去重后的id列的数据
println(df.values) # 全表的数据
println(df.columns) # 所有的列名
println(df.head()) #默认前10行数据
println(df.tail()) #默认后10行数据
D:Softwarepython37.32python.exe C:/Users/.wang/Desktop/pan.py
[1001 1002 1003 1004 1005 1006]
==================================结果分割线==================================
[[1001 Timestamp(‘2013-01-02 00:00:00‘) ‘Beijing ‘ ‘100-A‘ 23 1200.0]
[1002 Timestamp(‘2013-01-03 00:00:00‘) ‘SH‘ ‘100-B‘ 44 nan]
[1003 Timestamp(‘2013-01-04 00:00:00‘) ‘ guangzhou‘ ‘110-A‘ 54 2133.0]
[1004 Timestamp(‘2013-01-05 00:00:00‘) ‘shanghai‘ ‘110-C‘ 32 5433.0]
[1005 Timestamp(‘2013-01-06 00:00:00‘) ‘sh ‘ ‘210-A‘ 34 nan]
[1006 Timestamp(‘2013-01-07 00:00:00‘) ‘BEIJING ‘ ‘130-F‘ 32 4432.0]]
==================================结果分割线==================================
Index([‘id‘, ‘date‘, ‘city‘, ‘category‘, ‘age‘, ‘price‘], dtype=‘object‘)
==================================结果分割线==================================
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 shanghai 110-C 32 5433.0
4 1005 2013-01-06 sh 210-A 34 NaN
==================================结果分割线==================================
id date city category age price
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 shanghai 110-C 32 5433.0
4 1005 2013-01-06 sh 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
- 填充空值
df.fillna(value=0,inplace=True) # inplace=True,会直接改写原表,不写这个参数,则不会更改原表
println(df)
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 0.0
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 shanghai 110-C 32 5433.0
4 1005 2013-01-06 sh 210-A 34 0.0
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
- 获取一列数据的平均值
println(df[‘price‘].mean())
3299.5
==================================结果分割线==================================
- 数据清洗:删除多余空格,大小写转换
df[‘city‘]=df[‘city‘].map(str.strip)
println(df)
df[‘city‘] = df[‘city‘].str.lower()
println(df[‘city‘])
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 shanghai 110-C 32 5433.0
4 1005 2013-01-06 sh 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
==================================结果分割线==================================
0 beijing
1 sh
2 guangzhou
3 shanghai
4 sh
5 beijing
Name: city, dtype: object
==================================结果分割线==================================
Process finished with exit code 0
- 数据类型转换(需要先将此列所有的空值填充一个数才行)
df[‘price‘].fillna(value = 0,inplace=True)
println(df[‘price‘].astype(‘int‘))
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
0 1200
1 0
2 2133
3 5433
4 0
5 4432
Name: price, dtype: int32
==================================结果分割线==================================
- 重命名列名
df.rename(columns={‘category‘: ‘category-size‘},inplace=True)
println(df)
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id date city category-size age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 shanghai 110-C 32 5433.0
4 1005 2013-01-06 sh 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
==================================结果分割线==================================
Process finished with exit code 0
- 删除一列数据的重复项(去重)
df[‘city‘].drop_duplicates()
df[‘city‘].drop_duplicates(keep=‘last‘) # keep=‘last‘,‘first‘,‘False‘; 针对重复项,分别为:保留最后一个,保留第一个,都不保留
- 替换数据
println(df[‘city‘].replace(‘sh ‘, ‘shanghai‘)) # 将 ‘sh ‘ 替换为 ‘shanghai‘
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
0 Beijing
1 SH
2 guangzhou
3 shanghai
4 shanghai
5 BEIJING
Name: city, dtype: object
==================================结果分割线==================================
Process finished with exit code 0
- 两表合并
df1=pd.DataFrame(
{
"id":[1001,1002,1003,1004,1005,1006,1007,1008],
"gender":[‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘],
"pay":[‘Y‘,‘N‘,‘Y‘,‘Y‘,‘N‘,‘Y‘,‘N‘,‘Y‘,],
"m-point":[10,12,20,40,40,40,30,20]
})
df_inner = pd.merge(df,df1,how=‘inner‘) # 匹配合并,交集,等同于sql的内连接:select * from df inner join df1 on df.id = df1.id
println(df_inner)
df_left=pd.merge(df,df1,how=‘left‘) # 等同于sql的左连接
println(df_left)
df_right=pd.merge(df,df1,how=‘right‘) # 右连接
println(df_right)
df_outer=pd.merge(df,df1,how=‘outer‘) #并集,等同于sql的 full outer join
println(df_outer)
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id date city category age price gender pay m-point
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10
1 1002 2013-01-03 SH 100-B 44 NaN female N 12
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20
3 1004 2013-01-05 shanghai 110-C 32 5433.0 female Y 40
4 1005 2013-01-06 sh 210-A 34 NaN male N 40
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40
==================================结果分割线==================================
id date city category age price gender pay m-point
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10
1 1002 2013-01-03 SH 100-B 44 NaN female N 12
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20
3 1004 2013-01-05 shanghai 110-C 32 5433.0 female Y 40
4 1005 2013-01-06 sh 210-A 34 NaN male N 40
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40
==================================结果分割线==================================
id date city category age price gender pay m-point
0 1001 2013-01-02 Beijing 100-A 23.0 1200.0 male Y 10
1 1002 2013-01-03 SH 100-B 44.0 NaN female N 12
2 1003 2013-01-04 guangzhou 110-A 54.0 2133.0 male Y 20
3 1004 2013-01-05 shanghai 110-C 32.0 5433.0 female Y 40
4 1005 2013-01-06 sh 210-A 34.0 NaN male N 40
5 1006 2013-01-07 BEIJING 130-F 32.0 4432.0 female Y 40
6 1007 NaT NaN NaN NaN NaN male N 30
7 1008 NaT NaN NaN NaN NaN female Y 20
==================================结果分割线==================================
id date city category age price gender pay m-point
0 1001 2013-01-02 Beijing 100-A 23.0 1200.0 male Y 10
1 1002 2013-01-03 SH 100-B 44.0 NaN female N 12
2 1003 2013-01-04 guangzhou 110-A 54.0 2133.0 male Y 20
3 1004 2013-01-05 shanghai 110-C 32.0 5433.0 female Y 40
4 1005 2013-01-06 sh 210-A 34.0 NaN male N 40
5 1006 2013-01-07 BEIJING 130-F 32.0 4432.0 female Y 40
6 1007 NaT NaN NaN NaN NaN male N 30
7 1008 NaT NaN NaN NaN NaN female Y 20
==================================结果分割线==================================
Process finished with exit code 0
- 设置索引,并根据索引排序
df_inner.set_index(‘id‘) #设置id为索引
println(df_inner.sort_index(ascending=False)) # 降序排序
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id date city category age price gender pay m-point
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40
4 1005 2013-01-06 sh 210-A 34 NaN male N 40
3 1004 2013-01-05 shanghai 110-C 32 5433.0 female Y 40
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20
1 1002 2013-01-03 SH 100-B 44 NaN female N 12
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10
==================================结果分割线==================================
Process finished with exit code 0
- 按照某列数据排序, case when 语句
println(df_inner.sort_values(by=[‘age‘],ascending=False)) # == select * from df_inner order by age desc;
df_inner[‘group‘] = np.where(df_inner[‘price‘] > 3000,‘high‘,‘low‘) # == select *,(case price when price>3000 then ‘high‘ else ‘low‘ end)group from df_inner
println(df_inner)
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id date city category age price gender pay m-point
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20
1 1002 2013-01-03 SH 100-B 44 NaN female N 12
4 1005 2013-01-06 sh 210-A 34 NaN male N 40
3 1004 2013-01-05 shanghai 110-C 32 5433.0 female Y 40
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10
==================================结果分割线==================================
id date city category age price gender pay m-point group
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10 low
1 1002 2013-01-03 SH 100-B 44 NaN female N 12 low
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20 low
3 1004 2013-01-05 shanghai 110-C 32 5433.0 female Y 40 high
4 1005 2013-01-06 sh 210-A 34 NaN male N 40 low
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40 high
==================================结果分割线==================================
Process finished with exit code 0
- 新增一列
df_inner.loc[(df_inner[‘city‘] == ‘guangzhou‘) | (df_inner[‘price‘] >= 4000), ‘sign‘]=1 # loc
println(df_inner)
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id date city category age price gender pay m-point sign
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10 NaN
1 1002 2013-01-03 SH 100-B 44 NaN female N 12 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20 NaN
3 1004 2013-01-05 shanghai 110-C 32 5433.0 female Y 40 1.0
4 1005 2013-01-06 sh 210-A 34 NaN male N 40 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40 1.0
==================================结果分割线==================================
Process finished with exit code 0
- 数据分列
# 数据分列
new_split = pd.DataFrame((x.split(‘-‘) for x in df_inner[‘category‘]),index=df_inner.index,columns=[‘category‘,‘size‘])
print(new_split)
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
category size
0 100 A
1 100 B
2 110 A
3 110 C
4 210 A
5 130 F
Process finished with exit code 0
- 根据索引和区域取值
# 获取某一行的数据:根据索引
println(df_inner.loc[3])
# 获取索引区域的值
println(df_inner.iloc[0:5])
# 获取前三行,前两列的值,此处的数字不是索引哦,而是数据的位置
println(df_inner.iloc[:3,:2]) #冒号前后的数字不再是索引的标签名称,而是数据所在的位置,从0开始,前三行,前两列。
# 按照位置
println(df_inner.iloc[[0,2,5],[4,5]]) #提取第0、2、5行,4、5列)
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id 1004
date 2013-01-05 00:00:00
city shanghai
category 110-C
age 32
price 5433
gender female
pay Y
m-point 40
Name: 3, dtype: object
==================================结果分割线==================================
id date city category age price gender pay m-point
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10
1 1002 2013-01-03 SH 100-B 44 NaN female N 12
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20
3 1004 2013-01-05 shanghai 110-C 32 5433.0 female Y 40
4 1005 2013-01-06 sh 210-A 34 NaN male N 40
==================================结果分割线==================================
id date
0 1001 2013-01-02
1 1002 2013-01-03
2 1003 2013-01-04
==================================结果分割线==================================
age price
0 23 1200.0
2 54 2133.0
5 32 4432.0
==================================结果分割线==================================
- 设置索引
df_inner=df_inner.set_index(‘date‘)
println(df_inner)
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id city category age price gender pay m-point
date
2013-01-02 1001 Beijing 100-A 23 1200.0 male Y 10
2013-01-03 1002 SH 100-B 44 NaN female N 12
2013-01-04 1003 guangzhou 110-A 54 2133.0 male Y 20
2013-01-05 1004 shanghai 110-C 32 5433.0 female Y 40
2013-01-06 1005 sh 210-A 34 NaN male N 40
2013-01-07 1006 BEIJING 130-F 32 4432.0 female Y 40
==================================结果分割线==================================
Process finished with exit code 0
- 判断是否包含某个数据
# 判断某列是否包含某个数据
println(df_inner[‘city‘].isin([‘Beijing ‘]))
# 判断city列里是否包含beijing和shanghai,然后将符合条件的数据提取出来
println(df_inner.loc[df_inner[‘city‘].isin([‘Beijing ‘,‘shanghai‘])])
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
0 True
1 False
2 False
3 False
4 False
5 False
Name: city, dtype: bool
==================================结果分割线==================================
id date city category age price gender pay m-point
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10
3 1004 2013-01-05 shanghai 110-C 32 5433.0 female Y 40
==================================结果分割线==================================
Process finished with exit code 0
- 截取某列的前几个字符
println(pd.DataFrame(df_inner[‘category‘].str[:3]))
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
category
0 100
1 100
2 110
3 110
4 210
5 130
==================================结果分割线==================================
Process finished with exit code 0
- 表关联
# selet ‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘ from df_inner where age>10 and city=Beijing
println(df_inner.loc[(df_inner[‘age‘] > 10) & (df_inner[‘city‘] == ‘Beijing ‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]])
# select ‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘ from df_inner where age>25 or city=beijing order by age desc;
println(df_inner.loc[(df_inner[‘age‘] > 25) | (df_inner[‘city‘] == ‘beijing‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]].sort_values([‘age‘],ascending=False))
# select count(*) from df_inner where city != beijing ;
println(df_inner.loc[(df_inner[‘city‘] != ‘beijing‘)].city.count())
# select * from df_inner where city = Beijing or city = shanghai;
println(df_inner.query(‘city == ["Beijing ", "shanghai "]‘)) # 或的关系
# select sum(price) from df_inner where city = Beijing or city = shanghai;
println(df_inner.query(‘city == ["Beijing ", "shanghai "]‘).price.sum())
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id city age category gender
0 1001 Beijing 23 100-A male
==================================结果分割线==================================
id city age category gender
2 1003 guangzhou 54 110-A male
1 1002 SH 44 100-B female
4 1005 sh 34 210-A male
3 1004 shanghai 32 110-C female
5 1006 BEIJING 32 130-F female
==================================结果分割线==================================
6
==================================结果分割线==================================
id date city category age price gender pay m-point
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10
==================================结果分割线==================================
1200.0
==================================结果分割线==================================
- 分组,统计
println(df_inner.groupby(‘city‘).count())
println(df_inner.groupby(‘city‘)[‘id‘].count())
println(df_inner.groupby([‘city‘,‘age‘])[‘id‘].count())
println(df_inner.groupby(‘city‘)[‘price‘].agg([len,np.sum, np.mean]))
D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
id date category age price gender pay m-point
city
guangzhou 1 1 1 1 1 1 1 1
BEIJING 1 1 1 1 1 1 1 1
Beijing 1 1 1 1 1 1 1 1
SH 1 1 1 1 0 1 1 1
sh 1 1 1 1 0 1 1 1
shanghai 1 1 1 1 1 1 1 1
==================================结果分割线==================================
city
guangzhou 1
BEIJING 1
Beijing 1
SH 1
sh 1
shanghai 1
Name: id, dtype: int64
==================================结果分割线==================================
city age
guangzhou 54 1
BEIJING 32 1
Beijing 23 1
SH 44 1
sh 34 1
shanghai 32 1
Name: id, dtype: int64
==================================结果分割线==================================
len sum mean
city
guangzhou 1.0 2133.0 2133.0
BEIJING 1.0 4432.0 4432.0
Beijing 1.0 1200.0 1200.0
SH 1.0 0.0 NaN
sh 1.0 0.0 NaN
shanghai 1.0 5433.0 5433.0
==================================结果分割线==================================
Process finished with exit code 0
- 写入excel,csv
df_inner.to_excel(‘excel_to_python.xlsx‘, sheet_name=‘bluewhale_cc‘)
df_inner.to_csv(‘csv_to_python.csv‘)
- 读取excel,csv
df = pd.read_csv(‘excel_to_python.csv‘)
df2 = pd.read_excel(‘excel_to_python.xlsx‘)
代码:
import numpy as np
import pandas as pd
def println(*args):
print(*args,end=‘
==================================结果分割线==================================
‘)
df = pd.DataFrame(
{
"id":[1001,1002,1003,1004,1005,1006],
"date":pd.date_range(‘20130102‘, periods=6),
"city":[‘Beijing ‘, ‘SH‘, ‘ guangzhou‘, ‘shanghai‘, ‘sh ‘, ‘BEIJING ‘],
"age":[23,44,54,32,34,32],
"category":[‘100-A‘,‘100-B‘,‘110-A‘,‘110-C‘,‘210-A‘,‘130-F‘],
"price":[1200,np.nan,2133,5433,np.nan,4432]
},
columns =[‘id‘,‘date‘,‘city‘,‘category‘,‘age‘,‘price‘])
println(df,type(df))
println(df.shape) # (col,row)
println(df.info())
println(df.dtypes)
println(df[‘id‘].dtype)
println(df.isnull())
println(df[‘price‘].isnull())
println(df[‘id‘].unique())
println(df.values)
println(df.columns)
println(df.head()) #默认前10行数据
println(df.tail()) #默认后10行数据
df.fillna(value=0,inplace=True)
println(df)
println(df[‘price‘].mean())
df[‘city‘]=df[‘city‘].map(str.strip)
println(df)
df[‘city‘] = df[‘city‘].str.lower()
println(df[‘city‘])
df[‘price‘].fillna(value = 0,inplace=True)
println(df[‘price‘].astype(‘int‘))
df.rename(columns={‘category‘: ‘category-size‘},inplace=True)
println(df)
df[‘city‘].drop_duplicates()
df[‘city‘].drop_duplicates(keep=‘last‘)
println(df[‘city‘].replace(‘sh ‘, ‘shanghai‘))
df1=pd.DataFrame(
{
"id":[1001,1002,1003,1004,1005,1006,1007,1008],
"gender":[‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘,‘male‘,‘female‘],
"pay":[‘Y‘,‘N‘,‘Y‘,‘Y‘,‘N‘,‘Y‘,‘N‘,‘Y‘,],
"m-point":[10,12,20,40,40,40,30,20]
})
df_inner = pd.merge(df,df1,how=‘inner‘) # 匹配合并,交集
println(df_inner)
df_left=pd.merge(df,df1,how=‘left‘)
println(df_left)
df_right=pd.merge(df,df1,how=‘right‘)
println(df_right)
df_outer=pd.merge(df,df1,how=‘outer‘) #并集
println(df_outer)
df_inner.set_index(‘id‘)
println(df_inner.sort_index(ascending=False))
println(df_inner.sort_values(by=[‘age‘],ascending=False))
df_inner[‘group‘] = np.where(df_inner[‘price‘] > 3000,‘high‘,‘low‘)
println(df_inner)
df_inner.loc[(df_inner[‘city‘] == ‘guangzhou‘) | (df_inner[‘price‘] >= 4000), ‘sign‘]=1
println(df_inner)
# 数据分列
new_split = pd.DataFrame((x.split(‘-‘) for x in df_inner[‘category‘]),index=df_inner.index,columns=[‘category‘,‘size‘])
print(new_split)
# 获取某一行的数据:根据索引
println(df_inner.loc[3])
# 获取索引区域的值
println(df_inner.iloc[0:5])
# 获取前三行,前两列的值,此处的数字不是索引哦,而是数据的位置
println(df_inner.iloc[:3,:2]) #冒号前后的数字不再是索引的标签名称,而是数据所在的位置,从0开始,前三行,前两列。
# 按照位置
println(df_inner.iloc[[0,2,5],[4,5]]) #提取第0、2、5行,4、5列)
df_inner=df_inner.set_index(‘date‘)
println(df_inner)
# 判断某列是否包含某个数据
println(df_inner[‘city‘].isin([‘Beijing ‘]))
# 判断city列里是否包含beijing和shanghai,然后将符合条件的数据提取出来
println(df_inner.loc[df_inner[‘city‘].isin([‘Beijing ‘,‘shanghai‘])])
println(pd.DataFrame(df_inner[‘category‘].str[:3]))
# selet ‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘ from df_inner where age>10 and city=Beijing
println(df_inner.loc[(df_inner[‘age‘] > 10) & (df_inner[‘city‘] == ‘Beijing ‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]])
# select ‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘ from df_inner where age>25 or city=beijing order by age desc;
println(df_inner.loc[(df_inner[‘age‘] > 25) | (df_inner[‘city‘] == ‘beijing‘), [‘id‘,‘city‘,‘age‘,‘category‘,‘gender‘]].sort_values([‘age‘],ascending=False))
# select count(*) from df_inner where city != beijing ;
println(df_inner.loc[(df_inner[‘city‘] != ‘beijing‘)].city.count())
# select * from df_inner where city = Beijing or city = shanghai;
println(df_inner.query(‘city == ["Beijing ", "shanghai "]‘)) # 或的关系
# select sum(price) from df_inner where city = Beijing or city = shanghai;
println(df_inner.query(‘city == ["Beijing ", "shanghai "]‘).price.sum())
# 分组,统计
println(df_inner.groupby(‘city‘).count())
println(df_inner.groupby(‘city‘)[‘id‘].count())
println(df_inner.groupby([‘city‘,‘age‘])[‘id‘].count())
println(df_inner.groupby(‘city‘)[‘price‘].agg([len,np.sum, np.mean]))
df_inner.to_excel(‘excel_to_python.xlsx‘, sheet_name=‘bluewhale_cc‘)
df_inner.to_csv(‘excel_to_python.csv‘)
df = pd.read_csv(‘excel_to_python.csv‘)
df2 = pd.read_excel(‘excel_to_python.xlsx‘)
以上是关于python之pandas简介的主要内容,如果未能解决你的问题,请参考以下文章