面对小白的pandas命令手册+练习题三万字详解

Posted 五包辣条!

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了面对小白的pandas命令手册+练习题三万字详解相关的知识,希望对你有一定的参考价值。

大家好,我是辣条。

Pandas 是Python的核心数据分析支持库,提供了快速、灵活、明确的数据结构,旨在简单、直观地处理关系型、标记型数据。Pandas常用于处理带行列标签的矩阵数据、与 SQL 或 Excel 表类似的表格数据,应用于金融、统计、社会科学、工程等领域里的数据整理与清洗、数据分析与建模、数据可视化与制表等工作。

练习题索引

习题编号内容相应数据集
练习1 - 开始了解你的数据探索Chipotle快餐数据chipotle.tsv
练习2 - 数据过滤与排序探索2012欧洲杯数据Euro2012_stats.csv
[练习3 - 数据分组]探索酒类消费数据drinks.csv
[练习4 -Apply函数]探索1960 - 2014 美国犯罪数据US_Crime_Rates_1960_2014.csv
[练习5 - 合并]探索虚拟姓名数据练习中手动内置的数据
[练习6 - 统计探索风速数据wind.data
[练习7 - 可视化]探索泰坦尼克灾难数据train.csv
[练习8 - 创建数据框探索Pokemon数据练习中手动内置的数据
[练习9 - 时间序列探索Apple公司股价数据Apple_stock.csv
[练习10 - 删除数据探索Iris纸鸢花数据iris.csv

练习1-开始了解你的数据

探索Chipotle快餐数据

步骤1 导入必要的库

In [7]:

# 运行以下代码
import pandas as pd
 

步骤2 从如下地址导入数据集

In [5]:

# 运行以下代码
path1 = "./exercise_data/chipotle.tsv"    # chipotle.tsv

步骤3 将数据集存入一个名为chipo的数据框内

In [8]:

# 运行以下代码
chipo = pd.read_csv(path1, sep = '\\t')

步骤4 查看前10行内容

In [9]:

# 运行以下代码
chipo.head(10)

Out[9]:

order_idquantityitem_namechoice_descriptionitem_price
011Chips and Fresh Tomato SalsaNaN$2.39
111Izze[Clementine]$3.39
211Nantucket Nectar[Apple]$3.39
311Chips and Tomatillo-Green Chili SalsaNaN$2.39
422Chicken Bowl[Tomatillo-Red Chili Salsa (Hot), [Black Beans...$16.98
531Chicken Bowl[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...$10.98
631Side of ChipsNaN$1.69
741Steak Burrito[Tomatillo Red Chili Salsa, [Fajita Vegetables...$11.75
841Steak Soft Tacos[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...$9.25
951Steak Burrito[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...$9.25

步骤6 数据集中有多少个列(columns)

In [236]:

# 运行以下代码
chipo.shape[1]

Out[236]:

5

步骤7 打印出全部的列名称

In [237]:

# 运行以下代码
chipo.columns

Out[237]:

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

步骤8 数据集的索引是怎样的

In [238]:

# 运行以下代码
chipo.index

Out[238]:

RangeIndex(start=0, stop=4622, step=1)

步骤9 被下单数最多商品(item)是什么?

In [239]:

# 运行以下代码,做了修正
c = chipo[['item_name','quantity']].groupby(['item_name'],as_index=False).agg({'quantity':sum})
c.sort_values(['quantity'],ascending=False,inplace=True)
c.head()

Out[239]:

item_namequantity
17Chicken Bowl761
18Chicken Burrito591
25Chips and Guacamole506
39Steak Burrito386
10Canned Soft Drink351

步骤10 在item_name这一列中,一共有多少种商品被下单?

In [240]:

# 运行以下代码
chipo['item_name'].nunique()

Out[240]:

50

步骤11 在choice_description中,下单次数最多的商品是什么?

In [241]:

# 运行以下代码,存在一些小问题
chipo['choice_description'].value_counts().head()

Out[241]:

[Diet Coke]                                                                          134
[Coke]                                                                               123
[Sprite]                                                                              77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]]                42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]]     40
Name: choice_description, dtype: int64

步骤12 一共有多少商品被下单?

In [242]:

# 运行以下代码
total_items_orders = chipo['quantity'].sum()
total_items_orders

Out[242]:

4972

步骤13 将item_price转换为浮点数

In [243]:

# 运行以下代码
dollarizer = lambda x: float(x[1:-1])
chipo['item_price'] = chipo['item_price'].apply(dollarizer)

步骤14 在该数据集对应的时期内,收入(revenue)是多少

In [244]:

# 运行以下代码,已经做更正
chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2)
chipo['sub_total'].sum()

Out[244]:

39237.02

步骤15 在该数据集对应的时期内,一共有多少订单?

In [245]:

# 运行以下代码
chipo['order_id'].nunique()

Out[245]:

1834

步骤16 每一单(order)对应的平均总价是多少?

In [246]:

# 运行以下代码,已经做过更正
chipo[['order_id','sub_total']].groupby(by=['order_id']
).agg({'sub_total':'sum'})['sub_total'].mean()

Out[246]:

21.39423118865867

步骤17 一共有多少种不同的商品被售出?

In [247]:

# 运行以下代码
chipo['item_name'].nunique()

Out[247]:

练习2-数据过滤与排序

探索2012欧洲杯数据

步骤1 - 导入必要的库

In [248]:

# 运行以下代码
import pandas as pd

步骤2 - 从以下地址导入数据集

In [249]:

# 运行以下代码
path2 = "./exercise_data/Euro2012_stats.csv"      # Euro2012_stats.csv

步骤3 - 将数据集命名为euro12

In [250]:

# 运行以下代码
euro12 = pd.read_csv(path2)
euro12

Out[250]:

TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
0Croatia4131251.9%16.0%32000...1381.3%41622909916
1Czech Republic4131841.9%12.9%39000...960.1%5373870111119
2Denmark4101050.0%20.0%27100...1066.7%25388407715
3England5111850.0%17.2%40000...2288.1%4345650111116
4France3222437.9%6.5%65100...654.6%3651560111119
5Germany10323247.8%15.6%80210...1062.6%63491240151517
6Greece581830.7%19.2%32111...1365.1%67481291121220
7Italy6344543.0%7.5%110200...2074.1%1018916160181819
8Netherlands2123625.0%4.1%60200...1270.6%35303507715
9Poland2152339.4%5.2%48000...666.7%48563717717
10Portugal6224234.3%9.3%82600...1071.5%739010120141416
11Republic of Ireland171236.8%5.2%28000...1765.4%43511161101017
12Russia593122.5%12.5%59200...1077.0%34434607716
13Spain12423355.9%16.0%100010...1593.8%1028319110171718
14Sweden5171947.2%13.8%39300...861.6%35517709918
15Ukraine272621.2%6.0%38000...1376.5%48314509918

16 rows × 35 columns

步骤4 只选取 Goals 这一列

In [251]:

# 运行以下代码
euro12.Goals

Out[251]:

0      4
1      4
2      4
3      5
4      3
5     10
6      5
7      6
8      2
9      2
10     6
11     1
12     5
13    12
14     5
15     2
Name: Goals, dtype: int64

步骤5 有多少球队参与了2012欧洲杯?

In [252]:

# 运行以下代码
euro12.shape[0]

Out[252]:

16

步骤6 该数据集中一共有多少列(columns)?

In [253]:

# 运行以下代码
euro12.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
Team                          16 non-null object
Goals                         16 non-null int64
Shots on target               16 non-null int64
Shots off target              16 non-null int64
Shooting Accuracy             16 non-null object
% Goals-to-shots              16 non-null object
Total shots (inc. Blocked)    16 non-null int64
Hit Woodwork                  16 non-null int64
Penalty goals                 16 non-null int64
Penalties not scored          16 non-null int64
Headed goals                  16 non-null int64
Passes                        16 non-null int64
Passes completed              16 non-null int64
Passing Accuracy              16 non-null object
Touches                       16 non-null int64
Crosses                       16 non-null int64
Dribbles                      16 non-null int64
Corners Taken                 16 non-null int64
Tackles                       16 non-null int64
Clearances                    16 non-null int64
Interceptions                 16 non-null int64
Clearances off line           15 non-null float64
Clean Sheets                  16 non-null int64
Blocks                        16 non-null int64
Goals conceded                16 non-null int64
Saves made                    16 non-null int64
Saves-to-shots ratio          16 non-null object
Fouls Won                     16 non-null int64
Fouls Conceded                16 non-null int64
Offsides                      16 non-null int64
Yellow Cards                  16 non-null int64
Red Cards                     16 non-null int64
Subs on                       16 non-null int64
Subs off                      16 non-null int64
Players Used                  16 non-null int64
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB

步骤7 将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框

In [254]:

# 运行以下代码
discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']]
discipline

Out[254]:

TeamYellow CardsRed Cards
0Croatia90
1Czech Republic70
2Denmark40
3England50
4France60
5Germany40
6Greece91
7Italy160
8Netherlands50
9Poland71
10Portugal120
11Republic of Ireland61
12Russia60
13Spain110
14Sweden70
15Ukraine50

步骤8 对数据框discipline按照先Red Cards再Yellow Cards进行排序

In [255]:

# 运行以下代码
discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending = False)

Out[255]:

TeamYellow CardsRed Cards
6Greece91
9Poland71
11Republic of Ireland61
7Italy160
10Portugal120
13Spain110
0Croatia90
1Czech Republic70
14Sweden70
4France60
12Russia60
3England50
8Netherlands50
15Ukraine50
2Denmark40
5Germany40

步骤9 计算每个球队拿到的黄牌数的平均值

In [256]:

# 运行以下代码
round(discipline['Yellow Cards'].mean())

Out[256]:

7.0

步骤10 找到进球数Goals超过6的球队数据

In [257]:

# 运行以下代码
euro12[euro12.Goals > 6]

Out[257]:

TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
5Germany10323247.8%15.6%80210...1062.6%63491240151517
13Spain12423355.9%16.0%100010...1593.8%1028319110171718

2 rows × 35 columns

步骤11 选取以字母G开头的球队数据

In [258]:

# 运行以下代码
euro12[euro12.Team.str.startswith('G')]

Out[258]:

TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
5Germany10323247.8%15.6%80210...1062.6%63491240151517
6Greece581830.7%19.2%32111...1365.1%67481291121220

2 rows × 35 columns

步骤12 选取前7列

In [259]:

# 运行以下代码
euro12.iloc[: , 0:7]

Out[259]:

TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)
0Croatia4131251.9%16.0%32
1Czech Republic4131841.9%12.9%39
2Denmark4101050.0%20.0%27
3England5111850.0%17.2%40
4France3222437.9%6.5%65
5Germany10323247.8%15.6%80
6Greece581830.7%19.2%32
7Italy6344543.0%7.5%110
8Netherlands2123625.0%4.1%60
9Poland2152339.4%5.2%48
10Portugal6224234.3%9.3%82
11Republic of Ireland171236.8%5.2%28
12Russia593122.5%12.5%59
13Spain12423355.9%16.0%100
14Sweden5171947.2%13.8%39
15Ukraine272621.2%6.0%38

步骤13 选取除了最后3列之外的全部列

In [260]:

# 运行以下代码
euro12.iloc[: , :-3]

Out[260]:

TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Clean SheetsBlocksGoals concededSaves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed Cards
0Croatia4131251.9%16.0%32000...01031381.3%4162290
1Czech Republic4131841.9%12.9%39000...1106960.1%5373870
2Denmark4101050.0%20.0%27100...11051066.7%2538840
3England5111850.0%17.2%40000...22932288.1%4345650
4France3222437.9%6.5%65100...175654.6%3651560
5Germany10323247.8%15.6%80210...11161062.6%63491240
6Greece581830.7%19.2%32111...12371365.1%67481291
7Italy6344543.0%7.5%110200...21872074.1%1018916160
8Netherlands2123625.0%4.1%60200...0951270.6%3530350
9Poland2152339.4%5.2%48000...083666.7%4856371
10Portugal6224234.3%9.3%82600...21141071.5%739010120
11Republic of Ireland171236.8%5.2%28000...02391765.4%43511161
12Russia593122.5%12.5%59200...0831077.0%3443460
13Spain12423355.9%16.0%100010...5811593.8%1028319110
14Sweden5171947.2%13.8%39300...1125861.6%3551770
15Ukraine272621.2%6.0%38000...0441376.5%4831450

16 rows × 32 columns

步骤14 找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)

In [261]:

# 运行以下代码
euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]

Out[261]:

TeamShooting Accuracy
3England50.0%
7Italy43.0%
12Russia22.5%

练习3-数据分组

探索酒类消费数据

步骤1 导入必要的库

In [262]:

# 运行以下代码
import pandas as pd

步骤2 从以下地址导入数据

In [10]:

# 运行以下代码
path3 ='./exercise_data/drinks.csv'    #'drinks.csv'

步骤3 将数据框命名为drinks

In [11]:

# 运行以下代码
drinks = pd.read_csv(path3)
drinks.head()

Out[11]:

countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0AS
1Albania89132544.9EU
2Algeria250140.7AF
3Andorra24513831212.4EU
4Angola21757455.9AF

步骤4 哪个大陆(continent)平均消耗的啤酒(beer)更多?

In [12]:

# 运行以下代码
drinks.groupby('continent').beer_servings.mean()

Out[12]:

continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

步骤5 打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值

In [13]:

# 运行以下代码
drinks.groupby('continent').wine_servings.describe()

Out[13]:

countmeanstdmin25%50%75%max
continent
AF53.016.26415138.8464190.01.02.013.00233.0
AS44.09.06818221.6670340.00.01.08.00123.0
EU45.0142.22222297.4217380.059.0128.0195.00370.0
OC16.035.62500064.5557900.01.08.523.25212.0
SA12.062.41666788.6201891.03.012.098.50221.0

步骤6 打印出每个大陆每种酒类别的消耗平均值

In [15]:

# 运行以下代码
drinks.groupby('continent').mean()

Out[15]:

beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent
AF61.47169816.33962316.2641513.007547
AS37.04545560.8409099.0681822.170455
EU193.777778132.555556142.2222228.617778
OC89.68750058.43750035.6250003.381250
SA175.083333114.75000062.4166676.308333

步骤7 打印出每个大陆每种酒类别的消耗中位数

In [268]:

# 运行以下代码
drinks.groupby('continent').median()

Out[268]:

beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent
AF32.03.02.02.30
AS17.516.01.01.20
EU219.0122.0128.010.00
OC52.537.08.51.75
SA162.5108.512.06.85

步骤8 打印出每个大陆对spirit饮品消耗的平均值,最大值和最小值

In [269]:

# 运行以下代码
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])

Out[269]:

meanminmax
continent
AF16.3396230152
AS60.8409090326
EU132.5555560373
OC58.4375000254
SA114.75000025302

练习4-Apply函数

探索1960 - 2014 美国犯罪数据

步骤1 导入必要的库

In [16]:

# 运行以下代码
import numpy as np
import pandas as pd

步骤2 从以下地址导入数据集

In [27]:

# 运行以下代码
path4 = './exercise_data/US_Crime_Rates_1960_2014.csv'    # "US_Crime_Rates_1960_2014.csv"

步骤3 将数据框命名为crime

In [28]:

# 运行以下代码
crime = pd.read_csv(path4)
crime.head()

Out[28]:

YearPopulationTotalViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
01960179323175338420028846030957009110171901078401543209121001855400328200
11961182992000348800028939031986008740172201066701567609496001913000336000
21962185771000375220030151034507008530175501108601645709943002089600366800
319631884830004109500316970379250086401765011647017421010864002297800408300
419641911410004564600364220420040093602142013039020305012132002514400472800

步骤4 每一列(column)的数据类型是什么样的?

In [29]:

# 运行以下代码
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null int64
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: int64(12)
memory usage: 5.2 KB

注意到了吗,Year的数据类型为 int64,但是pandas有一个不同的数据类型去处理时间序列(time series),我们现在来看看。

步骤5 将Year的数据类型转换为 datetime64

In [30]:

# 运行以下代码
crime.Year = pd.to_datetime(crime.Year, format='%Y')
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null datetime64[ns]
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.2 KB

步骤6 将列Year设置为数据框的索引

In [31]:

# 运行以下代码
crime = crime.set_index('Year', drop = True)
crime.head()

Out[31]:

PopulationTotalViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-01179323175338420028846030957009110171901078401543209121001855400328200
1961-01-01182992000348800028939031986008740172201066701567609496001913000336000
1962-01-01185771000375220030151034507008530175501108601645709943002089600366800
1963-01-011884830004109500316970379250086401765011647017421010864002297800408300
1964-01-011911410004564600364220420040093602142013039020305012132002514400472800

步骤7 删除名为Total的列

In [32]:

# 运行以下代码
del crime['Total']
crime.head()

Out[32]:

PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-0117932317528846030957009110171901078401543209121001855400328200
1961-01-0118299200028939031986008740172201066701567609496001913000336000
1962-01-0118577100030151034507008530175501108601645709943002089600366800
1963-01-01188483000316970379250086401765011647017421010864002297800408300
1964-01-01191141000364220420040093602142013039020305012132002514400472800

In [33]:

crime.resample('10AS').sum()

Out[33]:

PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-0119150531754134930451609001061802367201633510215852013321100265477005292100
1970-01-0121211932989607930913838001922305545704159020470212028486000531578009739900
1980-01-0123713700691407432811704890020643986563953831097619130330734947204025311935411
1990-01-01261282525817527048119053499211664998827574893010568963267500157767936614624418
2000-01-0129479691171396805610094436916306892249942303668652124215651766797029111412834
2010-01-011570146307607201744095950728674210591749809376414210125170304016983569080

步骤8 按照Year对数据框进行分组并求和

*注意Population这一列,若直接对其求和,是不正确的**

In [34]:

# 更多关于 .resample 的介绍
# (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html)
# 更多关于 Offset Aliases的介绍 
# (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)
# 运行以下代码
crimes = crime.resample('10AS').sum() # resample a time series per decades


# 用resample去得到“Population”列的最大值
population = crime['Population'].resample('10AS').max()

# 更新 "Population" 
crimes['Population'] = population

crimes

Out[34]:

PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-012013850004134930451609001061802367201633510215852013321100265477005292100
1970-01-012200990009607930913838001922305545704159020470212028486000531578009739900
1980-01-012482390001407432811704890020643986563953831097619130330734947204025311935411
1990-01-0127269081317527048119053499211664998827574893010568963267500157767936614624418
2000-01-013070065501396805610094436916306892249942303668652124215651766797029111412834
2010-01-01318857056607201744095950728674210591749809376414210125170304016983569080

步骤9 何时是美国历史上生存最危险的年代?

In [279]:

# 运行以下代码
crime.idxmax(0)

Out[279]:

Population           2014-01-01
Violent              1992-01-01
Property             1991-01-01
Murder               1991-01-01
Forcible_Rape        1992-01-01
Robbery              1991-01-01
Aggravated_assault   1993-01-01
Burglary             1980-01-01
Larceny_Theft        1991-01-01
Vehicle_Theft        1991-01-01
dtype: datetime64[ns]

练习5-合并

探索虚拟姓名数据

步骤1 导入必要的库

In [280]:

# 运行以下代码
import numpy as np
import pandas as pd

步骤2 按照如下的元数据内容创建数据框

In [281]:

# 运行以下代码
raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

步骤3 将上述的数据框分别命名为data1, data2, data3

In [282]:

# 运行以下代码
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])

步骤4 将data1data2两个数据框按照行的维度进行合并,命名为all_data

In [283]:

# 运行以下代码
all_data = pd.concat([data1, data2])
all_data

Out[283]:

subject_idfirst_namelast_name
01AlexAnderson
12AmyAckerman
23AllenAli
34AliceAoni
45AyoungAtiches
04BillyBonder
15BrianBlack
26BranBalwner
37BryceBrice
48BettyBtisan

步骤5 将data1data2两个数据框按照列的维度进行合并,命名为all_data_col

In [284]:

# 运行以下代码
all_data_col = pd.concat([data1, data2], axis = 1)
all_data_col

Out[284]:

subject_idfirst_namelast_namesubject_idfirst_namelast_name
01AlexAnderson4BillyBonder
12AmyAckerman5BrianBlack
23AllenAli6BranBalwner
34AliceAoni7BryceBrice
45AyoungAtiches8BettyBtisan

步骤6 打印data3

In [285]:

# 运行以下代码
data3

Out[285]:

subject_idtest_id
0151
1215
2315
3461
4516
5714
6815
791
81061
91116

步骤7 按照subject_id的值对all_datadata3作合并

In [286]:

# 运行以下代码
pd.merge(all_data, data3, on='subject_id')

Out[286]:

subject_idfirst_namelast_nametest_id
01AlexAnderson51
12AmyAckerman15
23AllenAli15
34AliceAoni61
44BillyBonder61
55AyoungAtiches16
65BrianBlack16
77BryceBrice14
88BettyBtisan15

步骤8 对data1data2按照subject_id作连接

In [287]:

# 运行以下代码
pd.merge(data1, data2, on='subject_id', how='inner')

Out[287]:

subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
04AliceAoniBillyBonder
15AyoungAtichesBrianBlack

步骤9 找到 data1data2 合并之后的所有匹配结果

In [288]:

# 运行以下代码
pd.merge(data1, data2, on='subject_id', how='outer')

Out[288]:

subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
01AlexAndersonNaNNaN
12AmyAckermanNaNNaN
23AllenAliNaNNaN
34AliceAoniBillyBonder
45AyoungAtichesBrianBlack
56NaNNaNBranBalwner
67NaNNaNBryceBrice
78NaNNaNBettyBtisan

练习6-统计

探索风速数据

步骤1 导入必要的库

In [289]:

# 运行以下代码
import pandas as pd
import datetime

步骤2 从以下地址导入数据

In [290]:

import pandas as pd

In [35]:

# 运行以下代码
path6 = "./exercise_data/wind.data"  # wind.data

步骤3 将数据作存储并且设置前三列为合适的索引

In [292]:

import datetime

In [293]:

# 运行以下代码
data = pd.read_table(path6, sep = "\\s+", parse_dates = [[0,1,2]]) 
data.head()

Out[293]:

Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
02061-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
12061-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
22061-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
32061-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
42061-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

步骤4 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug

In [294]:

# 运行以下代码
def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)

# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)

# data.info()
data.head()

Out[294]:

Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
01961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
11961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
21961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
31961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
41961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

步骤5 将日期设为索引,注意数据类型,应该是datetime64[ns]

In [295]:

# 运行以下代码
# transform Yr_Mo_Dy it to date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])

# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')

data.head()
# data.info()

Out[295]:

RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
1961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
1961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
1961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
1961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

步骤6 对应每一个location,一共有多少数据值缺失

In [296]:

# 运行以下代码
data.isnull().sum()

Out[296]:

RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64

步骤7 对应每一个location,一共有多少完整的数据值

In [297]:

# 运行以下代码
data.shape[0] - data.isnull().sum()

Out[297]:

RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

步骤8 对于全体数据,计算风速的平均值

In [298]:

# 运行以下代码
data.mean().mean()

Out[298]:

10.227982360836924

步骤9 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差

In [299]:

# 运行以下代码
loc_stats = pd.DataFrame()

loc_stats['min'] = data.min() # min
loc_stats['max'] = data.max() # max 
loc_stats['mean'] = data.mean() # mean
loc_stats['std'] = data.std() # standard deviations

loc_stats

Out[299]:

minmaxmeanstd
RPT0.6735.8012.3629875.618413
VAL0.2133.3710.6443145.267356
ROS1.5033.8411.6605265.008450
KIL0.0028.466.3064683.605811
SHA0.1337.5410.4558344.936125
BIR0.0026.167.0922543.968683
DUB0.0030.379.7973434.977555
CLA0.0031.088.4950534.499449
MUL0.0025.888.4935904.166872
CLO0.0428.218.7073324.503954
BEL0.1342.3813.1210075.835037
MAL0.6742.5415.5990796.699794

步骤10 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差

In [300]:

# 运行以下代码
# create the dataframe
day_stats = pd.DataFrame()

# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max 
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations

day_stats.head()

Out[300]:

minmaxmeanstd
Yr_Mo_Dy
1961-01-019.2918.5013.0181822.808875
1961-01-026.5017.5411.3363643.188994
1961-01-036.1718.5011.6418183.681912
1961-01-041.7911.756.6191673.198126
1961-01-056.1713.3310.6300002.445356

步骤11 对于每一个location,计算一月份的平均风速

注意,1961年的1月和1962年的1月应该区别对待

In [301]:

# 运行以下代码
# creates a new column 'date' and gets the values from the index
data['date'] = data.index

# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)

# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')

# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()

Out[301]:

RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

步骤12 对于数据记录按照年为频率取样

In [302]:

# 运行以下代码
data.query('month == 1 and day == 1')

Out[302]:

RPTVALROSKILSHABIRDUBCLAMULCLOBELMALdatemonthyearday
Yr_Mo_Dy
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.041961-01-01119611
1962-01-019.293.4211.543.502.211.9610.412.793.545.174.387.921962-01-01119621
1963-01-0115.5913.6219.798.3812.2510.0023.4515.7113.5914.3717.5834.131963-01-01119631
1964-01-0125.8022.1318.2113.2521.2914.7914.1219.5813.2516.7528.9621.001964-01-01119641
1965-01-019.5411.929.004.386.085.2110.256.085.718.6312.0417.411965-01-01119651
1966-01-0122.0421.5017.0812.7522.1715.5921.7918.1216.6617.8328.3323.791966-01-01119661
1967-01-016.464.466.503.216.673.7911.383.837.719.0810.6720.911967-01-01119671
1968-01-0130.0417.8816.2516.2521.7912.5418.1616.6218.7517.6222.2527.291968-01-01119681
1969-01-016.131.635.411.082.541.008.502.424.586.349.1716.711969-01-01119691
1970-01-019.592.9611.793.426.134.089.004.467.293.507.3313.001970-01-01119701
1971-01-013.710.794.710.171.421.044.630.751.541.

以上是关于面对小白的pandas命令手册+练习题三万字详解的主要内容,如果未能解决你的问题,请参考以下文章

面渣逆袭:计算机网络六十二问,三万字图文详解!速收藏!

整理了将近三万字,超硬核BIO,NIO,AIO,Netty 超越 99% IO文章(面试题+综合案例+练习题)

面渣逆袭:三万字,七十图,详解计算机网络六十二问(收藏版)

❤️三万字详解《十大入门算法》突击刷掉这几个题,面试不慌!❤️(附源码,建议收藏)

三万字!DubboZookeeper学习笔记!秒杀面试官!——双非上岸阿里巴巴系列

三万字!DubboZookeeper学习笔记!秒杀面试官!——双非上岸阿里巴巴系列

(c)2006-2024 SYSTEM All Rights Reserved IT常识