面对小白的pandas命令手册+练习题三万字详解
Posted 五包辣条!
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了面对小白的pandas命令手册+练习题三万字详解相关的知识,希望对你有一定的参考价值。
大家好,我是辣条。
Pandas 是Python的核心数据分析支持库,提供了快速、灵活、明确的数据结构,旨在简单、直观地处理关系型、标记型数据。Pandas常用于处理带行列标签的矩阵数据、与 SQL 或 Excel 表类似的表格数据,应用于金融、统计、社会科学、工程等领域里的数据整理与清洗、数据分析与建模、数据可视化与制表等工作。
练习题索引
习题编号 | 内容 | 相应数据集 |
---|---|---|
练习1 - 开始了解你的数据 | 探索Chipotle快餐数据 | chipotle.tsv |
练习2 - 数据过滤与排序 | 探索2012欧洲杯数据 | Euro2012_stats.csv |
[练习3 - 数据分组] | 探索酒类消费数据 | drinks.csv |
[练习4 -Apply函数] | 探索1960 - 2014 美国犯罪数据 | US_Crime_Rates_1960_2014.csv |
[练习5 - 合并] | 探索虚拟姓名数据 | 练习中手动内置的数据 |
[练习6 - 统计 | 探索风速数据 | wind.data |
[练习7 - 可视化] | 探索泰坦尼克灾难数据 | train.csv |
[练习8 - 创建数据框 | 探索Pokemon数据 | 练习中手动内置的数据 |
[练习9 - 时间序列 | 探索Apple公司股价数据 | Apple_stock.csv |
[练习10 - 删除数据 | 探索Iris纸鸢花数据 | iris.csv |
练习1-开始了解你的数据
探索Chipotle快餐数据
步骤1 导入必要的库
In [7]:
# 运行以下代码
import pandas as pd
步骤2 从如下地址导入数据集
In [5]:
# 运行以下代码
path1 = "./exercise_data/chipotle.tsv" # chipotle.tsv
步骤3 将数据集存入一个名为chipo的数据框内
In [8]:
# 运行以下代码
chipo = pd.read_csv(path1, sep = '\\t')
步骤4 查看前10行内容
In [9]:
# 运行以下代码
chipo.head(10)
Out[9]:
order_id | quantity | item_name | choice_description | item_price | |
---|---|---|---|---|---|
0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
1 | 1 | 1 | Izze | [Clementine] | $3.39 |
2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
5 | 3 | 1 | Chicken Bowl | [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | $10.98 |
6 | 3 | 1 | Side of Chips | NaN | $1.69 |
7 | 4 | 1 | Steak Burrito | [Tomatillo Red Chili Salsa, [Fajita Vegetables... | $11.75 |
8 | 4 | 1 | Steak Soft Tacos | [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... | $9.25 |
9 | 5 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | $9.25 |
步骤6 数据集中有多少个列(columns)
In [236]:
# 运行以下代码
chipo.shape[1]
Out[236]:
5
步骤7 打印出全部的列名称
In [237]:
# 运行以下代码
chipo.columns
Out[237]:
Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')
步骤8 数据集的索引是怎样的
In [238]:
# 运行以下代码
chipo.index
Out[238]:
RangeIndex(start=0, stop=4622, step=1)
步骤9 被下单数最多商品(item)是什么?
In [239]:
# 运行以下代码,做了修正
c = chipo[['item_name','quantity']].groupby(['item_name'],as_index=False).agg({'quantity':sum})
c.sort_values(['quantity'],ascending=False,inplace=True)
c.head()
Out[239]:
item_name | quantity | |
---|---|---|
17 | Chicken Bowl | 761 |
18 | Chicken Burrito | 591 |
25 | Chips and Guacamole | 506 |
39 | Steak Burrito | 386 |
10 | Canned Soft Drink | 351 |
步骤10 在item_name这一列中,一共有多少种商品被下单?
In [240]:
# 运行以下代码
chipo['item_name'].nunique()
Out[240]:
50
步骤11 在choice_description中,下单次数最多的商品是什么?
In [241]:
# 运行以下代码,存在一些小问题
chipo['choice_description'].value_counts().head()
Out[241]:
[Diet Coke] 134
[Coke] 123
[Sprite] 77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]] 42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]] 40
Name: choice_description, dtype: int64
步骤12 一共有多少商品被下单?
In [242]:
# 运行以下代码
total_items_orders = chipo['quantity'].sum()
total_items_orders
Out[242]:
4972
步骤13 将item_price转换为浮点数
In [243]:
# 运行以下代码 dollarizer = lambda x: float(x[1:-1]) chipo['item_price'] = chipo['item_price'].apply(dollarizer)
步骤14 在该数据集对应的时期内,收入(revenue)是多少
In [244]:
# 运行以下代码,已经做更正 chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2) chipo['sub_total'].sum()
Out[244]:
39237.02
步骤15 在该数据集对应的时期内,一共有多少订单?
In [245]:
# 运行以下代码 chipo['order_id'].nunique()
Out[245]:
1834
步骤16 每一单(order)对应的平均总价是多少?
In [246]:
# 运行以下代码,已经做过更正 chipo[['order_id','sub_total']].groupby(by=['order_id'] ).agg({'sub_total':'sum'})['sub_total'].mean()
Out[246]:
21.39423118865867
步骤17 一共有多少种不同的商品被售出?
In [247]:
# 运行以下代码 chipo['item_name'].nunique()
Out[247]:
练习2-数据过滤与排序
探索2012欧洲杯数据
步骤1 - 导入必要的库
In [248]:
# 运行以下代码 import pandas as pd
步骤2 - 从以下地址导入数据集
In [249]:
# 运行以下代码 path2 = "./exercise_data/Euro2012_stats.csv" # Euro2012_stats.csv
步骤3 - 将数据集命名为euro12
In [250]:
# 运行以下代码 euro12 = pd.read_csv(path2) euro12
Out[250]:
Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | Subs on | Subs off | Players Used | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 | 0 | 0 | 0 | ... | 13 | 81.3% | 41 | 62 | 2 | 9 | 0 | 9 | 9 | 16 |
1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 | 0 | 0 | 0 | ... | 9 | 60.1% | 53 | 73 | 8 | 7 | 0 | 11 | 11 | 19 |
2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 | 1 | 0 | 0 | ... | 10 | 66.7% | 25 | 38 | 8 | 4 | 0 | 7 | 7 | 15 |
3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 | 0 | 0 | 0 | ... | 22 | 88.1% | 43 | 45 | 6 | 5 | 0 | 11 | 11 | 16 |
4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 | 1 | 0 | 0 | ... | 6 | 54.6% | 36 | 51 | 5 | 6 | 0 | 11 | 11 | 19 |
5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 | 15 | 15 | 17 |
6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 | 1 | 1 | 1 | ... | 13 | 65.1% | 67 | 48 | 12 | 9 | 1 | 12 | 12 | 20 |
7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 | 2 | 0 | 0 | ... | 20 | 74.1% | 101 | 89 | 16 | 16 | 0 | 18 | 18 | 19 |
8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 | 2 | 0 | 0 | ... | 12 | 70.6% | 35 | 30 | 3 | 5 | 0 | 7 | 7 | 15 |
9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 | 0 | 0 | 0 | ... | 6 | 66.7% | 48 | 56 | 3 | 7 | 1 | 7 | 7 | 17 |
10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 | 6 | 0 | 0 | ... | 10 | 71.5% | 73 | 90 | 10 | 12 | 0 | 14 | 14 | 16 |
11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 | 0 | 0 | 0 | ... | 17 | 65.4% | 43 | 51 | 11 | 6 | 1 | 10 | 10 | 17 |
12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 | 2 | 0 | 0 | ... | 10 | 77.0% | 34 | 43 | 4 | 6 | 0 | 7 | 7 | 16 |
13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 | 0 | 1 | 0 | ... | 15 | 93.8% | 102 | 83 | 19 | 11 | 0 | 17 | 17 | 18 |
14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 | 3 | 0 | 0 | ... | 8 | 61.6% | 35 | 51 | 7 | 7 | 0 | 9 | 9 | 18 |
15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 | 0 | 0 | 0 | ... | 13 | 76.5% | 48 | 31 | 4 | 5 | 0 | 9 | 9 | 18 |
16 rows × 35 columns
步骤4 只选取 Goals
这一列
In [251]:
# 运行以下代码 euro12.Goals
Out[251]:
0 4 1 4 2 4 3 5 4 3 5 10 6 5 7 6 8 2 9 2 10 6 11 1 12 5 13 12 14 5 15 2 Name: Goals, dtype: int64
步骤5 有多少球队参与了2012欧洲杯?
In [252]:
# 运行以下代码 euro12.shape[0]
Out[252]:
16
步骤6 该数据集中一共有多少列(columns)?
In [253]:
# 运行以下代码 euro12.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 16 entries, 0 to 15 Data columns (total 35 columns): Team 16 non-null object Goals 16 non-null int64 Shots on target 16 non-null int64 Shots off target 16 non-null int64 Shooting Accuracy 16 non-null object % Goals-to-shots 16 non-null object Total shots (inc. Blocked) 16 non-null int64 Hit Woodwork 16 non-null int64 Penalty goals 16 non-null int64 Penalties not scored 16 non-null int64 Headed goals 16 non-null int64 Passes 16 non-null int64 Passes completed 16 non-null int64 Passing Accuracy 16 non-null object Touches 16 non-null int64 Crosses 16 non-null int64 Dribbles 16 non-null int64 Corners Taken 16 non-null int64 Tackles 16 non-null int64 Clearances 16 non-null int64 Interceptions 16 non-null int64 Clearances off line 15 non-null float64 Clean Sheets 16 non-null int64 Blocks 16 non-null int64 Goals conceded 16 non-null int64 Saves made 16 non-null int64 Saves-to-shots ratio 16 non-null object Fouls Won 16 non-null int64 Fouls Conceded 16 non-null int64 Offsides 16 non-null int64 Yellow Cards 16 non-null int64 Red Cards 16 non-null int64 Subs on 16 non-null int64 Subs off 16 non-null int64 Players Used 16 non-null int64 dtypes: float64(1), int64(29), object(5) memory usage: 4.5+ KB
步骤7 将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框
In [254]:
# 运行以下代码 discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']] discipline
Out[254]:
Team | Yellow Cards | Red Cards | |
---|---|---|---|
0 | Croatia | 9 | 0 |
1 | Czech Republic | 7 | 0 |
2 | Denmark | 4 | 0 |
3 | England | 5 | 0 |
4 | France | 6 | 0 |
5 | Germany | 4 | 0 |
6 | Greece | 9 | 1 |
7 | Italy | 16 | 0 |
8 | Netherlands | 5 | 0 |
9 | Poland | 7 | 1 |
10 | Portugal | 12 | 0 |
11 | Republic of Ireland | 6 | 1 |
12 | Russia | 6 | 0 |
13 | Spain | 11 | 0 |
14 | Sweden | 7 | 0 |
15 | Ukraine | 5 | 0 |
步骤8 对数据框discipline按照先Red Cards再Yellow Cards进行排序
In [255]:
# 运行以下代码 discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending = False)
Out[255]:
Team | Yellow Cards | Red Cards | |
---|---|---|---|
6 | Greece | 9 | 1 |
9 | Poland | 7 | 1 |
11 | Republic of Ireland | 6 | 1 |
7 | Italy | 16 | 0 |
10 | Portugal | 12 | 0 |
13 | Spain | 11 | 0 |
0 | Croatia | 9 | 0 |
1 | Czech Republic | 7 | 0 |
14 | Sweden | 7 | 0 |
4 | France | 6 | 0 |
12 | Russia | 6 | 0 |
3 | England | 5 | 0 |
8 | Netherlands | 5 | 0 |
15 | Ukraine | 5 | 0 |
2 | Denmark | 4 | 0 |
5 | Germany | 4 | 0 |
步骤9 计算每个球队拿到的黄牌数的平均值
In [256]:
# 运行以下代码 round(discipline['Yellow Cards'].mean())
Out[256]:
7.0
步骤10 找到进球数Goals超过6的球队数据
In [257]:
# 运行以下代码 euro12[euro12.Goals > 6]
Out[257]:
Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | Subs on | Subs off | Players Used | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 | 15 | 15 | 17 |
13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 | 0 | 1 | 0 | ... | 15 | 93.8% | 102 | 83 | 19 | 11 | 0 | 17 | 17 | 18 |
2 rows × 35 columns
步骤11 选取以字母G开头的球队数据
In [258]:
# 运行以下代码 euro12[euro12.Team.str.startswith('G')]
Out[258]:
Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | Subs on | Subs off | Players Used | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 | 15 | 15 | 17 |
6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 | 1 | 1 | 1 | ... | 13 | 65.1% | 67 | 48 | 12 | 9 | 1 | 12 | 12 | 20 |
2 rows × 35 columns
步骤12 选取前7列
In [259]:
# 运行以下代码 euro12.iloc[: , 0:7]
Out[259]:
Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | |
---|---|---|---|---|---|---|---|
0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 |
1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 |
2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 |
3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 |
4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 |
5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 |
6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 |
7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 |
8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 |
9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 |
10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 |
11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 |
12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 |
13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 |
14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 |
15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 |
步骤13 选取除了最后3列之外的全部列
In [260]:
# 运行以下代码 euro12.iloc[: , :-3]
Out[260]:
Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Clean Sheets | Blocks | Goals conceded | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 | 0 | 0 | 0 | ... | 0 | 10 | 3 | 13 | 81.3% | 41 | 62 | 2 | 9 | 0 |
1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 | 0 | 0 | 0 | ... | 1 | 10 | 6 | 9 | 60.1% | 53 | 73 | 8 | 7 | 0 |
2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 | 1 | 0 | 0 | ... | 1 | 10 | 5 | 10 | 66.7% | 25 | 38 | 8 | 4 | 0 |
3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 | 0 | 0 | 0 | ... | 2 | 29 | 3 | 22 | 88.1% | 43 | 45 | 6 | 5 | 0 |
4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 | 1 | 0 | 0 | ... | 1 | 7 | 5 | 6 | 54.6% | 36 | 51 | 5 | 6 | 0 |
5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 1 | 11 | 6 | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 |
6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 | 1 | 1 | 1 | ... | 1 | 23 | 7 | 13 | 65.1% | 67 | 48 | 12 | 9 | 1 |
7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 | 2 | 0 | 0 | ... | 2 | 18 | 7 | 20 | 74.1% | 101 | 89 | 16 | 16 | 0 |
8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 | 2 | 0 | 0 | ... | 0 | 9 | 5 | 12 | 70.6% | 35 | 30 | 3 | 5 | 0 |
9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 | 0 | 0 | 0 | ... | 0 | 8 | 3 | 6 | 66.7% | 48 | 56 | 3 | 7 | 1 |
10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 | 6 | 0 | 0 | ... | 2 | 11 | 4 | 10 | 71.5% | 73 | 90 | 10 | 12 | 0 |
11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 | 0 | 0 | 0 | ... | 0 | 23 | 9 | 17 | 65.4% | 43 | 51 | 11 | 6 | 1 |
12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 | 2 | 0 | 0 | ... | 0 | 8 | 3 | 10 | 77.0% | 34 | 43 | 4 | 6 | 0 |
13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 | 0 | 1 | 0 | ... | 5 | 8 | 1 | 15 | 93.8% | 102 | 83 | 19 | 11 | 0 |
14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 | 3 | 0 | 0 | ... | 1 | 12 | 5 | 8 | 61.6% | 35 | 51 | 7 | 7 | 0 |
15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 | 0 | 0 | 0 | ... | 0 | 4 | 4 | 13 | 76.5% | 48 | 31 | 4 | 5 | 0 |
16 rows × 32 columns
步骤14 找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)
In [261]:
# 运行以下代码 euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
Out[261]:
Team | Shooting Accuracy | |
---|---|---|
3 | England | 50.0% |
7 | Italy | 43.0% |
12 | Russia | 22.5% |
练习3-数据分组
探索酒类消费数据
步骤1 导入必要的库
In [262]:
# 运行以下代码 import pandas as pd
步骤2 从以下地址导入数据
In [10]:
# 运行以下代码 path3 ='./exercise_data/drinks.csv' #'drinks.csv'
步骤3 将数据框命名为drinks
In [11]:
# 运行以下代码 drinks = pd.read_csv(path3) drinks.head()
Out[11]:
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 | AS |
1 | Albania | 89 | 132 | 54 | 4.9 | EU |
2 | Algeria | 25 | 0 | 14 | 0.7 | AF |
3 | Andorra | 245 | 138 | 312 | 12.4 | EU |
4 | Angola | 217 | 57 | 45 | 5.9 | AF |
步骤4 哪个大陆(continent)平均消耗的啤酒(beer)更多?
In [12]:
# 运行以下代码 drinks.groupby('continent').beer_servings.mean()
Out[12]:
continent AF 61.471698 AS 37.045455 EU 193.777778 OC 89.687500 SA 175.083333 Name: beer_servings, dtype: float64
步骤5 打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值
In [13]:
# 运行以下代码 drinks.groupby('continent').wine_servings.describe()
Out[13]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
continent | ||||||||
AF | 53.0 | 16.264151 | 38.846419 | 0.0 | 1.0 | 2.0 | 13.00 | 233.0 |
AS | 44.0 | 9.068182 | 21.667034 | 0.0 | 0.0 | 1.0 | 8.00 | 123.0 |
EU | 45.0 | 142.222222 | 97.421738 | 0.0 | 59.0 | 128.0 | 195.00 | 370.0 |
OC | 16.0 | 35.625000 | 64.555790 | 0.0 | 1.0 | 8.5 | 23.25 | 212.0 |
SA | 12.0 | 62.416667 | 88.620189 | 1.0 | 3.0 | 12.0 | 98.50 | 221.0 |
步骤6 打印出每个大陆每种酒类别的消耗平均值
In [15]:
# 运行以下代码 drinks.groupby('continent').mean()
Out[15]:
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|
continent | ||||
AF | 61.471698 | 16.339623 | 16.264151 | 3.007547 |
AS | 37.045455 | 60.840909 | 9.068182 | 2.170455 |
EU | 193.777778 | 132.555556 | 142.222222 | 8.617778 |
OC | 89.687500 | 58.437500 | 35.625000 | 3.381250 |
SA | 175.083333 | 114.750000 | 62.416667 | 6.308333 |
步骤7 打印出每个大陆每种酒类别的消耗中位数
In [268]:
# 运行以下代码 drinks.groupby('continent').median()
Out[268]:
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|
continent | ||||
AF | 32.0 | 3.0 | 2.0 | 2.30 |
AS | 17.5 | 16.0 | 1.0 | 1.20 |
EU | 219.0 | 122.0 | 128.0 | 10.00 |
OC | 52.5 | 37.0 | 8.5 | 1.75 |
SA | 162.5 | 108.5 | 12.0 | 6.85 |
步骤8 打印出每个大陆对spirit饮品消耗的平均值,最大值和最小值
In [269]:
# 运行以下代码 drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
Out[269]:
mean | min | max | |
---|---|---|---|
continent | |||
AF | 16.339623 | 0 | 152 |
AS | 60.840909 | 0 | 326 |
EU | 132.555556 | 0 | 373 |
OC | 58.437500 | 0 | 254 |
SA | 114.750000 | 25 | 302 |
练习4-Apply函数
探索1960 - 2014 美国犯罪数据
步骤1 导入必要的库
In [16]:
# 运行以下代码 import numpy as np import pandas as pd
步骤2 从以下地址导入数据集
In [27]:
# 运行以下代码 path4 = './exercise_data/US_Crime_Rates_1960_2014.csv' # "US_Crime_Rates_1960_2014.csv"
步骤3 将数据框命名为crime
In [28]:
# 运行以下代码 crime = pd.read_csv(path4) crime.head()
Out[28]:
Year | Population | Total | Violent | Property | Murder | Forcible_Rape | Robbery | Aggravated_assault | Burglary | Larceny_Theft | Vehicle_Theft | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1960 | 179323175 | 3384200 | 288460 | 3095700 | 9110 | 17190 | 107840 | 154320 | 912100 | 1855400 | 328200 |
1 | 1961 | 182992000 | 3488000 | 289390 | 3198600 | 8740 | 17220 | 106670 | 156760 | 949600 | 1913000 | 336000 |
2 | 1962 | 185771000 | 3752200 | 301510 | 3450700 | 8530 | 17550 | 110860 | 164570 | 994300 | 2089600 | 366800 |
3 | 1963 | 188483000 | 4109500 | 316970 | 3792500 | 8640 | 17650 | 116470 | 174210 | 1086400 | 2297800 | 408300 |
4 | 1964 | 191141000 | 4564600 | 364220 | 4200400 | 9360 | 21420 | 130390 | 203050 | 1213200 | 2514400 | 472800 |
步骤4 每一列(column)的数据类型是什么样的?
In [29]:
# 运行以下代码 crime.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 55 entries, 0 to 54 Data columns (total 12 columns): Year 55 non-null int64 Population 55 non-null int64 Total 55 non-null int64 Violent 55 non-null int64 Property 55 non-null int64 Murder 55 non-null int64 Forcible_Rape 55 non-null int64 Robbery 55 non-null int64 Aggravated_assault 55 non-null int64 Burglary 55 non-null int64 Larceny_Theft 55 non-null int64 Vehicle_Theft 55 non-null int64 dtypes: int64(12) memory usage: 5.2 KB
注意到了吗,Year的数据类型为 int64
,但是pandas有一个不同的数据类型去处理时间序列(time series),我们现在来看看。
步骤5 将Year的数据类型转换为 datetime64
In [30]:
# 运行以下代码 crime.Year = pd.to_datetime(crime.Year, format='%Y') crime.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 55 entries, 0 to 54 Data columns (total 12 columns): Year 55 non-null datetime64[ns] Population 55 non-null int64 Total 55 non-null int64 Violent 55 non-null int64 Property 55 non-null int64 Murder 55 non-null int64 Forcible_Rape 55 non-null int64 Robbery 55 non-null int64 Aggravated_assault 55 non-null int64 Burglary 55 non-null int64 Larceny_Theft 55 non-null int64 Vehicle_Theft 55 non-null int64 dtypes: datetime64[ns](1), int64(11) memory usage: 5.2 KB
步骤6 将列Year设置为数据框的索引
In [31]:
# 运行以下代码 crime = crime.set_index('Year', drop = True) crime.head()
Out[31]:
Population | Total | Violent | Property | Murder | Forcible_Rape | Robbery | Aggravated_assault | Burglary | Larceny_Theft | Vehicle_Theft | |
---|---|---|---|---|---|---|---|---|---|---|---|
Year | |||||||||||
1960-01-01 | 179323175 | 3384200 | 288460 | 3095700 | 9110 | 17190 | 107840 | 154320 | 912100 | 1855400 | 328200 |
1961-01-01 | 182992000 | 3488000 | 289390 | 3198600 | 8740 | 17220 | 106670 | 156760 | 949600 | 1913000 | 336000 |
1962-01-01 | 185771000 | 3752200 | 301510 | 3450700 | 8530 | 17550 | 110860 | 164570 | 994300 | 2089600 | 366800 |
1963-01-01 | 188483000 | 4109500 | 316970 | 3792500 | 8640 | 17650 | 116470 | 174210 | 1086400 | 2297800 | 408300 |
1964-01-01 | 191141000 | 4564600 | 364220 | 4200400 | 9360 | 21420 | 130390 | 203050 | 1213200 | 2514400 | 472800 |
步骤7 删除名为Total的列
In [32]:
# 运行以下代码 del crime['Total'] crime.head()
Out[32]:
Population | Violent | Property | Murder | Forcible_Rape | Robbery | Aggravated_assault | Burglary | Larceny_Theft | Vehicle_Theft | |
---|---|---|---|---|---|---|---|---|---|---|
Year | ||||||||||
1960-01-01 | 179323175 | 288460 | 3095700 | 9110 | 17190 | 107840 | 154320 | 912100 | 1855400 | 328200 |
1961-01-01 | 182992000 | 289390 | 3198600 | 8740 | 17220 | 106670 | 156760 | 949600 | 1913000 | 336000 |
1962-01-01 | 185771000 | 301510 | 3450700 | 8530 | 17550 | 110860 | 164570 | 994300 | 2089600 | 366800 |
1963-01-01 | 188483000 | 316970 | 3792500 | 8640 | 17650 | 116470 | 174210 | 1086400 | 2297800 | 408300 |
1964-01-01 | 191141000 | 364220 | 4200400 | 9360 | 21420 | 130390 | 203050 | 1213200 | 2514400 | 472800 |
In [33]:
crime.resample('10AS').sum()
Out[33]:
Population | Violent | Property | Murder | Forcible_Rape | Robbery | Aggravated_assault | Burglary | Larceny_Theft | Vehicle_Theft | |
---|---|---|---|---|---|---|---|---|---|---|
Year | ||||||||||
1960-01-01 | 1915053175 | 4134930 | 45160900 | 106180 | 236720 | 1633510 | 2158520 | 13321100 | 26547700 | 5292100 |
1970-01-01 | 2121193298 | 9607930 | 91383800 | 192230 | 554570 | 4159020 | 4702120 | 28486000 | 53157800 | 9739900 |
1980-01-01 | 2371370069 | 14074328 | 117048900 | 206439 | 865639 | 5383109 | 7619130 | 33073494 | 72040253 | 11935411 |
1990-01-01 | 2612825258 | 17527048 | 119053499 | 211664 | 998827 | 5748930 | 10568963 | 26750015 | 77679366 | 14624418 |
2000-01-01 | 2947969117 | 13968056 | 100944369 | 163068 | 922499 | 4230366 | 8652124 | 21565176 | 67970291 | 11412834 |
2010-01-01 | 1570146307 | 6072017 | 44095950 | 72867 | 421059 | 1749809 | 3764142 | 10125170 | 30401698 | 3569080 |
步骤8 按照Year对数据框进行分组并求和
*注意Population这一列,若直接对其求和,是不正确的**
In [34]:
# 更多关于 .resample 的介绍 # (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html) # 更多关于 Offset Aliases的介绍 # (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) # 运行以下代码 crimes = crime.resample('10AS').sum() # resample a time series per decades # 用resample去得到“Population”列的最大值 population = crime['Population'].resample('10AS').max() # 更新 "Population" crimes['Population'] = population crimes
Out[34]:
Population | Violent | Property | Murder | Forcible_Rape | Robbery | Aggravated_assault | Burglary | Larceny_Theft | Vehicle_Theft | |
---|---|---|---|---|---|---|---|---|---|---|
Year | ||||||||||
1960-01-01 | 201385000 | 4134930 | 45160900 | 106180 | 236720 | 1633510 | 2158520 | 13321100 | 26547700 | 5292100 |
1970-01-01 | 220099000 | 9607930 | 91383800 | 192230 | 554570 | 4159020 | 4702120 | 28486000 | 53157800 | 9739900 |
1980-01-01 | 248239000 | 14074328 | 117048900 | 206439 | 865639 | 5383109 | 7619130 | 33073494 | 72040253 | 11935411 |
1990-01-01 | 272690813 | 17527048 | 119053499 | 211664 | 998827 | 5748930 | 10568963 | 26750015 | 77679366 | 14624418 |
2000-01-01 | 307006550 | 13968056 | 100944369 | 163068 | 922499 | 4230366 | 8652124 | 21565176 | 67970291 | 11412834 |
2010-01-01 | 318857056 | 6072017 | 44095950 | 72867 | 421059 | 1749809 | 3764142 | 10125170 | 30401698 | 3569080 |
步骤9 何时是美国历史上生存最危险的年代?
In [279]:
# 运行以下代码 crime.idxmax(0)
Out[279]:
Population 2014-01-01 Violent 1992-01-01 Property 1991-01-01 Murder 1991-01-01 Forcible_Rape 1992-01-01 Robbery 1991-01-01 Aggravated_assault 1993-01-01 Burglary 1980-01-01 Larceny_Theft 1991-01-01 Vehicle_Theft 1991-01-01 dtype: datetime64[ns]
练习5-合并
探索虚拟姓名数据
步骤1 导入必要的库
In [280]:
# 运行以下代码 import numpy as np import pandas as pd
步骤2 按照如下的元数据内容创建数据框
In [281]:
# 运行以下代码 raw_data_1 = { 'subject_id': ['1', '2', '3', '4', '5'], 'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']} raw_data_2 = { 'subject_id': ['4', '5', '6', '7', '8'], 'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']} raw_data_3 = { 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'], 'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
步骤3 将上述的数据框分别命名为data1, data2, data3
In [282]:
# 运行以下代码 data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name']) data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name']) data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])
步骤4 将data1
和data2
两个数据框按照行的维度进行合并,命名为all_data
In [283]:
# 运行以下代码 all_data = pd.concat([data1, data2]) all_data
Out[283]:
subject_id | first_name | last_name | |
---|---|---|---|
0 | 1 | Alex | Anderson |
1 | 2 | Amy | Ackerman |
2 | 3 | Allen | Ali |
3 | 4 | Alice | Aoni |
4 | 5 | Ayoung | Atiches |
0 | 4 | Billy | Bonder |
1 | 5 | Brian | Black |
2 | 6 | Bran | Balwner |
3 | 7 | Bryce | Brice |
4 | 8 | Betty | Btisan |
步骤5 将data1
和data2
两个数据框按照列的维度进行合并,命名为all_data_col
In [284]:
# 运行以下代码 all_data_col = pd.concat([data1, data2], axis = 1) all_data_col
Out[284]:
subject_id | first_name | last_name | subject_id | first_name | last_name | |
---|---|---|---|---|---|---|
0 | 1 | Alex | Anderson | 4 | Billy | Bonder |
1 | 2 | Amy | Ackerman | 5 | Brian | Black |
2 | 3 | Allen | Ali | 6 | Bran | Balwner |
3 | 4 | Alice | Aoni | 7 | Bryce | Brice |
4 | 5 | Ayoung | Atiches | 8 | Betty | Btisan |
步骤6 打印data3
In [285]:
# 运行以下代码 data3
Out[285]:
subject_id | test_id | |
---|---|---|
0 | 1 | 51 |
1 | 2 | 15 |
2 | 3 | 15 |
3 | 4 | 61 |
4 | 5 | 16 |
5 | 7 | 14 |
6 | 8 | 15 |
7 | 9 | 1 |
8 | 10 | 61 |
9 | 11 | 16 |
步骤7 按照subject_id
的值对all_data
和data3
作合并
In [286]:
# 运行以下代码 pd.merge(all_data, data3, on='subject_id')
Out[286]:
subject_id | first_name | last_name | test_id | |
---|---|---|---|---|
0 | 1 | Alex | Anderson | 51 |
1 | 2 | Amy | Ackerman | 15 |
2 | 3 | Allen | Ali | 15 |
3 | 4 | Alice | Aoni | 61 |
4 | 4 | Billy | Bonder | 61 |
5 | 5 | Ayoung | Atiches | 16 |
6 | 5 | Brian | Black | 16 |
7 | 7 | Bryce | Brice | 14 |
8 | 8 | Betty | Btisan | 15 |
步骤8 对data1
和data2
按照subject_id
作连接
In [287]:
# 运行以下代码 pd.merge(data1, data2, on='subject_id', how='inner')
Out[287]:
subject_id | first_name_x | last_name_x | first_name_y | last_name_y | |
---|---|---|---|---|---|
0 | 4 | Alice | Aoni | Billy | Bonder |
1 | 5 | Ayoung | Atiches | Brian | Black |
步骤9 找到 data1
和 data2
合并之后的所有匹配结果
In [288]:
# 运行以下代码 pd.merge(data1, data2, on='subject_id', how='outer')
Out[288]:
subject_id | first_name_x | last_name_x | first_name_y | last_name_y | |
---|---|---|---|---|---|
0 | 1 | Alex | Anderson | NaN | NaN |
1 | 2 | Amy | Ackerman | NaN | NaN |
2 | 3 | Allen | Ali | NaN | NaN |
3 | 4 | Alice | Aoni | Billy | Bonder |
4 | 5 | Ayoung | Atiches | Brian | Black |
5 | 6 | NaN | NaN | Bran | Balwner |
6 | 7 | NaN | NaN | Bryce | Brice |
7 | 8 | NaN | NaN | Betty | Btisan |
练习6-统计
探索风速数据
步骤1 导入必要的库
In [289]:
# 运行以下代码 import pandas as pd import datetime
步骤2 从以下地址导入数据
In [290]:
import pandas as pd
In [35]:
# 运行以下代码 path6 = "./exercise_data/wind.data" # wind.data
步骤3 将数据作存储并且设置前三列为合适的索引
In [292]:
import datetime
In [293]:
# 运行以下代码 data = pd.read_table(path6, sep = "\\s+", parse_dates = [[0,1,2]]) data.head()
Out[293]:
Yr_Mo_Dy | RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2061-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
1 | 2061-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
2 | 2061-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
3 | 2061-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
4 | 2061-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
步骤4 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug
In [294]:
# 运行以下代码 def fix_century(x): year = x.year - 100 if x.year > 1989 else x.year return datetime.date(year, x.month, x.day) # apply the function fix_century on the column and replace the values to the right ones data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century) # data.info() data.head()
Out[294]:
Yr_Mo_Dy | RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
1 | 1961-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
2 | 1961-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
3 | 1961-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
4 | 1961-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
步骤5 将日期设为索引,注意数据类型,应该是datetime64[ns]
In [295]:
# 运行以下代码 # transform Yr_Mo_Dy it to date type datetime64 data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"]) # set 'Yr_Mo_Dy' as the index data = data.set_index('Yr_Mo_Dy') data.head() # data.info()
Out[295]:
RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Yr_Mo_Dy | ||||||||||||
1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
1961-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
1961-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
1961-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
1961-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
步骤6 对应每一个location,一共有多少数据值缺失
In [296]:
# 运行以下代码 data.isnull().sum()
Out[296]:
RPT 6 VAL 3 ROS 2 KIL 5 SHA 2 BIR 0 DUB 3 CLA 2 MUL 3 CLO 1 BEL 0 MAL 4 dtype: int64
步骤7 对应每一个location,一共有多少完整的数据值
In [297]:
# 运行以下代码 data.shape[0] - data.isnull().sum()
Out[297]:
RPT 6568 VAL 6571 ROS 6572 KIL 6569 SHA 6572 BIR 6574 DUB 6571 CLA 6572 MUL 6571 CLO 6573 BEL 6574 MAL 6570 dtype: int64
步骤8 对于全体数据,计算风速的平均值
In [298]:
# 运行以下代码 data.mean().mean()
Out[298]:
10.227982360836924
步骤9 创建一个名为loc_stats
的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差
In [299]:
# 运行以下代码 loc_stats = pd.DataFrame() loc_stats['min'] = data.min() # min loc_stats['max'] = data.max() # max loc_stats['mean'] = data.mean() # mean loc_stats['std'] = data.std() # standard deviations loc_stats
Out[299]:
min | max | mean | std | |
---|---|---|---|---|
RPT | 0.67 | 35.80 | 12.362987 | 5.618413 |
VAL | 0.21 | 33.37 | 10.644314 | 5.267356 |
ROS | 1.50 | 33.84 | 11.660526 | 5.008450 |
KIL | 0.00 | 28.46 | 6.306468 | 3.605811 |
SHA | 0.13 | 37.54 | 10.455834 | 4.936125 |
BIR | 0.00 | 26.16 | 7.092254 | 3.968683 |
DUB | 0.00 | 30.37 | 9.797343 | 4.977555 |
CLA | 0.00 | 31.08 | 8.495053 | 4.499449 |
MUL | 0.00 | 25.88 | 8.493590 | 4.166872 |
CLO | 0.04 | 28.21 | 8.707332 | 4.503954 |
BEL | 0.13 | 42.38 | 13.121007 | 5.835037 |
MAL | 0.67 | 42.54 | 15.599079 | 6.699794 |
步骤10 创建一个名为day_stats
的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差
In [300]:
# 运行以下代码 # create the dataframe day_stats = pd.DataFrame() # this time we determine axis equals to one so it gets each row. day_stats['min'] = data.min(axis = 1) # min day_stats['max'] = data.max(axis = 1) # max day_stats['mean'] = data.mean(axis = 1) # mean day_stats['std'] = data.std(axis = 1) # standard deviations day_stats.head()
Out[300]:
min | max | mean | std | |
---|---|---|---|---|
Yr_Mo_Dy | ||||
1961-01-01 | 9.29 | 18.50 | 13.018182 | 2.808875 |
1961-01-02 | 6.50 | 17.54 | 11.336364 | 3.188994 |
1961-01-03 | 6.17 | 18.50 | 11.641818 | 3.681912 |
1961-01-04 | 1.79 | 11.75 | 6.619167 | 3.198126 |
1961-01-05 | 6.17 | 13.33 | 10.630000 | 2.445356 |
步骤11 对于每一个location,计算一月份的平均风速
注意,1961年的1月和1962年的1月应该区别对待
In [301]:
# 运行以下代码 # creates a new column 'date' and gets the values from the index data['date'] = data.index # creates a column for each value from date data['month'] = data['date'].apply(lambda date: date.month) data['year'] = data['date'].apply(lambda date: date.year) data['day'] = data['date'].apply(lambda date: date.day) # gets all value from the month 1 and assign to janyary_winds january_winds = data.query('month == 1') # gets the mean from january_winds, using .loc to not print the mean of month, year and day january_winds.loc[:,'RPT':"MAL"].mean()
Out[301]:
RPT 14.847325 VAL 12.914560 ROS 13.299624 KIL 7.199498 SHA 11.667734 BIR 8.054839 DUB 11.819355 CLA 9.512047 MUL 9.543208 CLO 10.053566 BEL 14.550520 MAL 18.028763 dtype: float64
步骤12 对于数据记录按照年为频率取样
In [302]:
# 运行以下代码 data.query('month == 1 and day == 1')
Out[302]:
RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | date | month | year | day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Yr_Mo_Dy | ||||||||||||||||
1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 | 1961-01-01 | 1 | 1961 | 1 |
1962-01-01 | 9.29 | 3.42 | 11.54 | 3.50 | 2.21 | 1.96 | 10.41 | 2.79 | 3.54 | 5.17 | 4.38 | 7.92 | 1962-01-01 | 1 | 1962 | 1 |
1963-01-01 | 15.59 | 13.62 | 19.79 | 8.38 | 12.25 | 10.00 | 23.45 | 15.71 | 13.59 | 14.37 | 17.58 | 34.13 | 1963-01-01 | 1 | 1963 | 1 |
1964-01-01 | 25.80 | 22.13 | 18.21 | 13.25 | 21.29 | 14.79 | 14.12 | 19.58 | 13.25 | 16.75 | 28.96 | 21.00 | 1964-01-01 | 1 | 1964 | 1 |
1965-01-01 | 9.54 | 11.92 | 9.00 | 4.38 | 6.08 | 5.21 | 10.25 | 6.08 | 5.71 | 8.63 | 12.04 | 17.41 | 1965-01-01 | 1 | 1965 | 1 |
1966-01-01 | 22.04 | 21.50 | 17.08 | 12.75 | 22.17 | 15.59 | 21.79 | 18.12 | 16.66 | 17.83 | 28.33 | 23.79 | 1966-01-01 | 1 | 1966 | 1 |
1967-01-01 | 6.46 | 4.46 | 6.50 | 3.21 | 6.67 | 3.79 | 11.38 | 3.83 | 7.71 | 9.08 | 10.67 | 20.91 | 1967-01-01 | 1 | 1967 | 1 |
1968-01-01 | 30.04 | 17.88 | 16.25 | 16.25 | 21.79 | 12.54 | 18.16 | 16.62 | 18.75 | 17.62 | 22.25 | 27.29 | 1968-01-01 | 1 | 1968 | 1 |
1969-01-01 | 6.13 | 1.63 | 5.41 | 1.08 | 2.54 | 1.00 | 8.50 | 2.42 | 4.58 | 6.34 | 9.17 | 16.71 | 1969-01-01 | 1 | 1969 | 1 |
1970-01-01 | 9.59 | 2.96 | 11.79 | 3.42 | 6.13 | 4.08 | 9.00 | 4.46 | 7.29 | 3.50 | 7.33 | 13.00 | 1970-01-01 | 1 | 1970 | 1 |
1971-01-01 | 3.71 | 0.79 | 4.71 | 0.17 | 1.42 | 1.04 | 4.63 | 0.75 | 1.54 | 1. 以上是关于面对小白的pandas命令手册+练习题三万字详解的主要内容,如果未能解决你的问题,请参考以下文章 整理了将近三万字,超硬核BIO,NIO,AIO,Netty 超越 99% IO文章(面试题+综合案例+练习题) ❤️三万字详解《十大入门算法》突击刷掉这几个题,面试不慌!❤️(附源码,建议收藏) |