Python数据分析pandas入门练习题
Posted Geek_bao
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python数据分析pandas入门练习题相关的知识,希望对你有一定的参考价值。
Python数据分析基础
- Preparation
- Exercise 1- US - Baby Names
- Introduction:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv).
- Step 3. Assign it to a variable called baby_names.
- Step 4. See the first 10 entries
- Step 5. Delete the column 'Unnamed: 0' and 'Id'
- Step 6. Is there more male or female names in the dataset?
- Step 7. Group the dataset by name and assign to names
- Step 8. How many different names exist in the dataset?
- Step 9. What is the name with most occurrences?
- Step 10. How many different names have the least occurrences?
- Step 11. What is the median name occurrence?
- Step 12. What is the standard deviation of names?
- Step 13. Get a summary with the mean, min, max, std and quartiles.
- Exercise 2- Wind Statistics
- Introduction:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data)
- Step 3. Assign it to a variable called data and replace the first 3 columns by a proper datetime index.
- Step 4. Year 2061? Do we really have data from this year? Create a function to fix it and apply it.
- Step 5. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns].
- Step 6. Compute how many values are missing for each location over the entire record.
- Step 7. Compute how many non-missing values there are in total.
- Step 8. Calculate the mean windspeeds of the windspeeds over all the locations and all the times.
- Step 9. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days
- Step 10. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.
- Step 11. Find the average windspeed in January for each location.
- Step 12. Downsample the record to a yearly frequency for each location.
- Step 13. Downsample the record to a monthly frequency for each location.
- Step 14. Downsample the record to a weekly frequency for each location.
- Step 15. Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.
- Conclusion
Preparation
需要数据集可以自行网上寻找(都是公开的数据集)或私聊博主,传到csdn,你们下载要会员,就不传了。下面数据集链接下载不一定能成功。
Exercise 1- US - Baby Names
Introduction:
We are going to use a subset of US Baby Names from Kaggle.
In the file it will be names from 2004 until 2014
Step 1. Import the necessary libraries
代码如下:
import pandas as pd
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called baby_names.
代码如下:
baby_names = pd.read_csv("US_Baby_Names_right.csv")
baby_names.info()
输出结果如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
Unnamed: 0 1016395 non-null int64
Id 1016395 non-null int64
Name 1016395 non-null object
Year 1016395 non-null int64
Gender 1016395 non-null object
State 1016395 non-null object
Count 1016395 non-null int64
dtypes: int64(4), object(3)
memory usage: 54.3+ MB
Step 4. See the first 10 entries
代码如下:
baby_names.head(10)
输出结果如下:
Unnamed: 0 | Id | Name | Year | Gender | State | Count | |
---|---|---|---|---|---|---|---|
0 | 11349 | 11350 | Emma | 2004 | F | AK | 62 |
1 | 11350 | 11351 | Madison | 2004 | F | AK | 48 |
2 | 11351 | 11352 | Hannah | 2004 | F | AK | 46 |
3 | 11352 | 11353 | Grace | 2004 | F | AK | 44 |
4 | 11353 | 11354 | Emily | 2004 | F | AK | 41 |
5 | 11354 | 11355 | Abigail | 2004 | F | AK | 37 |
6 | 11355 | 11356 | Olivia | 2004 | F | AK | 33 |
7 | 11356 | 11357 | Isabella | 2004 | F | AK | 30 |
8 | 11357 | 11358 | Alyssa | 2004 | F | AK | 29 |
9 | 11358 | 11359 | Sophia | 2004 | F | AK | 28 |
Step 5. Delete the column ‘Unnamed: 0’ and ‘Id’
代码如下:
del baby_names['Id']
# OR del baby_names['Unnamed: 0']
baby_names = baby_names.loc[:, ~baby_names.columns.str.contains('^Unnamed')]
baby_names.head()
输出结果如下:
Name | Year | Gender | State | Count | |
---|---|---|---|---|---|
0 | Emma | 2004 | F | AK | 62 |
1 | Madison | 2004 | F | AK | 48 |
2 | Hannah | 2004 | F | AK | 46 |
3 | Grace | 2004 | F | AK | 44 |
4 | Emily | 2004 | F | AK | 41 |
Step 6. Is there more male or female names in the dataset?
代码如下:
# baby_names['Gender'].value_counts()
baby_names.groupby('Gender').Count.sum()
输出结果如下:
Gender
F 16380293
M 19041199
Name: Count, dtype: int64
Step 7. Group the dataset by name and assign to names
代码如下:
del baby_names["Year"]
names = baby_names.groupby("Name").sum()
names.head()
print(names.shape)
names.sort_values("Count", ascending = 0).head()
# names= baby_names.groupby('Name')
# names.head(1)
输出结果如下:
(17632, 1)
Count | |
---|---|
Name | |
Jacob | 242874 |
Emma | 214852 |
Michael | 214405 |
Ethan | 209277 |
Isabella | 204798 |
Step 8. How many different names exist in the dataset?
代码如下:
len(names)
输出结果如下:
17632
Step 9. What is the name with most occurrences?
代码如下:
# names['Count'].sum().argmax()
names.Count.idxmax() # idxmax()获取pandas中series最大值对应的索引
输出结果如下:
'Jacob'
Step 10. How many different names have the least occurrences?
代码如下:
len(names[names.Count == names.Count.min()])
输出结果如下:
2578
Step 11. What is the median name occurrence?
代码如下:
names[names.Count == names.Count.median()]
输出结果如下:
Count | |
---|---|
Name | |
Aishani | 49 |
Alara | 49 |
Alysse | 49 |
Ameir | 49 |
Anely | 49 |
Antonina | 49 |
Aveline | 49 |
Aziah | 49 |
Baily | 49 |
Caleah | 49 |
Carlota | 49 |
Cristine | 49 |
Dahlila | 49 |
Darvin | 49 |
Deante | 49 |
Deserae | 49 |
Devean | 49 |
Elizah | 49 |
Emmaly | 49 |
Emmanuela | 49 |
Envy | 49 |
Esli | 49 |
Fay | 49 |
Gurshaan | 49 |
Hareem | 49 |
Iven | 49 |
Jaice | 49 |
Jaiyana | 49 |
Jamiracle | 49 |
Jelissa | 49 |
... | ... |
Kyndle | 49 |
Kynsley | 49 |
Leylanie | 49 |
Maisha | 49 |
Malillany | 49 |
Mariann | 49 |
Marquell | 49 |
Maurilio | 49 |
Mckynzie | 49 |
Mehdi | 49 |
Nabeel | 49 |
Nalleli | 49 |
Nassir | 49 |
Nazier | 49 |
Nishant | 49 |
Rebecka | 49 |
Reghan | 49 |
Ridwan | 49 |
Riot | 49 |
Rubin | 49 |
Ryatt | 49 |
Sameera | 49 |
Sanjuanita | 49 |
Shalyn | 49 |
Skylie | 49 |
Sriram | 49 |
Trinton | 49 |
Vita | 49 |
Yoni | 49 |
Zuleima | 49 |
66 rows × 1 columns
Step 12. What is the standard deviation of names?
代码如下:
names.Count.std()
输出结果如下:
11006.069467891111
Step 13. Get a summary with the mean, min, max, std and quartiles.
代码如下:
names.describe()
输出结果如下:
Count | |
---|---|
count | 17632.000000 |
mean | 2008.932169 |
std | 11006.069468 |
min | 5.000000 |
25% | 11.000000 |
50% | 49.000000 |
75% | 337.000000 |
max | 242874.000000 |
Exercise 2- Wind Statistics
Introduction:
The data have been modified to contain some missing values, identified by NaN.
Using pandas should make this exercise
easier, in particular for the bonus question.
You should be able to perform all of these operations without using
a for loop or other looping construct.
- The data in ‘wind.data’ has the following format:
"""
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
"""
'\\nYr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL\\n61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04\\n61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83\\n61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71\\n'
The first three columns are year, month and day. The
remaining 12 columns are average windspeeds in knots at 12
locations in Ireland on that day.
More information about the dataset go here.
Step 1. Import the necessary libraries
代码如下:
import pandas as pd
import datetime
Step 2. Import the dataset from this address
Step 3. Assign it to a variable called data and replace the first 3 columns by a proper datetime index.
代码如下:
data = pd.read_table('wind.data', sep='\\s+', parse_dates = [[0, 1, 2]])
data.head()
输出结果如下:
Yr_Mo_Dy | RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2061-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
1 | 2061-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
2 | 2061-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
3 | 2061-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
4 | 2061-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
Step 4. Year 2061? Do we really have data from this year? Create a function to fix it and apply it.
代码如下:
def fix_century(x):
year = x.year - 100 if x.year > 1989 else x.year
return datetime.date(year, x.month, x.day)
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)
data.head()
输出结果如下:
Yr_Mo_Dy | RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
1 | 1961-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
2 | 1961-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
3 | 1961-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
4 | 1961-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
Step 5. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns].
代码如下:
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"]) # 转换为datetime64
data = data.set_index('Yr_Mo_Dy')
data.head()
输出结果如下:
RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Yr_Mo_Dy | ||||||||||||
1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
1961-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
1961-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
1961-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
1961-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
Step 6. Compute how many values are missing for each location over the entire record.
They should be ignored in all calculations below.
代码如下:
data.isnull().sum()
输出结果如下:
RPT 6
VAL 3
ROS 2
KIL 5
SHA 2
BIR 0
DUB 3
CLA 2
MUL 3
CLO 1
BEL 0
MAL 4
dtype: int64
Step 7. Compute how many non-missing values there are in total.
代码如下:
data.shape[0] - data.isnull().sum()
#OR data.notnull.sum()
输出结果如下:
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
Step 8. Calculate the mean windspeeds of the windspeeds over all the locations and all the times.
A single number for the entire dataset.
代码如下:
data.fillna(0).values.flatten().mean() # a.flatten()就是把data降到一维,默认是按行的方向降
输出结果如下:
10.223864592840483
Step 9. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days
A different set of numbers for each location.
代码如下:
# loc_stats = data.loc[:, 'RPT':'MAL'].describe(percentiles=[])
# loc_stats
data.describe(percentiles=[])
输出结果如下:
RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 6568.000000 | 6571.000000 | 6572.000000 | 6569.000000 | 6572.000000 | 6574.000000 | 6571.000000 | 6572.000000 | 6571.000000 | 6573.000000 | 6574.000000 | 6570.000000 |
mean | 12.362987 | 10.644314 | 11.660526 | 6.306468 | 10.455834 | 7.092254 | 9.797343 | 8.495053 | 8.493590 | 8.707332 | 13.121007 | 15.599079 |
std | 5.618413 | 5.267356 | 5.008450 | 3.605811 | 4.936125 | 3.968683 | 4.977555 | 4.499449 | 4.166872 | 4.503954 | 5.835037 | 6.699794 |
min | 0.670000 | 0.210000 | 1.500000 | 0.000000 | 0.130000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.040000 | 0.130000 | 0.670000 |
50% | 11.710000 | 10.170000 | 10.920000 | 5.750000 | 9.960000 | 6.830000 | 9.210000 | 8.080000 | 8.170000 | 8.290000 | 12.500000 | 15.000000 |
max | 35.800000 | 33.370000 | 33.840000 | 28.460000 | 37.540000 | 26.160000 | 30.370000 | 31.080000 | 25.880000 | 28.210000 | 42.380000 | 42.540000 |
Step 10. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.
A different set of numbers for each day.
代码如下:
day_stats = pd.DataFrame()
day_stats['min'] = data.min(axis = 1)
day_stats['max'] = data.max(axis = 1)
day_stats['mean'] = data.mean(axis = 1)
day_stats['std'] = data.std(axis = 1)
day_stats.head()
输出结果如下:
min | max | mean | std | |
---|---|---|---|---|
Yr_Mo_Dy | ||||
1961-01-01 | 9.29 | 18.50 | 13.018182 | 2.808875 |
1961-01-02 | 6.50 | 17.54 | 11.336364 | 3.188994 |
1961-01-03 | 6.17 | 18.50 | 11.641818 | 3.681912 |
1961-01-04 | 1.79 | 11.75 | 6.619167 | 3.198126 |
1961-01-05 | 6.17 | 13.33 | 10.630000 | 2.445356 |
Step 11. Find the average windspeed in January for each location.
Treat January 1961 and January 1962 both as January.
代码如下:
data.loc[data.index.month == 1].mean()
输出结果如下:
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
Step 12. Downsample the record to a yearly frequency for each location.
代码如下:
# pd.Period()创建时期数据
# pd.Period()参数:一个时间戳 + freq 参数 → freq 用于指明该 period 的长度,时间戳则说明该 period 在时间轴上的位置
# DatetimeIndex对象的数据转换为PeriodIndex
data.groupby(data.index.to_period('A')).mean()
输出结果如下:
RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Yr_Mo_Dy | ||||||||||||
1961 | 12.299583 | 10.351796 | 11.362369 | 6.958227 | 10.881763 | 7.729726 | 9.733923 | 8.858788 | 8.647652 | 9.835577 | 13.502795 | 13.680773 |
1962 | 12.246923 | 10.110438 | 11.732712 | 6.960440 | 10.657918 | 7.393068 | 11.020712 | 8.793753 | 8.316822 | 9.676247 | 12.930685 | 14.323956 |
1963 | 12.813452 | 10.836986 | 12.541151 | 7.330055 | 11.724110 | 8.434712 | 11.075699 | 10.336548 | 8.903589 | 10.224438 | 13.638877 | 14.999014 |
1964 | 12.363661 | 10.920164 | 12.104372 | 6.787787 | 11.454481 | 7.570874 | 10.259153 | 9.467350 | 7.789016 | 10.207951 | 13.740546 | 14.910301 |
1965 | 12.451370 | 11.075534 | 11.848767 | 6.858466 | 11.024795 | 7.478110 | 10.618712 | 8.879918 | 7.907425 | 9.918082 | 12.964247 | 15.591644 |
1966 | 13.461973 | 11.557205 | 12.020630 | 7.345726 | 11.805041 | 7.793671 | 10.579808 | 8.835096 | 8.514438 | 9.768959 | 14.265836 | 16.307260 |
1967 | 12.737151 | 10.990986 | 11.739397 | 7.143425 | 11.630740 | 7.368164 | 10.652027 | 9.325616 | 8.645014 | 9.547425 | 14.774548 | 17.135945 |
1968 | 11.835628 | 10.468197 | 11.409754 | 6.477678 | 10.760765 | 6.067322 | 8.859180 | 8.255519 | 7.224945 | 7.832978 | 12.808634 | 15.017486 |
1969 | 11.166356 | 9.723699 | 10.902000 | 5.767973 | 9.873918 | 6.189973 | 8.564493 | 7.711397 | 7.924521 | 7.754384 | 12.621233 | 15.762904 |
1970 | 12.600329 | 10.726932 | 11.730247 | 6.217178 | 10.567370 | 7.609452 | 9.609890 | 8.334630 | 9.297616 | 8.289808 | 13.183644 | 16.456027 |
1971 | 11.273123 | 9.095178 | 11.088329 | 5.241507 | 9.440329 | 6.097151 | 8.385890 | 6.757315 | 7.915370 | 7.229753 | 12.208932 | 15.025233 |
1972 | 12.463962 | 10.561311 | 12.058333 | 5.929699 | 9.430410 | 6.358825 | 9.704508 | 7.680792 | 8.357295 | 7.515273 | 12.727377 | 15.028716 |
1973 | 11.828466 | 10.680493 | 10.680493 | 5.547863 | 9.640877 | 6.548740 | 8.482110 | 7.614274 | 8.245534 | 7.812411 | 12.169699 | 15.441096 |
1974 | 13.643096 | 11.811781 | 12.336356 | 6.427041 | 11.110986 | 6.809781 | 10.084603 | 9.896986 | 9.331753 | 8.736356 | 13.252959 | 16.947671 |
1975 | 12.008575 | 10.293836 | 11.564712 | 5.269096 | 9.190082 | 5.668521 | 8.562603 | 7.843836 | 8.797945 | 7.382822 | 12.631671 | 15.307863 |
1976 | 11.737842 | 10.203115 | 10.761230 | 5.109426 | 8.846339 | 6.311038 | 9.149126 | 7.146202 | 8.883716 | 7.883087 | 12.332377 | 15.471448 |
1977 | 13.099616 | 11.144493 | 12.627836 | 6.073945 | 10.003836 | 8.586438 | 11.523205 | 8.378384 | 9.098192 | 8.821616 | 13.459068 | 16.590849 |
1978 | 12.504356 | 11.044274 | 11.380000 | 6.082356 | 10.167233 | 7.650658 | 9.489342 | 8.800466 | 9.089753 | 8.301699 | 12.967397 | 16.771370 |
Step 13. Downsample the record to a monthly frequency for each location.
代码如下:
data.groupby(data.index.to_period('M')).mean().head()
输出结果如下:
RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Yr_Mo_Dy | ||||||||||||
1961-01 | 14.841333 | 11.988333 | 13.431613 | 7.736774 | 11.072759 | 8.588065 | 11.184839 | 9.245333 | 9.085806 | 10.107419 | 13.880968 | 14.703226 |
1961-02 | 16.269286 | 14.975357 | 14.441481 | 9.230741 | 13.852143 | 10.937500 | 11.890714 | 11.846071 | 11.821429 | 12.714286 | 18.583214 | 15.411786 |
1961-03 | 10.890000 | 11.296452 | 10.752903 | 7.284000 | 10.509355 | 8.866774 | 9.644194 | 9.829677 | 10.294138 | 11.251935 | 16.410968 | 15.720000 |
1961-04 | 10.722667 | 9.427667 | 9.998000 | 5.830667 | 8.435000 | 6.495000 | 6.925333 | 7.094667 | 7.342333 | 7.237000 | 11.147333 | 10.278333 |
1961-05 | 9.860968 | 8.850000 | 10.818065 | 5.905333 | 9.490323 | 6.574839 | 7.604000 | 8.177097 | 8.039355 | 8.499355 | 11.900323 | 12.011613 |
Step 14. Downsample the record to a weekly frequency for each location.
代码如下:
data.groupby(data.index.to_period('W')).mean().head()
输出结果如下:
RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Yr_Mo_Dy | ||||||||||||
1960-12-26/1961-01-01 | 15.040000 | 14.960000 | 13.170000 | 9.290000 | NaN | 9.870000 | 13.670000 | 10.250000 | 10.830000 | 12.580000 | 18.500000 | 15.040000 |
1961-01-02/1961-01-08 | 13.541429 | 11.486667 | 10.487143 | 6.417143 | 9.474286 | 6.435714 | 11.061429 | 6.616667 | 8.434286 | 8.497143 | 12.481429 | 13.238571 |
1961-01-09/1961-01-15 | 12.468571 | 8.967143 | 11.958571 | 4.630000 | 7.351429 | 5.072857 | 7.535714 | 6.820000 | 5.712857 | 7.571429 | 11.125714 | 11.024286 |
1961-01-16/1961-01-22 | 13.204286 | 9.862857 | 12.982857 | 6.328571 | 8.966667 | 7.417143 | 9.257143 | 7.875714 | 7.145714 | 8.124286 | 9.821429 | 11.434286 |
1961-01-23/1961-01-29 | 19.880000 | 16.141429 | 18.225714 | 12.720000 | 17.432857 | 14.828571 | 15.528571 | 15.160000 | 14.480000 | 15.640000 | 20.930000 | 22.530000 |
Step 15. Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.
代码如下:
# data.groupby(data.index.to_period('1961-01-02', 'W')).describe(percentiles=[]).head()
weekly = data.resample('W').agg(['min', 'max', 'mean', 'std']) # resample()重新设置频率采样,再sh
weekly.loc[weekly.index[1:53], "RPT":"MAL"].head(10)
输出结果如下:
RPT | VAL | ROS | ... | CLO | BEL | MAL | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | max | mean | std | min | max | mean | std | min | max | ... | mean | std | min | max | mean | std | min | max | mean | std | |
Yr_Mo_Dy | |||||||||||||||||||||
1961-01-08 | 10.58 | 18.50 | 13.541429 | 2.631321 | 6.63 | 16.88 | 11.486667 | 3.949525 | 7.62 | 12.33 | ... | 8.497143 | 1.704941 | 5.46 | 17.54 | 12.481429 | 4.349139 | 10.88 | 16.46 | 13.238571 | 1.773062 |
1961-01-15 | 9.04 | 19.75 | 12.468571 | 3.555392 | 3.54 | 12.08 | 8.967143 | 3.148945 | 7.08 | 19.50 | ... | 7.571429 | 4.084293 | 5.25 | 20.71 | 11.125714 | 5.552215 | 5.17 | 16.92 | 11.024286 | 4.692355 |
1961-01-22 | 4.92 | 19.83 | 13.204286 | 5.337402 | 3.42 | 14.37 | 9.862857 | 3.837785 | 7.29 | 20.79 | ... | 8.124286 | 4.783952 | 6.50 | 15.92 | 9.821429 | 3.626584 | 6.79 | 17.96 | 11.434286 | 4.237239 |
1961-01-29 | 13.62 | 25.04 | 19.880000 | 4.619061 | 9.96 | 23.91 | 16.141429 | 5.170224 | 12.67 | 25.84 | ... | 15.640000 | 3.713368 | 14.04 | 27.71 | 20.930000 | 5.210726 | 17.50 | 27.63 | 22.530000 | 3.874721 |
1961-02-05 | 10.58 | 24.21 | 16.827143 | 5.251408 | 9.46 | 24.21 | 15.460000 | 5.187395 | 9.04 | 19.70 | ... | 9.460000 | 2.839501 | 9.17 | 19.33 | 14.012857 | 4.210858 | 7.17 | 19.25 | 11.935714 | 4.336104 |
1961-02-12 | 16.00 | 24.54 | 19.684286 | 3.587677 | 11.54 | 21.42 | 16.417143 | 3.608373 | 13.67 | 21.34 | ... | 14.440000 | 1.746749 | 15.21 | 26.38 | 21.832857 | 4.063753 | 17.04 | 21.84 | 19.155714 | 1.828705 |
1961-02-19 | 6.04 | 22.50 | 15.130000 | 5.064609 | 11.63 | 20.17 | 15.091429 | 3.575012 | 6.13 | 19.41 | ... | 13.542857 | 2.531361 | 14.09 | 29.63 | 21.167143 | 5.910938 | 10.96 | 22.58 | 16.584286 | 4.685377 |
1961-02-26 | 7.79 | 25.80 | 15.221429 | 7.020716 | 7.08 | 21.50 | 13.625714 | 5.147348 | 6.08 | 22.42 | ... | 12.730000 | 4.920064 | 9.59 | 23.21 | 16.304286 | 5.091162 | 6.67 | 23.87 | 14.322857 | 6.182283 |
1961-03-05 | 10.96 | 13.33 | 12.101429 | 0.997721 | 8.83 | 17.00 | 12.951429 | 2.851955 | 8.17 | 13.67 | ... | 12.370000 | 1.593685 | 11.58 | 23.45 | 17.842857 | 4.332331 | 8.83 | 17.54 | 13.951667 | 3.021387 |
1961-03-12 | 4.88 | 14.79 | 9.376667 | 3.732263 | 8.08 | 16.96 | 11.578571 | 3.230167 | 7.54 | 16.38 | ... | 10.458571 | 3.655113 | 10.21 | 22.71 | 16.701429 | 4.358759 | 5.54 | 22.54 | 14.420000 | 5.769890 |
10 rows × 48 columns
Conclusion
今天的pandas题更新,继续刷题,加油!
以上是关于Python数据分析pandas入门练习题的主要内容,如果未能解决你的问题,请参考以下文章