Python数据分析pandas入门练习题
Posted Geek_bao
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python数据分析pandas入门练习题相关的知识,希望对你有一定的参考价值。
Python数据分析基础
- Preparation
- Exercise 1-GroupBy
- Introduction:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv).
- Step 3. Assign it to a variable called drinks.
- Step 4. Which continent drinks more beer on average?
- Step 5. For each continent print the statistics for wine consumption.
- Step 6. Print the mean alcoohol consumption per continent for every column
- Step 7. Print the median alcoohol consumption per continent for every column
- Step 8. Print the mean, min and max values for spirit consumption.
- Exercise 2-Occupation
- Introduction:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user).
- Step 3. Assign it to a variable called users.
- Step 4. Discover what is the mean age per occupation
- Step 5. Discover the Male ratio per occupation and sort it from the most to the least
- Step 6. For each occupation, calculate the minimum and maximum ages
- Step 7. For each combination of occupation and gender, calculate the mean age
- Step 8. For each occupation present the percentage of women and men
- Exercise 3-Regiment
- Introduction:
- Step 1. Import the necessary libraries
- Step 2. Create the DataFrame with the following values:
- Step 3. Assign it to a variable called regiment.
- Step 4. What is the mean preTestScore from the regiment Nighthawks?
- Step 5. Present general statistics by company
- Step 6. What is the mean each company's preTestScore?
- Step 7. Present the mean preTestScores grouped by regiment and company
- Step 8. Present the mean preTestScores grouped by regiment and company without heirarchical indexing
- Step 9. Group the entire dataframe by regiment and company
- Step 10. What is the number of observations in each regiment and company
- Step 11. Iterate over a group and print the name and the whole data from the regiment
- Conclusion
Preparation
下面是练习题的数据集,尽量下载下来使用。下面习题的连接不一定能打开。
https://github.com/justmarkham/pandas-videos/tree/master/data
Exercise 1-GroupBy
Introduction:
GroupBy can be summarizes as Split-Apply-Combine.
Step 1. Import the necessary libraries
代码如下:
import pandas as pd
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called drinks.
代码如下:
drinks = pd.read_csv('drinks.csv', ',')
drinks
输出结果如下:
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 | AS |
1 | Albania | 89 | 132 | 54 | 4.9 | EU |
2 | Algeria | 25 | 0 | 14 | 0.7 | AF |
3 | Andorra | 245 | 138 | 312 | 12.4 | EU |
4 | Angola | 217 | 57 | 45 | 5.9 | AF |
5 | Antigua & Barbuda | 102 | 128 | 45 | 4.9 | NaN |
6 | Argentina | 193 | 25 | 221 | 8.3 | SA |
7 | Armenia | 21 | 179 | 11 | 3.8 | EU |
8 | Australia | 261 | 72 | 212 | 10.4 | OC |
9 | Austria | 279 | 75 | 191 | 9.7 | EU |
10 | Azerbaijan | 21 | 46 | 5 | 1.3 | EU |
11 | Bahamas | 122 | 176 | 51 | 6.3 | NaN |
12 | Bahrain | 42 | 63 | 7 | 2.0 | AS |
13 | Bangladesh | 0 | 0 | 0 | 0.0 | AS |
14 | Barbados | 143 | 173 | 36 | 6.3 | NaN |
15 | Belarus | 142 | 373 | 42 | 14.4 | EU |
16 | Belgium | 295 | 84 | 212 | 10.5 | EU |
17 | Belize | 263 | 114 | 8 | 6.8 | NaN |
18 | Benin | 34 | 4 | 13 | 1.1 | AF |
19 | Bhutan | 23 | 0 | 0 | 0.4 | AS |
20 | Bolivia | 167 | 41 | 8 | 3.8 | SA |
21 | Bosnia-Herzegovina | 76 | 173 | 8 | 4.6 | EU |
22 | Botswana | 173 | 35 | 35 | 5.4 | AF |
23 | Brazil | 245 | 145 | 16 | 7.2 | SA |
24 | Brunei | 31 | 2 | 1 | 0.6 | AS |
25 | Bulgaria | 231 | 252 | 94 | 10.3 | EU |
26 | Burkina Faso | 25 | 7 | 7 | 4.3 | AF |
27 | Burundi | 88 | 0 | 0 | 6.3 | AF |
28 | Cote d'Ivoire | 37 | 1 | 7 | 4.0 | AF |
29 | Cabo Verde | 144 | 56 | 16 | 4.0 | AF |
... | ... | ... | ... | ... | ... | ... |
163 | Suriname | 128 | 178 | 7 | 5.6 | SA |
164 | Swaziland | 90 | 2 | 2 | 4.7 | AF |
165 | Sweden | 152 | 60 | 186 | 7.2 | EU |
166 | Switzerland | 185 | 100 | 280 | 10.2 | EU |
167 | Syria | 5 | 35 | 16 | 1.0 | AS |
168 | Tajikistan | 2 | 15 | 0 | 0.3 | AS |
169 | Thailand | 99 | 258 | 1 | 6.4 | AS |
170 | Macedonia | 106 | 27 | 86 | 3.9 | EU |
171 | Timor-Leste | 1 | 1 | 4 | 0.1 | AS |
172 | Togo | 36 | 2 | 19 | 1.3 | AF |
173 | Tonga | 36 | 21 | 5 | 1.1 | OC |
174 | Trinidad & Tobago | 197 | 156 | 7 | 6.4 | NaN |
175 | Tunisia | 51 | 3 | 20 | 1.3 | AF |
176 | Turkey | 51 | 22 | 7 | 1.4 | AS |
177 | Turkmenistan | 19 | 71 | 32 | 2.2 | AS |
178 | Tuvalu | 6 | 41 | 9 | 1.0 | OC |
179 | Uganda | 45 | 9 | 0 | 8.3 | AF |
180 | Ukraine | 206 | 237 | 45 | 8.9 | EU |
181 | United Arab Emirates | 16 | 135 | 5 | 2.8 | AS |
182 | United Kingdom | 219 | 126 | 195 | 10.4 | EU |
183 | Tanzania | 36 | 6 | 1 | 5.7 | AF |
184 | USA | 249 | 158 | 84 | 8.7 | NaN |
185 | Uruguay | 115 | 35 | 220 | 6.6 | SA |
186 | Uzbekistan | 25 | 101 | 8 | 2.4 | AS |
187 | Vanuatu | 21 | 18 | 11 | 0.9 | OC |
188 | Venezuela | 333 | 100 | 3 | 7.7 | SA |
189 | Vietnam | 111 | 2 | 1 | 2.0 | AS |
190 | Yemen | 6 | 0 | 0 | 0.1 | AS |
191 | Zambia | 32 | 19 | 4 | 2.5 | AF |
192 | Zimbabwe | 64 | 18 | 4 | 4.7 | AF |
193 rows × 6 columns
Step 4. Which continent drinks more beer on average?
代码如下:
drinks.groupby('continent').beer_servings.mean()
输出结果如下:
continent
AF 61.471698
AS 37.045455
EU 193.777778
OC 89.687500
SA 175.083333
Name: beer_servings, dtype: float64
Step 5. For each continent print the statistics for wine consumption.
代码如下:
drinks.groupby('continent').wine_servings.describe()
输出结果如下:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
continent | ||||||||
AF | 53.0 | 16.264151 | 38.846419 | 0.0 | 1.0 | 2.0 | 13.00 | 233.0 |
AS | 44.0 | 9.068182 | 21.667034 | 0.0 | 0.0 | 1.0 | 8.00 | 123.0 |
EU | 45.0 | 142.222222 | 97.421738 | 0.0 | 59.0 | 128.0 | 195.00 | 370.0 |
OC | 16.0 | 35.625000 | 64.555790 | 0.0 | 1.0 | 8.5 | 23.25 | 212.0 |
SA | 12.0 | 62.416667 | 88.620189 | 1.0 | 3.0 | 12.0 | 98.50 | 221.0 |
Step 6. Print the mean alcoohol consumption per continent for every column
代码如下:
drinks.groupby('continent').mean()
输出结果如下:
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|
continent | ||||
AF | 61.471698 | 16.339623 | 16.264151 | 3.007547 |
AS | 37.045455 | 60.840909 | 9.068182 | 2.170455 |
EU | 193.777778 | 132.555556 | 142.222222 | 8.617778 |
OC | 89.687500 | 58.437500 | 35.625000 | 3.381250 |
SA | 175.083333 | 114.750000 | 62.416667 | 6.308333 |
Step 7. Print the median alcoohol consumption per continent for every column
代码如下:
drinks.groupby('continent').median()
输出结果如下:
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|
continent | ||||
AF | 32.0 | 3.0 | 2.0 | 2.30 |
AS | 17.5 | 16.0 | 1.0 | 1.20 |
EU | 219.0 | 122.0 | 128.0 | 10.00 |
OC | 52.5 | 37.0 | 8.5 | 1.75 |
SA | 162.5 | 108.5 | 12.0 | 6.85 |
Step 8. Print the mean, min and max values for spirit consumption.
This time output a DataFrame
代码如下:
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
# agg聚合函数,对分组后数据进行聚合,默认情况对分组后其他列进行聚合。
# 对分组后的部分列进行聚合,某些情况下,只需要对部分数据进行不同的聚合操作,可以通过字典来构建
# spirit_servings_info = {'spirit_servings':['min','mean','max']}
# print(df.groupby('continent').agg(spirit_servings_info))
输出结果如下:
mean | min | max | |
---|---|---|---|
continent | |||
AF | 16.339623 | 0 | 152 |
AS | 60.840909 | 0 | 326 |
EU | 132.555556 | 0 | 373 |
OC | 58.437500 | 0 | 254 |
SA | 114.750000 | 25 | 302 |
Exercise 2-Occupation
Introduction:
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.
Step 1. Import the necessary libraries
代码如下:
import pandas as pd
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called users.
代码如下:
users = pd.read_table('u.user', sep='|', index_col = 'user_id')
users.head()
输出结果如下:
age | gender | occupation | zip_code | |
---|---|---|---|---|
user_id | ||||
1 | 24 | M | technician | 85711 |
2 | 53 | F | other | 94043 |
3 | 23 | M | writer | 32067 |
4 | 24 | M | technician | 43537 |
5 | 33 | F | other | 15213 |
Step 4. Discover what is the mean age per occupation
代码如下:
users.groupby('occupation').age.mean()
输出结果如下:
occupation
administrator 38.746835
artist 31.392857
doctor 43.571429
educator 42.010526
engineer 36.388060
entertainment 29.222222
executive 38.718750
healthcare 41.562500
homemaker 32.571429
lawyer 36.750000
librarian 40.000000
marketing 37.615385
none 26.555556
other 34.523810
programmer 33.121212
retired 63.071429
salesman 35.666667
scientist 35.548387
student 22.081633
technician 33.148148
writer 36.311111
Name: age, dtype: float64
Step 5. Discover the Male ratio per occupation and sort it from the most to the least
代码如下:
# create function
def gender_to_numeric(x):
if x == 'M':
return 1
if x == 'F':
return 0
users['gender_n'] = users['gender'].apply(gender_to_numeric)
a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
a.sort_values(ascending = False)
输出结果如下:
doctor 100.000000
engineer 97.014925
technician 96.296296
retired 92.857143
programmer 90.909091
executive 90.625000
scientist 90.322581
entertainment 88.888889
lawyer 83.333333
salesman 75.000000
educator 72.631579
student 69.387755
other 65.714286
marketing 61.538462
writer 57.777778
none 55.555556
administrator 54.430380
artist 53.571429
librarian 43.137255
healthcare 31.250000
homemaker 14.285714
dtype: float64
Step 6. For each occupation, calculate the minimum and maximum ages
代码如下:
users.groupby('occupation').age.agg(['min', 'max'])
输出结果如下:
min | max | |
---|---|---|
occupation | ||
administrator | 21 | 70 |
artist | 19 | 48 |
doctor | 28 | 64 |
educator | 23 | 63 |
engineer | 22 | 70 |
entertainment | 15 | 50 |
executive | 22 | 69 |
healthcare | 22 | 62 |
homemaker | 20 | 50 |
lawyer | 21 | 53 |
librarian | 23 | 69 |
marketing | 24 | 55 |
none | 11 | 55 |
other | 13 | 64 |
programmer | 20 | 63 |
retired | 51 | 73 |
salesman | 18 | 66 |
scientist | 23 | 55 |
student | 7 | 42 |
technician | 21 | 55 |
writer | 18 | 60 |
Step 7. For each combination of occupation and gender, calculate the mean age
代码如下:
users.groupby(['occupation', 'gender']).mean()
输出结果如下:
age | gender_n | ||
---|---|---|---|
occupation | gender | ||
administrator | F | 40.638889 | 0.0 |
M | 37.162791 | 1.0 | |
artist | F | 30.307692 | 0.0 |
M | 32.333333 | 1.0 | |
doctor | M | 43.571429 | 1.0 |
educator | F | 39.115385 | 0.0 |
M | 43.101449 | 1.0 | |
engineer | F | 29.500000 | 0.0 |
M | 36.600000 | 1.0 | |
entertainment | F | 31.000000 | 0.0 |
M | 29.000000 | 1.0 | |
executive | F | 44.000000 | 0.0 |
M | 38.172414 | 1.0 | |
healthcare | F | 39.818182 | 0.0 |
M | 45.400000 | 1.0 | |
homemaker | F | 34.166667 | 0.0 |
M | 23.000000 | 1.0 | |
lawyer | F | 39.500000 | 0.0 |
M | 36.200000 | 1.0 | |
librarian | F | 40.000000 | 0.0 |
M | 40.000000 | 1.0 | |
marketing | F | 37.200000 | 0.0 |
M | 37.875000 | 1.0 | |
none | F | 36.500000 | 0.0 |
M | 18.600000 | 1.0 | |
other | F | 35.472222 | 0.0 |
M | 34.028986 | 1.0 | |
programmer | F | 32.166667 | 0.0 |
M | 33.216667 | 1.0 | |
retired | F | 70.000000 | 0.0 |
M | 62.538462 | 1.0 | |
salesman | F | 27.000000 | 0.0 |
M | 38.555556 | 1.0 | |
scientist | F | 28.333333 | 0.0 |
M | 36.321429 | 1.0 | |
student | F | 20.750000 | 0.0 |
M | 22.669118 | 1.0 | |
technician | F | 38.000000 | 0.0 |
M | 32.961538 | 1.0 | |
writer | F | 37.631579 | 0.0 |
M | 35.346154 | 1.0 |
Step 8. For each occupation present the percentage of women and men
代码如下:
# a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
# print(a.sort_values(ascending = False))
# b = 100 - a
# print(b.sort_values(ascending=True))
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'}) # 计算各个职业男女人数
occup_count = users.groupby(['occupation']).agg('count') # 计算各个职业总人数
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100 # 求出各个职业男女占比,返回一个DataFrame
occup_gender.loc[:, 'gender'] # 显示gender数据
输出结果如下:
occupation gender
administrator F 45.569620
M 54.430380
artist F 46.428571
M 53.571429
doctor M 100.000000
educator F 27.368421
M 72.631579
engineer F 2.985075
M 97.014925
entertainment F 11.111111
M 88.888889
executive F 9.375000
M 90.625000
healthcare F 68.750000
M 31.250000
homemaker F 85.714286
M 14.285714
lawyer F 16.666667
M 83.333333
librarian F 56.862745
M 43.137255
marketing F 38.461538
M 61.538462
none F 44.444444
M 55.555556
other F 34.285714
M 65.714286
programmer F 9.090909
M 90.909091
retired F 7.142857
M 92.857143
salesman F 25.000000
M 75.000000
scientist F 9.677419
M 90.322581
student F 30.612245
M 69.387755
technician F 3.703704
M 96.296296
writer F 42.222222
M 57.777778
Name: gender, dtype: float64
Exercise 3-Regiment
Introduction:
Special thanks to: http://chrisalbon.com/ for sharing the dataset and materials.
Step 1. Import the necessary libraries
代码如下:
import pandas as pd
Step 2. Create the DataFrame with the following values:
代码如下:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
Step 3. Assign it to a variable called regiment.
Don’t forget to name each column
代码如下:
regiment = pd.DataFrame(raw_data, columns = raw_data.keys())
regiment
输出结果如下:
regiment | company | name | preTestScore | postTestScore | |
---|---|---|---|---|---|
0 | Nighthawks | 1st | Miller | 4 | 25 |
1 | Nighthawks | 1st | Jacobson | 24 | 94 |
2 | Nighthawks | 2nd | Ali | 31 | 57 |
3 | Nighthawks | 2nd | Milner | 2 | 62 |
4 | Dragoons | 1st | Cooze | 3 | 70 |
5 | Dragoons | 1st | Jacon | 4 | 25 |
6 | Dragoons | 2nd | Ryaner | 24 | 94 |
7 | Dragoons | 2nd | Sone | 31 | 57 |
8 | Scouts | 1st | Sloan | 2 | 62 |
9 | Scouts | 1st | Piger | 3 | 70 |
10 | Scouts | 2nd | Riani | 2 | 62 |
11 | Scouts | 2nd | Ali | 3 | 70 |
Step 4. What is the mean preTestScore from the regiment Nighthawks?
代码如下:
regiment[regiment['regiment'] == 'Nighthawks'].groupby('regiment').mean()
# regiment[regiment['regiment'] == 'Nighthawks'].mean()
输出结果如下:
preTestScore | postTestScore | |
---|---|---|
regiment | ||
Nighthawks | 15.25 | 59.5 |
Step 5. Present general statistics by company
代码如下:
regiment.groupby('company').describe()
输出结果如下:
postTestScore | preTestScore | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
company | ||||||||||||||||
1st | 6.0 | 57.666667 | 27.485754 | 25.0 | 34.25 | 66.0 | 70.0 | 94.0 | 6.0 | 6.666667 | 8.524475 | 2.0 | 3.00 | 3.5 | 4.00 | 24.0 |
2nd | 6.0 | 67.000000 | 14.057027 | 57.0 | 58.25 | 62.0 | 68.0 | 94.0 | 6.0 | 15.500000 | 14.652645 | 2.0 | 2.25 | 13.5 | 29.25 | 31.0 |
Step 6. What is the mean each company’s preTestScore?
代码如下:
regiment.groupby('company').preTestScore.mean()
输出结果如下:
company
1st 6.666667
2nd 15.500000
Name: preTestScore, dtype: float64
Step 7. Present the mean preTestScores grouped by regiment and company
代码如下:
regiment.groupby(['regiment', 'company']).preTestScore.mean()
输出结果如下:
regiment company
Dragoons 1st 3.5
2nd 27.5
Nighthawks 1st 14.0
2nd 16.5
Scouts 1st 2.5
2nd 2.5
Name: preTestScore, dtype: float64
Step 8. Present the mean preTestScores grouped by regiment and company without heirarchical indexing
代码如下:
'''
stack()和unstack()
stack:将数据的列“旋转”为行。
unstack:将数据的行“旋转”为列。
如果是多层索引,则以上函数是针对内层索引。
'''
regiment.groupby(['regiment', 'company']).preTestScore.mean().unstack()
输出结果如下:
company | 1st | 2nd |
---|---|---|
regiment | ||
Dragoons | 3.5 | 27.5 |
Nighthawks | 14.0 | 16.5 |
Scouts | 2.5 | 2.5 |
Step 9. Group the entire dataframe by regiment and company
代码如下:
regiment.groupby(['regiment', 'company']).mean()
输出结果如下:
preTestScore | postTestScore | ||
---|---|---|---|
regiment | company | ||
Dragoons | 1st | 3.5 | 47.5 |
2nd | 27.5 | 75.5 | |
Nighthawks | 1st | 14.0 | 59.5 |
2nd | 16.5 | 59.5 | |
Scouts | 1st | 2.5 | 66.0 |
2nd | 2.5 | 66.0 |
Step 10. What is the number of observations in each regiment and company
代码如下:
regiment.groupby(['regiment', 'company']).size()
输出结果如下:
regiment company
Dragoons 1st 2
2nd 2
Nighthawks 1st 2
2nd 2
Scouts 1st 2
2nd 2
dtype: int64
Step 11. Iterate over a group and print the name and the whole data from the regiment
代码如下:
for name, group in regiment.groupby('regiment'):
print(name)
print(group)
输出结果如下:
Dragoons
regiment company name preTestScore postTestScore
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
Nighthawks
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
Scouts
regiment company name preTestScore postTestScore
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70
Conclusion
今天的pandas练习题就这么多了,大家坚持练习呀!还有英文的题目这次就没翻译了,各位要适应看英文。大家加油学习呀!有问题可以评论区探讨,欢迎大家一起学习进步!
以上是关于Python数据分析pandas入门练习题的主要内容,如果未能解决你的问题,请参考以下文章