Python数据分析pandas入门练习题

Posted 2023-02-27 Geek_bao

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python数据分析pandas入门练习题相关的知识，希望对你有一定的参考价值。

Python数据分析基础

Preparation
Exercise 1-GroupBy
Exercise 2-Occupation
Exercise 3-Regiment
Conclusion

Preparation

下面是练习题的数据集，尽量下载下来使用。下面习题的连接不一定能打开。
https://github.com/justmarkham/pandas-videos/tree/master/data

Exercise 1-GroupBy

Introduction:

GroupBy can be summarizes as Split-Apply-Combine.

Step 1. Import the necessary libraries

代码如下：

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called drinks.

代码如下：

drinks = pd.read_csv('drinks.csv', ',')
drinks

输出结果如下：

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	AS
1	Albania	89	132	54	4.9	EU
2	Algeria	25	0	14	0.7	AF
3	Andorra	245	138	312	12.4	EU
4	Angola	217	57	45	5.9	AF
5	Antigua & Barbuda	102	128	45	4.9	NaN
6	Argentina	193	25	221	8.3	SA
7	Armenia	21	179	11	3.8	EU
8	Australia	261	72	212	10.4	OC
9	Austria	279	75	191	9.7	EU
10	Azerbaijan	21	46	5	1.3	EU
11	Bahamas	122	176	51	6.3	NaN
12	Bahrain	42	63	7	2.0	AS
13	Bangladesh	0	0	0	0.0	AS
14	Barbados	143	173	36	6.3	NaN
15	Belarus	142	373	42	14.4	EU
16	Belgium	295	84	212	10.5	EU
17	Belize	263	114	8	6.8	NaN
18	Benin	34	4	13	1.1	AF
19	Bhutan	23	0	0	0.4	AS
20	Bolivia	167	41	8	3.8	SA
21	Bosnia-Herzegovina	76	173	8	4.6	EU
22	Botswana	173	35	35	5.4	AF
23	Brazil	245	145	16	7.2	SA
24	Brunei	31	2	1	0.6	AS
25	Bulgaria	231	252	94	10.3	EU
26	Burkina Faso	25	7	7	4.3	AF
27	Burundi	88	0	0	6.3	AF
28	Cote d'Ivoire	37	1	7	4.0	AF
29	Cabo Verde	144	56	16	4.0	AF
...	...	...	...	...	...	...
163	Suriname	128	178	7	5.6	SA
164	Swaziland	90	2	2	4.7	AF
165	Sweden	152	60	186	7.2	EU
166	Switzerland	185	100	280	10.2	EU
167	Syria	5	35	16	1.0	AS
168	Tajikistan	2	15	0	0.3	AS
169	Thailand	99	258	1	6.4	AS
170	Macedonia	106	27	86	3.9	EU
171	Timor-Leste	1	1	4	0.1	AS
172	Togo	36	2	19	1.3	AF
173	Tonga	36	21	5	1.1	OC
174	Trinidad & Tobago	197	156	7	6.4	NaN
175	Tunisia	51	3	20	1.3	AF
176	Turkey	51	22	7	1.4	AS
177	Turkmenistan	19	71	32	2.2	AS
178	Tuvalu	6	41	9	1.0	OC
179	Uganda	45	9	0	8.3	AF
180	Ukraine	206	237	45	8.9	EU
181	United Arab Emirates	16	135	5	2.8	AS
182	United Kingdom	219	126	195	10.4	EU
183	Tanzania	36	6	1	5.7	AF
184	USA	249	158	84	8.7	NaN
185	Uruguay	115	35	220	6.6	SA
186	Uzbekistan	25	101	8	2.4	AS
187	Vanuatu	21	18	11	0.9	OC
188	Venezuela	333	100	3	7.7	SA
189	Vietnam	111	2	1	2.0	AS
190	Yemen	6	0	0	0.1	AS
191	Zambia	32	19	4	2.5	AF
192	Zimbabwe	64	18	4	4.7	AF

193 rows × 6 columns

Step 4. Which continent drinks more beer on average?

代码如下：

drinks.groupby('continent').beer_servings.mean()

输出结果如下：

continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

Step 5. For each continent print the statistics for wine consumption.

代码如下：

drinks.groupby('continent').wine_servings.describe()

输出结果如下：

	count	mean	std	min	25%	50%	75%	max
continent
AF	53.0	16.264151	38.846419	0.0	1.0	2.0	13.00	233.0
AS	44.0	9.068182	21.667034	0.0	0.0	1.0	8.00	123.0
EU	45.0	142.222222	97.421738	0.0	59.0	128.0	195.00	370.0
OC	16.0	35.625000	64.555790	0.0	1.0	8.5	23.25	212.0
SA	12.0	62.416667	88.620189	1.0	3.0	12.0	98.50	221.0

Step 6. Print the mean alcoohol consumption per continent for every column

代码如下：

drinks.groupby('continent').mean()

输出结果如下：

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
continent
AF	61.471698	16.339623	16.264151	3.007547
AS	37.045455	60.840909	9.068182	2.170455
EU	193.777778	132.555556	142.222222	8.617778
OC	89.687500	58.437500	35.625000	3.381250
SA	175.083333	114.750000	62.416667	6.308333

Step 7. Print the median alcoohol consumption per continent for every column

代码如下：

drinks.groupby('continent').median()

输出结果如下：

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
continent
AF	32.0	3.0	2.0	2.30
AS	17.5	16.0	1.0	1.20
EU	219.0	122.0	128.0	10.00
OC	52.5	37.0	8.5	1.75
SA	162.5	108.5	12.0	6.85

Step 8. Print the mean, min and max values for spirit consumption.

This time output a DataFrame

代码如下：

drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
# agg聚合函数，对分组后数据进行聚合，默认情况对分组后其他列进行聚合。
# 对分组后的部分列进行聚合，某些情况下，只需要对部分数据进行不同的聚合操作，可以通过字典来构建
# spirit_servings_info = 'spirit_servings':['min','mean','max']
# print(df.groupby('continent').agg(spirit_servings_info))

输出结果如下：

	mean	min	max
continent
AF	16.339623	0	152
AS	60.840909	0	326
EU	132.555556	0	373
OC	58.437500	0	254
SA	114.750000	25	302

Exercise 2-Occupation

Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Step 1. Import the necessary libraries

代码如下：

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called users.

代码如下：

users = pd.read_table('u.user', sep='|', index_col = 'user_id')
users.head()

输出结果如下：

	age	gender	occupation	zip_code
user_id
1	24	M	technician	85711
2	53	F	other	94043
3	23	M	writer	32067
4	24	M	technician	43537
5	33	F	other	15213

Step 4. Discover what is the mean age per occupation

代码如下：

users.groupby('occupation').age.mean()

输出结果如下：

occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

Step 5. Discover the Male ratio per occupation and sort it from the most to the least

代码如下：

# create function
def gender_to_numeric(x):
    if x == 'M':
        return 1
    if x == 'F':
        return 0
users['gender_n'] = users['gender'].apply(gender_to_numeric)

a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
a.sort_values(ascending = False)

输出结果如下：

doctor           100.000000
engineer          97.014925
technician        96.296296
retired           92.857143
programmer        90.909091
executive         90.625000
scientist         90.322581
entertainment     88.888889
lawyer            83.333333
salesman          75.000000
educator          72.631579
student           69.387755
other             65.714286
marketing         61.538462
writer            57.777778
none              55.555556
administrator     54.430380
artist            53.571429
librarian         43.137255
healthcare        31.250000
homemaker         14.285714
dtype: float64

Step 6. For each occupation, calculate the minimum and maximum ages

代码如下：

users.groupby('occupation').age.agg(['min', 'max'])

输出结果如下：

	min	max
occupation
administrator	21	70
artist	19	48
doctor	28	64
educator	23	63
engineer	22	70
entertainment	15	50
executive	22	69
healthcare	22	62
homemaker	20	50
lawyer	21	53
librarian	23	69
marketing	24	55
none	11	55
other	13	64
programmer	20	63
retired	51	73
salesman	18	66
scientist	23	55
student	7	42
technician	21	55
writer	18	60

Step 7. For each combination of occupation and gender, calculate the mean age

代码如下：

users.groupby(['occupation', 'gender']).mean()

输出结果如下：

		age	gender_n
occupation	gender
administrator	F	40.638889	0.0
administrator	M	37.162791	1.0
artist	F	30.307692	0.0
artist	M	32.333333	1.0
doctor	M	43.571429	1.0
educator	F	39.115385	0.0
educator	M	43.101449	1.0
engineer	F	29.500000	0.0
engineer	M	36.600000	1.0
entertainment	F	31.000000	0.0
entertainment	M	29.000000	1.0
executive	F	44.000000	0.0
executive	M	38.172414	1.0
healthcare	F	39.818182	0.0
healthcare	M	45.400000	1.0
homemaker	F	34.166667	0.0
homemaker	M	23.000000	1.0
lawyer	F	39.500000	0.0
lawyer	M	36.200000	1.0
librarian	F	40.000000	0.0
librarian	M	40.000000	1.0
marketing	F	37.200000	0.0
marketing	M	37.875000	1.0
none	F	36.500000	0.0
none	M	18.600000	1.0
other	F	35.472222	0.0
other	M	34.028986	1.0
programmer	F	32.166667	0.0
programmer	M	33.216667	1.0
retired	F	70.000000	0.0
retired	M	62.538462	1.0
salesman	F	27.000000	0.0
salesman	M	38.555556	1.0
scientist	F	28.333333	0.0
scientist	M	36.321429	1.0
student	F	20.750000	0.0
student	M	22.669118	1.0
technician	F	38.000000	0.0
technician	M	32.961538	1.0
writer	F	37.631579	0.0
writer	M	35.346154	1.0

Step 8. For each occupation present the percentage of women and men

代码如下：

# a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
# print(a.sort_values(ascending = False))
# b = 100 - a
# print(b.sort_values(ascending=True))
gender_ocup = users.groupby(['occupation', 'gender']).agg('gender': 'count') # 计算各个职业男女人数
occup_count = users.groupby(['occupation']).agg('count')                     # 计算各个职业总人数
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100    # 求出各个职业男女占比，返回一个DataFrame
occup_gender.loc[:, 'gender']   # 显示gender数据

输出结果如下：

occupation     gender
administrator  F          45.569620
               M          54.430380
artist         F          46.428571
               M          53.571429
doctor         M         100.000000
educator       F          27.368421
               M          72.631579
engineer       F           2.985075
               M          97.014925
entertainment  F          11.111111
               M          88.888889
executive      F           9.375000
               M          90.625000
healthcare     F          68.750000
               M          31.250000
homemaker      F          85.714286
               M          14.285714
lawyer         F          16.666667
               M          83.333333
librarian      F          56.862745
               M          43.137255
marketing      F          38.461538
               M          61.538462
none           F          44.444444
               M          55.555556
other          F          34.285714
               M          65.714286
programmer     F           9.090909
               M          90.909091
retired        F           7.142857
               M          92.857143
salesman       F          25.000000
               M          75.000000
scientist      F           9.677419
               M          90.322581
student        F          30.612245
               M          69.387755
technician     F           3.703704
               M          96.296296
writer         F          42.222222
               M          57.777778
Name: gender, dtype: float64

Exercise 3-Regiment

Introduction:

Special thanks to: http://chrisalbon.com/ for sharing the dataset and materials.

Step 1. Import the necessary libraries

代码如下：

import pandas as pd

Step 2. Create the DataFrame with the following values:

代码如下：

raw_data = 'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]

Step 3. Assign it to a variable called regiment.

Don’t forget to name each column

代码如下：

regiment = pd.DataFrame(raw_data, columns = raw_data.keys())
regiment

输出结果如下：

	regiment	company	name	preTestScore	postTestScore
0	Nighthawks	1st	Miller	4	25
1	Nighthawks	1st	Jacobson	24	94
2	Nighthawks	2nd	Ali	31	57
3	Nighthawks	2nd	Milner	2	62
4	Dragoons	1st	Cooze	3	70
5	Dragoons	1st	Jacon	4	25
6	Dragoons	2nd	Ryaner	24	94
7	Dragoons	2nd	Sone	31	57
8	Scouts	1st	Sloan	2	62
9	Scouts	1st	Piger	3	70
10	Scouts	2nd	Riani	2	62
11	Scouts	2nd	Ali	3	70

Step 4. What is the mean preTestScore from the regiment Nighthawks?

代码如下：

regiment[regiment['regiment'] == 'Nighthawks'].groupby('regiment').mean()
# regiment[regiment['regiment'] == 'Nighthawks'].mean()

输出结果如下：

	preTestScore	postTestScore
regiment
Nighthawks	15.25	59.5

Step 5. Present general statistics by company

代码如下：

regiment.groupby('company').describe()

输出结果如下：

	postTestScore								preTestScore
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
company
1st	6.0	57.666667	27.485754	25.0	34.25	66.0	70.0	94.0	6.0	6.666667	8.524475	2.0	3.00	3.5	4.00	24.0
2nd	6.0	67.000000	14.057027	57.0	58.25	62.0	68.0	94.0	6.0	15.500000	14.652645	2.0	2.25	13.5	29.25	31.0

Step 6. What is the mean each company’s preTestScore?

代码如下：

regiment.groupby('company').preTestScore.mean()

输出结果如下：

company
1st     6.666667
2nd    15.500000
Name: preTestScore, dtype: float64

Step 7. Present the mean preTestScores grouped by regiment and company

代码如下：

regiment.groupby(['regiment', 'company']).preTestScore.mean()

输出结果如下：

regiment    company
Dragoons    1st         3.5
            2nd        27.5
Nighthawks  1st        14.0
            2nd        16.5
Scouts      1st         2.5
            2nd         2.5
Name: preTestScore, dtype: float64

Step 8. Present the mean preTestScores grouped by regiment and company without heirarchical indexing

代码如下：

'''
stack()和unstack()
stack:将数据的列“旋转”为行。
unstack：将数据的行“旋转”为列。
如果是多层索引，则以上函数是针对内层索引。
'''
regiment.groupby(['regiment', 'company']).preTestScore.mean().unstack()

输出结果如下：

company	1st	2nd
regiment
Dragoons	3.5	27.5
Nighthawks	14.0	16.5
Scouts	2.5	2.5

Step 9. Group the entire dataframe by regiment and company

代码如下：

regiment.groupby(['regiment', 'company']).mean()

输出结果如下：

		preTestScore	postTestScore
regiment	company
Dragoons	1st	3.5	47.5
Dragoons	2nd	27.5	75.5
Nighthawks	1st	14.0	59.5
Nighthawks	2nd	16.5	59.5
Scouts	1st	2.5	66.0
Scouts	2nd	2.5	66.0

Step 10. What is the number of observations in each regiment and company

代码如下：

regiment.groupby(['regiment', 'company']).size()

输出结果如下：

regiment    company
Dragoons    1st        2
            2nd        2
Nighthawks  1st        2
            2nd        2
Scouts      1st        2
            2nd        2
dtype: int64

Step 11. Iterate over a group and print the name and the whole data from the regiment

代码如下：

for name, group in regiment.groupby('regiment'):
    print(name)
    print(group)

输出结果如下：

Dragoons
   regiment company    name  preTestScore  postTestScore
4  Dragoons     1st   Cooze             3             70
5  Dragoons     1st   Jacon             4             25
6  Dragoons     2nd  Ryaner            24             94
7  Dragoons     2nd    Sone            31             57
Nighthawks
     regiment company      name  preTestScore  postTestScore
0  Nighthawks     1st    Miller             4             25
1  Nighthawks     1st  Jacobson            24             94
2  Nighthawks     2nd       Ali            31             57
3  Nighthawks     2nd    Milner             2             62
Scouts
   regiment company   name  preTestScore  postTestScore
8    Scouts     1st  Sloan             2             62
9    Scouts     1st  Piger             3             70
10   Scouts     2nd  Riani             2             62
11   Scouts     2nd    Ali             3             70

Conclusion

今天的pandas练习题就这么多了，大家坚持练习呀！还有英文的题目这次就没翻译了，各位要适应看英文。大家加油学习呀！有问题可以评论区探讨，欢迎大家一起学习进步！

以上是关于Python数据分析pandas入门练习题的主要内容，如果未能解决你的问题，请参考以下文章