Python数据分析pandas入门练习题

Posted 2023-02-21 Geek_bao

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python数据分析pandas入门练习题相关的知识，希望对你有一定的参考价值。

Python数据分析基础

Preparation
Exercise 1-Student Alcohol Consumption
Exercise 2-United States - Crime Rates - 1960 - 2014
Conclusion

Preparation

下面是练习题的数据集，尽量下载下来使用。下面习题的连接不一定能打开。
需要数据集可以私聊博主或者自行网上寻找，传到csdn，你们下载要会员，就不传了。

Exercise 1-Student Alcohol Consumption

Introduction:

This time you will download a dataset from the UCI.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called df.

代码如下：

df = pd.read_csv("student-mat.csv", sep=',')
df.head()

输出结果如下：

	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	...	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
0	GP	F	18	U	GT3	A	4	4	at_home	teacher	...	4	3	4	1	1	3	6	5	6	6
1	GP	F	17	U	GT3	T	1	1	at_home	other	...	5	3	3	1	1	3	4	5	5	6
2	GP	F	15	U	LE3	T	1	1	at_home	other	...	4	3	2	2	3	3	10	7	8	10
3	GP	F	15	U	GT3	T	4	2	health	services	...	3	2	2	1	1	5	2	15	14	15
4	GP	F	16	U	GT3	T	3	3	other	other	...	4	3	2	1	2	5	4	6	10	10

5 rows × 33 columns

Step 4. For the purpose of this exercise slice the dataframe from ‘school’ until the ‘guardian’ column

代码如下：

stud_alcoh = df.loc[:, 'school':'guardian']   # loc切片一般用行列名，iloc一般用行列号
stud_alcoh.head()

输出结果如下：

	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian
0	GP	F	18	U	GT3	A	4	4	at_home	teacher	course	mother
1	GP	F	17	U	GT3	T	1	1	at_home	other	course	father
2	GP	F	15	U	LE3	T	1	1	at_home	other	other	mother
3	GP	F	15	U	GT3	T	4	2	health	services	home	mother
4	GP	F	16	U	GT3	T	3	3	other	other	home	father

Step 5. Create a lambda function that capitalize strings.

代码如下：

capitalizer = lambda str: str.capitalize()  #capitalize()将字符串首字母转换为大写字母，upper()将整个字符串转化为大写
print(capitalizer('www'))

输出结果如下：

Www

Step 6. Capitalize both Mjob and Fjob

代码如下：

# for i in df['Mjob']:
#    print(capitalizer(i))
stud_alcoh.Mjob.apply(capitalizer)
stud_alcoh.Fjob.apply(capitalizer)

输出结果如下：

0       Teacher
1         Other
2         Other
3      Services
4         Other
5         Other
6         Other
7       Teacher
8         Other
9         Other
10       Health
11        Other
12     Services
13        Other
14        Other
15        Other
16     Services
17        Other
18     Services
19        Other
20        Other
21       Health
22        Other
23        Other
24       Health
25     Services
26        Other
27     Services
28        Other
29      Teacher
         ...   
365       Other
366    Services
367    Services
368    Services
369     Teacher
370    Services
371    Services
372     At_home
373       Other
374       Other
375       Other
376       Other
377    Services
378       Other
379       Other
380     Teacher
381       Other
382    Services
383    Services
384       Other
385       Other
386     At_home
387       Other
388    Services
389       Other
390    Services
391    Services
392       Other
393       Other
394     At_home
Name: Fjob, Length: 395, dtype: object

Step 7. Print the last elements of the data set.

代码如下：

# df.iloc[394, 32]
stud_alcoh.tail()

输出结果如下：

	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian
390	MS	M	20	U	LE3	A	2	2	services	services	course	other
391	MS	M	17	U	LE3	T	3	1	services	services	course	mother
392	MS	M	21	R	GT3	T	1	1	other	other	course	other
393	MS	M	18	R	LE3	T	3	2	services	other	course	mother
394	MS	M	19	U	LE3	T	1	1	other	at_home	course	father

Step 8. Did you notice the original dataframe is still lowercase? Why is that? Fix it and capitalize Mjob and Fjob.

代码如下：

stud_alcoh.Mjob = stud_alcoh.Mjob.apply(capitalizer)
stud_alcoh.Fjob = stud_alcoh.Fjob.apply(capitalizer)
stud_alcoh

输出结果如下：

	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian
0	GP	F	18	U	GT3	A	4	4	At_home	Teacher	course	mother
1	GP	F	17	U	GT3	T	1	1	At_home	Other	course	father
2	GP	F	15	U	LE3	T	1	1	At_home	Other	other	mother
3	GP	F	15	U	GT3	T	4	2	Health	Services	home	mother
4	GP	F	16	U	GT3	T	3	3	Other	Other	home	father
5	GP	M	16	U	LE3	T	4	3	Services	Other	reputation	mother
6	GP	M	16	U	LE3	T	2	2	Other	Other	home	mother
7	GP	F	17	U	GT3	A	4	4	Other	Teacher	home	mother
8	GP	M	15	U	LE3	A	3	2	Services	Other	home	mother
9	GP	M	15	U	GT3	T	3	4	Other	Other	home	mother
10	GP	F	15	U	GT3	T	4	4	Teacher	Health	reputation	mother
11	GP	F	15	U	GT3	T	2	1	Services	Other	reputation	father
12	GP	M	15	U	LE3	T	4	4	Health	Services	course	father
13	GP	M	15	U	GT3	T	4	3	Teacher	Other	course	mother
14	GP	M	15	U	GT3	A	2	2	Other	Other	home	other
15	GP	F	16	U	GT3	T	4	4	Health	Other	home	mother
16	GP	F	16	U	GT3	T	4	4	Services	Services	reputation	mother
17	GP	F	16	U	GT3	T	3	3	Other	Other	reputation	mother
18	GP	M	17	U	GT3	T	3	2	Services	Services	course	mother
19	GP	M	16	U	LE3	T	4	3	Health	Other	home	father
20	GP	M	15	U	GT3	T	4	3	Teacher	Other	reputation	mother
21	GP	M	15	U	GT3	T	4	4	Health	Health	other	father
22	GP	M	16	U	LE3	T	4	2	Teacher	Other	course	mother
23	GP	M	16	U	LE3	T	2	2	Other	Other	reputation	mother
24	GP	F	15	R	GT3	T	2	4	Services	Health	course	mother
25	GP	F	16	U	GT3	T	2	2	Services	Services	home	mother
26	GP	M	15	U	GT3	T	2	2	Other	Other	home	mother
27	GP	M	15	U	GT3	T	4	2	Health	Services	other	mother
28	GP	M	16	U	LE3	A	3	4	Services	Other	home	mother
29	GP	M	16	U	GT3	T	4	4	Teacher	Teacher	home	mother
...	...	...	...	...	...	...	...	...	...	...	...	...
365	MS	M	18	R	GT3	T	1	3	At_home	Other	course	mother
366	MS	M	18	U	LE3	T	4	4	Teacher	Services	other	mother
367	MS	F	17	R	GT3	T	1	1	Other	Services	reputation	mother
368	MS	F	18	U	GT3	T	2	3	At_home	Services	course	father
369	MS	F	18	R	GT3	T	4	4	Other	Teacher	other	father
370	MS	F	19	U	LE3	T	3	2	Services	Services	home	other
371	MS	M	18	R	LE3	T	1	2	At_home	Services	other	father
372	MS	F	17	U	GT3	T	2	2	Other	At_home	home	mother
373	MS	F	17	R	GT3	T	1	2	Other	Other	course	mother
374	MS	F	18	R	LE3	T	4	4	Other	Other	reputation	mother
375	MS	F	18	R	GT3	T	1	1	Other	Other	home	mother
376	MS	F	20	U	GT3	T	4	2	Health	Other	course	other
377	MS	F	18	R	LE3	T	4	4	Teacher	Services	course	mother
378	MS	F	18	U	GT3	T	3	3	Other	Other	home	mother
379	MS	F	17	R	GT3	T	3	1	At_home	Other	reputation	mother
380	MS	M	18	U	GT3	T	4	4	Teacher	Teacher	home	father
381	MS	M	18	R	GT3	T	2	1	Other	Other	other	mother
382	MS	M	17	U	GT3	T	2	3	Other	Services	home	father
383	MS	M	19	R	GT3	T	1	1	Other	Services	other	mother
384	MS	M	18	R	GT3	T	4	2	Other	Other	home	father
385	MS	F	18	R	GT3	T	2	2	At_home	Other	other	mother
386	MS	F	18	R	GT3	T	4	4	Teacher	At_home	reputation	mother
387	MS	F	19	R	GT3	T	2	3	Services	Other	course	mother
388	MS	F	18	U	LE3	T	3	1	Teacher	Services	course	mother
389	MS	F	18	U	GT3	T	1	1	Other	Other	course	mother
390	MS	M	20	U	LE3	A	2	2	Services	Services	course	other
391	MS	M	17	U	LE3	T	3	1	Services	Services	course	mother
392	MS	M	21	R	GT3	T	1	1	Other	Other	course	other
393	MS	M	18	R	LE3	T	3	2	Services	Other	course	mother
394	MS	M	19	U	LE3	T	1	1	Other	At_home	course	father

395 rows × 12 columns

Step 9. Create a function called majority that return a boolean value to a new column called legal_drinker (Consider majority as older than 17 years old)

代码如下：

def majority(age):
    if age > 17:
        return True
    else:
        return False

stud_alcoh['legal_drinker'] = stud_alcoh.age.apply(majority)
stud_alcoh

输出结果如下：

	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian	legal_drinker
0	GP	F	18	U	GT3	A	4	4	At_home	Teacher	course	mother	True
1	GP	F	17	U	GT3	T	1	1	At_home	Other	course	father	False
2	GP	F	15	U	LE3	T	1	1	At_home	Other	other	mother	False
3	GP	F	15	U	GT3	T	4	2	Health	Services	home	mother	False
4	GP	F	16	U	GT3	T	3	3	Other	Other	home	father	False
5	GP	M	16	U	LE3	T	4	3	Services	Other	reputation	mother	False
6	GP	M	16	U	LE3	T	2	2	Other	Other	home	mother	False
7	GP	F	17	U	GT3	A	4	4	Other	Teacher	home	mother	False
8	GP	M	15	U	LE3	A	3	2	Services	Other	home	mother	False
9	GP	M	15	U	GT3	T	3	4	Other	Other	home	mother	False
10	GP	F	15	U	GT3	T	4	4	Teacher	Health	reputation	mother	False
11	GP	F	15	U	GT3	T	2	1	Services	Other	reputation	father	False
12	GP	M	15	U	LE3	T	4	4	Health	Services	course	father	False
13	GP	M	15	U	GT3	T	4	3	Teacher	Other	course	mother	False
14	GP	M	15	U	GT3	A	2	2	Other	Other	home	other	False
15	GP	F	16	U	GT3	T	4	4	Health	Other	home	mother	False
16	GP	F	16	U	GT3	T	4	4	Services	Services	reputation	mother	False
17	GP	F	16	U	GT3	T	3	3	Other	Other	reputation	mother	False
18	GP	M	17	U	GT3	T	3	2	Services	Services	course	mother	False
19	GP	M	16	U	LE3	T	4	3	Health	Other	home	father	False
20	GP	M	15	U	GT3	T	4	3	Teacher	Other	reputation	mother	False
21	GP	M	15	U	GT3	T	4	4	Health	Health	other	father	False
22	GP	M	16	U	LE3	T	4	2	Teacher	Other	course	mother	False
23	GP	M	16	U	LE3	T	2	2	Other	Other	reputation	mother	False
24	GP	F	15	R	GT3	T	2	4	Services	Health	course	mother	False
25	GP	F	16	U	GT3	T	2	2	Services	Services	home	mother	False
26	GP	M	15	U	GT3	T	2	2	Other	Other	home	mother	False
27	GP	M	15	U	GT3	T	4	2	Health	Services	other	mother	False
28	GP	M	16	U	LE3	A	3	4	Services	Other	home	mother	False
29	GP	M	16	U	GT3	T	4	4	Teacher	Teacher	home	mother	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
365	MS	M	18	R	GT3	T	1	3	At_home	Other	course	mother	True
366	MS	M	18	U	LE3	T	4	4	Teacher	Services	other	mother	True
367	MS	F	17	R	GT3	T	1	1	Other	Services	reputation	mother	False
368	MS	F	18	U	GT3	T	2	3	At_home	Services	course	father	True
369	MS	F	18	R	GT3	T	4	4	Other	Teacher	other	father	True
370	MS	F	19	U	LE3	T	3	2	Services	Services	home	other	True
371	MS	M	18	R	LE3	T	1	2	At_home	Services	other	father	True
372	MS	F	17	U	GT3	T	2	2	Other	At_home	home	mother	False
373	MS	F	17	R	GT3	T	1	2	Other	Other	course	mother	False
374	MS	F	18	R	LE3	T	4	4	Other	Other	reputation	mother	True
375	MS	F	18	R	GT3	T	1	1	Other	Other	home	mother	True
376	MS	F	20	U	GT3	T	4	2	Health	Other	course	other	True
377	MS	F	18	R	LE3	T	4	4	Teacher	Services	course	mother	True
378	MS	F	18	U	GT3	T	3	3	Other	Other	home	mother	True
379	MS	F	17	R	GT3	T	3	1	At_home	Other	reputation	mother	False
380	MS	M	18	U	GT3	T	4	4	Teacher	Teacher	home	father	True
381	MS	M	18	R	GT3	T	2	1	Other	Other	other	mother	True
382	MS	M	17	U	GT3	T	2	3	Other	Services	home	father	False
383	MS	M	19	R	GT3	T	1	1	Other	Services	other	mother	True
384	MS	M	18	R	GT3	T	4	2	Other	Other	home	father	True
385	MS	F	18	R	GT3	T	2	2	At_home	Other	other	mother	True
386	MS	F	18	R	GT3	T	4	4	Teacher	At_home	reputation	mother	True
387	MS	F	19	R	GT3	T	2	3	Services	Other	course	mother	True
388	MS	F	18	U	LE3	T	3	1	Teacher	Services	course	mother	True
389	MS	F	18	U	GT3	T	1	1	Other	Other	course	mother	True
390	MS	M	20	U	LE3	A	2	2	Services	Services	course	other	True
391	MS	M	17	U	LE3	T	3	1	Services	Services	course	mother	False
392	MS	M	21	R	GT3	T	1	1	Other	Other	course	other	True
393	MS	M	18	R	LE3	T	3	2	Services	Other	course	mother	True
394	MS	M	19	U	LE3	T	1	1	Other	At_home	course	father	True

395 rows × 13 columns

Step 10. Multiply every number of the dataset by 10.

I know this makes no sense, don’t forget it is just an exercise

代码如下：

def times10(x):
    if type(x) is int:
        return 10 * x
    return x

# apply()将一个函数作用于DataFrame中的每个行或者列
# 将函数做用于DataFrame中的所有元素(elements)
stud_alcoh.applymap(times10).head()

输出结果如下：

	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian	legal_drinker
0	GP	F	180	U	GT3	A	40	40	At_home	Teacher	course	mother	True
1	GP	F	170	U	GT3	T	10	10	At_home	Other	course	father	False
2	GP	F	150	U	LE3	T	10	10	At_home	Other	other	mother	False
3	GP	F	150	U	GT3	T	40	20	Health	Services	home	mother	False
4	GP	F	160	U	GT3	T	30	30	Other	Other	home	father	False

Exercise 2-United States - Crime Rates - 1960 - 2014

Introduction:

This time you will create a data

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called crime.

代码如下：

crime = pd.read_csv('US_Crime_Rates_1960_2014.csv')
crime.head()

输出结果如下：

	Year	Population	Total	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
0	1960	179323175	3384200	288460	3095700	9110	17190	107840	154320	912100	1855400	328200
1	1961	182992000	3488000	289390	3198600	8740	17220	106670	156760	949600	1913000	336000
2	1962	185771000	3752200	301510	3450700	8530	17550	110860	164570	994300	2089600	366800
3	1963	188483000	4109500	316970	3792500	8640	17650	116470	174210	1086400	2297800	408300
4	1964	191141000	4564600	364220	4200400	9360	21420	130390	203050	1213200	2514400	472800

Step 4. What is the type of the columns?

代码如下：

# crime.columns
crime.info()

输出结果如下：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null int64
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: int64(12)
memory usage: 5.2 KB

Have you noticed that the type of Year is int64. But pandas has a different type to work with Time Series. Let’s see it now.

Step 5. Convert the type of the column Year to datetime64

代码如下：

crime.Year = pd.to_datetime(crime.Year, format='%Y')   # 转化为日期格式%Y%m%d %H%M%S
crime.info()   # 输出列名以及列数据类型等相关信息

输出结果如下：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null datetime64[ns]
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.2 KB

Step 6. Set the Year column as the index of the dataframe

代码如下：

crime = crime.set_index('Year', drop = True)# drop参数默认为False,想要删除原先的索引列要置为True
crime.head()

输出结果如下：

	Population	Total	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
Year
1960-01-01	179323175	3384200	288460	3095700	9110	17190	107840	154320	912100	1855400	328200
1961-01-01	182992000	3488000	289390	3198600	8740	17220	106670	156760	949600	1913000	336000
1962-01-01	185771000	3752200	301510	3450700	8530	17550	110860	164570	994300	2089600	366800
1963-01-01	188483000	4109500	316970	3792500	8640	17650	116470	174210	1086400	2297800	408300
1964-01-01	191141000	4564600	364220	4200400	9360	21420	130390	203050	1213200	2514400	472800

Step 7. Delete the Total column

代码如下：

del crime['Total']   # del直接删除
crime.head()

输出结果如下：

	Population	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
Year
1960-01-01	179323175	288460	3095700	9110	17190	107840	154320	912100	1855400	328200
1961-01-01	182992000	289390	3198600	8740	17220	106670	156760	949600	1913000	336000
1962-01-01	185771000	301510	3450700	8530	17550	110860	164570	994300	2089600	366800
1963-01-01	188483000	316970	3792500	8640	17650	116470	174210	1086400	2297800	408300
1964-01-01	191141000	364220	4200400	9360	21420	130390	203050	1213200	2514400	472800

Step 8. Group the year by decades and sum the values

Pay attention to the Population column number, summing this column is a mistake

代码如下：

# Uses resample to sum each decade
# resample聚合函数，将数据以W星期,M月,Q季度,QS季度的开始第一天开始,A年,10A十年,10AS十年聚合日期第一天开始.的形式进行聚合
crimes = crime.resample('10AS').sum()

# Uses resample to get the max value only for the "Population" column
population = crime['Population'].resample('10AS').max()

# Updating the "Population" column
crimes['Population'] = population
crimes

resample函数参数含义如下：

输出结果如下：

	Population	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
Year
1960-01-01	201385000.0	4134930	45160900	106180	236720	1633510	2158520	13321100	26547700	5292100
1970-01-01	220099000.0	9607930	91383800	192230	554570	4159020	4702120	28486000	53157800	9739900
1980-01-01	248239000.0	14074328	117048900	206439	865639	5383109	7619130	33073494	72040253	11935411
1990-01-01	272690813.0	17527048	119053499	211664	998827	5748930	10568963	26750015	77679366	14624418
2000-01-01	307006550.0	13968056	100944369	163068	922499	4230366	8652124	21565176	67970291	11412834
2010-01-01	318857056.0	6072017	44095950	72867	421059	1749809	3764142	10125170	30401698	3569080
2020-01-01	NaN	0	0	0	0	0	0	0	0	0

Step 9. What is the most dangerous decade to live in the US?

代码如下：

crime.idxmax(0) # 计算能够获得最大值的索引位置
# 从结果可以看出90s最危险，其实2020-2021更危险（滑稽+狗头保命）

输出结果如下：

Population           2014-01-01
Violent              1992-01-01
Property             1991-01-01
Murder               1991-01-01
Forcible_Rape        1992-01-01
Robbery              1991-01-01
Aggravated_assault   1993-01-01
Burglary             1980-01-01
Larceny_Theft        1991-01-01
Vehicle_Theft        1991-01-01
dtype: datetime64[ns]

Conclusion

今天主要练习了Apply()函数以及其他相关函数的使用。一定要理解每个参数含义，才能灵活运用函数得到你要的数。今天就到这里了，明天继续学习，各位加油！

以上是关于Python数据分析pandas入门练习题的主要内容，如果未能解决你的问题，请参考以下文章

Python数据分析pandas入门练习题