Python数据分析pandas入门练习题

Posted Geek_bao

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python数据分析pandas入门练习题相关的知识,希望对你有一定的参考价值。

Python数据分析基础

Preparation

需要数据集可以自行网上寻找或私聊博主,传到csdn,你们下载要会员,就不传了。下面数据集链接下载不一定能成功。

Exercise 1- MPG Cars

Introduction:

The following exercise utilizes data from UC Irvine Machine Learning Repository

Step 1. Import the necessary libraries

代码如下:

import pandas as pd
import numpy as np

Step 2. Import the first dataset cars1 and cars2.

Step 3. Assign each to a variable called cars1 and cars2

代码如下:

cars1 = pd.read_csv('cars1.csv')
cars2 = pd.read_csv('cars2.csv')
print(cars1.head())
print(cars2.head())

输出结果如下:

    mpg  cylinders  displacement horsepower  weight  acceleration  model  \\
0  18.0          8           307        130    3504          12.0     70   
1  15.0          8           350        165    3693          11.5     70   
2  18.0          8           318        150    3436          11.0     70   
3  16.0          8           304        150    3433          12.0     70   
4  17.0          8           302        140    3449          10.5     70   

   origin                        car  Unnamed: 9  Unnamed: 10  Unnamed: 11  \\
0       1  chevrolet chevelle malibu         NaN          NaN          NaN   
1       1          buick skylark 320         NaN          NaN          NaN   
2       1         plymouth satellite         NaN          NaN          NaN   
3       1              amc rebel sst         NaN          NaN          NaN   
4       1                ford torino         NaN          NaN          NaN   

   Unnamed: 12  Unnamed: 13  
0          NaN          NaN  
1          NaN          NaN  
2          NaN          NaN  
3          NaN          NaN  
4          NaN          NaN  
    mpg  cylinders  displacement horsepower  weight  acceleration  model  \\
0  33.0          4            91         53    1795          17.4     76   
1  20.0          6           225        100    3651          17.7     76   
2  18.0          6           250         78    3574          21.0     76   
3  18.5          6           250        110    3645          16.2     76   
4  17.5          6           258         95    3193          17.8     76   

   origin                 car  
0       3         honda civic  
1       1      dodge aspen se  
2       1   ford granada ghia  
3       1  pontiac ventura sj  
4       1       amc pacer d/l  

Step 4. Ops it seems our first dataset has some unnamed blank columns, fix cars1

代码如下:

# cars1.dropna(axis=1)
cars1 = cars1.loc[:, "mpg" : "car"]   # 取mpg列到car列,赋值给cars1
cars1.head()

输出结果如下:

mpgcylindersdisplacementhorsepowerweightaccelerationmodelorigincar
018.08307130350412.0701chevrolet chevelle malibu
115.08350165369311.5701buick skylark 320
218.08318150343611.0701plymouth satellite
316.08304150343312.0701amc rebel sst
417.08302140344910.5701ford torino

Step 5. What is the number of observations in each dataset?

代码如下:

print(cars1.shape)
print(cars2.shape)

输出结果如下:

(198, 9)
(200, 9)

Step 6. Join cars1 and cars2 into a single DataFrame called cars

代码如下:

cars = cars1.append(cars2)   # cars1后追加cars2
# 或者cars = pd.concat([cars1, cars2], axis=0, ignore_index=True)
cars

输出结果如下:

mpgcylindersdisplacementhorsepowerweightaccelerationmodelorigincar
018.08307130350412.0701chevrolet chevelle malibu
115.08350165369311.5701buick skylark 320
218.08318150343611.0701plymouth satellite
316.08304150343312.0701amc rebel sst
417.08302140344910.5701ford torino
515.08429198434110.0701ford galaxie 500
614.0845422043549.0701chevrolet impala
714.0844021543128.5701plymouth fury iii
814.08455225442510.0701pontiac catalina
915.0839019038508.5701amc ambassador dpl
1015.08383170356310.0701dodge challenger se
1114.0834016036098.0701plymouth 'cuda 340
1215.0840015037619.5701chevrolet monte carlo
1314.08455225308610.0701buick estate wagon (sw)
1424.0411395237215.0703toyota corona mark ii
1522.0619895283315.5701plymouth duster
1618.0619997277415.5701amc hornet
1721.0620085258716.0701ford maverick
1827.049788213014.5703datsun pl510
1926.049746183520.5702volkswagen 1131 deluxe sedan
2025.0411087267217.5702peugeot 504
2124.0410790243014.5702audi 100 ls
2225.0410495237517.5702saab 99e
2326.04121113223412.5702bmw 2002
2421.0619990264815.0701amc gremlin
2510.08360215461514.0701ford f250
2610.08307200437615.0701chevy c20
2711.08318210438213.5701dodge d200
289.08304193473218.5701hi 1200d
2927.049788213014.5713datsun pl510
..............................
17027.0411288264018.6821chevrolet cavalier wagon
17134.0411288239518.0821chevrolet cavalier 2-door
17231.0411285257516.2821pontiac j2000 se hatchback
17329.0413584252516.0821dodge aries se
17427.0415190273518.0821pontiac phoenix
17524.0414092286516.4821ford fairmont futura
17623.04151?303520.5821amc concord dl
17736.0410574198015.3822volkswagen rabbit l
17837.049168202518.2823mazda glc custom l
17931.049168197017.6823mazda glc custom
18038.0410563212514.7821plymouth horizon miser
18136.049870212517.3821mercury lynx l
18236.0412088216014.5823nissan stanza xe
18336.0410775220514.5823honda accord
18434.0410870224516.9823toyota corolla
18538.049167196515.0823honda civic
18632.049167196515.7823honda civic (auto)
18738.049167199516.2823datsun 310 gx
18825.06181110294516.4821buick century limited
18938.0626285301517.0821oldsmobile cutlass ciera (diesel)
19026.0415692258514.5821chrysler lebaron medallion
19122.06232112283514.7821ford granada l
19232.0414496266513.9823toyota celica gt
19336.0413584237013.0821dodge charger 2.2
19427.0415190295017.3821chevrolet camaro
19527.0414086279015.6821ford mustang gl
19644.049752213024.6822vw pickup
19732.0413584229511.6821dodge rampage
19828.0412079262518.6821ford ranger
19931.0411982272019.4821chevy s-10

398 rows × 9 columns

Step 7. Ops there is a column missing, called owners. Create a random number Series from 15,000 to 73,000.

代码如下:

# 创建一个随机的Series,数的范围为15000到73000
my_owners = np.random.randint(15000, 73000, 398)
my_owners

输出结果如下:

array([30395, 42733, 44554, 34325, 50270, 60139, 24218, 25925, 42502,
       45041, 21449, 34472, 42783, 56380, 15707, 25707, 61160, 29297,
       42237, 72966, 71738, 56392, 69335, 17479, 30914, 29516, 36953,
       51000, 39315, 32876, 18305, 27092, 16590, 46419, 32564, 72843,
       46094, 50032, 22524, 16894, 54936, 18294, 44021, 42157, 61278,
       55678, 58345, 32391, 17736, 56275, 21903, 47867, 22928, 52829,
       67523, 55847, 25127, 40745, 23557, 54718, 18046, 35915, 65050,
       49568, 61822, 60210, 17202, 30865, 71921, 23434, 66579, 55818,
       56517, 33692, 55612, 32730, 22067, 65470, 50373, 58544, 38244,
       21356, 70010, 49500, 56970, 50040, 48606, 65609, 37288, 19547,
       32552, 71469, 69222, 36178, 44561, 40260, 44320, 28935, 57835,
       24374, 65163, 43465, 22097, 59672, 42933, 47359, 18186, 17173,
       66674, 55787, 29976, 40561, 36443, 68754, 48264, 31182, 70643,
       15752, 29759, 37604, 21019, 49529, 61506, 25802, 29858, 23015,
       27686, 65069, 62086, 33228, 16118, 71662, 70313, 24824, 24145,
       65096, 20493, 30484, 68996, 26227, 53008, 53758, 18948, 54496,
       64296, 50249, 35804, 44871, 39478, 63729, 19158, 21156, 64732,
       16314, 51031, 36171, 16193, 17655, 38842, 61288, 31747, 50995,
       63973, 52996, 26386, 57648, 26917, 59280, 30409, 27326, 48687,
       20302, 54604, 62031, 62863, 31196, 67807, 30862, 66646, 20763,
       65260, 66917, 67245, 26877, 24180, 70477, 46640, 36947, 16129,
       55475, 32569, 53886, 19898, 62866, 42115, 18904, 28941, 48321,
       28726, 19294, 17524, 30191, 29962, 64426, 60301, 71109, 70145,
       60671, 62912, 57491, 48347, 28355, 29315, 39817, 71448, 62550,
       59895, 17500, 21399, 52074, 32021, 54743, 67416, 27439, 18368,
       21339, 18891, 26910, 66961, 15866, 71688, 24802, 15530, 23647,
       44735, 72447, 64943, 67634, 67242, 61201, 36495, 42778, 66391,
       25980, 61012, 51792, 45485, 52052, 27935, 66677, 29556, 67718,
       63235, 66715, 39916, 54433, 63466, 61667, 21403, 53130, 45514,
       55541, 54951, 66835, 37705, 34943, 18583, 26945, 31816, 30104,
       52488, 46073, 39184, 26461, 64275, 60612, 27026, 37623, 22297,
       33671, 53580, 38553, 29536, 56143, 47368, 16612, 54661, 49403,
       70564, 30202, 56649, 26010, 65496, 63384, 17810, 64697, 48685,
       56686, 35658, 15539, 49614, 42165, 17433, 51415, 35637, 50719,
       47660, 15843, 45879, 41314, 39516, 61481, 68731, 29011, 51430,
       20347, 41176, 50809, 55824, 37399, 40692, 18155, 69199, 38232,
       32516, 57175, 38183, 21583, 66353, 18430, 16846, 61518, 70780,
       71784, 38712, 61313, 55800, 61001, 52706, 18203, 17225, 66550,
       34556, 25500, 65731, 15544, 69825, 68116, 34481, 60377, 29735,
       47846, 51439, 53054, 45308, 66654, 65698, 18421, 59846, 15493,
       53974, 41658, 30768, 23367, 15484, 28173, 18845, 15455, 42450,
       18834, 59814, 55643, 38475, 45623, 23382, 50896, 66593, 72178,
       29783, 39787, 46350, 42547, 65359, 62119, 53808, 45300, 48233,
       34077, 60663, 46497, 48174, 19764, 56893, 52080, 41104, 21126,
       56865, 39795])

Step 8. Add the column owners to cars

代码如下:

# 增加一列owners
cars['owners'] = my_owners
cars.tail()

输出结果如下:

mpgcylindersdisplacementhorsepowerweightaccelerationmodelorigincarowners
19527.0414086279015.6821ford mustang gl52080
19644.049752213024.6822vw pickup41104
19732.0413584229511.6821dodge rampage21126
19828.0412079262518.6821ford ranger56865
19931.0411982272019.4821chevy s-1039795

Exercise 2-Fictitious Names

Introduction:

This time you will create a data again

Special thanks to Chris Albon for sharing the dataset and materials.
All the credits to this exercise belongs to him.

In order to understand about it go here.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Create the 3 DataFrames based on the followin raw data

代码如下:

raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

Step 3. Assign each to a variable called data1, data2, data3

代码如下:

data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id', 'test_id'])
print(data1)
print(data2)
print(data3)

输出结果如下:

  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches
  subject_id first_name last_name
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan
  subject_id  test_id
0          1       51
1          2       15
2          3       15
3          4       61
4          5       16
5          7       14
6          8       15
7          9        1
8         10       61
9         11       16

Step 4. Join the two dataframes along rows and assign all_data

代码如下:

# all_data = data1.append(data2)
# all_data = pd.merge(data1, data2, how='outer')
# 以上两种方法也都可
all_data = pd.concat([data1, data2])
all_data

输出结果如下:

subject_idfirst_namelast_name
01AlexAnderson
12AmyAckerman
23AllenAli
34AliceAoni
45AyoungAtiches
04BillyBonder
15BrianBlack
26BranBalwner
37BryceBrice
48BettyBtisan

Step 5. Join the two dataframes along columns and assing to all_data_col

代码如下:

# 按列合并并赋值给all_data_col
all_data_col = pd.concat([data1, data2], axis=1)
all_data_col

输出结果如下:

subject_idfirst_namelast_namesubject_idfirst_namelast_name
01AlexAnderson4BillyBonder
12AmyAckerman5BrianBlack
23AllenAli6BranBalwner
34AliceAoni7BryceBrice
45AyoungAtiches8BettyBtisan

Step 6. Print data3

代码如下:

data3

输出结果如下:

subject_idtest_id
0151
1215
2315
3461
4516
5714
6815
791
81061
91116

Step 7. Merge all_data and data3 along the subject_id value

代码如下:

all_data_3 = pd.merge(all_data, data3, on='subject_id') # 默认how='inner'
all_data_3

输出结果如下:

subject_idfirst_namelast_nametest_id
01AlexAnderson51
12AmyAckerman15
23AllenAli15
34AliceAoni61
44BillyBonder61
55AyoungAtiches16
65BrianBlack16
77BryceBrice14
88BettyBtisan15

Step 8. Merge only the data that has the same ‘subject_id’ on both data1 and data2

代码如下:

data = pd.merge(data1, data2, on='subject_id', how='inner')
data

输出结果如下:

subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
04AliceAoniBillyBonder
15AyoungAtichesBrianBlack

Step 9. Merge all values in data1 and data2, with matching records from both sides where available.

代码如下:

# 合并 data1 和 data2 中的所有值,并在可用的情况下使用双方的匹配记录。
pd.merge(data1, data2, on='subject_id', how='outer')  # 可通过suffixes=['_A', '_B']设置保证合并不重复

输出结果如下:

subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
01AlexAndersonNaNNaN
12AmyAckermanNaNNaN
23AllenAliNaNNaN
34AliceAoniBillyBonder
45AyoungAtichesBrianBlack
56NaNNaNBranBalwner
67NaNNaNBryceBrice
78NaNNaNBettyBtisan

Exercise 3-Housing Market

Introduction:

This time we will create our own dataset with fictional numbers to describe a house market. As we are going to create random data don’t try to reason of the numbers.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd
import numpy as np

Step 2. Create 3 differents Series, each of length 100, as follows:

  1. The first a random number from 1 to 4
  2. The second a random number from 1 to 3
  3. The third a random number from 10,000 to 30,000

代码如下:

s1 = pd.Series(np.random.randint(1, 4, 100))
s2 = pd.Series(np.random.randint(1, 3, 100))
s3 = pd.Series(np.random.randint(10000, 30000, 100))
print(s1, s2, s3)

输出结果如下:

0     2
1     3
2     1
3     3
4     1
5     1
6     2
7     1
8     1
9     1
10    1
11    3
12    1
13    2
14    3
15    2
16    1
17    1
18    3
19    3
20    1
21    3
22    3
23    1
24    1
25    2
26    1
27    1
28    2
29    1
     ..
70    1
71    1
72    3
73    2
74    2
75    1
76    2
77    1
78    3
79    2
80    3
81    3
82    3
83    2
84    1
85    3
86    2
87    1
88    3
89    3
90    1
91    3
92    2
93    3
94    1
95    2
96    3
97    2
98    3
99    1
Length: 100, dtype: int32 0     1
1     2
2     1
3     2
4     1
5     2
6     1
7     1
8     1
9     2
10    2
11    1
12    1
13    2
14    2
15    2
16    1
17    2
18    1
19    1
20    2
21    2
22    1
23    1
24    1
25    1
26    1
27    2
28    1
29    1
     ..
70    2
71    2
72    1
73    1
74    1
75    1
76    2
77    2
78    2
79    2
80    1
81    2
82    1
83    2
84    1
85    1
86    2
87    2
88    1
89    2
90    1
91    2
92    1
93    1
94    1
95    2
96    1
97    1
98    2
99    2
Length: 100, dtype: int32 0     11973
1     10804
2     26866
3     25940
4     23147
5     14552
6     22151
7     19312
8     25373
9     29329
10    17069
11    19629
12    26174
13    20524
14    16489
15    22613
16    25266
17    11566
18    28599
19    27562
20    12922
21    29055
22    12709
23    21727
24    16735
25    20818
26    20199
27    21400
28    21602
29    16792
      ...  
70    10076
71    20091
72    28284
73    12185
74    15879
75    12907
76    24946
77    20168
78    24435
79    12175
80    18286
81    18001
82    10938
83    19116
84    12802
85    11623
86    15048
87    10624
88    18989
89    19797
90    17798
91    21317
92    27047
93    25692
94    27564
95    23411
96    18808
97    16854
98    21737
99    18968
Length: 100, dtype: int32

Step 3. Let’s create a DataFrame by joinning the Series by column

代码如下:

housemkt = pd.concat([s1, s2, s3], axis=1)
housemkt.head()

输出结果如下:

012
02111973
13210804
21126866
33225940
41123147

Step 4. Change the name of the columns to bedrs, bathrs, price_sqr_meter

代码如下:

'''
rename函数主要用到的参数有:
columns:列名
index:行名
axis:指定坐标轴
inplace:是否替换,默认为False。inplace为False时返回修改后结果,变量自身不修改。inplace为True时返回None,变量自身被修改。
'''
housemkt.rename(columns={0: 'bedrs', 1: 'bathrs', 2: 'price_sqr_meter'}, inplace = True)
housemkt.head()

输出结果如下:

bedrsbathrsprice_sqr_meter
02111973
13210804
21126866
33225940
41123147

Step 5. Create a one column DataFrame with the values of the 3 Series and assign it to ‘bigcolumn’

代码如下:

bigcolumn = pd.concat([s1, s2, s3], axis=0)
bigcolumn = bigcolumn.to_frame()   # 可以将数组转换为DataFrame格式
print(type(bigcolumn))
bigcolumn

输出结果如下:

<class 'pandas.core.frame.DataFrame'>
0
02
13
21
33
41
51
62
71
81
91
101
113
121
132
143
152
161
171
183
193
201
213
223
231
241
252
261
271
282
291
......
7010076
7120091
7228284
7312185
7415879
7512907
7624946
7720168
7824435
7912175
8018286
8118001
8210938
8319116
8412802
8511623
8615048
8710624
8818989
8919797
9017798
9121317
9227047
9325692
9427564
9523411
9618808
9716854
9821737
9918968

300 rows × 1 columns

Step 6. Ops it seems it is going only until index 99. Is it true?

代码如下:

len(bigcolumn)

输出结果如下:

300

Step 7. Reindex the DataFrame so it goes from 0 to 299

代码如下:

# reset_index()函数,重置索引后,drop参数默认为False,想要删除原先的索引列要置为True.想要在原数据上修改要inplace=True.特别是不赋值的情况必须要加,否则drop无效
'''
set_index():
函数原型:DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False) 
参数解释:
keys:列标签或列标签/数组列表,需要设置为索引的列
drop:默认为True,删除用作新索引的列
append:默认为False,是否将列附加到现有索引
inplace:默认为False,适当修改DataFrame(不要创建新对象)
verify_integrity:默认为false,检查新索引的副本。否则,请将检查推迟到必要时进行。将其设置为false将提高该方法的性能。

reset_index():
函数原型:DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
参数解释:
level:int、str、tuple或list,默认无,仅从索引中删除给定级别。默认情况下移除所有级别。控制了具体要还原的那个等级的索引 
drop:drop为False则索引列会被还原为普通列,否则会丢失
inplace:默认为false,适当修改DataFrame(不要创建新对象)
col_level:int或str,默认值为0,如果列有多个级别,则确定将标签插入到哪个级别。默认情况下,它将插入到第一级。
col_fill:对象,默认‘’,如果列有多个级别,则确定其他级别的命名方式。如果没有,则重复索引名
注:reset_index还原分为两种类型,第一种是对原DataFrame进行reset,第二种是对使用过set_index()函数的DataFrame进行reset
'''
bigcolumn.reset_index(drop=True, inplace=True)
bigcolumn

输出结果如下:

0
02
13
21
33
41
51
62
71
81
91
101
113
121
132
143
152
161
171
183
193
201
213
223
231
241
252
261
271
282
291
......
27010076
27120091
27228284
27312185
27415879
27512907
27624946
27720168
27824435
27912175
28018286
28118001
28210938
28319116
28412802
28511623
28615048
28710624
28818989
28919797
29017798
29121317
29227047
29325692
29427564
29523411
29618808
29716854
29821737
29918968

300 rows × 1 columns

Conclusion

今天主要练习了合并函数的操作以及其他相关函数的使用。再次提醒本专栏pandas使用了anaconda—jupyter notebook。推荐使用,好用极了!

以上是关于Python数据分析pandas入门练习题的主要内容,如果未能解决你的问题,请参考以下文章

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题