Python数据分析pandas入门练习题
Posted Geek_bao
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python数据分析pandas入门练习题相关的知识,希望对你有一定的参考价值。
Python数据分析基础
- Exercise 1
- Step 1. Go to https://www.kaggle.com/openfoodfacts/world-food-facts/data
- Step 2. Download the dataset to your computer and unzip it.
- Step 3. Use the tsv file and assign it to a dataframe called food
- Step 4. See the first 5 entries
- Step 5. What is the number of observations in the dataset?
- Step 6. What is the number of columns in the dataset?
- Step 7. Print the name of all the columns.
- Step 8. What is the name of 105th column?
- Step 9. What is the type of the observations of the 105th column?
- Step 10. How is the dataset indexed?
- Step 11. What is the product name of the 19th observation?
- Exercise 2 - Getting and Knowing your Data
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv).
- Step 3. Assign it to a variable called chipo.
- Step 4. See the first 10 entries
- Step 5. What is the number of observations in the dataset?
- Step 6. What is the number of columns in the dataset?
- Step 7. Print the name of all the columns.
- Step 8. How is the dataset indexed?
- Step 9. Which was the most-ordered item?
- Step 10. For the most-ordered item, how many items were ordered?
- Step 11. What was the most ordered item in the choice_description column?
- Step 12. How many items were orderd in total?
- Step 13. Turn the item price into a float
- Step 14. How much was the revenue for the period in the dataset?
- Step 15. How many orders were made in the period?
- Step 16. What is the average revenue amount per order?
- Step 17. How many different items are sold?
- Exercise 3 - Getting and Knowing your Data
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user).
- Step 3. Assign it to a variable called users and use the 'user_id' as index
- Step 4. See the first 25 entries
- Step 5. See the last 10 entries
- Step 6. What is the number of observations in the dataset?
- Step 7. What is the number of columns in the dataset?
- Step 8. Print the name of all the columns.
- Step 9. How is the dataset indexed?
- Step 10. What is the data type of each column?
- Step 11. Print only the occupation column
- Step 12. How many different occupations there are in this dataset?
- Step 13. What is the most frequent occupation?
- Step 14. Summarize the DataFrame.
- Step 15. Summarize all the columns
- Step 16. Summarize only the occupation column
- Step 17. What is the mean age of users?
- Step 18. What is the age with least occurrence?
- 结语
Exercise 1
Step 1. Go to https://www.kaggle.com/openfoodfacts/world-food-facts/data
Step 2. Download the dataset to your computer and unzip it.
Step 3. Use the tsv file and assign it to a dataframe called food
代码如下:
# 从上面网址下载数据集并解压,然后将其读取并赋值给food
import pandas as pd
import numpy as np
food = pd.read_csv('en.openfoodfacts.org.products.tsv', sep='\\t')
输出结果如下:
D:\\Anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3020: DtypeWarning: Columns (0,3,5,19,20,24,25,26,27,28,36,37,38,39,48) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
Step 4. See the first 5 entries
代码如下:
# 取前五条数据
food.head()
输出结果如下:
code | url | creator | created_t | created_datetime | last_modified_t | last_modified_datetime | product_name | generic_name | quantity | ... | fruits-vegetables-nuts_100g | fruits-vegetables-nuts-estimate_100g | collagen-meat-protein-ratio_100g | cocoa_100g | chlorophyl_100g | carbon-footprint_100g | nutrition-score-fr_100g | nutrition-score-uk_100g | glycemic-index_100g | water-hardness_100g | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3087 | http://world-en.openfoodfacts.org/product/0000... | openfoodfacts-contributors | 1474103866 | 2016-09-17T09:17:46Z | 1474103893 | 2016-09-17T09:18:13Z | Farine de blé noir | NaN | 1kg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 4530 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489069957 | 2017-03-09T14:32:37Z | 1489069957 | 2017-03-09T14:32:37Z | Banana Chips Sweetened (Whole) | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 14.0 | 14.0 | NaN | NaN |
2 | 4559 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489069957 | 2017-03-09T14:32:37Z | 1489069957 | 2017-03-09T14:32:37Z | Peanuts | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN |
3 | 16087 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489055731 | 2017-03-09T10:35:31Z | 1489055731 | 2017-03-09T10:35:31Z | Organic Salted Nut Mix | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | 12.0 | NaN | NaN |
4 | 16094 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489055653 | 2017-03-09T10:34:13Z | 1489055653 | 2017-03-09T10:34:13Z | Organic Polenta | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 163 columns
Step 5. What is the number of observations in the dataset?
代码如下:
# 该数据集有多少条数据
food.shape[0]
输出结果如下:
356027
Step 6. What is the number of columns in the dataset?
代码如下:
# 该数据集有多少列
food.shape[1]
food.info()
输出结果如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356027 entries, 0 to 356026
Columns: 163 entries, code to water-hardness_100g
dtypes: float64(107), object(56)
memory usage: 442.8+ MB
Step 7. Print the name of all the columns.
代码如下:
# 所有的列名
food.columns
输出结果如下:
Index(['code', 'url', 'creator', 'created_t', 'created_datetime',
'last_modified_t', 'last_modified_datetime', 'product_name',
'generic_name', 'quantity',
...
'fruits-vegetables-nuts_100g', 'fruits-vegetables-nuts-estimate_100g',
'collagen-meat-protein-ratio_100g', 'cocoa_100g', 'chlorophyl_100g',
'carbon-footprint_100g', 'nutrition-score-fr_100g',
'nutrition-score-uk_100g', 'glycemic-index_100g',
'water-hardness_100g'],
dtype='object', length=163)
Step 8. What is the name of 105th column?
代码如下:
# 第105列的列名
food.columns[104]
输出结果如下:
'-glucose_100g'
Step 9. What is the type of the observations of the 105th column?
代码如下:
# 第105列的数据类型
food.dtypes[food.columns[104]]
输出结果如下:
dtype('float64')
Step 10. How is the dataset indexed?
代码如下:
# 该数据集是怎样索引的
food.index
输出结果如下:
RangeIndex(start=0, stop=356027, step=1)
Step 11. What is the product name of the 19th observation?
代码如下:
# 第19条数据的product_name是什么
food.values[18][7]
输出结果如下:
'Lotus Organic Brown Jasmine Rice'
后面两行代码效果一样,用的函数不一样。
food.iloc[18]['product_name']
'Lotus Organic Brown Jasmine Rice'
food.loc[18]['product_name']
'Lotus Organic Brown Jasmine Rice'
Exercise 2 - Getting and Knowing your Data
This time we are going to pull data directly from the internet.
Step 1. Import the necessary libraries
代码如下:
# 导入要用的库
import pandas as pd
import numpy as np
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called chipo.
代码如下:
# 导入数据集,并赋给变量chipo
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep = '\\t')
Step 4. See the first 10 entries
代码如下:
# 取前十条数据
chipo.head(10)
输出结果如下:
order_id | quantity | item_name | choice_description | item_price | |
---|---|---|---|---|---|
0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
1 | 1 | 1 | Izze | [Clementine] | $3.39 |
2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
5 | 3 | 1 | Chicken Bowl | [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | $10.98 |
6 | 3 | 1 | Side of Chips | NaN | $1.69 |
7 | 4 | 1 | Steak Burrito | [Tomatillo Red Chili Salsa, [Fajita Vegetables... | $11.75 |
8 | 4 | 1 | Steak Soft Tacos | [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... | $9.25 |
9 | 5 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | $9.25 |
Step 5. What is the number of observations in the dataset?
代码如下:
# Solution 1
# 求数据集的观察数
chipo.shape[0] # entries <=4622 observations
输出结果如下:
下面代码效果和上面类似。
4622
# Solution 2
chipo.info() # entries <=4622 observations
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
order_id 4622 non-null int64
quantity 4622 non-null int64
item_name 4622 non-null object
choice_description 3376 non-null object
item_price 4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.6+ KB
Step 6. What is the number of columns in the dataset?
代码如下:
# 求数据集的列数
chipo.shape[1]
输出结果如下:
5
Step 7. Print the name of all the columns.
代码如下:
# 打印数据集的所有列名
chipo.columns
输出结果如下:
Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')
Step 8. How is the dataset indexed?
代码如下:
# 数据集是如何索引的
chipo.index
输出结果如下:
RangeIndex(start=0, stop=4622, step=1)
Step 9. Which was the most-ordered item?
代码如下:
# 求最多订单的商品
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)
输出结果如下:
order_id | quantity | |
---|---|---|
item_name | ||
Chicken Bowl | 713926 | 761 |
Step 10. For the most-ordered item, how many items were ordered?
代码入下:
# 最多订购的商品被订购次数,同上题
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)
输出结果如下:
order_id | quantity | |
---|---|---|
item_name | ||
Chicken Bowl | 713926 | 761 |
Step 11. What was the most ordered item in the choice_description column?
代码如下:
# choice_description 列中订购最多的商品是什么?
c = chipo.groupby('choice_description').sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)
输出结果如下:
order_id | quantity | |
---|---|---|
choice_description | ||
[Diet Coke] | 123455 | 159 |
Step 12. How many items were orderd in total?
代码如下:
# 总共订购了多少件商品?
total_items_orders = chipo.quantity.sum()
total_items_orders
输出结果如下:
4972
Step 13. Turn the item price into a float
Step 13.a. Check the item price type
代码如下:
# 把tiem price转化为浮点型并检查
chipo.item_price.dtype
输出结果如下:
dtype('O')
Step 13.b. Create a lambda function and change the type of item price
代码如下:
dollarizer = lambda x: float(x[1:-1])
chipo.item_price = chipo.item_price.apply(dollarizer)
Step 13.c. Check the item price type
代码如下:
chipo.item_price.dtype
输出结果如下:
dtype('float64')
Step 14. How much was the revenue for the period in the dataset?
代码如下:
# 数据集中该时期的收入是多少?
revenue = (chipo['quantity'] * chipo['item_price']).sum()
print('Revenue was: $' + str(np.round(revenue, 2)))
# round() 方法返回浮点数x的四舍五入值。2表示保留小数点后两位数
输出结果如下:
Revenue was: $39237.02
Step 15. How many orders were made in the period?
代码如下:
# 期间有多少订单?
orders = chipo.order_id.value_counts().count()
# value_counts()查看chipo中有哪些不同的值,并计算每个值有多少个重复值,count()计算有多少不同的值
orders
输出结果如下:
1834
Step 16. What is the average revenue amount per order?
代码如下:
# Solution 1
# 每个订单的平均收入是多少?
chipo['revenue'] = (chipo['quantity'] * chipo['item_price'])
order_grouped = chipo.groupby(by=['order_id']).sum()
order_grouped.mean()['revenue']
输出结果如下:
21.394231188658654
下面代码和上面作用相同。
# Solution 2
chipo.groupby(by=['order_id']).sum().mean()['revenue']
21.394231188658654
Step 17. How many different items are sold?
代码如下:
# 售出多少种不同的商品?
sold_items = chipo.item_name.value_counts().count()
sold_items
输出结果如下:
50
Exercise 3 - Getting and Knowing your Data
This time we are going to pull data directly from the internet.
(后面的题和前面类似,这里不做解释,英文直接做吧)
Step 1. Import the necessary libraries
代码如下
import pandas as pd
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called users and use the ‘user_id’ as index
代码入戏:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'
users = pd.read_table(url, sep='|', index_col = 'user_id')
# sep/delimiter 用于对行中各字段进行拆分的字符序列
Step 4. See the first 25 entries
代码如下:
users.head(25)
输出结果如下:
age | gender | occupation | zip_code | |
---|---|---|---|---|
user_id | ||||
1 | 24 | M | technician | 85711 |
2 | 53 | F | other | 94043 |
3 | 23 | M | writer | 32067 |
4 | 24 | M | technician | 43537 |
5 | 33 | F | other | 15213 |
6 | 42 | M | executive | 98101 |
7 | 57 | M | administrator | 91344 |
8 | 36 | M | administrator | 05201 |
9 | 29 | M | student | 01002 |
10 | 53 | M | lawyer | 90703 |
11 | 39 | F | other | 30329 |
12 | 28 | F | other | 06405 |
13 | 47 | M | educator | 29206 |
14 | 45 | M | scientist | 55106 |
15 | 49 | F | educator | 97301 |
16 | 21 | M | entertainment | 10309 |
17 | 30 | M | programmer | 06355 |
18 | 35 | F | other | 37212 |
19 | 40 | M | librarian | 02138 |
20 | 42 | F | homemaker | 95660 |
21 | 26 | M | writer | 30068 |
22 | 25 | M | writer | 40206 |
23 | 30 | F | artist | 48197 |
24 | 21 | F | artist | 94533 |
25 | 39 | M | engineer | 55107 |
Step 5. See the last 10 entries
代码如下:
users.tail(10)
输出结果如下:
age | gender | occupation | zip_code | |
---|---|---|---|---|
user_id | ||||
934 | 61 | M | engineer | 22902 |
935 | 42 | M | doctor | 66221 |
936 | 24 | M | other | 32789 |
937 | 48 | M | educator | 98072 |
938 | 38 | F | technician | 55038 |
939 | 26 | F | student | 33319 |
940 | 32 | M | administrator | 02215 |
941 | 20 | M | student | 97229 |
942 | 48 | F | librarian | 78209 |
943 | 22 | M | student | 77841 |
Step 6. What is the number of observations in the dataset?
代码入戏:
users.shape[0]
输出结果如下:
943
Step 7. What is the number of columns in the dataset?
代码如下:
users.shape[1]
输出结果如下:
4
Step 8. Print the name of all the columns.
代码如下:
users.columns
输出结果如下:
Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')
Step 9. How is the dataset indexed?
代码入戏:
users.index
输出结果如下:
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
...
934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
dtype='int64', name='user_id', length=943)
Step 10. What is the data type of each column?
代码入下:
users.dtypes
输出结果如下:
age int64
gender object
occupation object
zip_code object
dtype: object
Step 11. Print only the occupation column
代码入下:
users['occupation'] # 或者users.occupation
输出结果如下:
user_id
1 technician
2 other
3 writer
4 technician
5 other
6 executive
7 administrator
8 administrator
9 student
10 lawyer
11 other
12 other
13 educator
14 scientist
15 educator
16 entertainment
17 programmer
18 other
19 librarian
20 homemaker
21 writer
22 writer
23 artist
24 artist
25 engineer
26 engineer
27 librarian
28 writer
29 programmer
30 student
...
914 other
915 entertainment
916 engineer
917 student
918 scientist
919 other
920 artist
921 student
922 administrator
923 student
924 other
925 salesman
926 entertainment
927 programmer
928 student
929 scientist
930 scientist
931 educator
932 educator
933 student
934 engineer
935 doctor
936 other
937 educator
938 technician
939 student
940 administrator
941 student
942 librarian
943 student
Name: occupation, Length: 943, dtype: object
Step 12. How many different occupations there are in this dataset?
代码如下:
users.occupation.nunique()
输出结果如下:
21
Step 13. What is the most frequent occupation?
代码如下:
users.occupation.value_counts().head()
输出结果如下:
student 196
other 105
educator 95
administrator 79
engineer 67
Name: occupation, dtype: int64
Step 14. Summarize the DataFrame.
代码如下:
users.describe()
输出结果如下:
age | |
---|---|
count | 943.000000 |
mean | 34.051962 |
std | 12.192740 |
min | 7.000000 |
25% | 25.000000 |
50% | 31.000000 |
75% | 43.000000 |
max | 73.000000 |
Step 15. Summarize all the columns
代码如下:
users.describe(include='all') #Notice: By default, only the numeric columns are returned.
输出结果如下:
age | gender | occupation | zip_code | |
---|---|---|---|---|
count | 943.000000 | 943 | 943 | 943 |
unique | NaN | 2 | 21 | 795 |
top | NaN | M | student | 55414 |
freq | NaN | 670 | 196 | 9 |
mean | 34.051962 | NaN | NaN | NaN |
std | 12.192740 | NaN | NaN | NaN |
min | 7.000000 | NaN | NaN | NaN |
25% | 25.000000 | NaN | NaN | NaN |
50% | 31.000000 | NaN | NaN | NaN |
75% | 43.000000 | NaN | NaN | NaN |
max | 73.000000 | NaN | NaN | NaN |
Step 16. Summarize only the occupation column
代码如下:
users.occupation.describe()
输出结果如下:
count 943
unique 21
top student
freq 196
Name: occupation, dtype: object
Step 17. What is the mean age of users?
代码如下:
round(users.age.mean())
输出结果如下:
34
Step 18. What is the age with least occurrence?
代码如下:
users.age.value_counts().tail() #7, 10, 11, 66 and 73 years -> only 1 occurrence
输出结果如下:
11 1
10 1
73 1
66 1
7 1
Name: age, dtype: int64
结语
本文使用anaconda的jupyter notebook编写。后面文章也会继续使用,真的好用,推荐给大家,一起学习呀!!!
以上是关于Python数据分析pandas入门练习题的主要内容,如果未能解决你的问题,请参考以下文章