Hands-on data analysis 第一章

Posted 2022-11-27 沧夜2021

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Hands-on data analysis 第一章相关的知识，希望对你有一定的参考价值。

Hands-on data analysis 第一章

文章目录

Hands-on data analysis 第一章

1.1.数据载入

在进行后续步骤之前都需要载入模块：

import numpy as np
import pandas as pd

数据的载入，针对不同的文件有不同的载入方法：

一般对于csv文件有：

pd.read_csv('train.csv')

其他类型的文件，可以参考pandas的官方文档：

IO tools (text, CSV, HDF5, …) — pandas 1.4.2 documentation (pydata.org)

Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	Fixed-Width Text File	read_fwf
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	LaTeX		Styler.to_latex
text	XML	read_xml	to_xml
text	Local clipboard	read_clipboard	to_clipboard
binary	MS Excel	read_excel	to_excel
binary	OpenDocument	read_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	ORC Format	read_orc
binary	Stata	read_stata	to_stata
binary	SAS	read_sas
binary	SPSS	read_spss
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google BigQuery	read_gbq	to_gbq

对于不同的文件格式，可以参考上面的表格

1.2.修改列明，重定义索引

df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)

names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口']将列名重新定义为了中文格式

index_col='乘客ID'将'乘客ID'作为索引列

header=0表示第一行为列名

1.3.查看数据的基本信息

df.info()可以用来查看数据的基本信息：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

1.4.只观察前几行数据或者末尾几行数据

df.head(10)观察前十行

df.tail(15)观察末尾十五行

1.5. 判断数据是否为空

df.isnull().head()，记住其返回的是True或者False，而不是数值。

	是否幸存	仓位等级	姓名	性别	年龄	兄弟姐妹个数	父母子女个数	船票信息	票价	客舱	登船港口
乘客ID											
1	False	False	False	False	False	False	False	False	False	True	False
2	False	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	True	False
4	False	False	False	False	False	False	False	False	False	False	False
5	False	False	False	False	False	False	False	False	False	True	False

2.1.数据排序

frame = pd.DataFrame(np.arange(8).reshape((2, 4)), 
                     index=['2', '1'], 
                     columns=['d', 'a', 'b', 'c'])
frame

生成的frame为：

	d	a	b	c
2	0	1	2	3
1	4	5	6	7

pd.DataFrame创建一个DataFrame对象

np.arange(8).reshape((2, 4))生成一个二维数组,2行4列（2*4）,第一行为：0，1，2，3 第二行为：4，5，6，7

index=['2', '1'] 定义DataFrame 对象的索引列
columns=['d', 'a', 'b', 'c']定义DataFrame 对象的列名

排序代码：

frame.sort_values(by='c', ascending=True)

输出结果为：

	d	a	b	c
2	0	1	2	3
1	4	5	6	7

可以发现，其排序是根据c这一列的值来的，且是升序排列

by参数指向要排列的列，ascending参数指向排序的方式（升序还是降序）

#按行索引排序
frame.sort_index()

	d	a	b	c
1	4	5	6	7
2	0	1	2	3

可以看到每一行都是升序的

#按列索引排序
frame.sort_index(axis=1)

	a	b	c	d
2	1	2	3	0
1	5	6	7	4

2.2.DataFrame相加

frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),
                     columns=['a', 'b', 'c'],
                     index=['one', 'two', 'three'])
frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),
                     columns=['a', 'e', 'c'],
                     index=['first', 'one', 'two', 'second'])

#frame1_a
		a	b	c
one		0.0	1.0	2.0
two		3.0	4.0	5.0
three	6.0	7.0	8.0

#frame1_b
		a	 e	 c
first	0.0	1.0	 2.0
one		3.0	4.0	 5.0
two		6.0	7.0	 8.0
second	9.0	10.0 11.0

执行frame1_a + frame1_b

		a	b	c	e
first	NaN	NaN	NaN	NaN
one	3.0	NaN	7.0	NaN
second	NaN	NaN	NaN	NaN
three	NaN	NaN	NaN	NaN
two	9.0	NaN	13.0	NaN

2.3.观察frame2的数据基本信息

describe()可以返回数据的基本信息

count : 样本数据大小
mean : 样本数据的平均值
std : 样本数据的标准差
min : 样本数据的最小值
25% : 样本数据25%的时候的值
50% : 样本数据50%的时候的值
75% : 样本数据75%的时候的值
max : 样本数据的最大值

frame2 = pd.DataFrame([[1.4, np.nan], 
                       [7.1, -4.5],
                       [np.nan, np.nan], 
                       [0.75, -1.3]
                      ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
frame2.describe()

		one			two
count	3.000000	2.000000
mean	3.083333	-2.900000
std		3.493685	2.262742
min		0.750000	-4.500000
25%		1.075000	-3.700000
50%		1.400000	-2.900000
75%		4.250000	-2.100000
max		7.100000	-1.300000

3.1.pandas基本数据类型

pandas中有两个数据类型DateFrame和Series

Series，只是一个一维数据结构，它由index和value组成。
DateFrame，是一个二维结构，除了拥有index和value之外，还拥有column。

DateFrame由多个Series组成

3.2.DateFrame列的名称

df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

就会输出这个DateFrame的列名称

3.3.查看某列的值

#查看"Cabin"这列的所有值
df['Cabin'].head(3)
#df.Cabin.head(3)

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

3.4.删除某一列

del test_1['a']

df.drop(['a'],axis=1,inplace=True).head(3)

inplace=True会将原始数据覆盖

3.5.数据筛选

df[df["Age"]<10].head(3)

midage = df[(df["Age"]>10)& (df["Age"]<50)]

3.6.loc与iloc方法

使用loc方法将midage的数据中第100，105，108行的"Pclass"，"Name"和"Sex"的数据显示出来

midage.loc[[100,105,108],['Pclass','Name','Sex']]

使用iloc方法将midage的数据中第100，105，108行的"Pclass"，"Name"和"Sex"的数据显示出来

midage.iloc[[100,105,108],[2,3,4]]

iloc使用的是索引下标，而loc使用的是列名

参考资料

hands-on-data-analysis 第一单元 - 飞桨AI Studio (baidu.com)

DATAWHALE - 一个热爱学习的社区 (linklearner.com)

pandas.read_csv — pandas 1.4.2 documentation (pydata.org)

IO tools (text, CSV, HDF5, …) — pandas 1.4.2 documentation (pydata.org)

以上是关于Hands-on data analysis 第一章的主要内容，如果未能解决你的问题，请参考以下文章