R语言—dplyr包

Posted 2021-04-18 大康的笔记

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了R语言—dplyr包相关的知识，希望对你有一定的参考价值。

叮~先来介绍一位被称为神一样的男人~

就是他~Hadley Wickham

Hadley Wickham在R语言数据科学领域是个鼎鼎有名的大人物，被称为“一个改变了R的人”。他是一位卓越的 R 包开发者，有ggplot2、dplyr、reshape2等诸多深受欢迎的作品。今天先来介绍dplyr包的一部分用法。以下内容适合有一定R语言基础的朋友们哦！（后面我会整理出R语言基础入门知识，如果感兴趣欢迎坑哦！）

dplyr包可用于在R中进行数据处理，拥有身份强大的功能，下面介绍一下dplyr包中一些常用函数。

#黄色背景填充为所写代码；“#”后文字对代码进行解释；黑色无填充为代码输出内容

> library(dplyr)#运行dplyr包

>help(package="dplyr")

> iris#鸢尾花数据包

>dplyr::filter(iris,Sepal.Length>7)#筛选出花萼长度大于7的数据

Sepal.Length Sepal.Width Petal.Length

1 7.1 3.0 5.9

2 7.6 3.0 6.6

3 7.3 2.9 6.3

4 7.2 3.6 6.1

5 7.7 3.8 6.7

6 7.7 2.6 6.9

7 7.7 2.8 6.7

8 7.2 3.2 6.0

9 7.2 3.0 5.8

10 7.4 2.8 6.1

11 7.9 3.8 6.4

12 7.7 3.0 6.1

>dplyr::distinct(rbind(iris[1:10,],iris[1:18,]))#对前10行和前18行去除重复项后合并

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

7 4.6 3.4 1.4 0.3 setosa

8 5.0 3.4 1.5 0.2 setosa

9 4.4 2.9 1.4 0.2 setosa

10 4.9 3.1 1.5 0.1 setosa

11 5.4 3.7 1.5 0.2 setosa

12 4.8 3.4 1.6 0.2 setosa

13 4.8 3.0 1.4 0.1 setosa

14 4.3 3.0 1.1 0.1 setosa

15 5.8 4.0 1.2 0.2 setosa

16 5.7 4.4 1.5 0.4 setosa

17 5.4 3.9 1.3 0.4 setosa

18 5.1 3.5 1.4 0.3 setosa

>dplyr::slice(iris,10:15)#可取出数据中的任意行

Sepal.Length Sepal.Width Petal.Length Petal.Width

1 4.9 3.1 1.5 0.1

2 5.4 3.7 1.5 0.2

3 4.8 3.4 1.6 0.2

4 4.8 3.0 1.4 0.1

5 4.3 3.0 1.1 0.1

6 5.8 4.0 1.2 0.2

> dplyr::slice(iris,10:15)#可取出数据中的任意行

Sepal.Length Sepal.Width Petal.Length Petal.Width

1 4.9 3.1 1.5 0.1

2 5.4 3.7 1.5 0.2

3 4.8 3.4 1.6 0.2

4 4.8 3.0 1.4 0.1

5 4.3 3.0 1.1 0.1

6 5.8 4.0 1.2 0.2

>dplyr::sample_n(iris,10)#随机抽取10行

Sepal.Length Sepal.Width Petal.LengthPetal.Width

1 5.0 3.4 1.5 0.2

2 4.9 2.4 3.3 1.0

3 5.2 4.1 1.5 0.1

4 6.0 2.9 4.5 1.5

5 5.9 3.0 5.1 1.8

6 6.4 2.8 5.6 2.1

7 6.7 3.0 5.0 1.7

8 5.6 2.9 3.6 1.3

9 6.9 3.1 4.9 1.5

10 6.7 2.5 5.8 1.8

> dplyr::sample_frac(iris,0.1)#按比例随机选取原数据的十分之一

Sepal.Length Sepal.Width Petal.Length

1 6.2 2.8 4.8

2 6.1 3.0 4.9

3 7.2 3.0 5.8

4 4.4 2.9 1.4

5 6.0 2.7 5.1

6 5.7 2.6 3.5

7 4.5 2.3 1.3

8 5.6 2.5 3.9

9 5.5 3.5 1.3

10 5.5 2.3 4.0

11 6.5 3.2 5.1

12 7.7 2.6 6.9

13 6.2 2.9 4.3

14 5.9 3.0 5.1

15 5.9 3.2 4.8

>head(dplyr::arrange(iris,Sepal.Length))#按照花萼长度进行排序；head()函数输出前6行数据

Sepal.Length Sepal.Width Petal.Length

1 4.3 3.0 1.1

2 4.4 2.9 1.4

3 4.4 3.0 1.3

4 4.4 3.2 1.3

5 4.5 2.3 1.3

6 4.6 3.1 1.5

>head(dplyr::arrange(iris,desc(Sepal.Length)))#按照花萼长度进行反方向排序

Sepal.Length Sepal.Width Petal.Length

1 7.9 3.8 6.4

2 7.7 3.8 6.7

3 7.7 2.6 6.9

4 7.7 2.8 6.7

5 7.7 3.0 6.1

6 7.6 3.0 6.6

>head(dplyr::select(iris,Sepal.Length))#根据名称取子集

Sepal.Length

1 5.1

2 4.9

3 4.7

4 4.6

5 5.0

6 5.4

>head(dplyr::select(iris,c(Sepal.Width,Sepal.Length)))#可取多列子集

Sepal.Width Sepal.Length

1 3.5 5.1

2 3.0 4.9

3 3.2 4.7

4 3.1 4.6

5 3.6 5.0

6 3.9 5.4

>head(dplyr::select(iris,ends_with("Width")))#按照结尾关键词取子集

Sepal.Width Petal.Width

1 3.5 0.2

2 3.0 0.2

3 3.2 0.2

4 3.1 0.2

5 3.6 0.2

6 3.9 0.4

>head(dplyr::select(iris,starts_with("Petal")))#按照开始关键词取子集

Petal.Length Petal.Width

1 1.4 0.2

2 1.4 0.2

3 1.3 0.2

4 1.5 0.2

5 1.4 0.2

6 1.7 0.4

>head(dplyr::select(iris,starts_with("Petal")&ends_with("width")))#按照初始和结尾关键词取子集

Petal.Width

1 0.2

2 0.2

3 0.2

4 0.2

5 0.2

6 0.4

>summarise(iris,avg=mean(Sepal.Length))#计算花萼的平均长度

avg

1 5.843333

>summarise(iris,sum=sum(Sepal.Length))#计算花萼的长度之和

sum

1 876.5

还有一个超实用的符号“%>%”，叫链式操作符，相当于管道符。它的功能是用于实现将一个函数的输出传递给下一个函数，作为下一个函数的输入。

> head(mtcars,20)

>head(mtcars,20)%>%tail(10)#%>%为管道符，取出上一数据的第11行到第20行

>dplyr::group_by(iris,species)#iris中存在species列，应该是所用版本问题，导致iris包加载出现问题，代码正确。

错误: Must group by variables found in `.data`.

* Column`species` is not found.

> iris%>%group_by(species)#通过管道符对数据进行分组

> iris%>%group_by(species)%>%summarise()#分组统计

>iris%>%group_by(species)%>%summarise(avg=mean(Sepal.Width))#分组计算平均值

>iris%>%group_by(species)%>%summarise(avg=mean(Sepal.Width))%>%arrange(avg)#分组计算平均值后排序

>dplyr::mutate(iris,new=Sepal.Length+Fetal.Length)#添加一列，数值为花萼和花瓣的长度总和

以上这些函数适用于单表格的操作，还有一些函数可使用与多表格的操作，明天我会再次更新多表格之间是如何进行整合连接！

以上是关于R语言—dplyr包的主要内容，如果未能解决你的问题，请参考以下文章