Hadoop基础操作

Posted xingweikun

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hadoop基础操作相关的知识,希望对你有一定的参考价值。

运行首个MapReduce任务

了解Hadoop官方的示例程序包

在集群服务的本地目录%HADOOP_HOME/share/hadoop/mapreduce/
中可以发现示例程序包hadoop-mapreduce-examples-2.7.3.jar。这个程序包封装了一些常用的测试模块。

模块名称内容
multifilewc统计多个文件中单词的数量
pi应用quasi-Monte Carlo算法来估算圆周率π的值
randomtextwriter在每个数据节点随机生成一个10GB的文本文件
wordcount对输入文件中的单词进行词频统计
wordmean计算输入文件中单词的平均长度
wordmedian计算输入文件中单词长度的中位数
wordstandarddeviation计算输入文件中单词长度的标准差

模块wordcount正好适合对email_log.txt中的数据进行登录次数的统计.

提交MapReduce任务给集群运行

Hadoop 运行程序时,输出目录不能存在
删除创建好的输出目录

[root@master hadoop]# hdfs dfs -rmdir /user/root/output
[root@master sbin]# hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /user/root/email_log.txt /user/root/output

输出为

21/10/22 19:42:03 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.20:8032
21/10/22 19:42:04 INFO input.FileInputFormat: Total input paths to process : 1
21/10/22 19:42:04 INFO mapreduce.JobSubmitter: number of splits:2
21/10/22 19:42:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1634902865621_0001
21/10/22 19:42:05 INFO impl.YarnClientImpl: Submitted application application_1634902865621_0001
21/10/22 19:42:05 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1634902865621_0001/
21/10/22 19:42:05 INFO mapreduce.Job: Running job: job_1634902865621_0001
21/10/22 19:42:22 INFO mapreduce.Job: Job job_1634902865621_0001 running in uber mode : false
21/10/22 19:42:22 INFO mapreduce.Job:  map 0% reduce 0%
21/10/22 19:42:38 INFO mapreduce.Job:  map 12% reduce 0%
21/10/22 19:42:40 INFO mapreduce.Job:  map 34% reduce 0%
21/10/22 19:42:41 INFO mapreduce.Job:  map 37% reduce 0%
21/10/22 19:42:42 INFO mapreduce.Job:  map 38% reduce 0%
21/10/22 19:42:48 INFO mapreduce.Job:  map 44% reduce 0%
21/10/22 19:42:50 INFO mapreduce.Job:  map 55% reduce 0%
21/10/22 19:42:51 INFO mapreduce.Job:  map 61% reduce 0%
21/10/22 19:42:58 INFO mapreduce.Job:  map 68% reduce 0%
21/10/22 19:43:00 INFO mapreduce.Job:  map 83% reduce 0%
21/10/22 19:43:02 INFO mapreduce.Job:  map 84% reduce 0%
21/10/22 19:43:05 INFO mapreduce.Job:  map 100% reduce 0%
21/10/22 19:43:15 INFO mapreduce.Job:  map 100% reduce 86%
21/10/22 19:43:17 INFO mapreduce.Job:  map 100% reduce 100%
21/10/22 19:43:22 INFO mapreduce.Job: Job job_1634902865621_0001 completed successfully
21/10/22 19:43:23 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=416431063
		FILE: Number of bytes written=584977498
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=226383999
		HDFS: Number of bytes written=114167885
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=77622
		Total time spent by all reduces in occupied slots (ms)=13224
		Total time spent by all map tasks (ms)=77622
		Total time spent by all reduce tasks (ms)=13224
		Total vcore-milliseconds taken by all map tasks=77622
		Total vcore-milliseconds taken by all reduce tasks=13224
		Total megabyte-milliseconds taken by all map tasks=79484928
		Total megabyte-milliseconds taken by all reduce tasks=13541376
	Map-Reduce Framework
		Map input records=8000000
		Map output records=8000000
		Map output bytes=250379675
		Map output materialized bytes=168189616
		Input split bytes=228
		Combine input records=12301355
		Combine output records=9352725
		Reduce input groups=3896706
		Reduce shuffle bytes=168189616
		Reduce input records=5051370
		Reduce output records=3896706
		Spilled Records=17558337
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=1311
		CPU time spent (ms)=31910
		Physical memory (bytes) snapshot=426545152
		Virtual memory (bytes) snapshot=6312058880
		Total committed heap usage (bytes)=270917632
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=226383771
	File Output Format Counters 
		Bytes Written=114167885

关键信息

job_1634902865621_0001
表示此任务的ID,通常也被称为作业号
21/10/22 19:42:22 INFO mapreduce.Job:  map 0% reduce 0%
表示将开始Map操作
21/10/22 19:43:05 INFO mapreduce.Job:  map 100% reduce 0%
表示Map操作完成
21/10/22 19:43:17 INFO mapreduce.Job:  map 100% reduce 100%
表示Reduce操作完成
21/10/22 19:43:22 INFO mapreduce.Job: Job job_1634902865621_0001 completed successfully
表示此作业成功完成
Map input records=8000000
表示输入的记录共有800万条
Reduce output records=3896706
表示输出的结果共有3896706条

19:42:03开始到19:43:23,执行整个任务累计用时1分20秒。已经很快了,要知道,email_log.txt总共有800万条数据。

:set nu设置行号,总共8000000行数据


运行完后,在输出目录出现两个新文件,这是返回的结果
其中_SUCCESS是一个标识文件,表示这个任务执行完成。

[root@master /]# hdfs dfs -ls /user/root/output
Found 2 items
-rw-r--r--   3 root supergroup          0 2021-10-22 19:43 /user/root/output/_SUCCESS
-rw-r--r--   3 root supergroup  114167885 2021-10-22 19:43 /user/root/output/part-r-00000
[root@master /]# 

我们来看看part-r-00000,这是任务执行完成后产生的结果文件。

前10行
[root@master /]# hdfs dfs -cat /user/root/output/part-r-00000 | head -10
A.A@30gigs.comac.be	3
A.A@AFreeInternetngo.pl	4
A.A@AmexMailac.im	1
A.A@AmexMailpe.kr	4
A.A@AsianWiredBI.it	1
A.A@BigAssWebor.jp	3
A.A@Care2vossevangen.no	3
A.A@FasterMailaid.pl	3
A.A@FetchMaildyroy.no	1
A.A@FitMommiesparliament.uk	1
cat: Unable to write to output stream.
[root@master /]#

后10行
[root@master /]# hdfs dfs -cat /user/root/output/part-r-00000 | tail -10
zwynn@MyOwnEmailnet.sc	1
zy@AmexMailgob.pk	1
zy@PostMark.netsauherad.no	4
zy@e-tapaal.comnesoddtangen.no	4
zy@eo.yifan.netplc.co.im	3
zyat@doramail.comlk	3
zyates@Mail.comha.no	1
zyor@BigAssWebSondrio.it	4
zzim@Planet-Mailsebastopol.ua	3
zzimmer@TheMailmil.se	2
[root@master /]#

总共行数
[root@master /]# hdfs dfs -cat /user/root/output/part-r-00000 | wc -l
3896706
[root@master /]#

第一列是用户名,第二列是该用户登录的次数。
统计用户登录次数的任务到这里基本完成。

以上是关于Hadoop基础操作的主要内容,如果未能解决你的问题,请参考以下文章

学习笔记Hadoop—— Hadoop基础操作

Hadoop基础操作

指导手册03:Hadoop基础操作

学习笔记Hadoop—— Hadoop基础操作—— HDFS常用Shell操作

Hadoop基础操作

学习笔记Hadoop—— Hadoop基础操作—— Hadoop安全模式Hadoop集群基本信息