Hadoop基础操作
Posted xingweikun
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hadoop基础操作相关的知识,希望对你有一定的参考价值。
运行首个MapReduce任务
了解Hadoop官方的示例程序包
在集群服务的本地目录%HADOOP_HOME/share/hadoop/mapreduce/
中可以发现示例程序包hadoop-mapreduce-examples-2.7.3.jar
。这个程序包封装了一些常用的测试模块。
模块名称 | 内容 |
---|---|
multifilewc | 统计多个文件中单词的数量 |
pi | 应用quasi-Monte Carlo算法来估算圆周率π的值 |
randomtextwriter | 在每个数据节点随机生成一个10GB的文本文件 |
wordcount | 对输入文件中的单词进行词频统计 |
wordmean | 计算输入文件中单词的平均长度 |
wordmedian | 计算输入文件中单词长度的中位数 |
wordstandarddeviation | 计算输入文件中单词长度的标准差 |
模块wordcount
正好适合对email_log.txt
中的数据进行登录次数的统计.
提交MapReduce任务给集群运行
Hadoop 运行程序时,输出目录不能存在
删除创建好的输出目录
[root@master hadoop]# hdfs dfs -rmdir /user/root/output
[root@master sbin]# hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /user/root/email_log.txt /user/root/output
输出为
21/10/22 19:42:03 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.20:8032
21/10/22 19:42:04 INFO input.FileInputFormat: Total input paths to process : 1
21/10/22 19:42:04 INFO mapreduce.JobSubmitter: number of splits:2
21/10/22 19:42:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1634902865621_0001
21/10/22 19:42:05 INFO impl.YarnClientImpl: Submitted application application_1634902865621_0001
21/10/22 19:42:05 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1634902865621_0001/
21/10/22 19:42:05 INFO mapreduce.Job: Running job: job_1634902865621_0001
21/10/22 19:42:22 INFO mapreduce.Job: Job job_1634902865621_0001 running in uber mode : false
21/10/22 19:42:22 INFO mapreduce.Job: map 0% reduce 0%
21/10/22 19:42:38 INFO mapreduce.Job: map 12% reduce 0%
21/10/22 19:42:40 INFO mapreduce.Job: map 34% reduce 0%
21/10/22 19:42:41 INFO mapreduce.Job: map 37% reduce 0%
21/10/22 19:42:42 INFO mapreduce.Job: map 38% reduce 0%
21/10/22 19:42:48 INFO mapreduce.Job: map 44% reduce 0%
21/10/22 19:42:50 INFO mapreduce.Job: map 55% reduce 0%
21/10/22 19:42:51 INFO mapreduce.Job: map 61% reduce 0%
21/10/22 19:42:58 INFO mapreduce.Job: map 68% reduce 0%
21/10/22 19:43:00 INFO mapreduce.Job: map 83% reduce 0%
21/10/22 19:43:02 INFO mapreduce.Job: map 84% reduce 0%
21/10/22 19:43:05 INFO mapreduce.Job: map 100% reduce 0%
21/10/22 19:43:15 INFO mapreduce.Job: map 100% reduce 86%
21/10/22 19:43:17 INFO mapreduce.Job: map 100% reduce 100%
21/10/22 19:43:22 INFO mapreduce.Job: Job job_1634902865621_0001 completed successfully
21/10/22 19:43:23 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=416431063
FILE: Number of bytes written=584977498
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=226383999
HDFS: Number of bytes written=114167885
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=77622
Total time spent by all reduces in occupied slots (ms)=13224
Total time spent by all map tasks (ms)=77622
Total time spent by all reduce tasks (ms)=13224
Total vcore-milliseconds taken by all map tasks=77622
Total vcore-milliseconds taken by all reduce tasks=13224
Total megabyte-milliseconds taken by all map tasks=79484928
Total megabyte-milliseconds taken by all reduce tasks=13541376
Map-Reduce Framework
Map input records=8000000
Map output records=8000000
Map output bytes=250379675
Map output materialized bytes=168189616
Input split bytes=228
Combine input records=12301355
Combine output records=9352725
Reduce input groups=3896706
Reduce shuffle bytes=168189616
Reduce input records=5051370
Reduce output records=3896706
Spilled Records=17558337
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1311
CPU time spent (ms)=31910
Physical memory (bytes) snapshot=426545152
Virtual memory (bytes) snapshot=6312058880
Total committed heap usage (bytes)=270917632
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=226383771
File Output Format Counters
Bytes Written=114167885
关键信息
job_1634902865621_0001
表示此任务的ID,通常也被称为作业号
21/10/22 19:42:22 INFO mapreduce.Job: map 0% reduce 0%
表示将开始Map操作
21/10/22 19:43:05 INFO mapreduce.Job: map 100% reduce 0%
表示Map操作完成
21/10/22 19:43:17 INFO mapreduce.Job: map 100% reduce 100%
表示Reduce操作完成
21/10/22 19:43:22 INFO mapreduce.Job: Job job_1634902865621_0001 completed successfully
表示此作业成功完成
Map input records=8000000
表示输入的记录共有800万条
Reduce output records=3896706
表示输出的结果共有3896706条
19:42:03开始到19:43:23,执行整个任务累计用时1分20秒。已经很快了,要知道,email_log.txt
总共有800万条数据。
:set nu设置行号,总共8000000行数据
运行完后,在输出目录出现两个新文件,这是返回的结果
其中_SUCCESS
是一个标识文件,表示这个任务执行完成。
[root@master /]# hdfs dfs -ls /user/root/output
Found 2 items
-rw-r--r-- 3 root supergroup 0 2021-10-22 19:43 /user/root/output/_SUCCESS
-rw-r--r-- 3 root supergroup 114167885 2021-10-22 19:43 /user/root/output/part-r-00000
[root@master /]#
我们来看看part-r-00000
,这是任务执行完成后产生的结果文件。
前10行
[root@master /]# hdfs dfs -cat /user/root/output/part-r-00000 | head -10
A.A@30gigs.comac.be 3
A.A@AFreeInternetngo.pl 4
A.A@AmexMailac.im 1
A.A@AmexMailpe.kr 4
A.A@AsianWiredBI.it 1
A.A@BigAssWebor.jp 3
A.A@Care2vossevangen.no 3
A.A@FasterMailaid.pl 3
A.A@FetchMaildyroy.no 1
A.A@FitMommiesparliament.uk 1
cat: Unable to write to output stream.
[root@master /]#
后10行
[root@master /]# hdfs dfs -cat /user/root/output/part-r-00000 | tail -10
zwynn@MyOwnEmailnet.sc 1
zy@AmexMailgob.pk 1
zy@PostMark.netsauherad.no 4
zy@e-tapaal.comnesoddtangen.no 4
zy@eo.yifan.netplc.co.im 3
zyat@doramail.comlk 3
zyates@Mail.comha.no 1
zyor@BigAssWebSondrio.it 4
zzim@Planet-Mailsebastopol.ua 3
zzimmer@TheMailmil.se 2
[root@master /]#
总共行数
[root@master /]# hdfs dfs -cat /user/root/output/part-r-00000 | wc -l
3896706
[root@master /]#
第一列是用户名,第二列是该用户登录的次数。
统计用户登录次数的任务到这里基本完成。
以上是关于Hadoop基础操作的主要内容,如果未能解决你的问题,请参考以下文章