尚硅谷大数据Hadoop教程-笔记01入门
Posted 延锋L
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了尚硅谷大数据Hadoop教程-笔记01入门相关的知识,希望对你有一定的参考价值。
视频地址:尚硅谷大数据Hadoop教程(Hadoop 3.x安装搭建到集群调优)
- 尚硅谷大数据Hadoop教程-笔记01【入门】
- 尚硅谷大数据Hadoop教程-笔记02【HDFS】
- 尚硅谷大数据Hadoop教程-笔记03【MapReduce】
- 尚硅谷大数据Hadoop教程-笔记04【Yarn】
- 尚硅谷大数据Hadoop教程-笔记04【生产调优手册】
- 尚硅谷大数据Hadoop教程-笔记04【源码解析】
目录
P001【001_尚硅谷_Hadoop_开篇_课程整体介绍】08:38
P002【002_尚硅谷_Hadoop_概论_大数据的概念】04:34
P003【003_尚硅谷_Hadoop_概论_大数据的特点】07:23
P004【004_尚硅谷_Hadoop_概论_大数据的应用场景】09:58
P005【005_尚硅谷_Hadoop_概论_大数据的发展场景】08:17
P006【006_尚硅谷_Hadoop_概论_未来工作内容】06:25
P007【007_尚硅谷_Hadoop_入门_课程介绍】07:29
P008【008_尚硅谷_Hadoop_入门_Hadoop是什么】03:00
P009【009_尚硅谷_Hadoop_入门_Hadoop发展历史】05:52
P010【010_尚硅谷_Hadoop_入门_Hadoop三大发行版本】05:59
P011【011_尚硅谷_Hadoop_入门_Hadoop优势】03:52
P012【012_尚硅谷_Hadoop_入门_Hadoop1.x2.x3.x区别】03:00
P013【013_尚硅谷_Hadoop_入门_HDFS概述】06:26
P014【014_尚硅谷_Hadoop_入门_YARN概述】06:35
P015【015_尚硅谷_Hadoop_入门_MapReduce概述】01:55
P016【016_尚硅谷_Hadoop_入门_HDFS&YARN&MR关系】03:22
P017【017_尚硅谷_Hadoop_入门_大数据技术生态体系】09:17
P018【018_尚硅谷_Hadoop_入门_VMware安装】04:41
P019【019_尚硅谷_Hadoop_入门_Centos7.5软硬件安装】15:56
P020【020_尚硅谷_Hadoop_入门_IP和主机名称配置】10:50
P021【021_尚硅谷_Hadoop_入门_Xshell远程访问工具】09:05
P022【022_尚硅谷_Hadoop_入门_模板虚拟机准备完成】12:25
P023【023_尚硅谷_Hadoop_入门_克隆三台虚拟机】15:01
P024【024_尚硅谷_Hadoop_入门_JDK安装】07:02
P025【025_尚硅谷_Hadoop_入门_Hadoop安装】07:20
P026【026_尚硅谷_Hadoop_入门_本地运行模式】11:56
P027【027_尚硅谷_Hadoop_入门_scp&rsync命令讲解】15:01
P028【028_尚硅谷_Hadoop_入门_xsync分发脚本】18:14
P029【029_尚硅谷_Hadoop_入门_ssh免密登录】11:25
P030【030_尚硅谷_Hadoop_入门_集群配置】13:24
P031【031_尚硅谷_Hadoop_入门_群起集群并测试】16:52
P032【032_尚硅谷_Hadoop_入门_集群崩溃处理办法】08:10
P033【033_尚硅谷_Hadoop_入门_历史服务器配置】05:26
P034【034_尚硅谷_Hadoop_入门_日志聚集功能配置】05:42
P035【035_尚硅谷_Hadoop_入门_两个常用脚本】09:18
P036【036_尚硅谷_Hadoop_入门_两道面试题】04:15
P037【037_尚硅谷_Hadoop_入门_集群时间同步】11:27
P038【038_尚硅谷_Hadoop_入门_常见问题总结】10:57
00_尚硅谷大数据Hadoop课程整体介绍
P001【001_尚硅谷_Hadoop_开篇_课程整体介绍】08:38
一、课程升级的重点内容
1、yarn
2、生产调优手册
3、源码
二、课程特色
1、新 hadoop3.1.3
2、细 从搭建集群开始 每一个配置每一行代码都有注释。出书
3、真 20+的企业案例 30+企业调优 从百万代码中阅读源码
4、全 全套资料
三、资料获取方式
1、关注尚硅谷教育 公众号:回复 大数据
2、谷粒学院
3、b站
四、技术基础要求
Javase,maven + idea + linux常用命令
01_尚硅谷大数据技术之大数据概论
P002【002_尚硅谷_Hadoop_概论_大数据的概念】04:34
第1章,大数据概念:大数据(Big Data):指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。
大数据主要解决,海量数据的采集、存储和分析计算问题。
P003【003_尚硅谷_Hadoop_概论_大数据的特点】07:23
第2章,大数据特点(4V)
- Volume(大量)
- Velocity(高速)
- Variety(多样)
- Value(低价值密度)
P004【004_尚硅谷_Hadoop_概论_大数据的应用场景】09:58
第3章,大数据应用场景
- 抖音:推荐的都是你喜欢的视频。
- 电商站内广告推荐:给用户推荐可能喜欢的商品。
- 零售:分析用户消费习惯,为用户购买商品提供方便,从而提升商品销量。
- 物流仓储:京东物流,上午下单下午送达、下午下单次日上午送达。
- 保险:海量数据挖掘及风险预测,助力保险行业精准营销,提升精细化定价能力。
- 金融:多维度体现用户特征,帮助金融机构推荐优质客户,防范欺诈风险。
- 房产:大数据全面助力房地产行业,打造精准投策与营销,选出更合适的地,建造更合适的楼,卖给更合适的人。
- 人工智能 + 5G + 物联网 + 虚拟与现实。
P005【005_尚硅谷_Hadoop_概论_大数据的发展场景】08:17
第4章,好!
P006【006_尚硅谷_Hadoop_概论_未来工作内容】06:25
第5章,大数据部门间业务流程分析
第6章,大数据部门内组织结构
02_尚硅谷大数据技术之Hadoop(入门)V3.3
P007【007_尚硅谷_Hadoop_入门_课程介绍】07:29
P008【008_尚硅谷_Hadoop_入门_Hadoop是什么】03:00
P009【009_尚硅谷_Hadoop_入门_Hadoop发展历史】05:52
P010【010_尚硅谷_Hadoop_入门_Hadoop三大发行版本】05:59
Hadoop三大发行版本:Apache、Cloudera、Hortonworks。
1)Apache Hadoop
官网地址:http://hadoop.apache.org
下载地址:https://hadoop.apache.org/releases.html
2)Cloudera Hadoop
官网地址:https://www.cloudera.com/downloads/cdh
下载地址:https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_6_download.html
(1)2008年成立的Cloudera是最早将Hadoop商用的公司,为合作伙伴提供Hadoop的商用解决方案,主要是包括支持、咨询服务、培训。
(2)2009年Hadoop的创始人Doug Cutting也加盟Cloudera公司。Cloudera产品主要为CDH,Cloudera Manager,Cloudera Support
(3)CDH是Cloudera的Hadoop发行版,完全开源,比Apache Hadoop在兼容性,安全性,稳定性上有所增强。Cloudera的标价为每年每个节点10000美元。
(4)Cloudera Manager是集群的软件分发及管理监控平台,可以在几个小时内部署好一个Hadoop集群,并对集群的节点及服务进行实时监控。
3)Hortonworks Hadoop
官网地址:https://hortonworks.com/products/data-center/hdp/
下载地址:https://hortonworks.com/downloads/#data-platform
(1)2011年成立的Hortonworks是雅虎与硅谷风投公司Benchmark Capital合资组建。
(2)公司成立之初就吸纳了大约25名至30名专门研究Hadoop的雅虎工程师,上述工程师均在2005年开始协助雅虎开发Hadoop,贡献了Hadoop80%的代码。
(3)Hortonworks的主打产品是Hortonworks Data Platform(HDP),也同样是100%开源的产品,HDP除常见的项目外还包括了Ambari,一款开源的安装和管理系统。
(4)2018年Hortonworks目前已经被Cloudera公司收购。
P011【011_尚硅谷_Hadoop_入门_Hadoop优势】03:52
Hadoop优势(4高)
- 高可靠性
- 高拓展性
- 高效性
- 高容错性
P012【012_尚硅谷_Hadoop_入门_Hadoop1.x2.x3.x区别】03:00
P013【013_尚硅谷_Hadoop_入门_HDFS概述】06:26
Hadoop Distributed File System,简称 HDFS,是一个分布式文件系统。
- 1)NameNode(nn):存储文件的元数据,如文件名,文件目录结构,文件属性(生成时间、副本数、文件权限),以及每个文件的块列表和块所在的DataNode等。
- 2)DataNode(dn):在本地文件系统存储文件块数据,以及块数据的校验和。
- 3)Secondary NameNode(2nn):每隔一段时间对NameNode元数据备份。
P014【014_尚硅谷_Hadoop_入门_YARN概述】06:35
Yet Another Resource Negotiator 简称 YARN ,另一种资源协调者,是 Hadoop 的资源管理器。
P015【015_尚硅谷_Hadoop_入门_MapReduce概述】01:55
MapReduce 将计算过程分为两个阶段:Map 和 Reduce
- 1)Map 阶段并行处理输入数据
- 2)Reduce 阶段对 Map 结果进行汇总
P016【016_尚硅谷_Hadoop_入门_HDFS&YARN&MR关系】03:22
- HDFS
- NameNode:负责数据存储。
- DataNode:数据存储在哪个节点上。
- SecondaryNameNode:秘书,备份NameNode数据恢复NameNode部分工作。
- YARN:整个集群的资源管理。
- ResourceManager:资源管理,map阶段。
- NodeManager
- MapReduce
P017【017_尚硅谷_Hadoop_入门_大数据技术生态体系】09:17
大数据技术生态体系
推荐系统项目框架
P018【018_尚硅谷_Hadoop_入门_VMware安装】04:41
P019【019_尚硅谷_Hadoop_入门_Centos7.5软硬件安装】15:56
P020【020_尚硅谷_Hadoop_入门_IP和主机名称配置】10:50
[root@hadoop100 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33
[root@hadoop100 ~]# ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.88.133 netmask 255.255.255.0 broadcast 192.168.88.255
inet6 fe80::363b:8659:c323:345d prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:0f:0a:6d txqueuelen 1000 (Ethernet)
RX packets 684561 bytes 1003221355 (956.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 53538 bytes 3445292 (3.2 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 84 bytes 9492 (9.2 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 84 bytes 9492 (9.2 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255
ether 52:54:00:1c:3c:a9 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@hadoop100 ~]# systemctl restart network
[root@hadoop100 ~]# cat /etc/host
cat: /etc/host: 没有那个文件或目录
[root@hadoop100 ~]# cat /etc/hostname
hadoop100
[root@hadoop100 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
[root@hadoop100 ~]# vim /etc/hosts
[root@hadoop100 ~]# ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.88.100 netmask 255.255.255.0 broadcast 192.168.88.255
inet6 fe80::363b:8659:c323:345d prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:0f:0a:6d txqueuelen 1000 (Ethernet)
RX packets 684830 bytes 1003244575 (956.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 53597 bytes 3452600 (3.2 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 132 bytes 14436 (14.0 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 132 bytes 14436 (14.0 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255
ether 52:54:00:1c:3c:a9 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@hadoop100 ~]# ll
总用量 40
-rw-------. 1 root root 1973 3月 14 10:19 anaconda-ks.cfg
-rw-r--r--. 1 root root 2021 3月 14 10:26 initial-setup-ks.cfg
drwxr-xr-x. 2 root root 4096 3月 14 10:27 公共
drwxr-xr-x. 2 root root 4096 3月 14 10:27 模板
drwxr-xr-x. 2 root root 4096 3月 14 10:27 视频
drwxr-xr-x. 2 root root 4096 3月 14 10:27 图片
drwxr-xr-x. 2 root root 4096 3月 14 10:27 文档
drwxr-xr-x. 2 root root 4096 3月 14 10:27 下载
drwxr-xr-x. 2 root root 4096 3月 14 10:27 音乐
drwxr-xr-x. 2 root root 4096 3月 14 10:27 桌面
[root@hadoop100 ~]#
vim /etc/sysconfig/network-scripts/ifcfg-ens33
TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="static"
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
NAME="ens33"
UUID="3241b48d-3234-4c23-8a03-b9b393a99a65"
DEVICE="ens33"
ONBOOT="yes"IPADDR=192.168.88.100
GATEWAY=192.168.88.2
DNS1=192.168.88.2vim /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6192.168.88.100 hadoop100
192.168.88.101 hadoop101
192.168.88.102 hadoop102
192.168.88.103 hadoop103
192.168.88.104 hadoop104
192.168.88.105 hadoop105
192.168.88.106 hadoop106
192.168.88.107 hadoop107
192.168.88.108 hadoop108192.168.88.151 node1 node1.itcast.cn
192.168.88.152 node2 node2.itcast.cn
192.168.88.153 node3 node3.itcast.cn
P021【021_尚硅谷_Hadoop_入门_Xshell远程访问工具】09:05
P022【022_尚硅谷_Hadoop_入门_模板虚拟机准备完成】12:25
yum install -y epel-release
systemctl stop firewalld
systemctl disable firewalld.service
P023【023_尚硅谷_Hadoop_入门_克隆三台虚拟机】15:01
vim /etc/sysconfig/network-scripts/ifcfg-ens33
vim /etc/hostname
reboot
P024【024_尚硅谷_Hadoop_入门_JDK安装】07:02
在hadoop102上安装jdk,然后将jdk拷贝到hadoop103与hadoop104上。
P025【025_尚硅谷_Hadoop_入门_Hadoop安装】07:20
同P024图!
P026【026_尚硅谷_Hadoop_入门_本地运行模式】11:56
[root@node1 ~]# cd /export/server/hadoop-3.3.0/share/hadoop/mapreduce/
[root@node1 mapreduce]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar wordcount /wordcount/input /wordcount/output
2023-03-20 14:43:07,516 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node1/192.168.88.151:8032
2023-03-20 14:43:09,291 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1679293699463_0001
2023-03-20 14:43:11,916 INFO input.FileInputFormat: Total input files to process : 1
2023-03-20 14:43:12,313 INFO mapreduce.JobSubmitter: number of splits:1
2023-03-20 14:43:13,173 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1679293699463_0001
2023-03-20 14:43:13,173 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-03-20 14:43:14,684 INFO conf.Configuration: resource-types.xml not found
2023-03-20 14:43:14,684 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2023-03-20 14:43:17,054 INFO impl.YarnClientImpl: Submitted application application_1679293699463_0001
2023-03-20 14:43:17,123 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1679293699463_0001/
2023-03-20 14:43:17,124 INFO mapreduce.Job: Running job: job_1679293699463_0001
2023-03-20 14:43:52,340 INFO mapreduce.Job: Job job_1679293699463_0001 running in uber mode : false
2023-03-20 14:43:52,360 INFO mapreduce.Job: map 0% reduce 0%
2023-03-20 14:44:08,011 INFO mapreduce.Job: map 100% reduce 0%
2023-03-20 14:44:16,986 INFO mapreduce.Job: map 100% reduce 100%
2023-03-20 14:44:18,020 INFO mapreduce.Job: Job job_1679293699463_0001 completed successfully
2023-03-20 14:44:18,579 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=31
FILE: Number of bytes written=529345
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=142
HDFS: Number of bytes written=17
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=11303
Total time spent by all reduces in occupied slots (ms)=6220
Total time spent by all map tasks (ms)=11303
Total time spent by all reduce tasks (ms)=6220
Total vcore-milliseconds taken by all map tasks=11303
Total vcore-milliseconds taken by all reduce tasks=6220
Total megabyte-milliseconds taken by all map tasks=11574272
Total megabyte-milliseconds taken by all reduce tasks=6369280
Map-Reduce Framework
Map input records=2
Map output records=5
Map output bytes=53
Map output materialized bytes=31
Input split bytes=108
Combine input records=5
Combine output records=2
Reduce input groups=2
Reduce shuffle bytes=31
Reduce input records=2
Reduce output records=2
Spilled Records=4
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=546
CPU time spent (ms)=3680
Physical memory (bytes) snapshot=499236864
Virtual memory (bytes) snapshot=5568684032
Total committed heap usage (bytes)=365953024
Peak Map Physical memory (bytes)=301096960
Peak Map Virtual memory (bytes)=2779201536
Peak Reduce Physical memory (bytes)=198139904
Peak Reduce Virtual memory (bytes)=2789482496
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=34
File Output Format Counters
Bytes Written=17
[root@node1 mapreduce]#
[root@node1 mapreduce]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar wordcount /wc_input /wc_output
2023-03-20 15:01:48,007 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node1/192.168.88.151:8032
2023-03-20 15:01:49,475 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1679293699463_0002
2023-03-20 15:01:50,522 INFO input.FileInputFormat: Total input files to process : 1
2023-03-20 15:01:51,010 INFO mapreduce.JobSubmitter: number of splits:1
2023-03-20 15:01:51,894 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1679293699463_0002
2023-03-20 15:01:51,894 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-03-20 15:01:52,684 INFO conf.Configuration: resource-types.xml not found
2023-03-20 15:01:52,687 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2023-03-20 15:01:53,237 INFO impl.YarnClientImpl: Submitted application application_1679293699463_0002
2023-03-20 15:01:53,487 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1679293699463_0002/
2023-03-20 15:01:53,492 INFO mapreduce.Job: Running job: job_1679293699463_0002
2023-03-20 15:02:15,329 INFO mapreduce.Job: Job job_1679293699463_0002 running in uber mode : false
2023-03-20 15:02:15,342 INFO mapreduce.Job: map 0% reduce 0%
2023-03-20 15:02:26,652 INFO mapreduce.Job: map 100% reduce 0%
2023-03-20 15:02:40,297 INFO mapreduce.Job: map 100% reduce 100%
2023-03-20 15:02:41,350 INFO mapreduce.Job: Job job_1679293699463_0002 completed successfully
2023-03-20 15:02:41,557 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=60
FILE: Number of bytes written=529375
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=149
HDFS: Number of bytes written=38
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=8398
Total time spent by all reduces in occupied slots (ms)=9720
Total time spent by all map tasks (ms)=8398
Total time spent by all reduce tasks (ms)=9720
Total vcore-milliseconds taken by all map tasks=8398
Total vcore-milliseconds taken by all reduce tasks=9720
Total megabyte-milliseconds taken by all map tasks=8599552
Total megabyte-milliseconds taken by all reduce tasks=9953280
Map-Reduce Framework
Map input records=4
Map output records=6
Map output bytes=69
Map output materialized bytes=60
Input split bytes=100
Combine input records=6
Combine output records=4
Reduce input groups=4
Reduce shuffle bytes=60
Reduce input records=4
Reduce output records=4
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=1000
CPU time spent (ms)=3880
Physical memory (bytes) snapshot=503771136
Virtual memory (bytes) snapshot=5568987136
Total committed heap usage (bytes)=428343296
Peak Map Physical memory (bytes)=303013888
Peak Map Virtual memory (bytes)=2782048256
Peak Reduce Physical memory (bytes)=200757248
Peak Reduce Virtual memory (bytes)=2786938880
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=49
File Output Format Counters
Bytes Written=38
[root@node1 mapreduce]# pwd
/export/server/hadoop-3.3.0/share/hadoop/mapreduce
[root@node1 mapreduce]#
P027【027_尚硅谷_Hadoop_入门_scp&rsync命令讲解】15:01
第一次同步用scp,后续同步用rsync。
rsync主要用于备份和镜像,具有速度快、避免复制相同内容和支持符号链接的优点。
rsync和scp区别:用rsync做文件的复制要比scp的速度快,rsync只对差异文件做更新。scp是把所有文件都复制过去。
P028【028_尚硅谷_Hadoop_入门_xsync分发脚本】18:14
拷贝同步命令
- scp(secure copy)安全拷贝
- rsync 远程同步工具
- xsync 集群分发脚本
dirname命令:截取文件的路径,去除文件名中的非目录部分,仅显示与目录有关的内容。
[root@node1 ~]# dirname /home/atguigu/a.txt
/home/atguigu
[root@node1 ~]#basename命令:获取文件名称。
[root@node1 atguigu]# basename /home/atguigu/a.txt
a.txt
[root@node1 atguigu]#
#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
echo Not Enough Arguement!
exit;
fi
#2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
echo ==================== $host ====================
#3. 遍历所有目录,挨个发送
for file in $@
do
#4. 判断文件是否存在
if [ -e $file ]
then
#5. 获取父目录
pdir=$(cd -P $(dirname $file); pwd)
#6. 获取当前文件的名称
fname=$(basename $file)
ssh $host "mkdir -p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done
[root@node1 bin]# chmod 777 xsync
[root@node1 bin]# ll
总用量 4
-rwxrwxrwx 1 atguigu atguigu 727 3月 20 16:00 xsync
[root@node1 bin]# cd ..
[root@node1 atguigu]# xsync bin/
==================== node1 ====================
sending incremental file list
sent 94 bytes received 17 bytes 222.00 bytes/sec
total size is 727 speedup is 6.55
==================== node2 ====================
sending incremental file list
bin/
bin/xsync
sent 871 bytes received 39 bytes 606.67 bytes/sec
total size is 727 speedup is 0.80
==================== node3 ====================
sending incremental file list
bin/
bin/xsync
sent 871 bytes received 39 bytes 1,820.00 bytes/sec
total size is 727 speedup is 0.80
[root@node1 atguigu]# pwd
/home/atguigu
[root@node1 atguigu]# ls -al
总用量 20
drwx------ 6 atguigu atguigu 168 3月 20 15:56 .
drwxr-xr-x. 6 root root 56 3月 20 10:08 ..
-rw-r--r-- 1 root root 0 3月 20 15:44 a.txt
-rw------- 1 atguigu atguigu 21 3月 20 11:48 .bash_history
-rw-r--r-- 1 atguigu atguigu 18 8月 8 2019 .bash_logout
-rw-r--r-- 1 atguigu atguigu 193 8月 8 2019 .bash_profile
-rw-r--r-- 1 atguigu atguigu 231 8月 8 2019 .bashrc
drwxrwxr-x 2 atguigu atguigu 19 3月 20 15:56 bin
drwxrwxr-x 3 atguigu atguigu 18 3月 20 10:17 .cache
drwxrwxr-x 3 atguigu atguigu 18 3月 20 10:17 .config
drwxr-xr-x 4 atguigu atguigu 39 3月 10 20:04 .mozilla
-rw------- 1 atguigu atguigu 1261 3月 20 15:56 .viminfo
[root@node1 atguigu]#
连接成功
Last login: Mon Mar 20 16:01:40 2023
[root@node1 ~]# su atguigu
[atguigu@node1 root]$ cd /home/atguigu/
[atguigu@node1 ~]$ pwd
/home/atguigu
[atguigu@node1 ~]$ xsync bin/
==================== node1 ====================
The authenticity of host 'node1 (192.168.88.151)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node1,192.168.88.151' (ECDSA) to the list of known hosts.
atguigu@node1's password:
atguigu@node1's password:
sending incremental file list
sent 98 bytes received 17 bytes 17.69 bytes/sec
total size is 727 speedup is 6.32
==================== node2 ====================
The authenticity of host 'node2 (192.168.88.152)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node2,192.168.88.152' (ECDSA) to the list of known hosts.
atguigu@node2's password:
atguigu@node2's password:
sending incremental file list
sent 94 bytes received 17 bytes 44.40 bytes/sec
total size is 727 speedup is 6.55
==================== node3 ====================
The authenticity of host 'node3 (192.168.88.153)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node3,192.168.88.153' (ECDSA) to the list of known hosts.
atguigu@node3's password:
atguigu@node3's password:
sending incremental file list
sent 94 bytes received 17 bytes 44.40 bytes/sec
total size is 727 speedup is 6.55
[atguigu@node1 ~]$
----------------------------------------------------------------------------------------
连接成功
Last login: Mon Mar 20 17:22:20 2023 from 192.168.88.151
[root@node2 ~]# su atguigu
[atguigu@node2 root]$ vim /etc/sudoers
您在 /var/spool/mail/root 中有新邮件
[atguigu@node2 root]$ su root
密码:
[root@node2 ~]# vim /etc/sudoers
[root@node2 ~]# cd /opt/
[root@node2 opt]# ll
总用量 0
drwxr-xr-x 4 atguigu atguigu 46 3月 20 11:32 module
drwxr-xr-x. 2 root root 6 10月 31 2018 rh
drwxr-xr-x 2 atguigu atguigu 67 3月 20 10:47 software
[root@node2 opt]# su atguigu
[atguigu@node2 opt]$ cd /home/atguigu/
[atguigu@node2 ~]$ llk
bash: llk: 未找到命令
[atguigu@node2 ~]$ ll
总用量 0
drwxrwxr-x 2 atguigu atguigu 19 3月 20 15:56 bin
[atguigu@node2 ~]$ cd ~
您在 /var/spool/mail/root 中有新邮件
[atguigu@node2 ~]$ ll
总用量 0
drwxrwxr-x 2 atguigu atguigu 19 3月 20 15:56 bin
[atguigu@node2 ~]$ ll
总用量 0
drwxrwxr-x 2 atguigu atguigu 19 3月 20 15:56 bin
您在 /var/spool/mail/root 中有新邮件
[atguigu@node2 ~]$ cd bin
[atguigu@node2 bin]$ ll
总用量 4
-rwxrwxrwx 1 atguigu atguigu 727 3月 20 16:00 xsync
[atguigu@node2 bin]$
----------------------------------------------------------------------------------------
连接成功
Last login: Mon Mar 20 17:22:26 2023 from 192.168.88.152
[root@node3 ~]# vim /etc/sudoers
您在 /var/spool/mail/root 中有新邮件
[root@node3 ~]# cd /opt/
[root@node3 opt]# ll
总用量 0
drwxr-xr-x 4 atguigu atguigu 46 3月 20 11:32 module
drwxr-xr-x. 2 root root 6 10月 31 2018 rh
drwxr-xr-x 2 atguigu atguigu 67 3月 20 10:47 software
[root@node3 opt]# cd ~
您在 /var/spool/mail/root 中有新邮件
[root@node3 ~]# ll
总用量 4
-rw-------. 1 root root 1340 9月 11 2020 anaconda-ks.cfg
-rw------- 1 root root 0 2月 23 16:20 nohup.out
[root@node3 ~]# ll
总用量 4
-rw-------. 1 root root 1340 9月 11 2020 anaconda-ks.cfg
-rw------- 1 root root 0 2月 23 16:20 nohup.out
您在 /var/spool/mail/root 中有新邮件
[root@node3 ~]# cd ~
[root@node3 ~]# ll
总用量 4
-rw-------. 1 root root 1340 9月 11 2020 anaconda-ks.cfg
-rw------- 1 root root 0 2月 23 16:20 nohup.out
[root@node3 ~]# su atguigu
[atguigu@node3 root]$ cd ~
[atguigu@node3 ~]$ ls
bin
[atguigu@node3 ~]$ ll
总用量 0
drwxrwxr-x 2 atguigu atguigu 19 3月 20 15:56 bin
[atguigu@node3 ~]$ cd bin
[atguigu@node3 bin]$ ll
总用量 4
-rwxrwxrwx 1 atguigu atguigu 727 3月 20 16:00 xsync
[atguigu@node3 bin]$
----------------------------------------------------------------------------------------
连接成功
Last login: Mon Mar 20 16:01:40 2023
[root@node1 ~]# su atguigu
[atguigu@node1 root]$ cd /home/atguigu/
[atguigu@node1 ~]$ pwd
/home/atguigu
[atguigu@node1 ~]$ xsync bin/
==================== node1 ====================
The authenticity of host 'node1 (192.168.88.151)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node1,192.168.88.151' (ECDSA) to the list of known hosts.
atguigu@node1's password:
atguigu@node1's password:
sending incremental file list
sent 98 bytes received 17 bytes 17.69 bytes/sec
total size is 727 speedup is 6.32
==================== node2 ====================
The authenticity of host 'node2 (192.168.88.152)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node2,192.168.88.152' (ECDSA) to the list of known hosts.
atguigu@node2's password:
atguigu@node2's password:
sending incremental file list
sent 94 bytes received 17 bytes 44.40 bytes/sec
total size is 727 speedup is 6.55
==================== node3 ====================
The authenticity of host 'node3 (192.168.88.153)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node3,192.168.88.153' (ECDSA) to the list of known hosts.
atguigu@node3's password:
atguigu@node3's password:
sending incremental file list
sent 94 bytes received 17 bytes 44.40 bytes/sec
total size is 727 speedup is 6.55
[atguigu@node1 ~]$ xsync /etc/profile.d/my_env.sh
==================== node1 ====================
atguigu@node1's password:
atguigu@node1's password:
.sending incremental file list
sent 48 bytes received 12 bytes 13.33 bytes/sec
total size is 223 speedup is 3.72
==================== node2 ====================
atguigu@node2's password:
atguigu@node2's password:
sending incremental file list
my_env.sh
rsync: mkstemp "/etc/profile.d/.my_env.sh.guTzvB" failed: Permission denied (13)
sent 95 bytes received 126 bytes 88.40 bytes/sec
total size is 223 speedup is 1.01
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]
==================== node3 =========尚硅谷大数据Hadoop教程-笔记02HDFS
视频地址:尚硅谷大数据Hadoop教程(Hadoop 3.x安装搭建到集群调优)
- 尚硅谷大数据Hadoop教程-笔记01【入门】
- 尚硅谷大数据Hadoop教程-笔记02【HDFS】
- 尚硅谷大数据Hadoop教程-笔记03【MapReduce】
- 尚硅谷大数据Hadoop教程-笔记04【Yarn】
- 尚硅谷大数据Hadoop教程-笔记04【生产调优手册】
- 尚硅谷大数据Hadoop教程-笔记04【源码解析】
目录
P039【039_尚硅谷_Hadoop_HDFS_课程介绍】04:23
P040【040_尚硅谷_Hadoop_HDFS_产生背景和定义】04:11
P041【041_尚硅谷_Hadoop_HDFS_优缺点】05:28
P042【042_尚硅谷_Hadoop_HDFS_组成】09:09
P043【043_尚硅谷_Hadoop_HDFS_文件块大小】08:01
P044【044_尚硅谷_Hadoop_HDFS_Shell命令上传】09:48
P045【045_尚硅谷_Hadoop_HDFS_Shell命令下载&直接操作】16:41
P046【046_尚硅谷_Hadoop_HDFS_API环境准备】08:20
P047【047_尚硅谷_Hadoop_HDFS_API创建文件夹】10:54
P048【048_尚硅谷_Hadoop_HDFS_API上传】06:42
P049【049_尚硅谷_Hadoop_HDFS_API参数的优先级】05:08
P050【050_尚硅谷_Hadoop_HDFS_API文件下载】08:24
P051【051_尚硅谷_Hadoop_HDFS_API文件删除】04:12
P052【052_尚硅谷_Hadoop_HDFS_API文件更名和移动】05:03
P053【053_尚硅谷_Hadoop_HDFS_API文件详情查看】07:57
P054【054_尚硅谷_Hadoop_HDFS_API文件和文件夹判断】03:20
P055【055_尚硅谷_Hadoop_HDFS_写数据流程】11:38
P056【056_尚硅谷_Hadoop_HDFS_节点距离计算】04:31
P057【057_尚硅谷_Hadoop_HDFS_机架感知(副本存储节点选择)】06:07
P058【058_尚硅谷_Hadoop_HDFS_读数据流程】05:04
P059【059_尚硅谷_Hadoop_HDFS_NN和2NN工作机制】13:28
P060【060_尚硅谷_Hadoop_HDFS_FsImage镜像文件】09:33
P061【061_尚硅谷_Hadoop_HDFS_Edits编辑日志】04:49
P062【062_尚硅谷_Hadoop_HDFS_检查点时间设置】
P063【063_尚硅谷_Hadoop_HDFS_DN工作机制】07:36
P064【064_尚硅谷_Hadoop_HDFS_数据完整性】07:07
P065【065_尚硅谷_Hadoop_HDFS_掉线时限参数设置】04:44
P066【066_尚硅谷_Hadoop_HDFS_总结】03:44
03_尚硅谷大数据技术之Hadoop(HDFS)V3.3
P039【039_尚硅谷_Hadoop_HDFS_课程介绍】04:23
P040【040_尚硅谷_Hadoop_HDFS_产生背景和定义】04:11
HDFS定义
HDFS(Hadoop Distributed File System),它是一个文件系统,用于存储文件,通过目录树来定位文件;其次,它是分布式的,由很多服务器联合起来实现其功能,集群中的服务器有各自的角色。
HDFS的使用场景:适合一次写入,多次读出的场景。一个文件经过创建、写入和关闭之后就不需要改变。
能追加数据,不能修改原来的数据。
P041【041_尚硅谷_Hadoop_HDFS_优缺点】05:28
HDFS优点
- 高容错性;
- 适合处理大数据,GB、TB、PB;
- 可构建在廉价机器上,通过多副本机制提高可靠性。
HDFS缺点
- 不适合低延时数据访问,比如毫秒级的存储数据,是做不到的;
- 无法高效的对大量小文件进行存储;
- 不支持并发写入、文件随机修改。仅支持数据append(追加)。
P042【042_尚硅谷_Hadoop_HDFS_组成】09:09
hadoop官方文档网站:Index of /docs
P043【043_尚硅谷_Hadoop_HDFS_文件块大小】08:01
思考:为什么块的大小不能设置太小,也不能设置太大?
(1)HDFS的块设置太小,会增加寻址时间,程序一直在找块的开始位置;
(2)如果块设置的太大,从磁盘传输数据的时间会明显大于定位这个块开始位置所需的时间。导致程序在处理这块数据时,会非常慢。
总结:HDFS块的大小设置主要取决于磁盘传输速率。
P044【044_尚硅谷_Hadoop_HDFS_Shell命令上传】09:48
hadoop fs 具体命令 OR hdfs dfs 具体命令,两个是完全相同的。
连接成功
Last login: Wed Mar 22 11:45:28 2023 from 192.168.88.1
[atguigu@node1 ~]$ hadoop fs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-v] [-x] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] -n name | -d [-e en] <path>]
[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
[-head <file>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [-b|-k -m|-x <acl_spec> <path>]|[--set <acl_spec> <path>]]
[-setfattr -n name [-v value] | -x name <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] [-s <sleep interval>] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
Generic options supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines
The general command line syntax is:
command [genericOptions] [commandOptions]
[atguigu@node1 ~]$ hdfs dfs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-v] [-x] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] -n name | -d [-e en] <path>]
[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
[-head <file>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [-b|-k -m|-x <acl_spec> <path>]|[--set <acl_spec> <path>]]
[-setfattr -n name [-v value] | -x name <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] [-s <sleep interval>] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
Generic options supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines
The general command line syntax is:
command [genericOptions] [commandOptions]
[atguigu@node1 ~]$
1)-moveFromLocal:从本地剪切粘贴到HDFS
[atguigu@hadoop102 hadoop-3.1.3]$ vim shuguo.txt
输入:
shuguo
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -moveFromLocal ./shuguo.txt /sanguo
2)-copyFromLocal:从本地文件系统中拷贝文件到HDFS路径去
[atguigu@hadoop102 hadoop-3.1.3]$ vim weiguo.txt
输入:
weiguo
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -copyFromLocal weiguo.txt /sanguo
3)-put:等同于copyFromLocal,生产环境更习惯用put
[atguigu@hadoop102 hadoop-3.1.3]$ vim wuguo.txt
输入:
wuguo
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -put ./wuguo.txt /sanguo
4)-appendToFile:追加一个文件到已经存在的文件末尾
[atguigu@hadoop102 hadoop-3.1.3]$ vim liubei.txt
输入:
liubei
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -appendToFile liubei.txt /sanguo/shuguo.txt
P045【045_尚硅谷_Hadoop_HDFS_Shell命令下载&直接操作】16:41
HDFS直接操作
1)-ls: 显示目录信息
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls /sanguo
2)-cat:显示文件内容
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cat /sanguo/shuguo.txt
3)-chgrp、-chmod、-chown:Linux文件系统中的用法一样,修改文件所属权限
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -chmod 666 /sanguo/shuguo.txt
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -chown atguigu:atguigu /sanguo/shuguo.txt
4)-mkdir:创建路径
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mkdir /jinguo
5)-cp:从HDFS的一个路径拷贝到HDFS的另一个路径
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cp /sanguo/shuguo.txt /jinguo
6)-mv:在HDFS目录中移动文件
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /sanguo/wuguo.txt /jinguo
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /sanguo/weiguo.txt /jinguo
7)-tail:显示一个文件的末尾1kb的数据
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -tail /jinguo/shuguo.txt
8)-rm:删除文件或文件夹
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rm /sanguo/shuguo.txt
9)-rm -r:递归删除目录及目录里面内容
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rm -r /sanguo
10)-du统计文件夹的大小信息
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -du -s -h /jinguo
27 81 /jinguo
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -du -h /jinguo
14 42 /jinguo/shuguo.txt
7 21 /jinguo/weiguo.txt
6 18 /jinguo/wuguo.tx
说明:27表示文件大小;81表示27*3个副本;/jinguo表示查看的目录
11)-setrep:设置HDFS中文件的副本数量
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -setrep 10 /jinguo/shuguo.txt
这里设置的副本数只是记录在NameNode的元数据中,是否真的会有这么多副本,还得看DataNode的数量。因为目前只有3台设备,最多也就3个副本,只有节点数的增加到10台时,副本数才能达到10。
P046【046_尚硅谷_Hadoop_HDFS_API环境准备】08:20
P047【047_尚硅谷_Hadoop_HDFS_API创建文件夹】10:54
idea,ctrl+p+enter:查看参数。
package com.atguigu.hdfs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.Test;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
/**
* 客户端代码常用套路
* 1、获取一个客户端对象
* 2、执行相关的操作命令
* 3、关闭资源
* HDFS zookeeper
*/
public class HdfsClient
//创建目录
@Test
public void testMkdir() throws URISyntaxException, IOException, InterruptedException
//连接的集群nn地址
URI uri = new URI("hdfs://node1:8020");
//创建一个配置文件
Configuration configuration = new Configuration();
//用户
String user = "atguigu";
//1、获取到了客户端对象
FileSystem fileSystem = FileSystem.get(uri, configuration, user);
//2、创建一个文件夹
fileSystem.mkdirs(new Path("/xiyou/huaguoshan"));
//3、关闭资源
fileSystem.close();
package com.atguigu.hdfs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
/**
* 客户端代码常用套路
* 1、获取一个客户端对象
* 2、执行相关的操作命令
* 3、关闭资源
* HDFS zookeeper
*/
public class HdfsClient
private FileSystem fs;
@Before
public void init() throws URISyntaxException, IOException, InterruptedException
// 连接的集群nn地址
URI uri = new URI("hdfs://node1:8020");
// 创建一个配置文件
Configuration configuration = new Configuration();
configuration.set("dfs.replication", "2");
// 用户
String user = "atguigu";
// 1、获取到了客户端对象
fs = FileSystem.get(uri, configuration, user);
@After
public void close() throws IOException
// 3、关闭资源
fs.close();
/*
@Test
public void testMkdir() throws URISyntaxException, IOException, InterruptedException
//连接的集群nn地址
URI uri = new URI("hdfs://node1:8020");
//创建一个配置文件
Configuration configuration = new Configuration();
//用户
String user = "atguigu";
//1、获取到了客户端对象
FileSystem fileSystem = FileSystem.get(uri, configuration, user);
//2、创建一个文件夹
fileSystem.mkdirs(new Path("/xiyou/huaguoshan"));
//3、关闭资源
fileSystem.close();
*/
//创建目录
@Test
public void testMkdir() throws URISyntaxException, IOException, InterruptedException
// //连接的集群nn地址
// URI uri = new URI("hdfs://node1:8020");
// //创建一个配置文件
// Configuration configuration = new Configuration();
//
// //用户
// String user = "atguigu";
//
// //1、获取到了客户端对象
// FileSystem fileSystem = FileSystem.get(uri, configuration, user);
//2、创建一个文件夹
fs.mkdirs(new Path("/xiyou/huaguoshan2"));
// //3、关闭资源
// fileSystem.close();
P048【048_尚硅谷_Hadoop_HDFS_API上传】06:42
package com.atguigu.hdfs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
/**
* 客户端代码常用套路
* 1、获取一个客户端对象
* 2、执行相关的操作命令
* 3、关闭资源
* HDFS zookeeper
*/
public class HdfsClient
private FileSystem fs;
@Before
public void init() throws URISyntaxException, IOException, InterruptedException
// 连接的集群nn地址
URI uri = new URI("hdfs://node1:8020");
// 创建一个配置文件
Configuration configuration = new Configuration();
// 用户
String user = "atguigu";
// 1、获取到了客户端对象
fs = FileSystem.get(uri, configuration, user);
@After
public void close() throws IOException
// 3、关闭资源
fs.close();
// 上传
@Test
public void testPut() throws IOException
// 参数解读,参数1:表示删除原数据、参数2:是否允许覆盖、参数3:原数据路径、参数4:目的地路径
fs.copyFromLocalFile(false, true, new Path("D:\\\\bigData\\\\file\\\\sunwukong.txt"), new Path("hdfs://node1/xiyou/huaguoshan"));
P049【049_尚硅谷_Hadoop_HDFS_API参数的优先级】05:08
将hdfs-site.xml拷贝到项目的resources资源目录下
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value> <!--副本数-->
</property>
</configuration>
参数优先级排序,hdfs-default.xml => hdfs-site.xml=> 在项目资源目录下的配置文件 => 代码里面的配置。
- 客户端代码中设置的值
- ClassPath下的用户自定义配置文件
- 然后是服务器的自定义配置(xxx-site.xml)
- 服务器的默认配置(xxx-default.xml)
P050【050_尚硅谷_Hadoop_HDFS_API文件下载】08:24
//文件下载
@Test
public void testGet() throws IOException
//参数的解读,参数一:原文件是否删除、参数二:原文件路径HDFS、参数三:Windows目标地址路径、参数四:crc校验
// fs.copyToLocalFile(false, new Path("hdfs://node1/xiyou/huaguoshan2/sunwukong.txt"), new Path("D:\\\\bigData\\\\file\\\\download"), false);
fs.copyToLocalFile(false, new Path("hdfs://node1/xiyou/huaguoshan2/"), new Path("D:\\\\bigData\\\\file\\\\download"), false);
// fs.copyToLocalFile(false, new Path("hdfs://node1/a.txt"), new Path("D:\\\\"), false);
P051【051_尚硅谷_Hadoop_HDFS_API文件删除】04:12
//删除
@Test
public void testRm() throws IOException
//参数解读,参数1:要删除的路径、参数2:是否递归删除
//删除文件
//fs.delete(new Path("/jdk-8u212-linux-x64.tar.gz"),false);
//删除空目录
//fs.delete(new Path("/xiyou"), false);
//删除非空目录
fs.delete(new Path("/jinguo"), true);
P052【052_尚硅谷_Hadoop_HDFS_API文件更名和移动】05:03
//文件的更名和移动
@Test
public void testmv() throws IOException
//参数解读,参数1:原文件路径、参数2:目标文件路径
//对文件名称的修改
fs.rename(new Path("/input/word.txt"), new Path("/input/ss.txt"));
//文件的移动和更名
fs.rename(new Path("/input/ss.txt"), new Path("/cls.txt"));
//目录更名
fs.rename(new Path("/input"), new Path("/output"));
P053【053_尚硅谷_Hadoop_HDFS_API文件详情查看】07:57
//获取文件详细信息
@Test
public void fileDetail() throws IOException
//获取所有文件信息
RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/"), true);
//遍历文件
while (listFiles.hasNext())
LocatedFileStatus fileStatus = listFiles.next();
System.out.println("========== " + fileStatus.getPath() + " =========");
System.out.println(fileStatus.getPermission());
System.out.println(fileStatus.getOwner());
System.out.println(fileStatus.getGroup());
System.out.println(fileStatus.getLen());
System.out.println(fileStatus.getModificationTime());
System.out.println(fileStatus.getReplication());
System.out.println(fileStatus.getBlockSize());
System.out.println(fileStatus.getPath().getName());
//获取块信息
BlockLocation[] blockLocations = fileStatus.getBlockLocations();
System.out.println(Arrays.toString(blockLocations));
P054【054_尚硅谷_Hadoop_HDFS_API文件和文件夹判断】03:20
//判断是文件夹还是文件
@Test
public void testFile() throws IOException
FileStatus[] listStatus = fs.listStatus(new Path("/"));
for (FileStatus status : listStatus)
if (status.isFile())
System.out.println("文件:" + status.getPath().getName());
else
System.out.println("目录:" + status.getPath().getName());
P055【055_尚硅谷_Hadoop_HDFS_写数据流程】11:38
HDFS写数据流程,剖析文件写入。
(1)客户端通过Distributed FileSystem模块向NameNode请求上传文件,NameNode检查目标文件是否已存在,父目录是否存在。
(2)NameNode返回是否可以上传。
(3)客户端请求第一个 Block上传到哪几个DataNode服务器上。
(4)NameNode返回3个DataNode节点,分别为dn1、dn2、dn3。
(5)客户端通过FSDataOutputStream模块请求dn1上传数据,dn1收到请求会继续调用dn2,然后dn2调用dn3,将这个通信管道建立完成。
(6)dn1、dn2、dn3逐级应答客户端。
(7)客户端开始往dn1上传第一个Block(先从磁盘读取数据放到一个本地内存缓存),以Packet为单位,dn1收到一个Packet就会传给dn2,dn2传给dn3;dn1每传一个packet会放入一个应答队列等待应答。
(8)当一个Block传输完成之后,客户端再次请求NameNode上传第二个Block的服务器。(重复执行3-7步)。
P056【056_尚硅谷_Hadoop_HDFS_节点距离计算】04:31
P057【057_尚硅谷_Hadoop_HDFS_机架感知(副本存储节点选择)】06:07
Apache Hadoop 3.1.3 – HDFS Architecture
- 第一个副本在Client所处的节点上;如果客户端在集群外,随机选一个。
- 第二个副本在另一个机架的随机一个节点。
- 第三个副本在第二个副本所在机架的随机节点。
P058【058_尚硅谷_Hadoop_HDFS_读数据流程】05:04
(1)客户端通过DistributedFileSystem向NameNode请求下载文件,NameNode通过查询元数据,找到文件块所在的DataNode地址。
(2)挑选一台DataNode(就近原则,然后随机)服务器,请求读取数据。
(3)DataNode开始传输数据给客户端(从磁盘里面读取数据输入流,以Packet为单位来做校验)。
(4)客户端以Packet为单位接收,先在本地缓存,然后写入目标文件。
P059【059_尚硅谷_Hadoop_HDFS_NN和2NN工作机制】13:28
第5章
P060【060_尚硅谷_Hadoop_HDFS_FsImage镜像文件】09:33
1)oiv查看Fsimage文件
(1)查看oiv和oev命令
[atguigu@hadoop102 current]$ hdfs
oiv apply the offline fsimage viewer to an fsimage
oev apply the offline edits viewer to an edits file
(2)基本语法
hdfs oiv -p 文件类型 -i镜像文件 -o 转换后文件输出路径
(3)案例实操
[atguigu@hadoop102 current]$ pwd
/opt/module/hadoop-3.1.3/data/dfs/name/current
[atguigu@hadoop102 current]$ hdfs oiv -p XML -i fsimage_0000000000000000025 -o /opt/module/hadoop-3.1.3/fsimage.xml
[atguigu@hadoop102 current]$ cat /opt/module/hadoop-3.1.3/fsimage.xml
将显示的xml文件内容拷贝到Idea中创建的xml文件中,并格式化。部分显示结果如下。
<inode>
<id>16386</id>
<type>DIRECTORY</type>
<name>user</name>
<mtime>1512722284477</mtime>
<permission>atguigu:supergroup:rwxr-xr-x</permission>
<nsquota>-1</nsquota>
<dsquota>-1</dsquota>
</inode>
<inode>
<id>16387</id>
<type>DIRECTORY</type>
<name>atguigu</name>
<mtime>1512790549080</mtime>
<permission>atguigu:supergroup:rwxr-xr-x</permission>
<nsquota>-1</nsquota>
<dsquota>-1</dsquota>
</inode>
<inode>
<id>16389</id>
<type>FILE</type>
<name>wc.input</name>
<replication>3</replication>
<mtime>1512722322219</mtime>
<atime>1512722321610</atime>
<perferredBlockSize>134217728</perferredBlockSize>
<permission>atguigu:supergroup:rw-r--r--</permission>
<blocks>
<block>
<id>1073741825</id>
<genstamp>1001</genstamp>
<numBytes>59</numBytes>
</block>
</blocks>
</inode >
思考:可以看出,Fsimage中没有记录块所对应DataNode,为什么?
在集群启动后,要求DataNode上报数据块信息,并间隔一段时间后再次上报。
P061【061_尚硅谷_Hadoop_HDFS_Edits编辑日志】04:49
2)oev查看Edits文件
(1)基本语法
hdfs oev -p 文件类型 -i编辑日志 -o 转换后文件输出路径
(2)案例实操
[atguigu@hadoop102 current]$ hdfs oev -p XML -i edits_0000000000000000012-0000000000000000013 -o /opt/module/hadoop-3.1.3/edits.xml
[atguigu@hadoop102 current]$ cat /opt/module/hadoop-3.1.3/edits.xml
将显示的xml文件内容拷贝到Idea中创建的xml文件中,并格式化。显示结果如下。
<?xml version="1.0" encoding="UTF-8"?>
<EDITS>
<EDITS_VERSION>-63</EDITS_VERSION>
<RECORD>
<OPCODE>OP_START_LOG_SEGMENT</OPCODE>
<DATA>
<TXID>129</TXID>
</DATA>
</RECORD>
<RECORD>
尚硅谷大数据技术Hadoop教程-笔记06Hadoop-生产调优手册