尚硅谷大数据Hadoop教程-笔记01入门

Posted 延锋L

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了尚硅谷大数据Hadoop教程-笔记01入门相关的知识,希望对你有一定的参考价值。

视频地址:尚硅谷大数据Hadoop教程(Hadoop 3.x安装搭建到集群调优)

  1. 尚硅谷大数据Hadoop教程-笔记01【入门】
  2. 尚硅谷大数据Hadoop教程-笔记02【HDFS】
  3. 尚硅谷大数据Hadoop教程-笔记03【MapReduce】
  4. 尚硅谷大数据Hadoop教程-笔记04【Yarn】
  5. 尚硅谷大数据Hadoop教程-笔记04【生产调优手册】
  6. 尚硅谷大数据Hadoop教程-笔记04【源码解析】

目录

00_尚硅谷大数据Hadoop课程整体介绍

P001【001_尚硅谷_Hadoop_开篇_课程整体介绍】08:38

01_尚硅谷大数据技术之大数据概论

P002【002_尚硅谷_Hadoop_概论_大数据的概念】04:34

P003【003_尚硅谷_Hadoop_概论_大数据的特点】07:23

P004【004_尚硅谷_Hadoop_概论_大数据的应用场景】09:58

P005【005_尚硅谷_Hadoop_概论_大数据的发展场景】08:17

P006【006_尚硅谷_Hadoop_概论_未来工作内容】06:25

02_尚硅谷大数据技术之Hadoop(入门)V3.3

P007【007_尚硅谷_Hadoop_入门_课程介绍】07:29

P008【008_尚硅谷_Hadoop_入门_Hadoop是什么】03:00

P009【009_尚硅谷_Hadoop_入门_Hadoop发展历史】05:52

P010【010_尚硅谷_Hadoop_入门_Hadoop三大发行版本】05:59

P011【011_尚硅谷_Hadoop_入门_Hadoop优势】03:52

P012【012_尚硅谷_Hadoop_入门_Hadoop1.x2.x3.x区别】03:00

P013【013_尚硅谷_Hadoop_入门_HDFS概述】06:26

P014【014_尚硅谷_Hadoop_入门_YARN概述】06:35

P015【015_尚硅谷_Hadoop_入门_MapReduce概述】01:55

P016【016_尚硅谷_Hadoop_入门_HDFS&YARN&MR关系】03:22

P017【017_尚硅谷_Hadoop_入门_大数据技术生态体系】09:17

P018【018_尚硅谷_Hadoop_入门_VMware安装】04:41

P019【019_尚硅谷_Hadoop_入门_Centos7.5软硬件安装】15:56

P020【020_尚硅谷_Hadoop_入门_IP和主机名称配置】10:50

P021【021_尚硅谷_Hadoop_入门_Xshell远程访问工具】09:05

P022【022_尚硅谷_Hadoop_入门_模板虚拟机准备完成】12:25

P023【023_尚硅谷_Hadoop_入门_克隆三台虚拟机】15:01

P024【024_尚硅谷_Hadoop_入门_JDK安装】07:02

P025【025_尚硅谷_Hadoop_入门_Hadoop安装】07:20

P026【026_尚硅谷_Hadoop_入门_本地运行模式】11:56

P027【027_尚硅谷_Hadoop_入门_scp&rsync命令讲解】15:01

P028【028_尚硅谷_Hadoop_入门_xsync分发脚本】18:14

P029【029_尚硅谷_Hadoop_入门_ssh免密登录】11:25

P030【030_尚硅谷_Hadoop_入门_集群配置】13:24

P031【031_尚硅谷_Hadoop_入门_群起集群并测试】16:52

P032【032_尚硅谷_Hadoop_入门_集群崩溃处理办法】08:10

P033【033_尚硅谷_Hadoop_入门_历史服务器配置】05:26

P034【034_尚硅谷_Hadoop_入门_日志聚集功能配置】05:42

P035【035_尚硅谷_Hadoop_入门_两个常用脚本】09:18

P036【036_尚硅谷_Hadoop_入门_两道面试题】04:15

P037【037_尚硅谷_Hadoop_入门_集群时间同步】11:27

P038【038_尚硅谷_Hadoop_入门_常见问题总结】10:57


00_尚硅谷大数据Hadoop课程整体介绍

P001【001_尚硅谷_Hadoop_开篇_课程整体介绍】08:38

Hadoop3.x从入门到精通

一、课程升级的重点内容
    1、yarn
    2、生产调优手册
    3、源码
二、课程特色
    1、新  hadoop3.1.3
    2、细  从搭建集群开始  每一个配置每一行代码都有注释。出书
    3、真  20+的企业案例  30+企业调优  从百万代码中阅读源码
    4、全  全套资料
三、资料获取方式
    1、关注尚硅谷教育 公众号:回复 大数据
    2、谷粒学院
    3、b站
四、技术基础要求
    Javase,maven + idea + linux常用命令

01_尚硅谷大数据技术之大数据概论

P002【002_尚硅谷_Hadoop_概论_大数据的概念】04:34

第1章,大数据概念:大数据(Big Data):指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。

大数据主要解决,海量数据的采集、存储和分析计算问题。

P003【003_尚硅谷_Hadoop_概论_大数据的特点】07:23

第2章,大数据特点(4V)

  1. Volume(大量)
  2. Velocity(高速)
  3. Variety(多样)
  4. Value(低价值密度)

P004【004_尚硅谷_Hadoop_概论_大数据的应用场景】09:58

第3章,大数据应用场景

  1. 抖音:推荐的都是你喜欢的视频。
  2. 电商站内广告推荐:给用户推荐可能喜欢的商品。
  3. 零售:分析用户消费习惯,为用户购买商品提供方便,从而提升商品销量。
  4. 物流仓储:京东物流,上午下单下午送达、下午下单次日上午送达。
  5. 保险:海量数据挖掘及风险预测,助力保险行业精准营销,提升精细化定价能力。
  6. 金融:多维度体现用户特征,帮助金融机构推荐优质客户,防范欺诈风险。
  7. 房产:大数据全面助力房地产行业,打造精准投策与营销,选出更合适的地,建造更合适的楼,卖给更合适的人。
  8. 人工智能 + 5G + 物联网 + 虚拟与现实。

P005【005_尚硅谷_Hadoop_概论_大数据的发展场景】08:17

第4章,好!

P006【006_尚硅谷_Hadoop_概论_未来工作内容】06:25

第5章,大数据部门间业务流程分析

第6章,大数据部门内组织结构

02_尚硅谷大数据技术之Hadoop(入门)V3.3

P007【007_尚硅谷_Hadoop_入门_课程介绍】07:29

P008【008_尚硅谷_Hadoop_入门_Hadoop是什么】03:00

P009【009_尚硅谷_Hadoop_入门_Hadoop发展历史】05:52

P010【010_尚硅谷_Hadoop_入门_Hadoop三大发行版本】05:59

Hadoop三大发行版本:Apache、Cloudera、Hortonworks。

1Apache Hadoop

官网地址:http://hadoop.apache.org

下载地址:https://hadoop.apache.org/releases.html

2Cloudera Hadoop

官网地址:https://www.cloudera.com/downloads/cdh

下载地址:https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_6_download.html

(1)2008年成立的Cloudera是最早将Hadoop商用的公司,为合作伙伴提供Hadoop的商用解决方案,主要是包括支持、咨询服务、培训。

22009Hadoop的创始人Doug Cutting也加盟Cloudera公司。Cloudera产品主要为CDH,Cloudera Manager,Cloudera Support

(3)CDH是Cloudera的Hadoop发行版,完全开源,比Apache Hadoop在兼容性,安全性,稳定性上有所增强。Cloudera的标价为每年每个节点10000美元

(4)Cloudera Manager是集群的软件分发及管理监控平台,可以在几个小时内部署好一个Hadoop集群,并对集群的节点及服务进行实时监控。

3Hortonworks Hadoop

官网地址:https://hortonworks.com/products/data-center/hdp/

下载地址:https://hortonworks.com/downloads/#data-platform

(1)2011年成立的Hortonworks是雅虎与硅谷风投公司Benchmark Capital合资组建。

2)公司成立之初就吸纳了大约25名至30名专门研究Hadoop的雅虎工程师,上述工程师均在2005年开始协助雅虎开发Hadoop,贡献了Hadoop80%的代码。

(3)Hortonworks的主打产品是Hortonworks Data Platform(HDP),也同样是100%开源的产品,HDP除常见的项目外还包括了Ambari,一款开源的安装和管理系统。

(4)2018年Hortonworks目前已经被Cloudera公司收购

P011【011_尚硅谷_Hadoop_入门_Hadoop优势】03:52

Hadoop优势(4高)

  1. 高可靠性
  2. 高拓展性
  3. 高效性
  4. 高容错性

P012【012_尚硅谷_Hadoop_入门_Hadoop1.x2.x3.x区别】03:00

P013【013_尚硅谷_Hadoop_入门_HDFS概述】06:26

Hadoop Distributed File System,简称 HDFS,是一个分布式文件系统。

  • 1)NameNode(nn):存储文件的元数据,如文件名,文件目录结构,文件属性(生成时间、副本数、文件权限),以及每个文件的块列表和块所在的DataNode等。
  • 2)DataNode(dn):在本地文件系统存储文件块数据,以及块数据的校验和。
  • 3)Secondary NameNode(2nn):每隔一段时间对NameNode元数据备份。

P014【014_尚硅谷_Hadoop_入门_YARN概述】06:35

Yet Another Resource Negotiator 简称 YARN ,另一种资源协调者,是 Hadoop 的资源管理器。

P015【015_尚硅谷_Hadoop_入门_MapReduce概述】01:55

MapReduce 将计算过程分为两个阶段:Map 和 Reduce

  • 1)Map 阶段并行处理输入数据
  • 2)Reduce 阶段对 Map 结果进行汇总

P016【016_尚硅谷_Hadoop_入门_HDFS&YARN&MR关系】03:22

  1. HDFS
    1. NameNode:负责数据存储。
    2. DataNode:数据存储在哪个节点上。
    3. SecondaryNameNode:秘书,备份NameNode数据恢复NameNode部分工作。
  2. YARN:整个集群的资源管理。
    1. ResourceManager:资源管理,map阶段。
    2. NodeManager
  3. MapReduce

P017【017_尚硅谷_Hadoop_入门_大数据技术生态体系】09:17

大数据技术生态体系

推荐系统项目框架

P018【018_尚硅谷_Hadoop_入门_VMware安装】04:41

 

P019【019_尚硅谷_Hadoop_入门_Centos7.5软硬件安装】15:56

P020【020_尚硅谷_Hadoop_入门_IP和主机名称配置】10:50

[root@hadoop100 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33
[root@hadoop100 ~]# ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.88.133  netmask 255.255.255.0  broadcast 192.168.88.255
        inet6 fe80::363b:8659:c323:345d  prefixlen 64  scopeid 0x20<link>
        ether 00:0c:29:0f:0a:6d  txqueuelen 1000  (Ethernet)
        RX packets 684561  bytes 1003221355 (956.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 53538  bytes 3445292 (3.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 84  bytes 9492 (9.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 84  bytes 9492 (9.2 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 52:54:00:1c:3c:a9  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@hadoop100 ~]# systemctl restart network
[root@hadoop100 ~]# cat /etc/host
cat: /etc/host: 没有那个文件或目录
[root@hadoop100 ~]# cat /etc/hostname
hadoop100
[root@hadoop100 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
[root@hadoop100 ~]# vim /etc/hosts
[root@hadoop100 ~]# ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.88.100  netmask 255.255.255.0  broadcast 192.168.88.255
        inet6 fe80::363b:8659:c323:345d  prefixlen 64  scopeid 0x20<link>
        ether 00:0c:29:0f:0a:6d  txqueuelen 1000  (Ethernet)
        RX packets 684830  bytes 1003244575 (956.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 53597  bytes 3452600 (3.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 132  bytes 14436 (14.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 132  bytes 14436 (14.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 52:54:00:1c:3c:a9  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@hadoop100 ~]# ll
总用量 40
-rw-------. 1 root root 1973 3月  14 10:19 anaconda-ks.cfg
-rw-r--r--. 1 root root 2021 3月  14 10:26 initial-setup-ks.cfg
drwxr-xr-x. 2 root root 4096 3月  14 10:27 公共
drwxr-xr-x. 2 root root 4096 3月  14 10:27 模板
drwxr-xr-x. 2 root root 4096 3月  14 10:27 视频
drwxr-xr-x. 2 root root 4096 3月  14 10:27 图片
drwxr-xr-x. 2 root root 4096 3月  14 10:27 文档
drwxr-xr-x. 2 root root 4096 3月  14 10:27 下载
drwxr-xr-x. 2 root root 4096 3月  14 10:27 音乐
drwxr-xr-x. 2 root root 4096 3月  14 10:27 桌面
[root@hadoop100 ~]# 

vim /etc/sysconfig/network-scripts/ifcfg-ens33

TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="static"
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
NAME="ens33"
UUID="3241b48d-3234-4c23-8a03-b9b393a99a65"
DEVICE="ens33"
ONBOOT="yes"

IPADDR=192.168.88.100
GATEWAY=192.168.88.2
DNS1=192.168.88.2

vim /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.88.100 hadoop100
192.168.88.101 hadoop101
192.168.88.102 hadoop102
192.168.88.103 hadoop103
192.168.88.104 hadoop104
192.168.88.105 hadoop105
192.168.88.106 hadoop106
192.168.88.107 hadoop107
192.168.88.108 hadoop108

192.168.88.151 node1 node1.itcast.cn
192.168.88.152 node2 node2.itcast.cn
192.168.88.153 node3 node3.itcast.cn

P021【021_尚硅谷_Hadoop_入门_Xshell远程访问工具】09:05

P022【022_尚硅谷_Hadoop_入门_模板虚拟机准备完成】12:25

yum install -y epel-release

systemctl stop firewalld

systemctl disable firewalld.service

P023【023_尚硅谷_Hadoop_入门_克隆三台虚拟机】15:01

vim /etc/sysconfig/network-scripts/ifcfg-ens33

vim /etc/hostname

reboot

P024【024_尚硅谷_Hadoop_入门_JDK安装】07:02

在hadoop102上安装jdk,然后将jdk拷贝到hadoop103与hadoop104上。

P025【025_尚硅谷_Hadoop_入门_Hadoop安装】07:20

同P024图!

P026【026_尚硅谷_Hadoop_入门_本地运行模式】11:56

Apache Hadoop

http://node1:9870/explorer.html#/

[root@node1 ~]# cd /export/server/hadoop-3.3.0/share/hadoop/mapreduce/
[root@node1 mapreduce]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar wordcount /wordcount/input /wordcount/output
2023-03-20 14:43:07,516 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node1/192.168.88.151:8032
2023-03-20 14:43:09,291 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1679293699463_0001
2023-03-20 14:43:11,916 INFO input.FileInputFormat: Total input files to process : 1
2023-03-20 14:43:12,313 INFO mapreduce.JobSubmitter: number of splits:1
2023-03-20 14:43:13,173 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1679293699463_0001
2023-03-20 14:43:13,173 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-03-20 14:43:14,684 INFO conf.Configuration: resource-types.xml not found
2023-03-20 14:43:14,684 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2023-03-20 14:43:17,054 INFO impl.YarnClientImpl: Submitted application application_1679293699463_0001
2023-03-20 14:43:17,123 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1679293699463_0001/
2023-03-20 14:43:17,124 INFO mapreduce.Job: Running job: job_1679293699463_0001
2023-03-20 14:43:52,340 INFO mapreduce.Job: Job job_1679293699463_0001 running in uber mode : false
2023-03-20 14:43:52,360 INFO mapreduce.Job:  map 0% reduce 0%
2023-03-20 14:44:08,011 INFO mapreduce.Job:  map 100% reduce 0%
2023-03-20 14:44:16,986 INFO mapreduce.Job:  map 100% reduce 100%
2023-03-20 14:44:18,020 INFO mapreduce.Job: Job job_1679293699463_0001 completed successfully
2023-03-20 14:44:18,579 INFO mapreduce.Job: Counters: 54
        File System Counters
                FILE: Number of bytes read=31
                FILE: Number of bytes written=529345
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=142
                HDFS: Number of bytes written=17
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=11303
                Total time spent by all reduces in occupied slots (ms)=6220
                Total time spent by all map tasks (ms)=11303
                Total time spent by all reduce tasks (ms)=6220
                Total vcore-milliseconds taken by all map tasks=11303
                Total vcore-milliseconds taken by all reduce tasks=6220
                Total megabyte-milliseconds taken by all map tasks=11574272
                Total megabyte-milliseconds taken by all reduce tasks=6369280
        Map-Reduce Framework
                Map input records=2
                Map output records=5
                Map output bytes=53
                Map output materialized bytes=31
                Input split bytes=108
                Combine input records=5
                Combine output records=2
                Reduce input groups=2
                Reduce shuffle bytes=31
                Reduce input records=2
                Reduce output records=2
                Spilled Records=4
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=546
                CPU time spent (ms)=3680
                Physical memory (bytes) snapshot=499236864
                Virtual memory (bytes) snapshot=5568684032
                Total committed heap usage (bytes)=365953024
                Peak Map Physical memory (bytes)=301096960
                Peak Map Virtual memory (bytes)=2779201536
                Peak Reduce Physical memory (bytes)=198139904
                Peak Reduce Virtual memory (bytes)=2789482496
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=34
        File Output Format Counters 
                Bytes Written=17
[root@node1 mapreduce]#

[root@node1 mapreduce]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar wordcount /wc_input /wc_output
2023-03-20 15:01:48,007 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node1/192.168.88.151:8032
2023-03-20 15:01:49,475 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1679293699463_0002
2023-03-20 15:01:50,522 INFO input.FileInputFormat: Total input files to process : 1
2023-03-20 15:01:51,010 INFO mapreduce.JobSubmitter: number of splits:1
2023-03-20 15:01:51,894 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1679293699463_0002
2023-03-20 15:01:51,894 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-03-20 15:01:52,684 INFO conf.Configuration: resource-types.xml not found
2023-03-20 15:01:52,687 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2023-03-20 15:01:53,237 INFO impl.YarnClientImpl: Submitted application application_1679293699463_0002
2023-03-20 15:01:53,487 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1679293699463_0002/
2023-03-20 15:01:53,492 INFO mapreduce.Job: Running job: job_1679293699463_0002
2023-03-20 15:02:15,329 INFO mapreduce.Job: Job job_1679293699463_0002 running in uber mode : false
2023-03-20 15:02:15,342 INFO mapreduce.Job:  map 0% reduce 0%
2023-03-20 15:02:26,652 INFO mapreduce.Job:  map 100% reduce 0%
2023-03-20 15:02:40,297 INFO mapreduce.Job:  map 100% reduce 100%
2023-03-20 15:02:41,350 INFO mapreduce.Job: Job job_1679293699463_0002 completed successfully
2023-03-20 15:02:41,557 INFO mapreduce.Job: Counters: 54
        File System Counters
                FILE: Number of bytes read=60
                FILE: Number of bytes written=529375
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=149
                HDFS: Number of bytes written=38
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=8398
                Total time spent by all reduces in occupied slots (ms)=9720
                Total time spent by all map tasks (ms)=8398
                Total time spent by all reduce tasks (ms)=9720
                Total vcore-milliseconds taken by all map tasks=8398
                Total vcore-milliseconds taken by all reduce tasks=9720
                Total megabyte-milliseconds taken by all map tasks=8599552
                Total megabyte-milliseconds taken by all reduce tasks=9953280
        Map-Reduce Framework
                Map input records=4
                Map output records=6
                Map output bytes=69
                Map output materialized bytes=60
                Input split bytes=100
                Combine input records=6
                Combine output records=4
                Reduce input groups=4
                Reduce shuffle bytes=60
                Reduce input records=4
                Reduce output records=4
                Spilled Records=8
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=1000
                CPU time spent (ms)=3880
                Physical memory (bytes) snapshot=503771136
                Virtual memory (bytes) snapshot=5568987136
                Total committed heap usage (bytes)=428343296
                Peak Map Physical memory (bytes)=303013888
                Peak Map Virtual memory (bytes)=2782048256
                Peak Reduce Physical memory (bytes)=200757248
                Peak Reduce Virtual memory (bytes)=2786938880
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=49
        File Output Format Counters 
                Bytes Written=38
[root@node1 mapreduce]# pwd
/export/server/hadoop-3.3.0/share/hadoop/mapreduce
[root@node1 mapreduce]# 

P027【027_尚硅谷_Hadoop_入门_scp&rsync命令讲解】15:01

第一次同步用scp,后续同步用rsync。

rsync主要用于备份和镜像,具有速度快、避免复制相同内容和支持符号链接的优点。

rsyncscp区别:rsync做文件的复制要比scp的速度快,rsync只对差异文件做更新。scp是把所有文件都复制过去。

P028【028_尚硅谷_Hadoop_入门_xsync分发脚本】18:14

拷贝同步命令

  1. scp(secure copy)安全拷贝
  2. rsync 远程同步工具
  3. xsync 集群分发脚本

dirname命令:截取文件的路径,去除文件名中的非目录部分,仅显示与目录有关的内容。

[root@node1 ~]# dirname /home/atguigu/a.txt
/home/atguigu
[root@node1 ~]#

basename命令:获取文件名称。

[root@node1 atguigu]# basename /home/atguigu/a.txt
a.txt
[root@node1 atguigu]#

#!/bin/bash

#1. 判断参数个数
if [ $# -lt 1 ]
then
    echo Not Enough Arguement!
    exit;
fi

#2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
    echo ====================  $host  ====================
    #3. 遍历所有目录,挨个发送

    for file in $@
    do
        #4. 判断文件是否存在
        if [ -e $file ]
            then
                #5. 获取父目录
                pdir=$(cd -P $(dirname $file); pwd)

                #6. 获取当前文件的名称
                fname=$(basename $file)
                ssh $host "mkdir -p $pdir"
                rsync -av $pdir/$fname $host:$pdir
            else
                echo $file does not exists!
        fi
    done
done
[root@node1 bin]# chmod 777 xsync 
[root@node1 bin]# ll
总用量 4
-rwxrwxrwx 1 atguigu atguigu 727 3月  20 16:00 xsync
[root@node1 bin]# cd ..
[root@node1 atguigu]# xsync bin/
==================== node1 ====================
sending incremental file list

sent 94 bytes  received 17 bytes  222.00 bytes/sec
total size is 727  speedup is 6.55
==================== node2 ====================
sending incremental file list
bin/
bin/xsync

sent 871 bytes  received 39 bytes  606.67 bytes/sec
total size is 727  speedup is 0.80
==================== node3 ====================
sending incremental file list
bin/
bin/xsync

sent 871 bytes  received 39 bytes  1,820.00 bytes/sec
total size is 727  speedup is 0.80
[root@node1 atguigu]# pwd
/home/atguigu
[root@node1 atguigu]# ls -al
总用量 20
drwx------  6 atguigu atguigu  168 3月  20 15:56 .
drwxr-xr-x. 6 root    root      56 3月  20 10:08 ..
-rw-r--r--  1 root    root       0 3月  20 15:44 a.txt
-rw-------  1 atguigu atguigu   21 3月  20 11:48 .bash_history
-rw-r--r--  1 atguigu atguigu   18 8月   8 2019 .bash_logout
-rw-r--r--  1 atguigu atguigu  193 8月   8 2019 .bash_profile
-rw-r--r--  1 atguigu atguigu  231 8月   8 2019 .bashrc
drwxrwxr-x  2 atguigu atguigu   19 3月  20 15:56 bin
drwxrwxr-x  3 atguigu atguigu   18 3月  20 10:17 .cache
drwxrwxr-x  3 atguigu atguigu   18 3月  20 10:17 .config
drwxr-xr-x  4 atguigu atguigu   39 3月  10 20:04 .mozilla
-rw-------  1 atguigu atguigu 1261 3月  20 15:56 .viminfo
[root@node1 atguigu]# 
连接成功
Last login: Mon Mar 20 16:01:40 2023
[root@node1 ~]# su atguigu
[atguigu@node1 root]$ cd /home/atguigu/
[atguigu@node1 ~]$ pwd
/home/atguigu
[atguigu@node1 ~]$ xsync bin/
==================== node1 ====================
The authenticity of host 'node1 (192.168.88.151)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node1,192.168.88.151' (ECDSA) to the list of known hosts.
atguigu@node1's password: 
atguigu@node1's password: 
sending incremental file list

sent 98 bytes  received 17 bytes  17.69 bytes/sec
total size is 727  speedup is 6.32
==================== node2 ====================
The authenticity of host 'node2 (192.168.88.152)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node2,192.168.88.152' (ECDSA) to the list of known hosts.
atguigu@node2's password: 
atguigu@node2's password: 
sending incremental file list

sent 94 bytes  received 17 bytes  44.40 bytes/sec
total size is 727  speedup is 6.55
==================== node3 ====================
The authenticity of host 'node3 (192.168.88.153)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node3,192.168.88.153' (ECDSA) to the list of known hosts.
atguigu@node3's password: 
atguigu@node3's password: 
sending incremental file list

sent 94 bytes  received 17 bytes  44.40 bytes/sec
total size is 727  speedup is 6.55
[atguigu@node1 ~]$ 
----------------------------------------------------------------------------------------
连接成功
Last login: Mon Mar 20 17:22:20 2023 from 192.168.88.151
[root@node2 ~]# su atguigu
[atguigu@node2 root]$ vim /etc/sudoers
您在 /var/spool/mail/root 中有新邮件
[atguigu@node2 root]$ su root
密码:
[root@node2 ~]# vim /etc/sudoers
[root@node2 ~]# cd /opt/
[root@node2 opt]# ll
总用量 0
drwxr-xr-x  4 atguigu atguigu 46 3月  20 11:32 module
drwxr-xr-x. 2 root    root     6 10月 31 2018 rh
drwxr-xr-x  2 atguigu atguigu 67 3月  20 10:47 software
[root@node2 opt]# su atguigu
[atguigu@node2 opt]$ cd /home/atguigu/
[atguigu@node2 ~]$ llk
bash: llk: 未找到命令
[atguigu@node2 ~]$ ll
总用量 0
drwxrwxr-x 2 atguigu atguigu 19 3月  20 15:56 bin
[atguigu@node2 ~]$ cd ~
您在 /var/spool/mail/root 中有新邮件
[atguigu@node2 ~]$ ll
总用量 0
drwxrwxr-x 2 atguigu atguigu 19 3月  20 15:56 bin
[atguigu@node2 ~]$ ll
总用量 0
drwxrwxr-x 2 atguigu atguigu 19 3月  20 15:56 bin
您在 /var/spool/mail/root 中有新邮件
[atguigu@node2 ~]$ cd bin
[atguigu@node2 bin]$ ll
总用量 4
-rwxrwxrwx 1 atguigu atguigu 727 3月  20 16:00 xsync
[atguigu@node2 bin]$ 
----------------------------------------------------------------------------------------
连接成功
Last login: Mon Mar 20 17:22:26 2023 from 192.168.88.152
[root@node3 ~]# vim /etc/sudoers
您在 /var/spool/mail/root 中有新邮件
[root@node3 ~]# cd /opt/
[root@node3 opt]# ll
总用量 0
drwxr-xr-x  4 atguigu atguigu 46 3月  20 11:32 module
drwxr-xr-x. 2 root    root     6 10月 31 2018 rh
drwxr-xr-x  2 atguigu atguigu 67 3月  20 10:47 software
[root@node3 opt]# cd ~
您在 /var/spool/mail/root 中有新邮件
[root@node3 ~]# ll
总用量 4
-rw-------. 1 root root 1340 9月  11 2020 anaconda-ks.cfg
-rw-------  1 root root    0 2月  23 16:20 nohup.out
[root@node3 ~]# ll
总用量 4
-rw-------. 1 root root 1340 9月  11 2020 anaconda-ks.cfg
-rw-------  1 root root    0 2月  23 16:20 nohup.out
您在 /var/spool/mail/root 中有新邮件
[root@node3 ~]# cd ~
[root@node3 ~]# ll
总用量 4
-rw-------. 1 root root 1340 9月  11 2020 anaconda-ks.cfg
-rw-------  1 root root    0 2月  23 16:20 nohup.out
[root@node3 ~]# su atguigu
[atguigu@node3 root]$ cd ~
[atguigu@node3 ~]$ ls
bin
[atguigu@node3 ~]$ ll
总用量 0
drwxrwxr-x 2 atguigu atguigu 19 3月  20 15:56 bin
[atguigu@node3 ~]$ cd bin
[atguigu@node3 bin]$ ll
总用量 4
-rwxrwxrwx 1 atguigu atguigu 727 3月  20 16:00 xsync
[atguigu@node3 bin]$ 
----------------------------------------------------------------------------------------
连接成功
Last login: Mon Mar 20 16:01:40 2023
[root@node1 ~]# su atguigu
[atguigu@node1 root]$ cd /home/atguigu/
[atguigu@node1 ~]$ pwd
/home/atguigu
[atguigu@node1 ~]$ xsync bin/
==================== node1 ====================
The authenticity of host 'node1 (192.168.88.151)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node1,192.168.88.151' (ECDSA) to the list of known hosts.
atguigu@node1's password: 
atguigu@node1's password: 
sending incremental file list

sent 98 bytes  received 17 bytes  17.69 bytes/sec
total size is 727  speedup is 6.32
==================== node2 ====================
The authenticity of host 'node2 (192.168.88.152)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node2,192.168.88.152' (ECDSA) to the list of known hosts.
atguigu@node2's password: 
atguigu@node2's password: 
sending incremental file list

sent 94 bytes  received 17 bytes  44.40 bytes/sec
total size is 727  speedup is 6.55
==================== node3 ====================
The authenticity of host 'node3 (192.168.88.153)' can't be established.
ECDSA key fingerprint is SHA256:+eLT3FrOEuEsxBxjOd89raPi/ChJz26WGAfqBpz/KEk.
ECDSA key fingerprint is MD5:18:42:ad:0f:2b:97:d8:b5:68:14:6a:98:e9:72:db:bb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node3,192.168.88.153' (ECDSA) to the list of known hosts.
atguigu@node3's password: 
atguigu@node3's password: 
sending incremental file list

sent 94 bytes  received 17 bytes  44.40 bytes/sec
total size is 727  speedup is 6.55
[atguigu@node1 ~]$ xsync /etc/profile.d/my_env.sh
==================== node1 ====================
atguigu@node1's password: 
atguigu@node1's password: 
.sending incremental file list

sent 48 bytes  received 12 bytes  13.33 bytes/sec
total size is 223  speedup is 3.72
==================== node2 ====================
atguigu@node2's password: 
atguigu@node2's password: 
sending incremental file list
my_env.sh
rsync: mkstemp "/etc/profile.d/.my_env.sh.guTzvB" failed: Permission denied (13)

sent 95 bytes  received 126 bytes  88.40 bytes/sec
total size is 223  speedup is 1.01
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]
==================== node3 =========&#

尚硅谷大数据Hadoop教程-笔记02HDFS

视频地址:尚硅谷大数据Hadoop教程(Hadoop 3.x安装搭建到集群调优)

  1. 尚硅谷大数据Hadoop教程-笔记01【入门】
  2. 尚硅谷大数据Hadoop教程-笔记02【HDFS】
  3. 尚硅谷大数据Hadoop教程-笔记03【MapReduce】
  4. 尚硅谷大数据Hadoop教程-笔记04【Yarn】
  5. 尚硅谷大数据Hadoop教程-笔记04【生产调优手册】
  6. 尚硅谷大数据Hadoop教程-笔记04【源码解析】

目录

03_尚硅谷大数据技术之Hadoop(HDFS)V3.3

P039【039_尚硅谷_Hadoop_HDFS_课程介绍】04:23

P040【040_尚硅谷_Hadoop_HDFS_产生背景和定义】04:11

P041【041_尚硅谷_Hadoop_HDFS_优缺点】05:28

P042【042_尚硅谷_Hadoop_HDFS_组成】09:09

P043【043_尚硅谷_Hadoop_HDFS_文件块大小】08:01

P044【044_尚硅谷_Hadoop_HDFS_Shell命令上传】09:48

P045【045_尚硅谷_Hadoop_HDFS_Shell命令下载&直接操作】16:41

P046【046_尚硅谷_Hadoop_HDFS_API环境准备】08:20

P047【047_尚硅谷_Hadoop_HDFS_API创建文件夹】10:54

P048【048_尚硅谷_Hadoop_HDFS_API上传】06:42

P049【049_尚硅谷_Hadoop_HDFS_API参数的优先级】05:08

P050【050_尚硅谷_Hadoop_HDFS_API文件下载】08:24

P051【051_尚硅谷_Hadoop_HDFS_API文件删除】04:12

P052【052_尚硅谷_Hadoop_HDFS_API文件更名和移动】05:03

P053【053_尚硅谷_Hadoop_HDFS_API文件详情查看】07:57

P054【054_尚硅谷_Hadoop_HDFS_API文件和文件夹判断】03:20

P055【055_尚硅谷_Hadoop_HDFS_写数据流程】11:38

P056【056_尚硅谷_Hadoop_HDFS_节点距离计算】04:31

P057【057_尚硅谷_Hadoop_HDFS_机架感知(副本存储节点选择)】06:07

P058【058_尚硅谷_Hadoop_HDFS_读数据流程】05:04

P059【059_尚硅谷_Hadoop_HDFS_NN和2NN工作机制】13:28

P060【060_尚硅谷_Hadoop_HDFS_FsImage镜像文件】09:33

P061【061_尚硅谷_Hadoop_HDFS_Edits编辑日志】04:49

P062【062_尚硅谷_Hadoop_HDFS_检查点时间设置】

P063【063_尚硅谷_Hadoop_HDFS_DN工作机制】07:36

P064【064_尚硅谷_Hadoop_HDFS_数据完整性】07:07

P065【065_尚硅谷_Hadoop_HDFS_掉线时限参数设置】04:44

P066【066_尚硅谷_Hadoop_HDFS_总结】03:44


03_尚硅谷大数据技术之Hadoop(HDFS)V3.3

P039【039_尚硅谷_Hadoop_HDFS_课程介绍】04:23

 

P040【040_尚硅谷_Hadoop_HDFS_产生背景和定义】04:11

HDFS定义

HDFSHadoop Distributed File System),它是一个文件系统,用于存储文件,通过目录树来定位文件;其次,它是分布式的,由很多服务器联合起来实现其功能,集群中的服务器有各自的角色。

HDFS的使用场景:适合一次写入,多次读出的场景。一个文件经过创建、写入和关闭之后就不需要改变。

 

能追加数据,不能修改原来的数据。

P041【041_尚硅谷_Hadoop_HDFS_优缺点】05:28

HDFS优点

  1. 高容错性;
  2. 适合处理大数据,GB、TB、PB
  3. 可构建在廉价机器上,通过多副本机制提高可靠性。

HDFS缺点

  1. 不适合低延时数据访问,比如毫秒级的存储数据,是做不到的;
  2. 无法高效的对大量小文件进行存储;
  3. 不支持并发写入、文件随机修改。仅支持数据append(追加)。

P042【042_尚硅谷_Hadoop_HDFS_组成】09:09

hadoop官方文档网站:Index of /docs

P043【043_尚硅谷_Hadoop_HDFS_文件块大小】08:01

思考:为什么块的大小不能设置太小,也不能设置太大?

(1)HDFS的块设置太小,会增加寻址时间,程序一直在找块的开始位置;

(2)如果块设置的太大,从磁盘传输数据的时间会明显大于定位这个块开始位置所需的时间。导致程序在处理这块数据时,会非常慢。

总结:HDFS块的大小设置主要取决于磁盘传输速率。

P044【044_尚硅谷_Hadoop_HDFS_Shell命令上传】09:48

hadoop fs 具体命令  OR  hdfs dfs 具体命令,两个是完全相同的。

连接成功
Last login: Wed Mar 22 11:45:28 2023 from 192.168.88.1
[atguigu@node1 ~]$ hadoop fs
Usage: hadoop fs [generic options]
        [-appendToFile <localsrc> ... <dst>]
        [-cat [-ignoreCrc] <src> ...]
        [-checksum <src> ...]
        [-chgrp [-R] GROUP PATH...]
        [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
        [-chown [-R] [OWNER][:[GROUP]] PATH...]
        [-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
        [-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
        [-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
        [-createSnapshot <snapshotDir> [<snapshotName>]]
        [-deleteSnapshot <snapshotDir> <snapshotName>]
        [-df [-h] [<path> ...]]
        [-du [-s] [-h] [-v] [-x] <path> ...]
        [-expunge]
        [-find <path> ... <expression> ...]
        [-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-getfacl [-R] <path>]
        [-getfattr [-R] -n name | -d [-e en] <path>]
        [-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
        [-head <file>]
        [-help [cmd ...]]
        [-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
        [-mkdir [-p] <path> ...]
        [-moveFromLocal <localsrc> ... <dst>]
        [-moveToLocal <src> <localdst>]
        [-mv <src> ... <dst>]
        [-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
        [-renameSnapshot <snapshotDir> <oldName> <newName>]
        [-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
        [-rmdir [--ignore-fail-on-non-empty] <dir> ...]
        [-setfacl [-R] [-b|-k -m|-x <acl_spec> <path>]|[--set <acl_spec> <path>]]
        [-setfattr -n name [-v value] | -x name <path>]
        [-setrep [-R] [-w] <rep> <path> ...]
        [-stat [format] <path> ...]
        [-tail [-f] [-s <sleep interval>] <file>]
        [-test -[defsz] <path>]
        [-text [-ignoreCrc] <src> ...]
        [-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
        [-touchz <path> ...]
        [-truncate [-w] <length> <path> ...]
        [-usage [cmd ...]]

Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

[atguigu@node1 ~]$ hdfs dfs
Usage: hadoop fs [generic options]
        [-appendToFile <localsrc> ... <dst>]
        [-cat [-ignoreCrc] <src> ...]
        [-checksum <src> ...]
        [-chgrp [-R] GROUP PATH...]
        [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
        [-chown [-R] [OWNER][:[GROUP]] PATH...]
        [-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
        [-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
        [-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
        [-createSnapshot <snapshotDir> [<snapshotName>]]
        [-deleteSnapshot <snapshotDir> <snapshotName>]
        [-df [-h] [<path> ...]]
        [-du [-s] [-h] [-v] [-x] <path> ...]
        [-expunge]
        [-find <path> ... <expression> ...]
        [-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-getfacl [-R] <path>]
        [-getfattr [-R] -n name | -d [-e en] <path>]
        [-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
        [-head <file>]
        [-help [cmd ...]]
        [-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
        [-mkdir [-p] <path> ...]
        [-moveFromLocal <localsrc> ... <dst>]
        [-moveToLocal <src> <localdst>]
        [-mv <src> ... <dst>]
        [-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
        [-renameSnapshot <snapshotDir> <oldName> <newName>]
        [-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
        [-rmdir [--ignore-fail-on-non-empty] <dir> ...]
        [-setfacl [-R] [-b|-k -m|-x <acl_spec> <path>]|[--set <acl_spec> <path>]]
        [-setfattr -n name [-v value] | -x name <path>]
        [-setrep [-R] [-w] <rep> <path> ...]
        [-stat [format] <path> ...]
        [-tail [-f] [-s <sleep interval>] <file>]
        [-test -[defsz] <path>]
        [-text [-ignoreCrc] <src> ...]
        [-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
        [-touchz <path> ...]
        [-truncate [-w] <length> <path> ...]
        [-usage [cmd ...]]

Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

[atguigu@node1 ~]$ 

1)-moveFromLocal:从本地剪切粘贴到HDFS

[atguigu@hadoop102 hadoop-3.1.3]$ vim shuguo.txt

输入:

shuguo

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs  -moveFromLocal  ./shuguo.txt  /sanguo

2)-copyFromLocal:从本地文件系统中拷贝文件到HDFS路径去

[atguigu@hadoop102 hadoop-3.1.3]$ vim weiguo.txt

输入:

weiguo

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -copyFromLocal weiguo.txt /sanguo

3)-put:等同于copyFromLocal,生产环境更习惯用put

[atguigu@hadoop102 hadoop-3.1.3]$ vim wuguo.txt

输入:

wuguo

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -put ./wuguo.txt /sanguo

4)-appendToFile:追加一个文件到已经存在的文件末尾

[atguigu@hadoop102 hadoop-3.1.3]$ vim liubei.txt

输入:

liubei

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -appendToFile liubei.txt /sanguo/shuguo.txt

P045【045_尚硅谷_Hadoop_HDFS_Shell命令下载&直接操作】16:41

HDFS直接操作

1)-ls: 显示目录信息

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls /sanguo

2)-cat:显示文件内容

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cat /sanguo/shuguo.txt

3)-chgrp、-chmod、-chown:Linux文件系统中的用法一样,修改文件所属权限

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs  -chmod 666  /sanguo/shuguo.txt

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs  -chown  atguigu:atguigu   /sanguo/shuguo.txt

4)-mkdir:创建路径

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mkdir /jinguo

5)-cp:从HDFS的一个路径拷贝到HDFS的另一个路径

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cp /sanguo/shuguo.txt /jinguo

6)-mv:在HDFS目录中移动文件

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /sanguo/wuguo.txt /jinguo

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /sanguo/weiguo.txt /jinguo

7)-tail:显示一个文件的末尾1kb的数据

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -tail /jinguo/shuguo.txt

8)-rm:删除文件或文件夹

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rm /sanguo/shuguo.txt

9)-rm -r:递归删除目录及目录里面内容

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rm -r /sanguo

10)-du统计文件夹的大小信息

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -du -s -h /jinguo

27  81  /jinguo

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -du  -h /jinguo

14  42  /jinguo/shuguo.txt

7   21   /jinguo/weiguo.txt

6   18   /jinguo/wuguo.tx

        说明:27表示文件大小;81表示27*3个副本;/jinguo表示查看的目录

11)-setrep:设置HDFS中文件的副本数量

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -setrep 10 /jinguo/shuguo.txt

这里设置的副本数只是记录在NameNode的元数据中,是否真的会有这么多副本,还得看DataNode的数量。因为目前只有3台设备,最多也就3个副本,只有节点数的增加到10台时,副本数才能达到10。

P046【046_尚硅谷_Hadoop_HDFS_API环境准备】08:20

 

P047【047_尚硅谷_Hadoop_HDFS_API创建文件夹】10:54

idea,ctrl+p+enter:查看参数。

package com.atguigu.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.Test;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * 客户端代码常用套路
 * 1、获取一个客户端对象
 * 2、执行相关的操作命令
 * 3、关闭资源
 * HDFS  zookeeper
 */
public class HdfsClient 
    //创建目录
    @Test
    public void testMkdir() throws URISyntaxException, IOException, InterruptedException 
        //连接的集群nn地址
        URI uri = new URI("hdfs://node1:8020");
        //创建一个配置文件
        Configuration configuration = new Configuration();

        //用户
        String user = "atguigu";

        //1、获取到了客户端对象
        FileSystem fileSystem = FileSystem.get(uri, configuration, user);
        //2、创建一个文件夹
        fileSystem.mkdirs(new Path("/xiyou/huaguoshan"));
        //3、关闭资源
        fileSystem.close();
    

package com.atguigu.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * 客户端代码常用套路
 * 1、获取一个客户端对象
 * 2、执行相关的操作命令
 * 3、关闭资源
 * HDFS  zookeeper
 */
public class HdfsClient 
    private FileSystem fs;

    @Before
    public void init() throws URISyntaxException, IOException, InterruptedException 
        // 连接的集群nn地址
        URI uri = new URI("hdfs://node1:8020");
        // 创建一个配置文件
        Configuration configuration = new Configuration();

        configuration.set("dfs.replication", "2");
        // 用户
        String user = "atguigu";

        // 1、获取到了客户端对象
        fs = FileSystem.get(uri, configuration, user);
    

    @After
    public void close() throws IOException 
        // 3、关闭资源
        fs.close();
    

    /*
    @Test
    public void testMkdir() throws URISyntaxException, IOException, InterruptedException 
        //连接的集群nn地址
        URI uri = new URI("hdfs://node1:8020");
        //创建一个配置文件
        Configuration configuration = new Configuration();

        //用户
        String user = "atguigu";

        //1、获取到了客户端对象
        FileSystem fileSystem = FileSystem.get(uri, configuration, user);
        //2、创建一个文件夹
        fileSystem.mkdirs(new Path("/xiyou/huaguoshan"));
        //3、关闭资源
        fileSystem.close();
    */

    //创建目录
    @Test
    public void testMkdir() throws URISyntaxException, IOException, InterruptedException 
//        //连接的集群nn地址
//        URI uri = new URI("hdfs://node1:8020");
//        //创建一个配置文件
//        Configuration configuration = new Configuration();
//
//        //用户
//        String user = "atguigu";
//
//        //1、获取到了客户端对象
//        FileSystem fileSystem = FileSystem.get(uri, configuration, user);
        //2、创建一个文件夹
        fs.mkdirs(new Path("/xiyou/huaguoshan2"));
//        //3、关闭资源
//        fileSystem.close();
    

P048【048_尚硅谷_Hadoop_HDFS_API上传】06:42

package com.atguigu.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * 客户端代码常用套路
 * 1、获取一个客户端对象
 * 2、执行相关的操作命令
 * 3、关闭资源
 * HDFS  zookeeper
 */
public class HdfsClient 
    private FileSystem fs;

    @Before
    public void init() throws URISyntaxException, IOException, InterruptedException 
        // 连接的集群nn地址
        URI uri = new URI("hdfs://node1:8020");
        // 创建一个配置文件
        Configuration configuration = new Configuration();

        // 用户
        String user = "atguigu";

        // 1、获取到了客户端对象
        fs = FileSystem.get(uri, configuration, user);
    

    @After
    public void close() throws IOException 
        // 3、关闭资源
        fs.close();
    

    // 上传
    @Test
    public void testPut() throws IOException 
        // 参数解读,参数1:表示删除原数据、参数2:是否允许覆盖、参数3:原数据路径、参数4:目的地路径
        fs.copyFromLocalFile(false, true, new Path("D:\\\\bigData\\\\file\\\\sunwukong.txt"), new Path("hdfs://node1/xiyou/huaguoshan"));
    

P049【049_尚硅谷_Hadoop_HDFS_API参数的优先级】05:08

将hdfs-site.xml拷贝到项目的resources资源目录下

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

    <property>

        <name>dfs.replication</name>

         <value>1</value> <!--副本数-->

    </property>

</configuration>

参数优先级排序,hdfs-default.xml => hdfs-site.xml=> 在项目资源目录下的配置文件 => 代码里面的配置。

  1. 客户端代码中设置的值
  2. ClassPath下的用户自定义配置文件
  3. 然后是服务器的自定义配置(xxx-site.xml)
  4. 服务器的默认配置(xxx-default.xml)

P050【050_尚硅谷_Hadoop_HDFS_API文件下载】08:24

    //文件下载
    @Test
    public void testGet() throws IOException 
        //参数的解读,参数一:原文件是否删除、参数二:原文件路径HDFS、参数三:Windows目标地址路径、参数四:crc校验
//        fs.copyToLocalFile(false, new Path("hdfs://node1/xiyou/huaguoshan2/sunwukong.txt"), new Path("D:\\\\bigData\\\\file\\\\download"), false);
        fs.copyToLocalFile(false, new Path("hdfs://node1/xiyou/huaguoshan2/"), new Path("D:\\\\bigData\\\\file\\\\download"), false);
//      fs.copyToLocalFile(false, new Path("hdfs://node1/a.txt"), new Path("D:\\\\"), false);
    

P051【051_尚硅谷_Hadoop_HDFS_API文件删除】04:12

    //删除
    @Test
    public void testRm() throws IOException 
        //参数解读,参数1:要删除的路径、参数2:是否递归删除
        //删除文件
        //fs.delete(new Path("/jdk-8u212-linux-x64.tar.gz"),false);

        //删除空目录
        //fs.delete(new Path("/xiyou"), false);

        //删除非空目录
        fs.delete(new Path("/jinguo"), true);
    

P052【052_尚硅谷_Hadoop_HDFS_API文件更名和移动】05:03

    //文件的更名和移动
    @Test
    public void testmv() throws IOException 
        //参数解读,参数1:原文件路径、参数2:目标文件路径
        //对文件名称的修改
        fs.rename(new Path("/input/word.txt"), new Path("/input/ss.txt"));

        //文件的移动和更名
        fs.rename(new Path("/input/ss.txt"), new Path("/cls.txt"));

        //目录更名
        fs.rename(new Path("/input"), new Path("/output"));
    

P053【053_尚硅谷_Hadoop_HDFS_API文件详情查看】07:57

    //获取文件详细信息
    @Test
    public void fileDetail() throws IOException 
        //获取所有文件信息
        RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/"), true);

        //遍历文件
        while (listFiles.hasNext()) 
            LocatedFileStatus fileStatus = listFiles.next();

            System.out.println("========== " + fileStatus.getPath() + " =========");
            System.out.println(fileStatus.getPermission());
            System.out.println(fileStatus.getOwner());
            System.out.println(fileStatus.getGroup());
            System.out.println(fileStatus.getLen());
            System.out.println(fileStatus.getModificationTime());
            System.out.println(fileStatus.getReplication());
            System.out.println(fileStatus.getBlockSize());
            System.out.println(fileStatus.getPath().getName());

            //获取块信息
            BlockLocation[] blockLocations = fileStatus.getBlockLocations();

            System.out.println(Arrays.toString(blockLocations));
        
    

P054【054_尚硅谷_Hadoop_HDFS_API文件和文件夹判断】03:20

    //判断是文件夹还是文件
    @Test
    public void testFile() throws IOException 
        FileStatus[] listStatus = fs.listStatus(new Path("/"));
        for (FileStatus status : listStatus) 
            if (status.isFile()) 
                System.out.println("文件:" + status.getPath().getName());
             else 
                System.out.println("目录:" + status.getPath().getName());
            
        
    

P055【055_尚硅谷_Hadoop_HDFS_写数据流程】11:38

HDFS写数据流程,剖析文件写入。

(1)客户端通过Distributed FileSystem模块向NameNode请求上传文件,NameNode检查目标文件是否已存在,父目录是否存在。

(2)NameNode返回是否可以上传。

(3)客户端请求第一个 Block上传到哪几个DataNode服务器上。

(4)NameNode返回3个DataNode节点,分别为dn1、dn2、dn3。

(5)客户端通过FSDataOutputStream模块请求dn1上传数据,dn1收到请求会继续调用dn2,然后dn2调用dn3,将这个通信管道建立完成。

(6)dn1、dn2、dn3逐级应答客户端。

(7)客户端开始往dn1上传第一个Block(先从磁盘读取数据放到一个本地内存缓存),以Packet为单位,dn1收到一个Packet就会传给dn2,dn2传给dn3;dn1每传一个packet会放入一个应答队列等待应答

(8)当一个Block传输完成之后,客户端再次请求NameNode上传第二个Block的服务器。(重复执行3-7步)。

P056【056_尚硅谷_Hadoop_HDFS_节点距离计算】04:31

P057【057_尚硅谷_Hadoop_HDFS_机架感知(副本存储节点选择)】06:07

Apache Hadoop 3.1.3 – HDFS Architecture

  1. 第一个副本在Client所处的节点上;如果客户端在集群外,随机选一个。
  2. 第二个副本在另一个机架的随机一个节点。
  3. 第三个副本在第二个副本所在机架的随机节点。

P058【058_尚硅谷_Hadoop_HDFS_读数据流程】05:04

(1)客户端通过DistributedFileSystem向NameNode请求下载文件,NameNode通过查询元数据,找到文件块所在的DataNode地址。

(2)挑选一台DataNode(就近原则,然后随机)服务器,请求读取数据。

(3)DataNode开始传输数据给客户端(从磁盘里面读取数据输入流,以Packet为单位来做校验)。

(4)客户端以Packet为单位接收,先在本地缓存,然后写入目标文件。

P059【059_尚硅谷_Hadoop_HDFS_NN和2NN工作机制】13:28

第5章

P060【060_尚硅谷_Hadoop_HDFS_FsImage镜像文件】09:33

1)oiv查看Fsimage文件

(1)查看oiv和oev命令

[atguigu@hadoop102 current]$ hdfs

oiv            apply the offline fsimage viewer to an fsimage

oev            apply the offline edits viewer to an edits file

(2)基本语法

hdfs oiv -p 文件类型 -i镜像文件 -o 转换后文件输出路径

(3)案例实操

[atguigu@hadoop102 current]$ pwd

/opt/module/hadoop-3.1.3/data/dfs/name/current

[atguigu@hadoop102 current]$ hdfs oiv -p XML -i fsimage_0000000000000000025 -o /opt/module/hadoop-3.1.3/fsimage.xml

[atguigu@hadoop102 current]$ cat /opt/module/hadoop-3.1.3/fsimage.xml

将显示的xml文件内容拷贝到Idea中创建的xml文件中,并格式化。部分显示结果如下。

<inode>

    <id>16386</id>

    <type>DIRECTORY</type>

    <name>user</name>

    <mtime>1512722284477</mtime>

    <permission>atguigu:supergroup:rwxr-xr-x</permission>

    <nsquota>-1</nsquota>

    <dsquota>-1</dsquota>

</inode>

<inode>

    <id>16387</id>

    <type>DIRECTORY</type>

    <name>atguigu</name>

    <mtime>1512790549080</mtime>

    <permission>atguigu:supergroup:rwxr-xr-x</permission>

    <nsquota>-1</nsquota>

    <dsquota>-1</dsquota>

</inode>

<inode>

    <id>16389</id>

    <type>FILE</type>

    <name>wc.input</name>

    <replication>3</replication>

    <mtime>1512722322219</mtime>

    <atime>1512722321610</atime>

    <perferredBlockSize>134217728</perferredBlockSize>

    <permission>atguigu:supergroup:rw-r--r--</permission>

    <blocks>

        <block>

           <id>1073741825</id>

           <genstamp>1001</genstamp>

           <numBytes>59</numBytes>

        </block>

    </blocks>

</inode >

思考:可以看出,Fsimage中没有记录块所对应DataNode,为什么?

在集群启动后,要求DataNode上报数据块信息,并间隔一段时间后再次上报。

P061【061_尚硅谷_Hadoop_HDFS_Edits编辑日志】04:49

2)oev查看Edits文件

(1)基本语法

hdfs oev -p 文件类型 -i编辑日志 -o 转换后文件输出路径

(2)案例实操

[atguigu@hadoop102 current]$ hdfs oev -p XML -i edits_0000000000000000012-0000000000000000013 -o /opt/module/hadoop-3.1.3/edits.xml

[atguigu@hadoop102 current]$ cat /opt/module/hadoop-3.1.3/edits.xml

将显示的xml文件内容拷贝到Idea中创建的xml文件中,并格式化。显示结果如下。

<?xml version="1.0" encoding="UTF-8"?>

<EDITS>

    <EDITS_VERSION>-63</EDITS_VERSION>

    <RECORD>

        <OPCODE>OP_START_LOG_SEGMENT</OPCODE>

        <DATA>

           <TXID>129</TXID>

        </DATA>

    </RECORD>

    <RECORD>

尚硅谷大数据技术Hadoop教程-笔记06Hadoop-生产调优手册

学习笔记尚硅谷Hadoop大数据教程笔记

学习笔记尚硅谷Hadoop大数据教程笔记

学习笔记尚硅谷Hadoop大数据教程笔记

大数据周会-本周学习内容总结09

尚硅谷大数据视频_Hadoop视频教程免费下载