何时使用hadoop fs，hadoop dfs与hdfs dfs命令

Posted 2023-04-17

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了何时使用hadoop fs，hadoop dfs与hdfs dfs命令相关的知识，希望对你有一定的参考价值。

参考技术A 　　hadoop fs：使用面最广，可以操作任何文件系统。

　　hadoop dfs与hdfs dfs：只能操作HDFS文件系统相关（包括与Local FS间的操作），前者已经Deprecated，一般使用后者。

　　以下内容参考自stackoverflow

　　Following are the three commands which appears same but have minute differences

　　hadoop fs args
　　hadoop dfs args
　　hdfs dfs args

　　hadoop fs <args>
　　FS relates to a generic file system which can point to any file systems like local, HDFS etc. So this can be used when you are dealing with different file systems such as Local FS, HFTP FS, S3 FS, and others

　　hadoop dfs <args>
　　dfs is very specific to HDFS. would work for operation relates to HDFS. This has been deprecated and we should use hdfs dfs instead.

　　hdfs dfs <args>
　　same as 2nd i.e would work for all the operations related to HDFS and is the recommended command instead of hadoop dfs

　　below is the list categorized as HDFS commands.

　　**#hdfs commands**
　　namenode|secondarynamenode|datanode|dfs|dfsadmin|fsck|balancer|fetchdt|oiv|dfsgroups
　　So even if you use Hadoop dfs , it will look locate hdfs and delegate that command to hdfs dfs本回答被提问者和网友采纳

面试系列四之项目涉及技术Hadoop

1.1、Hadoop常用端口号

dfs.namenode.http-address:50070
dfs.datanode.http-address:50075
SecondaryNameNode辅助名称节点端口号：50090
dfs.datanode.address:50010
fs.defaultFS:8020 或者9000
yarn.resourcemanager.webapp.address:8088
历史服务器web访问端口：19888

1.2、Hadoop配置文件以及简单的Hadoop集群搭建

（1）配置文件：

core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml
hadoop-env.sh、yarn-env.sh、mapred-env.sh、slaves

（2）简单的集群搭建过程：

JDK安装
配置SSH免密登录
配置hadoop核心文件:
格式化namenode

1.3、HDFS读流程和写流程

这个很重要，虽然现在Hadoop已经到了3.x, 存储也越来越多样化，但是HDFS还是主流的存储，我们需要知道HDFS的读写流程。

1.3.1、HDFS 读流程

1.3.2、HDFS 写流程

1.3.3、MapReduce流程

1.3.3.1、Shffule机制

1.4、Hadoop优化

1.4.1、HDFS小文件影响

（1）影响NameNode的寿命，因为文件元数据存储在NameNode的内存中
（2）影响计算引擎的任务数量，比如每个小的文件都会生成一个Map任务

1.4.2、数据输入小文件处理：

（1）合并小文件：对小文件进行归档（Har）、自定义Inputformat将小文件存储成SequenceFile文件。
（2）采用ConbinFileInputFormat来作为输入，解决输入端大量小文件场景。
（3）对于大量小文件Job，可以开启JVM重用。

1.4.3、Map阶段

（1）增大环形缓冲区大小。由100m扩大到200m
（2）增大环形缓冲区溢写的比例。由80%扩大到90%
（3）减少对溢写文件的merge次数。
（4）不影响实际业务的前提下，采用Combiner提前合并，减少 I/O。

1.4.4、Reduce阶段

（1）合理设置Map和Reduce数：两个都不能设置太少，也不能设置太多。太少，会导致Task等待，延长处理时间；太多，会导致 Map、Reduce任务间竞争资源，造成处理超时等错误。
（2）设置Map、Reduce共存：调整slowstart.completedmaps参数，使Map运行到一定程度后，Reduce也开始运行，减少Reduce的等待时间。
（3）规避使用Reduce，因为Reduce在用于连接数据集的时候将会产生大量的网络消耗。
（4）增加每个Reduce去Map中拿数据的并行数
（5）集群性能可以的前提下，增大Reduce端存储数据内存的大小。