大数据实战之Hadoop常用命令及API应用

Posted 2021-04-13 进击的鱼豆腐

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了大数据实战之Hadoop常用命令及API应用相关的知识，希望对你有一定的参考价值。

1. Hadoop常用shell命令

1.1 集群开关

// 启动dfs$HADOOP_HOME/sbin/start-dfs.sh// 启动yarn$HADOOP_HOME/sbin/start-yarn.sh

1.2 HDFS控制

// 控制HDFS的命令有两种 $HADOOP_HOME/bin/hadoop fs// 或者 $HADOOP_HOME/bin/hdfs dfs// 两者操作hdfs时的效果相同，其中dfs是fs的实现类
// -help 输出命令参数$HADOOP_HOME/bin/hadoop fs -help <command>例如：$HADOOP_HOME/bin/hadoop fs -help rm
// -ls 显示目录信息, -R 递归显示$HADOOP_HOME/bin/hadoop fs -ls <path>$HADOOP_HOME/bin/hadoop fs -ls -R <path>例如：$HADOOP_HOME/bin/hadoop fs -ls -R /
// -mkdir 在HDFS上创建目录, -p 创建多级目录$HADOOP_HOME/bin/hadoop fs -mkdir <path>$HADOOP_HOME/bin/hadoop fs -mkdir -p <path>例如：$HADOOP_HOME/bin/hadoop fs -mkdir -p /dou/fu
// -rm 删除HDFS上的文件, -R递归删除$HADOOP_HOME/bin/hadoop fs -rm <path>$HADOOP_HOME/bin/hadoop fs -rm -R <path>例如：$HADOOP_HOME/bin/hadoop fs -rm -R /dou/fu
// -rmdir 只能删除空目录$HADOOP_HOME/bin/hadoop fs -rmdir <path>
// -cp 从HDFS的一个路径拷贝到HDFS的另一个路径$HADOOP_HOME/bin/hadoop fs -cp [src] [dst]
//-mv 在HDFS目录中移动文件$HADOOP_HOME/bin/hadoop fs -mv [src] [dst]
// -text 和 -cat 都是查看HDFS文件内容命令，效果一样$HADOOP_HOME/bin/hadoop fs -text <path>$HADOOP_HOME/bin/hadoop fs -cat <path>
// -tail 显示文件的最后一千字节, -f 会持续监听文件内容变化$HADOOP_HOME/bin/hadoop fs -tail <path>$HADOOP_HOME/bin/hadoop fs -tail -f <path>
// -put 或 -copyFromLocal 上传本地文件到HDFS(复制)$HADOOP_HOME/bin/hadoop fs -put [localsrc] [dst]$HADOOP_HOME/bin/hadoop fs -copyFromLocal [localsrc] [dst]
// -moveFromLocal 移动本地文件到HDFS(剪切)$HADOOP_HOME/bin/hadoop fs -moveFromLocal [localsrc] [dst]
// -get 或 -copyToLocal 从HDFS拷贝文件到本地(复制)$HADOOP_HOME/bin/hadoop fs -put [src] [dst]
// -moveToLocal HDFS移动文件到本地(剪切)$HADOOP_HOME/bin/hadoop fs -put [src] [dst]
// -chgrp -chmod -chown 和Linux文件系统中的用法一样，修改文件所属权限// -R 递归操作$HADOOP_HOME/bin/hadoop fs -chgrp [-R] <DIR OR FILE>$HADOOP_HOME/bin/hadoop fs -chmod [-R] <DIR OR FILE>$HADOOP_HOME/bin/hadoop fs -chown [-R] <DIR OR FILE>
// -setrep 更改文件复制因子(即更改副本数)$HADOOP_HOME/bin/hadoop fs -setrep [-R] [-w] <numReplicas> <path>例如：hadoop fs -setrep -w 3 /dou/fu/test.txt
// -getmerge 合并下载多个文件// -nl 在每个文件的末尾添加换行符, -skip-empty-file 跳过空文件$HADOOP_HOME/bin/hadoop fs -getmerge// HDFS上的test1.txt和test2.txt合并下载到本地为merge.txt例如：$HADOOP_HOME/bin/hadoop fs -getmerge -nl /test1.txt /test2.txt /home/hadoop/merge.txt
// 统计文件系统的可用空间信息$HADOOP_HOME/bin/hadoop fs -df -h /
// 统计当前目录下各文件大小$HADOOP_HOME/bin/hadoop fs -du <path>
//文件检测// -d 如果路径是目录，返回 0// -e 如果路径存在，则返回 0// -f 如果路径是文件，则返回 0// -s 如果路径不为空，则返回 0// -r 如果路径存在且授予读权限，则返回 0// -w 如果路径存在且授予写入权限，则返回 0// -z 如果文件长度为零，则返回 0$HADOOP_HOME/bin/hadoop fs -test - [defsz] URI例如：$HADOOP_HOME/bin/hadoop fs -test -e /user/input/test.txt

2. Hadoop常用API应用

2.1 IDEA创建maven工程，在pom文件中导入jar包做好测试准备

（导入后即可在本地运行）

<dependencies> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>3.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>3.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>3.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs-client</artifactId> <version>3.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-yarn-api --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn-api</artifactId> <version>3.2.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>3.2.1</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>3.2.1</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-api</artifactId> <version>2.13.3</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>2.13.3</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.4</version> </dependency></dependencies>

2.2 文件上传

@Testpublic void testCopyFromLocalFile() throws URISyntaxException, IOException, InterruptedException {    // 创建配置信息 Configuration conf = new Configuration(); // 设置副本数 conf.set("dfs.replication", "1");    // 获取HDFS文件系统，设置了文件系统地址、配置文件和用户 FileSystem fs = FileSystem.get(new URI("hdfs://hadoop101:9000"), conf, "hadoop"); // 上传文件，配置源路径和目的路径    fs.copyFromLocalFile(new Path("src/main/resources/input/input.txt"), new Path("/user/test/input.txt")); // 关闭资源    fs.close();}

2.3 文件下载

@Testpublic void testCopyToLocalFile() throws IOException, URISyntaxException, InterruptedException {    // 创建配置信息    Configuration conf = new Configuration();    // 获取HDFS文件系统，设置了文件系统地址、配置文件和用户 FileSystem fs = FileSystem.get(new URI("hdfs://hadoop101:9000"), conf, "hadoop"); //fs.copyToLocalFile(new Path("/user/input/hdfs-site.xml"), new Path("src/main/resources/input/xml/hdfs-site.xml"));    // 1.是否删除源文件，true则为剪切操作，否则为复制，默认为false     // 2.源路径     // 3.目的路径     // 4.是否是使用本地系统，默认为false，true则不会进行校验，false会产生循环冗余校验码 fs.copyToLocalFile(false, new Path("/user/input/core-site.xml"), new Path("src/main/resources/input/xml/core-site.xml"), true);    // 关闭资源    fs.close();}

2.4 文件删除

@Testpublic void testDelete() throws IOException, URISyntaxException, InterruptedException {    // 创建配置信息    Configuration conf = new Configuration();    // 获取HDFS文件系统，设置了文件系统地址、配置文件和用户    FileSystem fs = FileSystem.get(new URI("hdfs://hadoop101:9000"), conf, "hadoop");    // 1.删除的文件路径 2.是否递归删除，是目录时才需置为true    fs.delete(new Path("/user/test/input.txt"), false);    // 关闭资源    fs.close();}

2.5 文件改名

@Testpublic void testRename() throws IOException, URISyntaxException, InterruptedException {    // 创建配置信息    Configuration conf = new Configuration();    // 获取HDFS文件系统，设置了文件系统地址、配置文件和用户    FileSystem fs = FileSystem.get(new URI("hdfs://hadoop101:9000"), conf, "hadoop");    // 1 源文件名 2 更改后的文件名    fs.rename(new Path("/user/test/input.txt"), new Path("/user/test/reinput.txt")); // 关闭资源    fs.close();}

2.6 获取文件信息

@Testpublic void testGetFileStatus() throws IOException, URISyntaxException, InterruptedException {    // 创建配置信息    Configuration conf = new Configuration();    // 获取HDFS文件系统，设置了文件系统地址、配置文件和用户    FileSystem fs = FileSystem.get(new URI("hdfs://hadoop101:9000"), conf, "hadoop");    // 获取文件信息列表    RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/user/input"), true);        while (listFiles.hasNext()) {        LocatedFileStatus fileStatus = listFiles.next();        // 文件名        System.out.println("文件名: " + fileStatus.getPath().getName()); // 文件权限 System.out.println("文件权限: " + fileStatus.getPermission()); // 文件长度 System.out.println("文件长度: " + fileStatus.getLen()); // 文件块 BlockLocation[] blockLocations = fileStatus.getBlockLocations(); for (BlockLocation blockLocation : blockLocations) { String[] hosts = blockLocation.getHosts(); for (String host : hosts) { System.out.println("块分布地址: " + host); } } System.out.println("-------------------分割线----------------------"); } // 关闭资源 fs.close();}

2.7 判断文件是否是目录

@Testpublic void testListFileStatu() throws IOException, URISyntaxException, InterruptedException { // 创建配置信息 Configuration conf = new Configuration(); // 获取HDFS文件系统，设置了文件系统地址、配置文件和用户 FileSystem fs = FileSystem.get(new URI("hdfs://hadoop101:9000"), conf, "hadoop");    // 获取文件状态数组    FileStatus[] listStatus = fs.listStatus(new Path("/user"));
 for (FileStatus fileStatus : listStatus) { // 判读文件类型 if (fileStatus.isFile()) { System.out.println("file: " + fileStatus.getPath().getName()); } else { System.out.println("directory: " + fileStatus.getPath().getName()); } } // 关闭资源 fs.close();}

2.8 不借助API，以纯字节形式上传至HDFS

@Testpublic void putFileToHDFS() throws URISyntaxException, IOException, InterruptedException { // 配置对象 Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(new URI("hdfs://hadoop101:9000"), conf, "hadoop"); // 获取输入流 FileInputStream fin = new FileInputStream(new File("D:/hadoop-3.2.1.tar.gz")); // 获取输出流 FSDataOutputStream fout = fs.create(new Path("/user/test/hadoop-3.2.1.tar.gz")); // 流的对拷 IOUtils.copyBytes(fin, fout, conf); // 关闭资源 IOUtils.closeStream(fin); IOUtils.closeStream(fout); fs.close();}

2.9 不借助API，以纯字节下载HDFS文件

@Testpublic void getFileFromHDFS() throws URISyntaxException, IOException, InterruptedException { // 配置对象 Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(new URI("hdfs://hadoop101:9000"), conf, "hadoop"); // 获取输入流    FSDataInputStream fin = fs.open(new Path("/user/test/input.txt")); // 获取输出流    FileOutputStream fout = new FileOutputStream(new File("src/main/resources/input/input.txt")); // 流的对拷 IOUtils.copyBytes(fin, fout, conf); // 关闭资源 IOUtils.closeStream(fin); IOUtils.closeStream(fout); fs.close();}

2.10 对于切分为多块的大文件，进行定位读取大文件的块（读取第一块）

@Testpublic void readFileSeek1() throws URISyntaxException, IOException, InterruptedException { // 配置对象 Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(new URI("hdfs://hadoop101:9000"), conf, "hadoop"); // 获取输入流 FSDataInputStream fin = fs.open(new Path("/user/test/hadoop-3.2.1.tar.gz")); // 获取输出流 FileOutputStream fout = new FileOutputStream(new File("src/main/resources/hadoop-3.2.1.tar.gz-part1")); // 流的对拷(拷128M) byte[] buff = new byte[1024];    // 这里一个buff为1024B，所以循环1024*128个buff就为128MB for (int i = 0; i < 1024 * 128; i++) { fin.read(buff); fout.write(buff); } // 关闭资源 IOUtils.closeStream(fin); IOUtils.closeStream(fout); fs.close();}

2.11 对于切分为多块的大文件，进行定位读取大文件的块（读取第二块）

@Testpublic void readFileSeek2() throws URISyntaxException, IOException, InterruptedException { // 配置对象 Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(new URI("hdfs://hadoop101:9000"), conf, "hadoop");
 // 获取输入流 FSDataInputStream fin = fs.open(new Path("/user/test/hadoop-3.2.1.tar.gz"));
    // 指定起始读取点，第一块为128M，所以移动128M的位置开始读取第二块 fin.seek(1024 * 1024 * 128);
 // 获取输出流 FileOutputStream fout = new FileOutputStream(new File("src/main/resources/hadoop-3.2.1.tar.gz-part2"));
 // 流的对拷 byte[] buff = new byte[1024]; for (int i = 0; i < 1024 * 128; i++) { fin.read(buff); fout.write(buff); }
 // 关闭资源 IOUtils.closeStream(fin); IOUtils.closeStream(fout); fs.close();}

到这里，Hadoop的常用命令和API介绍就结束啦！

下一节将介绍HDFS中的读写流程及NN和2NN的工作机制！

以上是关于大数据实战之Hadoop常用命令及API应用的主要内容，如果未能解决你的问题，请参考以下文章