hadoop权威指南(第四版)要点翻译——Chapter 3. The HDFS

Posted 2020-10-24 llguanli

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了hadoop权威指南(第四版)要点翻译——Chapter 3. The HDFS相关的知识，希望对你有一定的参考价值。

5) The Java Interface
a) Reading Data from a Hadoop URL.
使用hadoop URL来读取数据
b) Although we focus mainly on the HDFS implementation, DistributedFileSystem, in general you should strive to write your code against the FileSystem abstract class, to retain portability across filesystems.
尽管我们把基本的注意力都集中在HDFS的实现上，即DistributedFileSystem，但通常你应该针对抽象类FileSystem编写代码以保持其跨文件系统的可移植性。
c) One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL object to open a stream to read the data from. The general idiom is:
从一个hadoop文件系统中读取一个文件最简单的方式就是使用一个java.net.URL对象打开一个数据流去从中读取数据。

通常格式是：

InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}

There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL scheme. This is achieved by calling the setURLStreamHandlerFactory() method on URL with an instance of FsUrlStreamHandlerFactory. This method can be called only once per JVM, so it is typically executed in a static block.
让Java识别hadoop的hdfs url方案还须要一点额外的工作，在这里能够通过FsUrlStreamHandlerFactory对象调用URL中的setURLStreamHandlerFactory()方法来实现。

这种方法每个JVM仅仅能运行一次，因此通常在一个静态程序块中运行。
d) Example 3-1. Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler.
使用URLStreamHandler用标准输出的方式列出一个hadoop文件系统中的文件。

public class URLCat {

    static {
        URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
    }

    public static void main(String[] args) throws Exception {
        InputStream in = null;
        try {
            in = new URL(args[0]).openStream();
            IOUtils.copyBytes(in, System.out, 4096, false);
        } finally {
            IOUtils.closeStream(in);
        }
    }
}

run：
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt

e) We make use of the handy IOUtils class that comes with Hadoop for closing the stream in the finally clause, and also for copying bytes between the input stream and the output stream (System.out, in this case). The last two arguments to the copyBytes() method are the buffer size used for copying and whether to close the streams when the copy is complete. We close the input stream ourselves, and System.out doesn’t need to be closed.
我们使用了hadoop中就近的IOUtils类，而且在finally子句中关闭了数据流，而且在输入流和输出流之间复制数据(在这个样例中输出流是System.out). copyBytes()方法中最后的两个參数表示复制数据的缓存大小以及当复制完毕时是否关闭数据流。在这里我们关闭了输入流。而输出流System.out不须要关闭。
f) Reading Data Using the FileSystem API.
使用FileSystem API来读取数据。
g) FileSystem is a general filesystem API, so the first step is to retrieve an instance for the filesystem we want to use — HDFS, in this case. There are several static factory methods for getting a FileSystem instance:
FileSystem类是一个通用文件系统的API，因此第一步就是获得一个文件系统的实力，在本例中是HDFS。

获得一个FileSystem实例有几种静态工厂方法。

public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException

h) A Configuration object encapsulates a client or server’s configuration, which is set using configuration files read from the classpath, such as etc/hadoop/core-site.xml. The first method returns the default filesystem (as specified in core-site.xml, or the default local filesystem if not specified there). The second uses the given URI’s scheme and authority to determine the filesystem to use, falling back to the default filesystem if no scheme is specified in the given URI. The third retrieves the filesystem as the given user, which is important in the context of security.
Configuration对象封装了客户端或者服务器端的配置，其设置成使用配置文件从类路径中读取，比方etc/hadoop/core-site.xml。

第一种方法返回默认的文件系统(其在core-site.xml中指定，假设没有在这里指定的话，就是默认的本地文件系统).另外一种方法依据给定的URL方案和权限来决定所使用的文件系统，假设在给定的URL中没有指定详细的方案，那么返回默认的文件系统。

第三种方法会去检索给定的用户的文件系统。在强调安全的背景下，这是非常重要的。
i) Example 3-2. Displaying files from a Hadoop filesystem on standard output by using the FileSystem directly.
直接使用FileSystem类以标准输出格式列出hadoop文件系统中的文件。

public class FileSystemCat {

    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        InputStream in = null;
        try {
            in = fs.open(new Path(uri));
            IOUtils.copyBytes(in, System.out, 4096, false);
        } finally {
            IOUtils.closeStream(in);
        }
    }
}
run:
% hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt

j) The open() method on FileSystem actually returns an FSDataInputStream rather than a standard java.io class. This class is a specialization of java.io.DataInputStream with support for random access, so you can read from any part of the stream:
FileSystem类的open()方法实际上返回的是一个FSDataInputStream，而不是一个标准的Java IO类。这个类一个继承了java.io.DataInputStream类的特殊类，且支持随机訪问，因此。能够读取数据流的不论什么部分。

package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation elided
}

k) The Seekable interface permits seeking to a position in the file and provides a query method for the current offset from the start of the file (getPos()):
Seekable接口同意进行在文件里定位，而且提供一个当前位置相对文件起始位置的偏移量的查询方法(getPos()):

public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}

l) Example 3-3. Displaying files from a Hadoop filesystem on standard output twice, by using seek():
使用seek()方法以标准输出方式列出2次hadoop文件系统的文件

public class FileSystemDoubleCat {

    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        FSDataInputStream in = null;
        try {
            in = fs.open(new Path(uri));
            IOUtils.copyBytes(in, System.out, 4096, false);
            in.seek(0); // go back to the start of the file
            IOUtils.copyBytes(in, System.out, 4096, false);
        } finally {
            IOUtils.closeStream(in);
        }
    }
}
run:
% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt

m) Finally, bear in mind that calling seek() is a relatively expensive operation and should be done sparingly. You should structure your application access patterns to rely on streaming data (by using MapReduce, for example) rather than performing a large number of seeks.
最后。别忘了调用seek()方法是一个相对开销比較大的操作，应该慎重使用。你应该在流数据之上(比方，MapReduce)构建应用程序訪问模式,而不是运行大量的seek()方法。

n) Writing Data
o) The FileSystem class has a number of methods for creating a file. The simplest is the method that takes a Path object for the file to be created and returns an output stream to write to:
FileSystem类有很多创建文件的方法。

最简单的方法是给要创建的文件设置一个Path对象，而且返回一个能够给文件写入数据的输出流。

public FSDataOutputStream create(Path f) throws IOException
p) There’s also an overloaded method for passing a callback interface, Progressable, so your application can be notified of the progress of the data being written to the datanodes:
另一个重载方法，用来传递一个回调接口Progressable。因此这样能够把数据写入节点的进度告知应用程序。

package org.apache.hadoop.util;
public interface Progressable {
    public void progress();
}

q) As an alternative to creating a new file, you can append to an existing file using the append() method (there are also some other overloaded versions):
作为一个创建新文件的可选方式，你能够使用append()方法来附件一个已经存在的文件(也有其它的重载版本号)。
public FSDataOutputStream append(Path f) throws IOException
r) Example 3-4. Copying a local file to a Hadoop filesystem
复制一个本地文件到hadoop文件系统。

public class FileCopyWithProgress {
    public static void main(String[] args) throws Exception {
        String localSrc = args[0];
        String dst = args[1];

        InputStream in = new BufferedInputStream(new FileInputStream(localSrc));

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(dst), conf);
        OutputStream out = fs.create(new Path(dst), new Progressable() {
            public void progress() {
                System.out.print(".");
            }
        });

        IOUtils.copyBytes(in, out, 4096, true);
    }
}

s) The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream, has a method for querying the current position in the file:
FileSystem类的create()方法返回了一个FSDataOutputStream，就像FSDataInputStream一样，也有一个方法用来查询文件里的当前位置：

package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
    public long getPos() throws IOException {
// implementation elided
    }
// implementation elided
}

However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is because HDFS allows only sequential writes to an open file or appends to an already written file. In other words, there is no support for writing to anywhere other than the end of the file, so there is no value in being able to seek while writing.
然而，跟FSDataInputStream不一样。FSDataOutputStream不同意检索。这是由于HDFS仅同意连续的写入一个已经打开的文件，或者附加到一个已经存在的可写入文档。

换句话说。除了支持写入文件的末尾之外，其它位置都不支持，因此写入的时候进行定位就毫无意义。

t) FileSystem provides a method to create a directory:
FileSystem类提供了一个方法去创建文件夹。
public boolean mkdirs(Path f) throws IOException
Often, you don’t need to explicitly create a directory, because writing a file by calling create() will automatically create any parent directories.
通常。你不须要显示的创建一个文件夹，由于使用create()方法写入文件时会自己主动的创建不论什么须要的父文件夹。
u) Querying the Filesystem
v) An important feature of any filesystem is the ability to navigate its directory structure and retrieve information about the files and directories that it stores. The FileStatus class encapsulates filesystem metadata for files and directories, including file length, block size, replication, modification time, ownership, and permission information.
不论什么文件系统的一个重要特征就是具有浏览和检索所存储的文件和文件夹的文件夹结构和信息。

FileStatus类封装了文件系统中文件和文件夹的元数据，包括文件长度、块大小、备份因素、改动时间，全部者以及权限信息。
w) The method getFileStatus() on FileSystem provides a way of getting a FileStatus object for a single file or directory.
FileSystem类的getFileStatus()方法提供了一个获取文件或文件夹的FileStatus对象的方式。

x) Finding information on a single file or directory is useful, but you also often need to be able to list the contents of a directory. That’s what FileSystem’s listStatus() methods are for:
在一个单个文件或文件夹上搜寻信息是实用的。可是你也会常常须要罗列一个文件夹的内容。

这就是FileSystem类listStatus()方法的功能。

public FileStatus[] listStatus(Path f) throws IOException
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException
public FileStatus[] listStatus(Path[] files) throws IOException
public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException

When the argument is a file, the simplest variant returns an array of FileStatus objects of length 1. When the argument is a directory, it returns zero or more FileStatus objects representing the files and directories contained in the directory.
当參数是一个文件时，最简单变化就是返回一个长度为1的FileStatus对象数组。当參数是一个文件夹时。返回0个或多个FileStatus对象，代表文件夹中包括的文件或者文件夹。
y) Example 3-6. Showing the file statuses for a collection of paths in a Hadoop filesystem.
显示hadoop文件系统中一组路径的文件状态

public class ListStatus {

    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);

        Path[] paths = new Path[args.length];
        for (int i = 0; i < paths.length; i++) {
            paths[i] = new Path(args[i]);
        }

        FileStatus[] status = fs.listStatus(paths);
        Path[] listedPaths = FileUtil.stat2Paths(status);
        for (Path p : listedPaths) {
            System.out.println(p);
        }
    }
}

z) Rather than having to enumerate each file and directory to specify the input, it is convenient to use wildcard characters to match multiple files with a single expression, an operation that is known as globbing. Hadoop provides two FileSystem methods for processing globs:
不同于使用枚举的方式去指定每个文件和文件夹作为输入，它能够非常方便的使用通配符用一个表达式去匹配多个文件，也就是被觉得的globbing操作。hadoop提供了两种FileSystem类的方法去处理globs：

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException

Hadoop supports the same set of glob characters as the Unix bash shell.
hadoop支持与Unix系统bash脚本一致的通配符表达。