从 Pig UDF Java 类、Amazon EMR 中的分布式缓存访问文件

Posted 2023-04-13

技术标签:

【中文标题】从 Pig UDF Java 类、Amazon EMR 中的分布式缓存访问文件【英文标题】：Accessing a File from Distributed Cache in Pig UDF Java class, Amazon EMR 【发布时间】：2015-07-19 21:21:48 【问题描述】：

我正在尝试访问 UDF 中的文件 (sample.txt)。我想将该文件放在分布式缓存中并从那里使用它。我正在使用亚马逊 EMR 来运行 Pig 作业。我在创建集群时使用 EMR 引导操作将文件 (sample.txt) 复制到 HDFS。

bootstrap.sh（将文件从 s3 复制到 hdfs）

hadoop fs -copyToLocal s3n://s3_path/sample.txt /mnt/sample.txt

UsingSample.java（使用 sample.txt 的 UDF）

public class UsingSample extends EvalFunc<String>

public String useSampleText(String str) throws Exception
    File  sampleFile = new File(“./sample”);

    //do something with sampleFile



@Override
public String exec(Tuple input) throws IOException 
    if (input == null || input.size() == 0)
        return null;

    String str = (String) input.get(0);
    String result = "";
    try 
        result = useSampleText(str);
     catch (Exception e) 
        // TODO Auto-generated catch block
        e.printStackTrace();
    
    return result;


public List<String> getCacheFiles()  
   List<String> list = new ArrayList<String>(1); 
   list.add("/mnt/sample.txt#sample"); // not sure if the path I am passing is correct
   return list;

create_cluster.sh（创建集群并执行 Pig 脚本的脚本）

aws emr create-cluster 

--auto-terminate 

--name "sample cluster" 

--ami-version 3.8.0  

--enable-debugging 

--applications Name=Pig 

--use-default-roles 

--instance-type m1.large 

--instance-count 3 

--steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE,Args=[-f,$S3_PIG_SCRIPT_URL,-p,INPUT=$INPUT,-p,OUTPUT=$OUTPUT] 

--bootstrap-action Path=s3://s3_bootstrapscript_path/bootstrap.sh

我得到的错误是尝试访问 getCacheFiles() 中的 sample.txt 时出现 FileNotFound 异常。

我正在使用：

Hadoop 2.4 Pig 0.12

请帮忙。

【问题讨论】：

【参考方案1】：

尝试使用以下命令将文件复制到 HDFS：

Hadoop distcp s3n://bucket/file /home/filelocation

然后使用以下命令检查 HDFS 上的文件是否存在：

hdfs dfs -ls /home/filelocation

【讨论】：

以上是关于从 Pig UDF Java 类、Amazon EMR 中的分布式缓存访问文件的主要内容，如果未能解决你的问题，请参考以下文章