如何在 piglatin 中每次加载具有不同分隔符的文件

Posted

技术标签:

【中文标题】如何在 piglatin 中每次加载具有不同分隔符的文件【英文标题】:how to load files with different delimiter each time in piglatin 【发布时间】:2014-10-14 07:25:02 【问题描述】:

来自输入源的数据具有不同的分隔符,例如 , OR ; . 有时可能,有时可能; .但是 PigStorage 函数一次只接受一个参数作为分隔符。如何加载这种数据[带分隔符,或; ]

【问题讨论】:

你怎么知道分隔符是什么?一个文件中的两行可以有不同的分隔符吗? 我们使用 , 或 ;作为分隔符。不,一个文件中的两行具有相同的分隔符。所有记录/行在文件中都将具有相同的分隔符。 你可以将分隔符作为参数传递给 pig 脚本并使用特定的调用它。 如果是 , OR ;如何传递分隔符? PigStorage 只接受单个分隔符。 你怎么知道它是,还是;?单独的文件夹?文件名? 【参考方案1】:

你能检查一下这是否适合你吗?

    它将适用于具有不同分隔符的所有输入文件 它也适用于具有不同分隔符的同一个文件。

您可以在字符类[,:,]内添加任意数量的分隔符

例子:

input1.txt
1,2,3,4

input2.txt
a-b-c-d

input3.txt
100:200:300:400

input4.txt
100,aaa-200:b

PigScript:
A = LOAD 'input*' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,:-](.*)[,:-](.*)[,:-](.*)'))  AS (f1,f2,f3,f4);
DUMP B;

Output:
(1,2,3,4)
(a,b,c,d)
(100,200,300,400)
(100,aaa,200,b)

【讨论】:

你能解释一下正则表达式是如何工作的吗?为什么 [,:-] 重复三次??【参考方案2】:
A = LOAD '/some/path/COMMA-DELIM-PREFIX*' USING PigStorage(',') AS (f1:chararray, ...);
B = LOAD '/some/path/SEMICOLON-DELIM-PREFIX*' USING PigStorage('\t') AS (f1:chararray, ...);

C = UNION A,B;

【讨论】:

谢谢弗雷德。我在文件名中有标准前缀(QWERTY_123、POIUY_029 等),它们始终具有相同的分隔符。 QWERTY -> ,POIUY -> ;我开发了一个带有前缀并告诉分隔符的 udf。现在如何在 pig 中读取文件名,以便我拆分前缀并获取分隔符 根据有多少前缀以及它们更改的频率,我要么按照上面描述的方式对其进行硬编码,要么,如果这太麻烦,扩展类 PigStorage 并在那里设置逻辑。这有帮助吗? 我认为自定义加载器是唯一的方法。 Fred,A 和 B 的架构不同,Union 在这里不起作用。如何将 A 和 B 保存到单独的文件中?【参考方案3】:
You need to write your own custom loader for delimiter .

Steps for writing custom loader :

As of 0.7.0, Pig loaders extend the LoadFunc abstract class.This means they need to override 4 methods:

    getInputFormat() this method returns to the caller an instance of the InputFormat that the loader supports. The actual load process needs an instance to use at load time, and doesn't want to place any constraints on how that instance is created.
    prepareToRead() is called prior to reading a split. It passes in the reader used during the reads of the split, as well as the actual split. The implementation of the loader usually keeps the reader, and may want to access the actual split if needed.
    setLocation() Pig calls this to communicate the load location to the loader, which is responsible for passing that information to the underlying InputFormat object. This method can be called multiple times, so there should be no state associated with the method (unless that state gets reset when the method is called).
    getNext() Pig calls this to get the next tuple from the loader once all setup has been done. If this method returns a NULL, Pig assumes that all  information in the split passed via the prepareToRead() method has been processed. 


please find the code 


package Pig;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.pig.LoadFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;

public class CustomLoader extends LoadFunc 

     private String DELIM = ",";
     private static final int DEFAULT_LIMIT = 226;
     private int limit = DEFAULT_LIMIT;
     private RecordReader reader;
     private List <Integer>indexes;
     private TupleFactory tupleFactory;




     public CustomLoader(String delimter) 
            this.DELIM = delimter;

        

     @Override
     public InputFormat getInputFormat() throws IOException 
       return new TextInputFormat();

     




     @Override
     public Tuple getNext() throws IOException 
      Tuple tuple = null;
      List values = new ArrayList();
      tupleFactory = TupleFactory.getInstance();
      try 
       boolean notDone = reader.nextKeyValue();
       if (!notDone) 
           return null;
       
       Text value = (Text) reader.getCurrentValue();

       if(value != null) 
        String parts[] = value.toString().split(DELIM);


         for (int index=0 ;index< parts.length;index++) 


             if(index > limit) 
          throw new IOException("index "+index+ "is out of bounds: max index = "+limit);
          else 
          values.add(parts[index]);
         
         

        tuple = tupleFactory.newTuple(values);
       

       catch (InterruptedException e) 
       // add more information to the runtime exception condition. 
       int errCode = 6018;
                String errMsg = "Error while reading input";
                throw new ExecException(errMsg, errCode,
                        PigException.REMOTE_ENVIRONMENT, e);
      

      return tuple;

     

     @Override
     public void prepareToRead(RecordReader reader, PigSplit pigSplit)
       throws IOException 
      this.reader = reader; // note that for this Loader, we don't care about the PigSplit.
     

     @Override
     public void setLocation(String location, Job job) throws IOException 
      FileInputFormat.setInputPaths(job, location); // the location is assumed to be comma separated paths. 

     
     public static void main(String[] args) 

    

     



create a jar file 

register '/home/impadmin/customloader.jar' ;


load '/pig/u.data' using Pig.CustomLoader('::') as (id,mov_id,rat,timestamp);

data sets 

196::242::3::881250949
186::302::3::891717742
22::377::1::878887116
244::51::2::880606923
166::346::1::886397596
298::474::4::884182806
115::265::2::881171488
253::465::5::891628467
305::451::::886324817
6::86::3::883603013


Now you can specify any delimiter you want 

【讨论】:

谢谢。但我的问题是要知道如何动态加载具有不同分隔符的文件。例如, :: 可以是 ;或 , 或 - 您需要在加载猪加载器时指定分隔符通过一个分隔符,这是无效的情况,其中一个字段与一个分隔符在同一文件中不同,请告诉我您的想法 1.每个文件只有一个分隔符 [ 要么 , OR ; OR tab ] 2. 我知道如何将分隔符传递给 PigStorage 或自定义加载程序。我的情况是:我的输入文件由 , OR 分隔;或选项卡。我不确定这三个中的分隔符是什么。如何动态传递分隔符以便正确加载数据是我的问题

以上是关于如何在 piglatin 中每次加载具有不同分隔符的文件的主要内容,如果未能解决你的问题,请参考以下文章

如何从 Yahoo PigLatin UDF 中将文件加载到 DataBag 中?

如何在猪中加载由 :: 分隔的数据

如何在一个 EmbeddedDatabaseBuilder 中组合多个具有不同分隔符的 SQL 文件?

如何为具有不同公式的多个 glm 调用仅加载一次数据?

pig latin - 计数不同并分组

如何在循环中的每次迭代中保存具有不同名称的文件? MATLAB