java 读写Parquet格式的数据 Parquet example

Posted Nucky_yang

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了java 读写Parquet格式的数据 Parquet example相关的知识,希望对你有一定的参考价值。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.Random;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.log4j.Logger;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.GroupFactory;
import org.apache.parquet.example.data.simple.SimpleGroupFactory;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.ParquetReader.Builder;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.example.GroupReadSupport;
import org.apache.parquet.hadoop.example.GroupWriteSupport;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.MessageTypeParser;

public class ReadParquet {
    static Logger logger=Logger.getLogger(ReadParquet.class);
    public static void main(String[] args) throws Exception {
        
//        parquetWriter("test\\parquet-out2","input.txt");
        parquetReaderV2("test\\parquet-out2");
    }
    
    
    static void parquetReaderV2(String inPath) throws Exception{
        GroupReadSupport readSupport = new GroupReadSupport();
        Builder<Group> reader= ParquetReader.builder(readSupport, new Path(inPath));
        ParquetReader<Group> build=reader.build();
        Group line=null;
        while((line=build.read())!=null){
            System.out.println(line.toString());
        }
        System.out.println("读取结束");
        
    } 
    //新版本中new ParquetReader()所有构造方法好像都弃用了,用上面的builder去构造对象
    static void parquetReader(String inPath) throws Exception{
        GroupReadSupport readSupport = new GroupReadSupport();
        ParquetReader<Group> reader = new ParquetReader<Group>(new Path(inPath),readSupport);
        Group line=null;
        while((line=reader.read())!=null){
            System.out.println(line.toString());
        }
        System.out.println("读取结束");
        
    }
    /**
     * 
     * @param outPath  输出Parquet格式
     * @param inPath  输入普通文本文件
     * @throws IOException
     */
    static void parquetWriter(String outPath,String inPath) throws IOException{
        MessageType schema = MessageTypeParser.parseMessageType("message Pair {\n" +
                " required binary city (UTF8);\n" +
                " required binary ip (UTF8);\n" +
                " repeated group time {\n"+
                  " required int32 ttl;\n"+
                  " required binary ttl2;\n"+
                "}\n"+
              "}");
        GroupFactory factory = new SimpleGroupFactory(schema);
        Path path = new Path(outPath);
       Configuration configuration = new Configuration();
       GroupWriteSupport writeSupport = new GroupWriteSupport();
       writeSupport.setSchema(schema,configuration);
       ParquetWriter<Group> writer = new ParquetWriter<Group>(path,configuration,writeSupport);
    //把本地文件读取进去,用来生成parquet格式文件 BufferedReader br
=new BufferedReader(new FileReader(new File(inPath))); String line=""; Random r=new Random(); while((line=br.readLine())!=null){ String[] strs=line.split("\\s+"); if(strs.length==2) { Group group = factory.newGroup() .append("city",strs[0]) .append("ip",strs[1]); Group tmpG =group.addGroup("time"); tmpG.append("ttl", r.nextInt(9)+1); tmpG.append("ttl2", r.nextInt(9)+"_a"); writer.write(group); } } System.out.println("write end"); writer.close(); } }
说下schema(写Parquet格式数据需要schema,读取的话"自动识别"了schema)
/*
 * 每一个字段有三个属性:重复数、数据类型和字段名,重复数可以是以下三种:
 *         required(出现1次)
 *         repeated(出现0次或多次)
 *         optional(出现0次或1次)
 * 每一个字段的数据类型可以分成两种:
 *         group(复杂类型)
 *         primitive(基本类型)
* 数据类型有
* INT64, INT32, BOOLEAN, BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY
*/

maven依赖(我用的1.7)
<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-hadoop</artifactId>
    <version>1.7.0</version>
</dependency>

 







以上是关于java 读写Parquet格式的数据 Parquet example的主要内容,如果未能解决你的问题,请参考以下文章

读取大量 parquet 文件:read_parquet vs from_delayed

spark DataFrame 读写和保存数据

使用Spark读写Parquet文件验证Parquet自带表头的性质及NULL值来源Java

使用Spark读写Parquet文件验证Parquet自带表头的性质及NULL值来源Java

使用Spark读写Parquet文件验证Parquet自带表头的性质及NULL值来源Java

12.spark sql之读写数据