Linux 环境中安装 ORC 库,golang读取orc文件,做大数据处理学习准备

Posted freewebsys

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Linux 环境中安装 ORC 库,golang读取orc文件,做大数据处理学习准备相关的知识,希望对你有一定的参考价值。

目录

前言


本文的原文连接是:
https://blog.csdn.net/freewebsys/article/details/124121247

未经博主允许不得转载。
博主地址是:http://blog.csdn.net/freewebsys

1,关于ORC库


网站地址:https://orc.apache.org/

ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache Hive,用于降低Hadoop数据存储空间和加速Hive查询速度。

2,安装使用


下载地址:
https://www.apache.org/dyn/closer.cgi/orc/orc-1.7.3/orc-1.7.3.tar.gz

apt-get -y install build-essential cmake
curl -sSLO  https://dlcdn.apache.org/orc/orc-1.7.3/orc-1.7.3.tar.gz
tar -zxf orc-1.7.3.tar.gz
orc-1.7.3.tar.gz
mkdir build
cd build
cmake .. -DBUILD_JAVA=OFF
make install
# 直接就安装到了/usr/local/bin 目录下了。

然后就可以把csv 多个格式转换成orc格式了。
orc-contents orc-memory orc-metadata orc-scan orc-statistics

# orc-contents 
Usage: orc-contents [options] <filename>...
Options:
	-h --help
	-c --columns		Comma separated list of top-level column fields
	-t --columnTypeIds	Comma separated list of column type ids
	-n --columnNames	Comma separated list of column names
	-b --batch		Batch size for reading
Print contents of ORC files.

# csv-import 
Usage: csv-import [-h] [--help]
                  [-d <character>] [--delimiter=<character>]
                  [-s <size>] [--stripe=<size>]
                  [-c <size>] [--block=<size>]
                  [-b <size>] [--batch=<size>]
                  [-t <string>] [--timezone=<string>]
                  <schema> <input> <output>
Import CSV file into an Orc file using the specified schema.
The timezone is writer timezone of timestamp types.
Compound types are not yet supported.

# orc-metadata 
Usage: orc-metadata [-h] [--help] [-r] [--raw] [-v] [--verbose] <filename>

#创建一个 CSV的格式,是考试成绩。
vi student.csv
zhangsan,13,100,98
lisi,14,89,88
wangwu,13,60,78
zhaoliu,12,56,67

# 转换成 stuct 结构
csv-import "struct<name:string,age:int,math:int,english:int>" student.csv student.orc
[2022-04-12 14:00:10] Start importing Orc file...
[2022-04-12 14:00:10] Finish importing Orc file.
[2022-04-12 14:00:10] Total writer elasped time: 0.005487s.
[2022-04-12 14:00:10] Total writer CPU time: 0.005466s.

# orc-contents  student.orc 
"name": "zhangsan", "age": 13, "math": 100, "english": 98
"name": "lisi", "age": 14, "math": 89, "english": 88
"name": "wangwu", "age": 13, "math": 60, "english": 78
"name": "zhaoliu", "age": 12, "math": 56, "english": 67

#查看一个字段
# orc-contents -n name  student.orc 
"name": "zhangsan"
"name": "lisi"
"name": "wangwu"
"name": "zhaoliu"

# 统计
# orc-statistics student.orc 
File student.orc has 5 columns
*** Column 0 ***
Column has 4 values and has null value: no

*** Column 1 ***
Data type: String
Values: 4
Has null: no
Minimum: lisi
Maximum: zhaoliu
Total length: 25

*** Column 2 ***
Data type: Integer
Values: 4
Has null: no
Minimum: 12
Maximum: 14
Sum: 52
...

3,golang代码


golang 可以使用github.com/scritchley/orc 库,直接进行orc 文件的读取。
可以直接把数据读取出来。

package main

import (
	"testing"
	"log"
	"fmt"
	"github.com/scritchley/orc"
)

func TestReadNullAtEnd(t *testing.T) 
	r, err := orc.Open("student.orc")

    if err != nil 
        log.Fatal(err)
    

    selected := r.Schema().Columns()
    c := r.Select(selected...)
    defer c.Close()

    vals := make([]interface, len(selected))
    ptrVals := make([]interface, len(selected))
    strVals := make([]string, len(selected))
    for i := range vals 
        ptrVals[i] = &vals[i]
    

    for c.Stripes() 
        for c.Next() 
            err := c.Scan(ptrVals...)
            if err != nil 
                log.Fatal(err)
            
            for i := range ptrVals 
                strVals[i] = fmt.Sprint(ptrVals[i])
                log.Println(strVals[i])
            
        
    


    if err := c.Err(); err != nil 
        log.Fatal(err)
    

# go test -v orc_read_test.go 
=== RUN   TestReadNullAtEnd
2022/04/12 14:13:56 zhangsan
2022/04/12 14:13:56 13
2022/04/12 14:13:56 100
2022/04/12 14:13:56 98
2022/04/12 14:13:56 lisi
2022/04/12 14:13:56 14
2022/04/12 14:13:56 89
2022/04/12 14:13:56 88
2022/04/12 14:13:56 wangwu
2022/04/12 14:13:56 13
2022/04/12 14:13:56 60
2022/04/12 14:13:56 78
2022/04/12 14:13:56 zhaoliu
2022/04/12 14:13:56 12
2022/04/12 14:13:56 56
2022/04/12 14:13:56 67
--- PASS: TestReadNullAtEnd (0.00s)
PASS
ok  	command-line-arguments	0.008s

4,orc总结


ORC格式化的数据文件,处理起来非常方便。
ORC在大数据处理上非常常用的格式,学习起来非常方便。

本文的原文连接是:
https://blog.csdn.net/freewebsys/article/details/124121247

博主地址是:https://blog.csdn.net/freewebsys

以上是关于Linux 环境中安装 ORC 库,golang读取orc文件,做大数据处理学习准备的主要内容,如果未能解决你的问题,请参考以下文章

Linux中安装python3.6和第三方库

kali中安装gobuster教程

kali中安装gobuster教程

关于linux安装arm-linux-gcc中安装兼容库出现问题

linux中安装ES数据库

linux中安装curl组件