Linux 环境中安装 ORC 库,golang读取orc文件,做大数据处理学习准备
Posted freewebsys
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Linux 环境中安装 ORC 库,golang读取orc文件,做大数据处理学习准备相关的知识,希望对你有一定的参考价值。
目录
前言
本文的原文连接是:
https://blog.csdn.net/freewebsys/article/details/124121247
未经博主允许不得转载。
博主地址是:http://blog.csdn.net/freewebsys
1,关于ORC库
ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache Hive,用于降低Hadoop数据存储空间和加速Hive查询速度。
2,安装使用
下载地址:
https://www.apache.org/dyn/closer.cgi/orc/orc-1.7.3/orc-1.7.3.tar.gz
apt-get -y install build-essential cmake
curl -sSLO https://dlcdn.apache.org/orc/orc-1.7.3/orc-1.7.3.tar.gz
tar -zxf orc-1.7.3.tar.gz
orc-1.7.3.tar.gz
mkdir build
cd build
cmake .. -DBUILD_JAVA=OFF
make install
# 直接就安装到了/usr/local/bin 目录下了。
然后就可以把csv 多个格式转换成orc格式了。
orc-contents orc-memory orc-metadata orc-scan orc-statistics
# orc-contents
Usage: orc-contents [options] <filename>...
Options:
-h --help
-c --columns Comma separated list of top-level column fields
-t --columnTypeIds Comma separated list of column type ids
-n --columnNames Comma separated list of column names
-b --batch Batch size for reading
Print contents of ORC files.
# csv-import
Usage: csv-import [-h] [--help]
[-d <character>] [--delimiter=<character>]
[-s <size>] [--stripe=<size>]
[-c <size>] [--block=<size>]
[-b <size>] [--batch=<size>]
[-t <string>] [--timezone=<string>]
<schema> <input> <output>
Import CSV file into an Orc file using the specified schema.
The timezone is writer timezone of timestamp types.
Compound types are not yet supported.
# orc-metadata
Usage: orc-metadata [-h] [--help] [-r] [--raw] [-v] [--verbose] <filename>
#创建一个 CSV的格式,是考试成绩。
vi student.csv
zhangsan,13,100,98
lisi,14,89,88
wangwu,13,60,78
zhaoliu,12,56,67
# 转换成 stuct 结构
csv-import "struct<name:string,age:int,math:int,english:int>" student.csv student.orc
[2022-04-12 14:00:10] Start importing Orc file...
[2022-04-12 14:00:10] Finish importing Orc file.
[2022-04-12 14:00:10] Total writer elasped time: 0.005487s.
[2022-04-12 14:00:10] Total writer CPU time: 0.005466s.
# orc-contents student.orc
"name": "zhangsan", "age": 13, "math": 100, "english": 98
"name": "lisi", "age": 14, "math": 89, "english": 88
"name": "wangwu", "age": 13, "math": 60, "english": 78
"name": "zhaoliu", "age": 12, "math": 56, "english": 67
#查看一个字段
# orc-contents -n name student.orc
"name": "zhangsan"
"name": "lisi"
"name": "wangwu"
"name": "zhaoliu"
# 统计
# orc-statistics student.orc
File student.orc has 5 columns
*** Column 0 ***
Column has 4 values and has null value: no
*** Column 1 ***
Data type: String
Values: 4
Has null: no
Minimum: lisi
Maximum: zhaoliu
Total length: 25
*** Column 2 ***
Data type: Integer
Values: 4
Has null: no
Minimum: 12
Maximum: 14
Sum: 52
...
3,golang代码
golang 可以使用github.com/scritchley/orc 库,直接进行orc 文件的读取。
可以直接把数据读取出来。
package main
import (
"testing"
"log"
"fmt"
"github.com/scritchley/orc"
)
func TestReadNullAtEnd(t *testing.T)
r, err := orc.Open("student.orc")
if err != nil
log.Fatal(err)
selected := r.Schema().Columns()
c := r.Select(selected...)
defer c.Close()
vals := make([]interface, len(selected))
ptrVals := make([]interface, len(selected))
strVals := make([]string, len(selected))
for i := range vals
ptrVals[i] = &vals[i]
for c.Stripes()
for c.Next()
err := c.Scan(ptrVals...)
if err != nil
log.Fatal(err)
for i := range ptrVals
strVals[i] = fmt.Sprint(ptrVals[i])
log.Println(strVals[i])
if err := c.Err(); err != nil
log.Fatal(err)
# go test -v orc_read_test.go
=== RUN TestReadNullAtEnd
2022/04/12 14:13:56 zhangsan
2022/04/12 14:13:56 13
2022/04/12 14:13:56 100
2022/04/12 14:13:56 98
2022/04/12 14:13:56 lisi
2022/04/12 14:13:56 14
2022/04/12 14:13:56 89
2022/04/12 14:13:56 88
2022/04/12 14:13:56 wangwu
2022/04/12 14:13:56 13
2022/04/12 14:13:56 60
2022/04/12 14:13:56 78
2022/04/12 14:13:56 zhaoliu
2022/04/12 14:13:56 12
2022/04/12 14:13:56 56
2022/04/12 14:13:56 67
--- PASS: TestReadNullAtEnd (0.00s)
PASS
ok command-line-arguments 0.008s
4,orc总结
ORC格式化的数据文件,处理起来非常方便。
ORC在大数据处理上非常常用的格式,学习起来非常方便。
本文的原文连接是:
https://blog.csdn.net/freewebsys/article/details/124121247
博主地址是:https://blog.csdn.net/freewebsys
以上是关于Linux 环境中安装 ORC 库,golang读取orc文件,做大数据处理学习准备的主要内容,如果未能解决你的问题,请参考以下文章