Python ETL yaml

Posted onetoinf

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python ETL yaml相关的知识,希望对你有一定的参考价值。

标记语言的比较

  • XML

最早的通用信息标记语言,可扩展性好,但繁琐,html可以看作XML的子集

  • JSON

信息有类型,适合程序处理(js),较XML简洁。移动应用云端和节点的信息通信,无注释

  • YAML

信息无类型,文本信息比例最高,可读性好。各类系统的配置文件,有注释易读。

YAML 是专门用来写配置文件的语言,非常简洁和强大,远比 JSON 格式方便。

    YAML在python语言中有PyYAML安装包,下载地址:https://pypi.python.org/pypi/PyYAML

YAML简介

YAML 语言(发音 /?j?m?l/ )的设计目标,就是方便人类读写。它实质上是一种通用的数据串行化格式。

它的基本语法规则如下:

  • 大小写敏感

  • 使用缩进表示层级关系

  • 缩进时不允许使用Tab键,只允许使用空格。

  • 缩进的空格数目不重要,只要相同层级的元素左侧对齐即可

  • # 表示注释,从这个字符一直到行尾,都会被解析器忽略,这个和python的注释一样

YAML 支持的数据结构有三种:

  • 对象:键值对的集合,又称为映射(mapping)/ 哈希(hashes) / 字典(dictionary)

  • 数组:一组按次序排列的值,又称为序列(sequence) / 列表(list)

  • 纯量(scalars):单个的、不可再分的值。字符串、布尔值、整数、浮点数、Null、时间、日期

数据结构可以用类似大纲的缩排方式呈现,结构通过缩进来表示,

数组

YAML流是一个零或多个数组的集合。空流不包含数组,数组使用---进行分离,数组可以选择以...结尾

数组可以选择以---进行标记。

  • 隐式数组示例
- Multimedia
- Internet
- Education
  • 一个显式数组的例子
---
- Afterstep
- CTWM
- Oroborus
...

同一流程中的几个数组示例

---
- Ada
- APL
- ASP

- Assembly
- Awk
---
- Basic
---
- C
- C#    # Note that comments are denoted with ' #' (space then #).
- C++
- Cold Fusion

序列

连续的项目通过减号“-”来表示,

# YAML
- The Dagger 'Narthanc'
- The Dagger 'Nimthanc'
- The Dagger 'Dethanc'

# Python
["The Dagger 'Narthanc'", "The Dagger 'Nimthanc'", "The Dagger 'Dethanc'"]

分段可以进行嵌套

# YAML
-
  - HTML
  - LaTeX
  - SGML
  - VRML
  - XML
  - YAML
-
  - BSD
  - GNU Hurd
  - Linux
# Python
[['HTML', 'LaTeX', 'SGML', 'VRML', 'XML', 'YAML'], ['BSD', 'GNU Hurd', 'Linux']]

不需要用新行来启动嵌套的序列

# YAML
- 1.1
- - 2.1
  - 2.2
- - - 3.1
    - 3.2
    - 3.3
    
# Python
[1.1, [2.1, 2.2], [[3.1, 3.2, 3.3]]]

块序列可以嵌套到块映射中。注意,在这种情况下,没有必要缩进序列。

# YAML
left hand:
- Ring of Teleportation
- Ring of Speed

right hand:
- Ring of Resist Fire
- Ring of Resist Cold
- Ring of Resist Poison
# Python
{'right hand': ['Ring of Resist Fire', 'Ring of Resist Cold', 'Ring of Resist Poison'],
'left hand': ['Ring of Teleportation', 'Ring of Speed']}

对象

map结构里面的key/value对用冒号“:”来分隔。

# YAML
base armor class: 0
base damage: [4,4]
plus to-hit: 12
plus to-dam: 16
plus to-ac: 0

# Python
{
    'plus to-hit': 12, 
    'base damage': [4, 4], 
    'base armor class': 0, 
    'plus to-ac': 0, 
    'plus to-dam': 16
    }

用复杂的键用?和类型表示

# YAML
? !!python/tuple [0,0]
: The Hero
? !!python/tuple [0,1]
: Treasure
? !!python/tuple [1,0]
: Treasure
? !!python/tuple [1,1]
: The Dragon
# Python
{(0, 1): 'Treasure', (1, 0): 'Treasure', (0, 0): 'The Hero', (1, 1): 'The Dragon'}

对象同样可以进行嵌套

# YAML
hero:
  hp: 34
  sp: 8
  level: 4
orc:
  hp: 12
  sp: 0
  level: 2
  
# Python
{'hero': {'hp': 34, 'sp': 8, 'level': 4}, 'orc': {'hp': 12, 'sp': 0, 'level': 2}}

对象可以嵌套在数组中

# YAML
- name: PyYAML
  status: 4
  license: MIT
  language: Python
- name: PySyck
  status: 5
  license: BSD
  language: Python
# Python
[{'status': 4, 'language': 'Python', 'name': 'PyYAML', 'license': 'MIT'},
{'status': 5, 'license': 'BSD', 'name': 'PySyck', 'language': 'Python'}]

流集合

YAML中的流集合的语法非常接近于Python中的list和dictionary构造函数的语法:

# YAML
{ str: [15, 17], con: [16, 16], dex: [17, 18], wis: [16, 16], int: [10, 13], chr: [5, 8] }

# Python
{
    'dex': [17, 18], 
    'int': [10, 13], 
    'chr': [5, 8], 
    'wis': [16, 16], 
    'str': [15, 17], 
    'con': [16, 16]
    }

纯量

纯量是最基本的、不可再分的值。以下数据类型都属于 JavaScript 的纯量。

  • 字符串
  • 布尔值
  • 整数
  • 浮点数
  • Null
  • 时间
  • 日期

PyYaml

python 主要解析YAML的库

Loading YAML

>>> yaml.load("""
... - Hesperiidae
... - Papilionidae
... - Apatelodidae
... - Epiplemidae
... """)

['Hesperiidae', 'Papilionidae', 'Apatelodidae', 'Epiplemidae']

pyyaml支持unicode

>>> yaml.load(u"""
... hello: Привет!
... """)    # In Python 3, do not use the 'u' prefix

{'hello': u'\u041f\u0440\u0438\u0432\u0435\u0442!'}

>>> stream = file('document.yaml', 'r')    # 'document.yaml' contains a single YAML document.
>>> yaml.load(stream)
[...]    # A Python object corresponding to the document.

复杂例子

>>> documents = """
... ---
... name: The Set of Gauntlets 'Pauraegen'
... description: >
...     A set of handgear with sparks that crackle
...     across its knuckleguards.
... ---
... name: The Set of Gauntlets 'Paurnen'
... description: >
...   A set of gauntlets that gives off a foul,
...   acrid odour yet remains untarnished.
... ---
... name: The Set of Gauntlets 'Paurnimmen'
... description: >
...   A set of handgear, freezing with unnatural cold.
... """

>>> for data in yaml.load_all(documents):
...     print data

{'description': 'A set of handgear with sparks that crackle across its knuckleguards.\n',
'name': "The Set of Gauntlets 'Pauraegen'"}
{'description': 'A set of gauntlets that gives off a foul, acrid odour yet remains untarnished.\n',
'name': "The Set of Gauntlets 'Paurnen'"}
{'description': 'A set of handgear, freezing with unnatural cold.\n',
'name': "The Set of Gauntlets 'Paurnimmen'"}

转化为Python类型

>>> yaml.load("""
... none: [~, null]
... bool: [true, false, on, off]
... int: 42
... float: 3.14159
... list: [LITE, RES_ACID, SUS_DEXT]
... dict: {hp: 13, sp: 5}
... """)

{'none': [None, None], 'int': 42, 'float': 3.1415899999999999,
'list': ['LITE', 'RES_ACID', 'SUS_DEXT'], 'dict': {'hp': 13, 'sp': 5},
'bool': [True, False, True, False]}

即使是Python类的实例也可以使用!!python/object对象标签。

>>> class Hero:
...     def __init__(self, name, hp, sp):
...         self.name = name
...         self.hp = hp
...         self.sp = sp
...     def __repr__(self):
...         return "%s(name=%r, hp=%r, sp=%r)" % (
...             self.__class__.__name__, self.name, self.hp, self.sp)

>>> yaml.load("""
... !!python/object:__main__.Hero
... name: Welthyr Syxgon
... hp: 1200
... sp: 0
... """)

Hero(name='Welthyr Syxgon', hp=1200, sp=0)

Dumping YAML

将Python对象序列化

>>> print yaml.dump({'name': 'Silenthand Olleander', 'race': 'Human',
... 'traits': ['ONE_HAND', 'ONE_EYE']})

name: Silenthand Olleander
race: Human
traits: [ONE_HAND, ONE_EYE]

转化数据流

>>> stream = file('document.yaml', 'w')
>>> yaml.dump(data, stream)    # Write a YAML representation of data to 'document.yaml'.
>>> print yaml.dump(data)      # Output the document to the screen.

深层序列化

>>> print yaml.dump([1,2,3], explicit_start=True)
--- [1, 2, 3]

>>> print yaml.dump_all([1,2,3], explicit_start=True)
--- 1
--- 2
--- 3

类的序列化

>>> class Hero:
...     def __init__(self, name, hp, sp):
...         self.name = name
...         self.hp = hp
...         self.sp = sp
...     def __repr__(self):
...         return "%s(name=%r, hp=%r, sp=%r)" % (
...             self.__class__.__name__, self.name, self.hp, self.sp)

>>> print yaml.dump(Hero("Galain Ysseleg", hp=-3, sp=2))

!!python/object:__main__.Hero {hp: -3, name: Galain Ysseleg, sp: 2}

序列化字符串的格式和数据类型

>>> print yaml.dump(range(50))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
  23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
  43, 44, 45, 46, 47, 48, 49]

>>> print yaml.dump(range(50), width=50, indent=4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
    16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
    28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
    40, 41, 42, 43, 44, 45, 46, 47, 48, 49]

>>> print yaml.dump(range(5), canonical=True)
---
!!seq [
  !!int "0",
  !!int "1",
  !!int "2",
  !!int "3",
  !!int "4",
]

>>> print yaml.dump(range(5), default_flow_style=False)
- 0
- 1
- 2
- 3
- 4

>>> print yaml.dump(range(5), default_flow_style=True, default_style='"')
[!!int "0", !!int "1", !!int "2", !!int "3", !!int "4"]

参考链接

以上是关于Python ETL yaml的主要内容,如果未能解决你的问题,请参考以下文章

在 AWS 中运行 Python ETL 代码的最佳选项

python ETL工具 pyetl

python ETL工具 pyetl

Python/Pyspark 迭代代码(用于 AWS Glue ETL 作业)

Y服务-你真的懂 Yaml 吗

无法使用 yaml-cpp 发出空值