Singer 学习十三发现模式

Posted 2021-02-02 rongfengliang

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Singer 学习十三发现模式相关的知识，希望对你有一定的参考价值。

发现模式

发现模式提供了一种描述tap 支持数据流的方式，使用了json schema 做为描述数据的结构以及每个数据流的
类型，发现模式的实现依赖tap 的数据源，有些taps 将硬编码每个流的模式，而其他的将连接到提供可用流的
描述的api，当运行发现模式时，tap 应该写如stdout 流列表，称为目录，每个条目包含关于流的一些基本信息和
描述流的json schema
发现模式下运行tap，使用--discover

 

tap --config CONFIG --discover

我们可以在运行的时候将输出重定向到一个文件

tap --config CONFIG --discover > catalog.json
 

对于一些遗留的taps ，会使用properties.json 做为目录

schema

JSON用于表示数据，因为它无处不在，可读，并且特别适用于将数据公开为JSON（如Web API）的大量源。但是，
JSON远非完美：

它有一个有限类型的系统，不支持日期等常见类型，也没有整数和浮点数之间的区别
虽然它的灵活性使其易于使用，但它也可能导致兼容性问题
模式用于解决这些问题。一般而言，模式是描述数据结构的任何方式。模式由TEMA在SCHEMA消息中编写，格式遵循
JSON模式规范。
模式通过提供有关如何解释JSON基本类型的更多信息来解决有限的数据类型问题。例如，JSON Schema规范区分integer和number
类型，后者被适当地解释为浮点。此外，它定义了一个名为的字符串格式date-time，可用于指示数据点何时应为格式正确的时间戳字符串。
Schema提供了一种验证一组数据点结构的简便方法，从而减轻了JSON的兼容性问题。Taps通过鼓励每个流仅使用单个模式，并在持
久性之前根据其schema验证每个数据点来部署此概念。这迫使Tap作者思考如何解决模式演变和兼容性问题，将该责任尽可能接近原始数据源，
并使下游系统无需做出明智的假设来解决这些问题。
schema 是必需的，但它们可以用最广泛的术语定义 - “{}”的JSON schema 验证所有数据点。但是，Tap作者最好以尽可能窄的方式定义schema。

Stitch中的schema

Stitch Target和Stitch API使用schema如下：

当Stitch Target遇到未根据其流的最新schema验证的数据点时，它会失败
schema必须是顶级的“对象”
Stitch支持具有嵌套到任何深度的对象的schema，以及嵌套到任何深度的对象数组 - Stitch docs中的更多信息
在构造消息之前，必须完全解析并替换使用JSON模式$ref功能的引用SCHEMA。规范不支持传递额外schema以作为参考分辨率的方法。
类型string和格式的属性date-time将转换为目标数据库中的相应时间戳或日期时间类型
类型的属性integer在目标数据库中转换为整数
类型的属性number在目标数据库中转换为十进制或数字
（很快）maxLengthtype属性的参数string用于定义目标数据库中相应varchar列的宽度
当Stitch遇到与要在目标数据库中加载流的表不兼容的流的schema时，它会将数据添加到拒绝堆中
参考：

 
{

  "type": [

    "null", 

    "object"

  ],

  "additionalProperties": false,

  "properties": {

    "id": {

      "type": [

        "null",

        "string"

      ],

    },

    "name": {

      "type": [

        "null",

        "string"

      ],

    },

    "date_modified": {

      "type": [

        "null",

        "string"

      ],

      "format": "date-time",

    }

  }

}

目录（catalog）

发现模式的输出应该是Tap支持的数据流列表。此JSON格式的列表称为目录。顶层是一个对象，其中一个被调用的键"streams"指向一个对象数组，
每个对象都有以下字段：
tap_stream_id 字符串需要流的唯一标识符。允许这与流的名称不同，以允许具有重复流名称的源。
schema 对象需要流的JSON模式。
table_name 字符串可选的对于数据库源，表的名称。
metadata 元数据数组可选的请参阅下面的元数据以获取解释
参考：

 
{

  "streams": [

    {

      "tap_stream_id": "users",

      "stream": "users",

      "schema": {

        "type": ["null", "object"],

        "additionalProperties": false,

        "properties": {

          "id": {

            "type": [

              "null",

              "string"

            ],

          },

          "name": {

            "type": [

              "null",

              "string"

            ],

          },

          "date_modified": {

            "type": [

              "null",

              "string"

            ],

            "format": "date-time",

          }

        }

      }

    }

  ]

}

metadata

元数据是关联模式中节点的额外信息的首选机制。
应该通过tap 来写入和读取某些元数据。此元数据称为discoverable元数据。其他元数据将由其他系统（如UI）编写
，因此只能通过tap读取。这种类型的元数据称为non-discoverable元数据
参考的字段信息：

Keyword	Tap Type	Discoverable?	Description
`selected`	any	non-discoverable	Either `true` or `false`. Indicates that this node in the schema has been selected by the user for replication.
`replication-method`	any	non-discoverable	Either `FULL_TABLE`, `INCREMENTAL`, or `LOG_BASED`. The replication method to use for a stream.
`replication-key`	any	non-discoverable	The name of a property in the source to use as a "bookmark". For example, this will often be an "updated-at" field or an auto-incrementing primary key (requires `replication-method`).
`view-key-properties`	database	non-discoverable	List of key properties for a database view.
`inclusion`	any	discoverable	Either `available`, `automatic`, or `unsupported`. `available` means the field is available for selection, and the tap will only emit values for that field if it is marked with `"selected": true`. `automatic` means that the tap will emit values for the field. `unsupported` means that the field exists in the source data but the tap is unable to provide it.
`selected-by-default`	any	discoverable	Either `true` or `false`. Indicates if a node in the schema should be replicated if a user has not expressed any opinion on whether or not to replicate it.
`valid-replication-keys`	any	discoverable	List of the fields that could be used as replication keys.
`schema-name`	any	discoverable	The name of the stream.
`forced-replication-method`	any	discoverable	Used to force the replication method to either `FULL_TABLE` or `INCREMENTAL`.
`table-key-properties`	database	discoverable	List of key properties for a database table.
`is-view`	database	discoverable	Either `true` or `false`. Indicates whether a stream corresponds to a database view.
`row-count`	database	discoverable	Number of rows in a database table/view.
`database-name`	database	discoverable	Name of database.
`sql-datatype`	database	discoverable	Represents the datatype of a database column.

参考的数据格式

 
{

  "metadata" : {

    "selected" : true,

    "some-other-metadata" : "whatever"

  },

  "breadcrumb" : ["properties", "some-field-name"]

}

上面的breadcrumb对象定义了到元数据所属节点的模式的路径。流的元数据将具有空的面包屑。
参考完整例子

 
{

  "streams": [

    {

      "tap_stream_id": "users",

      "stream": "users",

      "schema": {

        "type": ["null", "object"],

        "additionalProperties": false,

        "properties": {

          "id": {

            "type": [

              "null",

              "string"

            ],

          },

          "name": {

            "type": [

              "null",

              "string"

            ],

          },

          "date_modified": {

            "type": [

              "null",

              "string"

            ],

            "format": "date-time",

          }

        }

      },

      "metadata": [

        {

          "metadata": {

            "inclusion": "available",

            "table-key-properties": ["id"],

            "selected-by-default": true,

            "valid-replication-keys": ["date_modified"],

            "schema-name": "users",

          },

          "breadcrumb": []

        },

        {

          "metadata": {

            "inclusion": "automatic",

          },

          "breadcrumb": ["properties", "id"]

        },

        {

          "metadata": {

            "inclusion": "available",

            "selected-by-default": true,

          },

          "breadcrumb": ["properties", "name"]

        },

        {

          "metadata": {

            "inclusion": "automatic",

          },

          "breadcrumb": ["properties", "date_modified"]

        }

      ]

    }

  ]

}