Elasticsearch：如何在导入时忽略格式错误的数据

Posted 2021-10-22 中国社区官方博客

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Elasticsearch：如何在导入时忽略格式错误的数据相关的知识，希望对你有一定的参考价值。

有时，你对收到的数据没有太多控制权。一个用户可以发送一个日期形式的 login 字段，另一个用户发送一个电子邮件地址形式的 login 字段。

默认情况下，尝试将错误的数据类型索引到字段中会引发异常，并拒绝整个文档。 ignore_malformed 参数如果设置为 true，则允许忽略异常。格式错误的字段没有被索引，但是文档中的其他字段被正常处理。

让我们来用一个例子来进行展示。在 Kibana 中，打入如下的数据：

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "number_one": {
        "type": "integer",
        "ignore_malformed": true
      },
      "number_two": {
        "type": "integer"
      }
    }
  }
}

在上面的命令中，我们创建一个索引。这个索引含有两个字段：

number_one：这是一个整数的字段。在这个字段中，我们定义了 ignore_malformed 为 true，也就是说在摄入时，即使这个字段的值不为整数，它也会被摄入成功，尽管这个字段不能被正常搜索。
number_two：这个字段也是一个整数的字段。在摄入时，如果摄入的值不是整数，那么整个文档将被拒绝，这是因为它没有定义 ignore_malformed 我 true。

我们使用如下的两个文档来做测试：

PUT my-index-000001/_doc/1
{
  "text": "Some text value",
  "number_one": "foo"
}

在上面，尽管 number_one 是整数，但是我们在摄入时使用 "foo" 来导入。由于该字段的 ignore_malformed 为 true，那么整个文档还是会被成功地导入。我们可以使用如下的命令来进行查询：

GET my-index-000001/_doc/1

上面的命令返回的结果是：

{
  "_index" : "my-index-000001",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "text" : "Some text value",
    "number_one" : "foo"
  }
}

同样，我们使用如下的命令来导入另外一个文档：

PUT my-index-000001/_doc/2
{
  "text": "Some text value",
  "number_two": "foo"
}

由于 number_two 这个字段没有定义 ignore_malformed 为 true，但是我们提供给 number_two 的是一个字符串而不是一个整型值。上面的操作会导致如下的错误：

{
  "error" : {
    "root_cause" : [
      {
        "type" : "mapper_parsing_exception",
        "reason" : "failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'"
      }
    ],
    "type" : "mapper_parsing_exception",
    "reason" : "failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'",
    "caused_by" : {
      "type" : "number_format_exception",
      "reason" : "For input string: \\"foo\\""
    }
  },
  "status" : 400
}

从上面的信息中，我们可以看出来整个文档的导入是不成功的。

以上是关于Elasticsearch：如何在导入时忽略格式错误的数据的主要内容，如果未能解决你的问题，请参考以下文章

Elasticsearch忽略字段格式类型（ignore_malformed）

Elasticsearch：如何提高 Elasticsearch 数据摄入速度

如何导入elasticsearch的JAVA API？

从固定格式的文本文件批量插入忽略行终止符

Google BigQuery Spark 连接器：如何在追加时忽略未知值