Elasticsearch：在使用 html_strip 过滤器不工作的索引文档之前去除 HTML 标签

Posted 2023-02-27

技术标签:

【中文标题】Elasticsearch：在使用 html_strip 过滤器不工作的索引文档之前去除 HTML 标签【英文标题】：Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working 【发布时间】：2016-09-18 00:38:18 【问题描述】：

鉴于我已经在我的自定义分析器中指定了我的 html 条形字符过滤器

当我索引包含 html 内容的文档时

那么我希望将 html 从索引内容中剥离出来

并且在从索引中检索返回的文档时不应包含 hmtl

实际：索引文档包含 html 检索到的文档包含 html

我已经尝试将分析器指定为 index_analyzer，正如人们所期望的那样，以及其他一些出于绝望的 search_analyzer 和分析器。似乎对被索引或检索的文档没有任何影响。

针对 HTML_Strip 分析字段测试文档索引：

请求：带有 html 内容的示例 POST 文档

POST /html_poc_v2/html_poc_type/02

  "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
  "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
  "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"

预期 : 已通过 html 分析器解析的索引数据。实际：数据是用html索引的


   "_index": "html_poc_v2",   "_type": "html_poc_type",   "_id": "02", ...
   "_source": 
      "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
      "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
      "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"

设置和文档映射

PUT /html_poc_v2

  "settings": 
    "analysis": 
      "analyzer": 
        "my_html_analyzer": 
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ]
        
      
    ,
    "mappings": 
      "html_poc_type": 
        "properties": 
          "body": 
            "type": "string",
            "analyzer": "my_html_analyzer"
          ,
          "description": 
            "type": "string",
            "analyzer": "my_html_analyzer"
          ,
          "title": 
            "type": "string",
            "search_analyser": "my_html_analyzer"
          ,
          "urlTitle": 
            "type": "string"

测试以证明自定义分析器完美运行：

请求

GET /html_poc_v2/_analyze?analyzer=my_html_analyzer
<p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a>

回应


   "tokens": [
      
         "token": "Some",… "position": 1
      ,
      
         "token": "déjà",… "position": 2
      ,
      
         "token": "vu",…  "position": 3
      ,
      
         "token": "website",… "position": 4
      
   ]

引擎盖下

使用内联脚本进一步证明我的 html 分析器必须被跳过

请求

GET /html_poc_v2/html_poc_type/_search?pretty=true

  "query" : 
    "match_all" :  
  ,
  "script_fields": 
    "terms" : 
        "script": "doc[field].values",
        "params": 
            "field": "title"

 …
   "hits":  ..
      "hits": [
         
            "_index": "html_poc_v2",
            "_type": "html_poc_type",
            …
            "fields": 
               "terms": [
                  [
                     "a",
                     "agrave",
                     "d",
                     "eacute",
                     "href",
                     "http",
                     "j",
                     "p",
                     "some",
                     "somedomain.com",
                     "title",
                     "vu",
                     "website"
                  ]
               ]
            
         
      ]

这里与这个问题类似：Why HTML tag is searchable even if it was filtered in elastic search

我也读过这个惊人的文档：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

ES 版本：1.7.2

请帮忙。

【问题讨论】：

【参考方案1】：

您混淆了响应中的“_source”字段以返回正在分析和索引的内容。看起来您的期望是响应中的 _source 字段返回分析的文档。这是不正确的。

来自documentation；

_source 字段包含原来的 JSON 文档正文在索引时间通过。 _source 字段本身没有被索引（并且因此不可搜索），但它被存储以便可以返回执行获取请求时，例如 get 或 search。

理想情况下，在上述情况下，您希望格式化源数据以进行演示，它应该在客户端完成。

然而，对于上述用例来说，实现它的一种方法是使用script fields 和keyword-tokenizer，如下所示：

PUT test

   "settings": 
      "analysis": 
         "analyzer": 
            "my_html_analyzer": 
               "type": "custom",
               "tokenizer": "standard",
               "char_filter": [
                  "html_strip"
               ]
            ,
            "parsed_analyzer": 
               "type": "custom",
               "tokenizer": "keyword",
               "char_filter": [
                  "html_strip"
               ]
            
         
      
   ,
   "mappings": 
      "test": 
         "properties": 
            "body": 
               "type": "string",
               "analyzer": "my_html_analyzer",
               "fields": 
                  "parsed": 
                     "type": "string",
                     "analyzer": "parsed_analyzer"
                  
               
            
         
      
   



PUT test/test/1 

    "body" : "Title <p> Some d&eacute;j&agrave; vu <a href='http://somedomain.com'> website </a> <span> this is inline </span></p> "


GET test/_search

  "query" : 
    "match_all" :  
  ,
  "script_fields": 
    "terms" : 
        "script": "doc[field].values",
        "params": 
            "field": "body.parsed"

结果：


   "_index": "test",
   "_type": "test",
   "_id": "1",
   "_score": 1,
   "fields": 
        "terms": [
            "Title \n Some déjà vu  website   this is inline \n "
           ]

请注意，我认为上述方法是一个坏主意，因为在客户端可以轻松实现剥离 html 标记，并且与依赖于诸如此类的变通方法相比，您在格式化方面拥有更多的控制权。更重要的是，它可能在客户端执行此操作。

【讨论】：

这也是 _source 字段，我想防止使用 html 标记。换句话说，除了不索引 html 标签之外，我也不希望在 GET/Search 上返回 html。关于如何实现这一目标的任何想法/建议？非常感谢 keety。很好的建议，尽管可以解决，但您是对的，这不是一个好主意。它还需要在我的 Live 集群上启用脚本。也不会与我们拥有的自定义 elasticsearch 端点 _plugins 兼容，这些端点与从我所说的文档的 _source 中提取字段相关联，例如标题、正文和描述。那么这意味着如果不影响安全性就无法实现您想要的目标？我实际上找到了解决方案 - 我将在下面发布我的答案！ :) @mel 他们删除了我的帖子，因为它是重复的。查看我在这篇文章中的回答：***.com/questions/37386354/…

以上是关于Elasticsearch：在使用 html_strip 过滤器不工作的索引文档之前去除 HTML 标签的主要内容，如果未能解决你的问题，请参考以下文章