Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Posted Elastic 中国社区官方博客

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Elasticsearch:如何部署 NLP:文本嵌入和向量搜索相关的知识,希望对你有一定的参考价值。

作为我们自然语言处理 (NLP) 博客系列的一部分,我们将介绍一个使用文本嵌入模型生成文本内容的向量表示并演示对生成的向量进行向量相似性搜索的示例。我们将在 Elasticsearch 上部署一个公开可用的模型,并在摄取管道中使用它来从文本文档生成嵌入。然后,我们将展示如何在向量相似性搜索中使用这些嵌入(embedding)来查找给定查询的语义相似文档。

矢量相似性搜索(vector similarity search),或者通常称为语义搜索,超越了传统的基于关键字的搜索,允许用户找到可能没有任何共同关键字的语义相似的文档,从而提供更广泛的结果。向量相似性搜索对密集向量进行操作,并使用 k-最近邻(k-nearest neighbour)搜索来查找相似向量。为此,首先需要使用文本嵌入模型将文本形式的内容转换为其数字向量表示。

我们将使用来自 MS MARCO Passage Ranking Task 的公共数据集进行演示。它由来自 Microsoft Bing 搜索引擎的真实问题和人工生成的答案组成。该数据集是测试向量相似性搜索的完美资源,首先,因为问答是向量搜索最常见的用例之一,其次,MS MARCO 排行榜中的顶级论文以某种形式使用了向量搜索。

在我们的示例中,我们将使用此数据集的样本,使用模型生成文本嵌入,然后对其运行向量搜索。我们还希望对向量搜索产生的结果的质量进行快速验证。在今天的展示中,我将使用 Elastic Stack 8.2 来进行展示。

安装

Elasticsearch 及 Kibana

如果你还没安装好自己的 Elasticsearch 及 Kibana,请参阅如下的文章来进行安装:

请注意文章中的 8.x 的安装部分。由于使用 eland 上传模型是白金版或者是企业版的功能,在我们的演示中,我们需要启动白金版试用功能:

 

Eland

可以使用 Pip 从 PyPI 安装 Eland:

python -m pip install eland

也可以使用 Conda 从 Conda Forge 安装 Eland:

conda install -c conda-forge eland

希望在不安装 Eland 的情况下使用它的用户,为了只运行可用的脚本,可以构建 Docker 容器:

git clone https://github.com/elastic/eland
cd eland
docker build -t elastic/eland .

Eland 将 Hugging Face 转换器模型到其 TorchScript 表示的转换和分块过程封装在一个 Python 方法中; 因此,这是推荐的导入方法。

  1. 安装 Eland Python 客户端
  2. 运行 eland_import_hub_model 脚本。 例如:
eland_import_hub_model --url <clusterUrl> \\ 
--hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \\ 
--task-type ner 
  • 指定 URL 以访问你的集群。 例如,https://<user>:<password>@<hostname>:<port>。
  • 在 Hugging Face 模型中心中指定模型的标识符。
  • 指定 NLP 任务的类型。 支持的值为 fill_mask、ner、text_classification、text_embedding 和 zero_shot_classification。
     

部署文本嵌入模型

第一步是安装文本嵌入模型。 对于我们的模型,我们使用 Hugging Facemsmarco-distilbert-base-tas-b。 这是一个句子转换模型,它将一个句子或一个段落映射到一个 768 维的密集向量。 该模型针对语义搜索进行了优化,并专门针对 MS MARCO Passage 数据集进行了训练,使其适合我们的任务。 除了这个模型,Elasticsearch 还支持许多其他的文本嵌入模型。 完整列表可以在这里找到。

我们使用我们在 NER 示例中构建的 Eland docker 代理安装模型。 运行下面的脚本将我们的模型导入我们的本地集群并部署它:

docker run -it --rm elastic/eland \\
    eland_import_hub_model \\
        --url https://elastic:lOwgBZT3KowJrQWMwRWm@192.168.0.3:9200/ \\
        --hub-model-id sentence-transformers/msmarco-distilbert-base-tas-b \\
        --task-type text_embedding \\
        --insecure \\
        --start       

在上面,请注意你需要根据自己的情况替换点上面的用户名及密码部分。你也需要修改相应的 Elasticsearch 地址。在这里,由于我们使用的是自签名安装,我使用了 --insecuer 选择来进行安装以规避 SSL 的安全证书检查。这里, --task-type 设置为 text_embedding 并且 --start 选项被传递给 Eland 脚本,因此模型将自动部署,而无需在模型管理 UI 中启动它。 为了加快推理速度,你可以使用 inference_threads 参数增加推理线程的数量。

从上面的输出中,我们可以看到模型已经被成功地上传了。 

我们可以通过在 Kibana 控制台中使用这个示例来测试模型的成功部署:

POST /_ml/trained_models/sentence-transformers__msmarco-distilbert-base-tas-b/deployment/_infer

  "docs": 
    "text_field": "how is the weather in jamaica"
  

我们应该看到预测的密集向量(dense vector)作为结果:

 经过上面的操作后,我们可以在 Kibana 中进行查看已经被摄入的模型:

装载初始数据

如介绍中所述,我们使用 MS MARCO Passage Ranking 数据集。 数据集非常大,包含超过 800 万个段落。 在我们的示例中,我们使用了它的一个子集,该子集在 2019 TREC Deep Learning Track 的测试阶段使用。 用于重新排序任务的数据集 msmarco-passagetest2019-top1000.tsv 包含 200 个查询,每个查询由一个简单的 IR 系统提取的相关文本段落列表。 从该数据集中,我们提取了所有带有 id 的唯一段落,并将它们放入一个单独的 tsv 文件中,总共 182469 个段落。 我们使用这个文件作为我们的数据集。

我们使用 Kibana 的文件上传功能来上传这个数据集。 Kibana 文件上传允许我们为字段提供自定义名称,我们将它们称为 id 类型为 long 的段落 id 和 text 类型的文本为段落的内容。 索引名称是 collection。 上传后,我们可以看到一个名为 collection 的索引,其中包含 182469 个文档。

 

从上面,我们可以看出来有 182469 个文档被摄入。

创建 pipeline

我们希望使用推理处理器(inference processor)处理初始数据,该处理器将为每个段落添加嵌入(embedding)。 为此,我们创建了一个文本嵌入摄取管道,然后使用该管道重新索引我们的初始数据。

在 Kibana 控制台中,我们创建了一个摄取管道用于文本嵌入,并将其称为 text-embedding。 这些段落位于名为 text 的字段中。 正如我们之前所做的,我们将定义一个 field_map 来将文本映射到模型期望的字段 text_field。 同样 on_failure 处理程序设置为将故障索引到不同的索引中:

PUT _ingest/pipeline/text-embeddings

  "description": "Text embedding pipeline",
  "processors": [
    
      "inference": 
        "model_id": "sentence-transformers__msmarco-distilbert-base-tas-b",
        "target_field": "text_embedding",
        "field_map": 
          "text": "text_field"
        
      
    
  ],
  "on_failure": [
    
      "set": 
        "description": "Index document to 'failed-<index>'",
        "field": "_index",
        "value": "failed-_index"
      
    ,
    
      "set": 
        "description": "Set error message",
        "field": "ingest.failure",
        "value": "_ingest.on_failure_message"
      
    
  ]

Reindex

我们希望通过 text-embedding 管道推送文档,将文档从 collection 索引重新索引(reindex)到新的 collection-with-embedding 索引中,以便在 collection-with-embeddings 索引中的文档具有用于段落嵌入的附加字段。 但在我们这样做之前,我们需要为我们的目标索引创建和定义一个映射,特别是对于摄取处理器将存储嵌入的字段 text_embedding.predicted_value。 如果我们不这样做,嵌入将被索引到常规浮点 float 字段中,并且不能用于向量相似性搜索。 我们使用的模型将嵌入生成为 768 维向量,因此我们使用具有 768 个维度的索引 dense_vector 字段类型,如下所示:

PUT collection-with-embeddings

  "mappings": 
    "properties": 
      "text_embedding.predicted_value": 
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      ,
      "text": 
        "type": "text"
      
    
  

最后,我们准备重新索引。 鉴于 reindex 需要一些时间来处理所有文档并对其进行推断,我们通过调用带有 wait_for_completion=false 标志的 API 在后台reindex:

POST _reindex?wait_for_completion=false

  "source": 
    "index": "collection"
  ,
  "dest": 
    "index": "collection-with-embeddings",
    "pipeline": "text-embeddings"
  

以上返回一个任务 ID。 我们可以通过以下方式监控任务的进度:

GET _tasks/<task_id>

或者,通过观察 model stats API或模型统计 UI 中的 inference count 增加来跟踪进度。

当我们看到它达到我们之前的那个文档数  182469,它就表明已经完成。

重新索引的文档现在包含推理结果——向量嵌入(vetor embedings)。 例如,其中一个文档如下所示:


    "id": 7130104,
    "text": "This is the definition of RNA along with examples of types of RNA molecules. This is the definition of RNA along with examples of types of RNA molecules. RNA Definition",
    "text_embedding":
    
        "predicted_value":
        [
            0.057356324046850204,
            0.1602816879749298,
            -0.18122544884681702,
            0.022277727723121643,
            ....
        ],
        "model_id": "sentence-transformers__msmarco-distilbert-base-tas-b"
    

Vector Similarity Search

目前我们不支持在搜索请求期间从查询词隐式生成嵌入,因此我们的语义搜索被组织为一个两步过程:

  • 从文本查询中获取文本嵌入。 为此,我们使用模型的 _infer API。
  • 使用向量搜索来查找与查询文本语义相似的文档。 在 Elasticsearch v8.0 中,我们引入了一个新的 _knn_search 端点,它允许在索引的 dense_vector 字段上进行有效的近似最近邻搜索。 我们使用 _knn_search API 来查找最近的文档。

例如,给一个文本查询 “how is the weather in jamaica”,我们首先运行 _infer API 以得到一个密集向量的 embedding:

POST /_ml/trained_models/sentence-transformers__msmarco-distilbert-base-tas-b/deployment/_infer

  "docs": 
    "text_field": "how is the weather in jamaica"
  

上面的命令返回如下的结果:

 上面的 predicted_value 是一个768 维的向量。之后,我们将生成的密集向量(dense vector)插入到 _knn_search 中,如下所示:

GET collection-with-embeddings/_knn_search

  "knn": 
    "field": "text_embedding.predicted_value",
    "query_vector": [
    -0.09194609522819519,
    -0.49406030774116516,
    0.03598763048648834,
       …
    ],
    "k": 10,
    "num_candidates": 100
  ,
  "_source": [
    "id",
    "text"
  ]

结果,我们得到最接近查询文档的前 10 个文档,按它们与查询的接近程度排序:

"hits" : [
      
        "_index" : "collection-with-embeddings",
        "_id" : "6H_OsH8Bi5IvRzQ7g-Aa",
        "_score" : 0.9527166,
        "_source" : 
          "id" : 6140,
          "text" : "Ocho Rios Jamaica Weather - Winter ( December, January And February) The winters in this town are usually colder when compared to other parts of the island. The average temperature for December, January and February are 81  °F and 79  °F respectively. All three months usually have a high temperature of 84  °F."
        
      ,
      
        "_index" : "collection-with-embeddings",
        "_id" : "6n_OsH8Bi5IvRzQ7g-Aa",
        "_score" : 0.95225316,
        "_source" : 
          "id" : 6142,
          "text" : "Jamaica Weather and When to Go. Jamaica weather essentials. For more details on the current temperature, wind, and stuff like that you can check any search engine weather feature. The rainy months, also called the rainy season, are generally from the end of April, or early May, until the end of September or early October."
        
      ,
      
        "_index" : "collection-with-embeddings",
        "_id" : "5n_OsH8Bi5IvRzQ7g-Aa",
        "_score" : 0.9394933,
        "_source" : 
          "id" : 6138,
          "text" : "Quick Answer. Hurricane season in Jamaica starts on June 1 and ends on Nov. 30. Satellite weather forecasts work to allow tourists and island dwellers adequate time to take precautions when hurricanes approach during those months. Continue Reading."
        
      ,
…

 

快速验证

由于我们只使用了 MS MARCO 数据集的一个子集,我们无法进行全面评估。相反,我们可以做的是对一些查询进行简单的验证,以了解我们确实得到了相关的结果,而不是一些随机的结果。从 TREC 2019 Deep Learning Track 对 Passage Ranking Task 的判断中,我们选取​​最后 3 个查询,将它们提交到我们的向量相似性搜索,获得前 10 个结果并参考 TREC 判断,看看我们收到的结果的相关性如何。对于文章排名任务,文章按照不相关 (0)、相关(文章主题但不回答问题) (1)、高度相关 (2) 和完全相关 ( 3)。

请注意,我们的验证不是严格的评估,它仅用于我们的快速演示。由于我们只对已知与查询相关的段落进行索引,因此它比原始段落检索任务要容易得多。未来我们打算对 MS MARCO 数据集进行严格的评估。

查询 #1124210 “tracheids are part of _____” 提交给我们的向量搜索返回以下结果:

Passage idRelevance ratingPassage
22585912 - highly relevantTracheid of oak shows pits along the walls. It is longer than a vessel element and has no perforation plates. Tracheids are elongated cells in the xylem of vascular plants that serve in the transport of water and mineral salts.Tracheids are one of two types of tracheary elements, vessel elements being the other. Tracheids, unlike vessel elements, do not have perforation plates.racheids provide most of the structural support in softwoods, where they are the major cell type. Because tracheids have a much higher surface to volume ratio compared to vessel elements, they serve to hold water against gravity (by adhesion) when transpiration is not occurring.
22585923 - perfectly relevantTracheid. a dead lignified plant cell that functions in water conduction. Tracheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae.Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores.racheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae. Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores.
27284482 - highly relevantThe xylem tracheary elements consist of cells known as tracheids and vessel members, both of which are typically narrow, hollow, and elongated. Tracheids are less specialized than the vessel members and are the only type of water-conducting cells in most gymnosperms and seedless vascular plants.
74435862 - highly relevant1 The xylem tracheary elements consist of cells known as tracheids and vessel members, both of which are typically narrow, hollow, and elongated. Tracheids are less specialized than the vessel members and are the only type of water-conducting cells in most gymnosperms and seedless vascular plants.
80267372 - highly relevantIts major components include xylem parenchyma, xylem fibers, tracheids, and xylem vessels. Tracheids are one of the two types of tracheary elements of vascular plants. (The other being the vessel elements). A tracheid cell loses its protoplast at maturity. Thus, at maturity, it becomes one of the non-living components of the xylem.
22585952 - highly relevantSummary: Vessels have perforations at the end plates while tracheids do not have end plates. Tracheids are derived from single individual cells while vessels are derived from a pile of cells. Tracheids are present in all vascular plants whereas vessels are confined to angiosperms.Tracheids are thin whereas vessel elements are wide. Tracheids have a much higher surface-to-volume ratio as compared to vessel elements.Vessels are broader than tracheids with which they are associated.Morphology of the perforation plate is different from that in tracheids.racheids are thin whereas vessel elements are wide. Tracheids have a much higher surface-to-volume ratio as compared to vessel elements. Vessels are broader than tracheids with which they are associated. Morphology of the perforation plate is different from that in tracheids.
1811773 - perfectly relevantXylem tracheids are pointed, elongated xylem cells, the simplest of which have continuous primary cell walls and lignified secondary wall thickenings in the form of rings, hoops, or reticulate networks.
22585972 - highly relevantThank you... In plants xylem and phloem are the complex tissues which are the components parts of conductive system. In higher plants xylem contains tracheids, vessels (tracheae), xylem fibres(wood fibres) and xylem parenchyma (wood parenchyma).Tracheids These are elongated narrow tube like cells with hard thick and lignified walls with large cell cavity.hank you... In plants xylem and phloem are the complex tissues which are the components parts of conductive system. In higher plants xylem contains tracheids, vessels (tracheae), xylem fibres(wood fibres) and xylem parenchyma (wood parenchyma).
65418662 - highly relevant
In most plants, pitted tracheids function as the primary transport cells. The other type of tracheary element, besides the tracheid, is the vessel element. Vessel elements are joined by perforations into vessels. In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes.n most plants, pitted tracheids function as the primary transport cells. The other type of tracheary element, besides the tracheid, is the vessel element. Vessel elements are joined by perforations into vessels. In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes.

查询 #1129237 “hydrogen is a liquid below what temperature” 返回以下结果:

Passage idRelevance ratingPassage
1289843 - perfectly relevantHydrogen gas has the molecular formula H 2. At room temperature and under standard pressure conditions, hydrogen is a gas that is tasteless, odorless and colorless. Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F). Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form. Liquid hydrogen is also used as a rocket fuel.
59061303 - perfectly relevantRating Newest Oldest. Best Answer: Hydrogen, like water, can exist in 3 states....Solid, Liquid and Gas Its temperature as a solid is −259.14 °C' Hydrogen melts to liquid at −252.87 °C. It boils and vaporises at -252.125 °C Just cooling or compressing Hydrogen won't liquefy or freeze it.
42548151 - relatedAnswer   The boiling point of liquid hydrogen is 20.268 K (-252.88 °C or -423.184 °F)    The freezing point of hydrogen is 14.025 K (-259.125 °C or -434.
85882223 - perfectly relevantUser: Hydrogen is a liquid below what temperature? a. 100 degrees C c. -183 degrees C b. -253 degrees C d. 0 degrees C Weegy: Hydrogen is a liquid below 253 degrees C. User: What is the boiling point of oxygen? a. 100 degrees C c. -57 degrees C b. 8 degrees C d. -183 degrees C Weegy: The boiling point of oxygen is -183 degrees C.
42548113 - perfectly relevantConfidence votes 11.4K. At STP (standard temperature and pressure) hydrogen is a gas. It cools to a liquid at -423 °F, which is only about 37 degrees above absolute zero. Eleven degrees cooler, at -434 °F, it starts to solidify.
26977522 - highly relevantHydrogen's state of matter is gas at standard conditions of temperature and pressure. Hydrogen condenses into a liquid or freezes solid at extremely cold... Hydrogen's state of matter is gas at standard conditions of temperature and pressure. Hydrogen condenses into a liquid or freezes solid at extremely cold temperatures. Hydrogen's state of matter can change when the temperature changes, becoming a liquid at temperatures between minus 423.18 and minus 434.49 degrees Fahrenheit. It becomes a solid at temperatures below minus 434.49 F.Due to its high flammability, hydrogen gas is commonly used in combustion reactions, such as in rocket and automobile fuels.
60804603 - perfectly relevantHydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F). Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form. Liquid hydrogen is also used as a rocket fuel.ydrogen is found in large amounts in giant gas planets and stars, it plays a key role in powering stars through fusion reactions. Hydrogen is one of two important elements found in water (H 2 O). Each molecule of water is made up of two hydrogen atoms bonded to one oxygen atom.
39058023 - perfectly relevantHydrogen is found naturally in the molecular H2 form. To exist as a liquid, H2 must be cooled below hydrogen's critical point of 33 K. However, for hydrogen to be in a fully liquid state without boiling at atmospheric pressure, it needs to be cooled to 20.28 K (−423.17 °F/−252.87 °C).

查询 #1133167 “how is the weather in jamaica” 返回以下结果

Passage idRelevance rating        Passage
30231232 - highly relevantClimate - Jamaica. Temperature, rainfall, prevailing weather conditions, when to go, what to pack. In Jamaica the climate is tropical, hot all year round, with little difference between winter and summer (just a few degrees). Even in winter, the maximum temperatures are around 27/30 °C, and minimum temperatures around 20/23 °C.
4341212 - highly relevantTemperature, rainfall, prevailing weather conditions, when to go, what to pack. In Jamaica the climate is tropical, hot all year round, with little difference between winter and summer (just a few degrees). Even in winter, the maximum temperatures are around 27/30 °C (81/86 °F), and minimum temperatures around 20/23 °C (68/73 °F).
49226192 - highly relevantMap from Google - Jamaica. 1  In Jamaica the climate is tropical, hot all year round, with little difference between winter and summer (just a few degrees). Even in winter, the maximum temperatures are around 27/30 °C (81/86 °F), and minimum temperatures around 20/23 °C (68/73 °F).
82557062 - highly relevantAnd it's absolutely true. This is Jamaica weather! Most of our days are filled with warmth and sunshine, even during the rainy season. Jamaica has a tropical climate with hot and humid weather at sea level. The higher inland regions have a more temperate climate. (Bring a light jacket just in case you travel to the mountains where temperatures can be 10 degrees cooler or in case you go on a windy boat ride).
1908062 - highly relevantIt is always important to know what the weather in Jamaica will be like before you plan and take your vacation. For the most part, the average temperature in Jamaica is between 80 °F and 90 °F (27 °FCelsius-29 °Celsius). Luckily, the weather in Jamaica is always vacation friendly. You will hardly experience long periods of rain fall, and you will become accustomed to weeks upon weeks of sunny weather.
18244862 - highly relevantThe climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably...
44984743 - perfectly relevantThe climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year.
18244803 - perfectly relevantThe climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year.

正如我们所见,对于所有 3 个查询,Elasticsearch 返回了大部分相关的结果,并且所有查询的前 1 个结果要么高度相关,要么完全相关。

试一试

NLP 是 Elastic Stack 中的一项强大功能,具有令人兴奋的路线图。 通过在 Elastic Cloud 中构建集群,发现新功能并跟上最新发展。 立即注册免费试用 14 天,并尝试此博客中的示例。

以上是关于Elasticsearch:如何部署 NLP:文本嵌入和向量搜索的主要内容,如果未能解决你的问题,请参考以下文章

使用ElasticSearch 和 BERT进行NLP文本分析

使用ElasticSearch 和 BERT进行NLP文本分析

Elasticsearch:使用向量搜索来查询及比较文字 - NLP text embedding

Elasticsearch:使用 NLP 问答模型与你喜欢的圣诞歌曲交谈

Elasticsearch:使用 NLP 问答模型与你喜欢的圣诞歌曲交谈

Elasticsearch:在摄入管道中添加 NLP 任务