MongoDB 全文和部分文本搜索

Posted 2023-02-14

技术标签:

【中文标题】MongoDB 全文和部分文本搜索【英文标题】：MongoDB Full and Partial Text Search 【发布时间】：2017-12-03 16:01:15 【问题描述】：

环境：

MongoDB (3.2.0) 与 Mongoose

收藏：

用户

文本索引创建：

  BasicDBObject keys = new BasicDBObject();
  keys.put("name","text");

  BasicDBObject options = new BasicDBObject();
  options.put("name", "userTextSearch");
  options.put("unique", Boolean.FALSE);
  options.put("background", Boolean.TRUE);
  
  userCollection.createIndex(keys, options); // using MongoTemplate

文档：

"name":"LEONEL"

查询：

db.users.find( "$text" : "$search" : "LEONEL" ) => 找到 db.users.find( "$text" : "$search" : "leonel" ) => 找到（搜索 caseSensitive 为假） db.users.find( "$text" : "$search" : "LEONÉL" ) => 找到（使用 diacriticSensitive 搜索是错误的） db.users.find( "$text" : "$search" : "LEONE" ) => 找到（部分搜索） db.users.find( "$text" : "$search" : "LEO" ) => 未找到（部分搜索） db.users.find( "$text" : "$search" : "L" ) => 未找到（部分搜索）

知道为什么我使用查询“LEO”或“L”得到 0 个结果吗？

不允许带有文本索引搜索的正则表达式。

db.getCollection('users')
     .find(  "$text" :  "$search" : "/LEO/i", 
                          "$caseSensitive": false, 
                          "$diacriticSensitive": false  )
     .count() // 0 results

db.getCollection('users')
     .find(  "$text" :  "$search" : "LEO", 
                          "$caseSensitive": false, 
                          "$diacriticSensitive": false  )
.count() // 0 results

MongoDB 文档：

Text Search $text Text Indexes Improve Text Indexes to support partial word match

【问题讨论】：

MongoDB: Is it possible to make a case-insensitive query?的可能重复这个问题与使用文本索引的部分搜索有关，而不是区分大小写的搜索。 @LucasCosta 请不要将此问题标记为重复。这是可能的，至少需要 5 票@Leonel 你试过/LEO/i吗？您可以在 mongodb 的搜索值中使用正则表达式 @LucasCosta 文本索引搜索不允许正则表达式。 【参考方案1】：

在 MongoDB 3.4 中，text search 功能旨在支持使用特定语言的停用词和词干规则对文本内容进行不区分大小写的搜索。 supported languages 的词干提取规则基于标准算法，通常处理常见动词和名词，但不知道专有名词。

没有明确支持部分匹配或模糊匹配，但源于相似结果的术语可能看起来是这样工作的。例如：“taste”、“tastes”和tasteful”都源于“tast”。尝试Snowball Stemming Demo页面尝试更多的词和词干算法。

您匹配的结果是同一个单词“LEONEL”的所有变体，并且仅因大小写和变音符号而异。除非您选择的语言规则可以将“LEONEL”词干化为更短的词，否则这些是唯一可以匹配的变体类型。

如果您想进行有效的部分匹配，则需要采用不同的方法。有关一些有用的想法，请参阅：

Efficient Techniques for Fuzzy and Partial matching in MongoDB 约翰·佩奇 Efficient Partial Keyword Searches James Tan

您可以在 MongoDB 问题跟踪器中观看/点赞一个相关的改进请求：SERVER-15090: Improve Text Indexes to support partial word match。

【讨论】：

现在有更好的方法。在免费层中查看 Atlas Search 以提高效率：docs.atlas.mongodb.com/atlas-search【参考方案2】：

import re

db.collection.find("$or": ["your field name": re.compile(text, re.IGNORECASE),"your field name": re.compile(text, re.IGNORECASE)])

【讨论】：

不鼓励仅使用代码回答，因为它们没有为未来的读者提供太多信息，请对您所写的内容提供一些解释【参考方案3】：

由于 Mongo 目前默认不支持部分搜索...

我创建了一个简单的静态方法。

import mongoose from 'mongoose'

const PostSchema = new mongoose.Schema(
    title:  type: String, default: '', trim: true ,
    body:  type: String, default: '', trim: true ,
);

PostSchema.index( title: "text", body: "text",,
     weights:  title: 5, body: 3,  )

PostSchema.statics = 
    searchPartial: function(q, callback) 
        return this.find(
            $or: [
                 "title": new RegExp(q, "gi") ,
                 "body": new RegExp(q, "gi") ,
            ]
        , callback);
    ,

    searchFull: function (q, callback) 
        return this.find(
            $text:  $search: q, $caseSensitive: false 
        , callback)
    ,

    search: function(q, callback) 
        this.searchFull(q, (err, data) => 
            if (err) return callback(err, data);
            if (!err && data.length) return callback(err, data);
            if (!err && data.length === 0) return this.searchPartial(q, callback);
        );
    ,


export default mongoose.models.Post || mongoose.model('Post', PostSchema)

使用方法：

import Post from '../models/post'

Post.search('Firs', function(err, data) 
   console.log(data);
)

【讨论】：

如何从 Post.search() 返回数据？ @LeventeOrbán 承诺！我会在下面给出答案。更多信息：docs.mongodb.com/manual/reference/operator/query/text/… - docs.mongodb.com/manual/text-search/index.html @RicardoCanelas 有没有办法在子文档字段上添加索引？如果一个字段是一个数组呢？很好的答案！想知道这是否也适用于异步/等待功能。刚刚试了一下，对我没用。【参考方案4】：

不创建索引，我们可以简单地使用：

db.users.find( name: /<full_or_partial_text>/i)（不区分大小写）

【讨论】：

赞成，它适用于aqp。谢谢！ new RegExp(string, 'i') 适用于需要动态字符串搜索的任何人如何在其中设置变量？请注意，由于搜索不在索引字段上，因此效率和可扩展性不高，对于大型表，这会很慢。【参考方案5】：

我在 on npm987654321@ 的猫鼬插件中包装了 @Ricardo Canelas 的答案

进行了两项更改： - 使用承诺 - 搜索类型为String的任何字段

这里是重要的源代码：

// mongoose-partial-full-search

module.exports = exports = function addPartialFullSearch(schema, options) 
  schema.statics = 
    ...schema.statics,
    makePartialSearchQueries: function (q) 
      if (!q) return ;
      const $or = Object.entries(this.schema.paths).reduce((queries, [path, val]) => 
        val.instance == "String" &&
          queries.push(
            [path]: new RegExp(q, "gi")
          );
        return queries;
      , []);
      return  $or 
    ,
    searchPartial: function (q, opts) 
      return this.find(this.makePartialSearchQueries(q), opts);
    ,

    searchFull: function (q, opts) 
      return this.find(
        $text: 
          $search: q
        
      , opts);
    ,

    search: function (q, opts) 
      return this.searchFull(q, opts).then(data => 
        return data.length ? data : this.searchPartial(q, opts);
      );
    
  


exports.version = require('../package').version;

用法

// PostSchema.js
import addPartialFullSearch from 'mongoose-partial-full-search';
PostSchema.plugin(addPartialFullSearch);

// some other file.js
import Post from '../wherever/models/post'

Post.search('Firs').then(data => console.log(data);)

【讨论】：

【参考方案6】：

如果您使用变量来存储要搜索的字符串或值：

它将与正则表达式一起使用，如下所示：

 collection.find( name of Mongodb field: new RegExp(variable_name, 'i')

这里，I 代表忽略大小写选项

【讨论】：

我用的是Monk，collection就是db.get()函数，用来连接数据库【参考方案7】：

对我有用的快速而肮脏的解决方案：首先使用文本搜索，如果没有找到，然后使用正则表达式进行另一个查询。如果您不想进行两个查询 - $or 也可以，但 requires all fields in query to be indexed。

另外，你最好不要使用不区分大小写的 rx，因为it can't rely on indexes。就我而言，我制作了已用字段的小写副本。

【讨论】：

【参考方案8】：

这里解释了基于 n-gram 的良好模糊匹配方法（还解释了如何使用前缀匹配获得更高的结果） https://medium.com/xeneta/fuzzy-search-with-mongodb-and-python-57103928ee5d

注意：基于 n-gram 的方法可以存储广泛，mongodb 集合大小会增加。

【讨论】：

【参考方案9】：

如果您想使用 MongoDB 全文搜索的所有优势并希望部分匹配（可能用于自动完成），Shrikant Prabhu 提到的基于 n-gram 的方法对我来说是正确的解决方案。显然，您的里程可能会有所不同，这在索引大型文档时可能不实用。

在我的情况下，我主要需要部分匹配来仅用于我的文档的 title 字段（和一些其他短字段）。

我使用了边缘 n-gram 方法。那是什么意思？简而言之，您将"Mississippi River" 之类的字符串转换为"Mis Miss Missi Missis Mississ Mississi Mississip Mississipp Mississippi Riv Rive River" 之类的字符串。

受刘根this code的启发，我想出了这个方法：

function createEdgeNGrams(str) 
    if (str && str.length > 3) 
        const minGram = 3
        const maxGram = str.length
        
        return str.split(" ").reduce((ngrams, token) => 
            if (token.length > minGram)    
                for (let i = minGram; i <= maxGram && i <= token.length; ++i) 
                    ngrams = [...ngrams, token.substr(0, i)]
                
             else 
                ngrams = [...ngrams, token]
            
            return ngrams
        , []).join(" ")
     
    
    return str


let res = createEdgeNGrams("Mississippi River")
console.log(res)

现在为了在 Mongo 中使用它，我在我的文档中添加了一个 searchTitle 字段，并通过使用上述函数将实际的 title 字段转换为边缘 n-gram 来设置它的值。我还为searchTitle 字段创建了一个"text" 索引。

然后我使用投影从搜索结果中排除 searchTitle 字段：

db.collection('my-collection')
  .find( $text:  $search: mySearchTerm  ,  projection:  searchTitle: 0  )

【讨论】：

在我看来这是迄今为止最好的解决方案，可惜 mongo 没有开箱即用的 ngram。【参考方案10】：

在 MongodB 中完整/部分搜索“纯”流星项目

我修改了 flash 的代码以将它与 Meteor-Collections 和 simpleSchema 一起使用，但没有 mongoose（意味着：删除使用 .plugin()-method 和 schema.path（虽然这看起来是 flash 代码中的 simpleSchema-attribute，它没有为我解决））并返回结果数组而不是游标。

认为这可能对某人有所帮助，所以我分享它。

export function partialFullTextSearch(meteorCollection, searchString) 

    // builds an "or"-mongoDB-query for all fields with type "String" with a regEx as search parameter
    const makePartialSearchQueries = () => 
        if (!searchString) return ;
        const $or = Object.entries(meteorCollection.simpleSchema().schema())
            .reduce((queries, [name, def]) => 
                def.type.definitions.some(t => t.type === String) &&
                queries.push([name]: new RegExp(searchString, "gi"));
                return queries
            , []);
        return $or
    ;

    // returns a promise with result as array
    const searchPartial = () => meteorCollection.rawCollection()
        .find(makePartialSearchQueries(searchString)).toArray();

    // returns a promise with result as array
    const searchFull = () => meteorCollection.rawCollection()
        .find($text: $search: searchString).toArray();

    return searchFull().then(result => 
        if (result.length === 0) throw null
        else return result
    ).catch(() => searchPartial());

这会返回一个 Promise，所以像这样调用它（即作为异步 Meteor-MethodsearchContact 在服务器端的返回）。这意味着您在调用此方法之前已将simpleSchema 附加到您的集合中。

return partialFullTextSearch(Contacts, searchString).then(result => result);

【讨论】：

【参考方案11】：

我创建了一个附加字段，它将文档中我要搜索的所有字段组合在一起。然后我只使用正则表达式：

user = 
    firstName: 'Bob',
    lastName: 'Smith',
    address: 
        street: 'First Ave',
        city: 'New York City',
        
    notes: 'Bob knows Mary'


// add combined search field with '+' separator to preserve spaces
user.searchString = `$user.firstName+$user.lastName+$user.address.street+$user.address.city+$user.notes`

db.users.find(searchString: $regex: 'mar', $options: 'i')
// returns Bob because 'mar' matches his notes field

// TODO write a client-side function to highlight the matching fragments

【讨论】：

以上是关于MongoDB 全文和部分文本搜索的主要内容，如果未能解决你的问题，请参考以下文章