弹性搜索 MoreLikeThis 查询从不返回结果

Posted

技术标签:

【中文标题】弹性搜索 MoreLikeThis 查询从不返回结果【英文标题】:Elastic Search MoreLikeThis Query Never Returns Results 【发布时间】:2019-07-20 13:55:33 【问题描述】:

我在这里肯定做错了什么。我正在尝试在我们拥有的使用弹性搜索的搜索引擎项目中获得“更像这样”的查询。这个想法是 CMS 可以将标签(如类别)写入页面中的 Meta 标签或其他东西,我们会将这些标签读入 Elastic 并使用它们来驱动基于输入文档 ID 的“更像这样”的搜索。

因此,如果输入文档具有 catfish, chicken, goat 标签,我希望 Elastic Search 能够找到共享这些标签的其他文档,而不是返回 racecarairplane 的标签。

我通过以下方式构建了一个概念验证控制台应用程序:

按照https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html上的说明获取在 Docker 中运行的本地 Elastic Search 6.6.1 实例

创建新的 .NET Framework 4.6.1 控制台应用程序

为 NEST 6.5.0 和 ElasticSearch.Net 6.5.0 添加 NuGet 包

然后我创建了一个新的弹性索引,其中包含具有“Tags”属性的对象(类型“MyThing”)。此标记是一组可能值中的随机逗号分隔的单词集。我在测试中的索引中插入了 100 到 5000 个项目。我在集合中尝试了越来越多的可能单词。

无论我尝试什么,MoreLikeThis 查询都不会返回任何内容,我不明白为什么。

不返回结果的查询:

    var result = EsClient.Search<MyThing>(s => s
        .Index(DEFAULT_INDEX)
        .Query(esQuery =>
        
            var mainQuery = esQuery
                .MoreLikeThis(mlt => mlt
                    .Include(true)
                    .Fields(f => f.Field(ff => ff.Tags, 5))
                    .Like(l => l.Document(d => d.Id(id)))
                );

            return mainQuery;
        

完整的“program.cs”来源:

using Nest;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test_MoreLikeThis_ES6

    class Program
    
        public class MyThing
        
            public string Tags  get; set; 
        

        const string ELASTIC_SERVER = "http://localhost:9200";
        const string DEFAULT_INDEX = "my_index";
        const int NUM_RECORDS = 1000;

        private static Uri es_node = new Uri(ELASTIC_SERVER);
        private static ConnectionSettings settings = new ConnectionSettings(es_node).DefaultIndex(DEFAULT_INDEX);
        private static ElasticClient EsClient = new ElasticClient(settings);

        private static Random rnd = new Random();

        static void Main(string[] args)
        
            Console.WriteLine("Rebuild index? (y):");
            var answer = Console.ReadLine().ToLower();
            if (answer == "y")
            
                RebuildIndex();
                for (int i = 0; i < NUM_RECORDS; i++)
                
                    AddToIndex();
                
            

            Console.WriteLine("");
            Console.WriteLine("Getting a Thing...");
            var aThingId = GetARandomThingId();


            Console.WriteLine("");
            Console.WriteLine("Looking for something similar to document with id " + aThingId);
            Console.WriteLine("");
            Console.WriteLine("");

            GetMoreLikeAThing(aThingId);
        

        private static string GetARandomThingId()
        
            var firstdocQuery = EsClient
                .Search<MyThing>(s =>
                    s.Size(1)
                    .Query(q => 
                        return q.FunctionScore(fs => fs.Functions(fn => fn.RandomScore(rs => rs.Seed(DateTime.Now.Ticks).Field("_seq_no"))));
                    )
                );

            if (!firstdocQuery.IsValid || firstdocQuery.Hits.Count == 0) return null;

            var hit = firstdocQuery.Hits.First();
            Console.WriteLine("Found a thing with id '" + hit.Id + "' and tags: " + hit.Source.Tags);
            return hit.Id;
        

        private static void GetMoreLikeAThing(string id)
        

            var result = EsClient.Search<MyThing>(s => s
                .Index(DEFAULT_INDEX)
                .Query(esQuery =>
                
                    var mainQuery = esQuery
                        .MoreLikeThis(mlt => mlt
                            .Include(true)
                            .Fields(f => f.Field(ff => ff.Tags, 5))
                            .Like(l => l.Document(d => d.Id(id)))
                        );

                    return mainQuery;
                

            ));

            if (result.IsValid)
            
                if (result.Hits.Count > 0)
                
                    Console.WriteLine("These things are similar:");
                    foreach (var hit in result.Hits)
                    
                        Console.WriteLine("   " + hit.Id + " : " + hit.Source.Tags);
                    
                
                else
                
                    Console.WriteLine("No similar things found.");
                

            
            else
            
                Console.WriteLine("There was an error running the ES query.");
            

            Console.WriteLine("");
            Console.WriteLine("Enter (y) to get another thing, or anything else to exit");
            var y = Console.ReadLine().ToLower();

            if (y == "y")
            
                var aThingId = GetARandomThingId();
                GetMoreLikeAThing(aThingId);
            

            Console.WriteLine("");
            Console.WriteLine("Any key to exit...");
            Console.ReadKey();

        

        private static void RebuildIndex()
        
            var existsResponse = EsClient.IndexExists(DEFAULT_INDEX);
            if (existsResponse.Exists) //delete existing mapping (and data)
            
                EsClient.DeleteIndex(DEFAULT_INDEX);
            

            var rebuildResponse = EsClient.CreateIndex(DEFAULT_INDEX, c => c.Settings(s => s.NumberOfReplicas(1).NumberOfShards(5)));
            var response2 = EsClient.Map<MyThing>(m => m.AutoMap());
        

        private static void AddToIndex()
        
            var myThing = new MyThing();
            var tags = new List<string> 
                    "catfish",
                    "tractor",
                    "racecar",
                    "airplane",
                    "chicken",
                    "goat",
                    "pig",
                    "horse",
                    "goose",
                    "duck"
                ;

            var randNum = rnd.Next(0, tags.Count);

            //get randNum random tags
            var rand = tags.OrderBy(o => Guid.NewGuid().ToString()).Take(randNum);
            myThing.Tags = string.Join(", ", rand);

            var ir = new IndexRequest<MyThing>(myThing);
            var indexResponse = EsClient.Index(ir);

            Console.WriteLine("Index response: " + indexResponse.Id + " : " + string.Join(" " , myThing.Tags));
        
    

【问题讨论】:

【参考方案1】:

这里的问题是,原型文档的任何条款都不会满足默认的min_term_freq 值 2,因为所有文档只包含每个标签 (term) 一次。如果您将min_term_freq 降为 1,您将获得结果。可能还想将min_doc_freq 也设置为 1,并与排除原型文档的查询结合使用。

这是一个可以玩的例子

const string ELASTIC_SERVER = "http://localhost:9200";
const string DEFAULT_INDEX = "my_index";
const int NUM_RECORDS = 1000;

private static readonly Random _random = new Random();
private static readonly IReadOnlyList<string> Tags = 
    new List<string>
    
        "catfish",
        "tractor",
        "racecar",
        "airplane",
        "chicken",
        "goat",
        "pig",
        "horse",
        "goose",
        "duck"
    ;

private static ElasticClient _client;

private static void Main()

    var pool = new SingleNodeConnectionPool(new Uri(ELASTIC_SERVER));

    var settings = new ConnectionSettings(pool)
        .DefaultIndex(DEFAULT_INDEX);

    _client = new ElasticClient(settings);

    Console.WriteLine("Rebuild index? (y):");
    var answer = Console.ReadLine().ToLower();
    if (answer == "y")
    
        RebuildIndex();
        AddToIndex();
    

    Console.WriteLine();
    Console.WriteLine("Getting a Thing...");
    var aThingId = GetARandomThingId();

    Console.WriteLine();
    Console.WriteLine("Looking for something similar to document with id " + aThingId);
    Console.WriteLine();
    Console.WriteLine();

    GetMoreLikeAThing(aThingId);


public class MyThing

    public List<string> Tags  get; set; 


private static string GetARandomThingId()

    var firstdocQuery = _client
        .Search<MyThing>(s =>
            s.Size(1)
            .Query(q => q
                .FunctionScore(fs => fs
                    .Functions(fn => fn
                        .RandomScore(rs => rs
                            .Seed(DateTime.Now.Ticks)
                            .Field("_seq_no")
                        )
                    )
                )
            )
        );

    if (!firstdocQuery.IsValid || firstdocQuery.Hits.Count == 0) return null;

    var hit = firstdocQuery.Hits.First();
    Console.WriteLine($"Found a thing with id 'hit.Id' and tags: string.Join(", ", hit.Source.Tags)");
    return hit.Id;


private static void GetMoreLikeAThing(string id)

    var result = _client.Search<MyThing>(s => s
        .Index(DEFAULT_INDEX)
        .Query(esQuery => esQuery 
            .MoreLikeThis(mlt => mlt
                    .Include(true)
                    .Fields(f => f.Field(ff => ff.Tags))
                    .Like(l => l.Document(d => d.Id(id)))
                    .MinTermFrequency(1)
                    .MinDocumentFrequency(1)
            ) && !esQuery
            .Ids(ids => ids
                .Values(id)
            )
        )
    );

    if (result.IsValid)
    
        if (result.Hits.Count > 0)
        
            Console.WriteLine("These things are similar:");
            foreach (var hit in result.Hits)
            
                Console.WriteLine($"   hit.Id: string.Join(", ", hit.Source.Tags)");
            
        
        else
        
            Console.WriteLine("No similar things found.");
        

    
    else
    
        Console.WriteLine("There was an error running the ES query.");
    

    Console.WriteLine();
    Console.WriteLine("Enter (y) to get another thing, or anything else to exit");
    var y = Console.ReadLine().ToLower();

    if (y == "y")
    
        var aThingId = GetARandomThingId();
        GetMoreLikeAThing(aThingId);
    

    Console.WriteLine();
    Console.WriteLine("Any key to exit...");



private static void RebuildIndex()

    var existsResponse = _client.IndexExists(DEFAULT_INDEX);
    if (existsResponse.Exists) //delete existing mapping (and data)
    
        _client.DeleteIndex(DEFAULT_INDEX);
    

    var rebuildResponse = _client.CreateIndex(DEFAULT_INDEX, c => c
        .Settings(s => s
            .NumberOfShards(1)
        )
        .Mappings(m => m       
            .Map<MyThing>(mm => mm.AutoMap())
        )
    );


private static void AddToIndex()

    var bulkAllObservable = _client.BulkAll(GetMyThings(), b => b
        .RefreshOnCompleted()
        .Size(1000));

    var waitHandle = new ManualResetEvent(false);
    Exception exception = null;

    var bulkAllObserver = new BulkAllObserver(
        onNext: r =>
        
            Console.WriteLine($"Indexed page r.Page");
        ,
        onError: e => 
        
            exception = e;
            waitHandle.Set();
        ,
        onCompleted: () => waitHandle.Set());

    bulkAllObservable.Subscribe(bulkAllObserver);

    waitHandle.WaitOne();

    if (exception != null)
    
        throw exception;
    


private static IEnumerable<MyThing> GetMyThings()

    for (int i = 0; i < NUM_RECORDS; i++)
    
        var randomTags = Tags.OrderBy(o => Guid.NewGuid().ToString())
            .Take(_random.Next(0, Tags.Count))
            .OrderBy(t => t)
            .ToList();

        yield return new MyThing  Tags = randomTags ;
    

这是一个示例输出

Found a thing with id 'Ugg9LGkBPK3n91HQD1d5' and tags: airplane, goat
These things are similar:
   4wg9LGkBPK3n91HQD1l5: airplane, goat
   9Ag9LGkBPK3n91HQD1l5: airplane, goat
   Vgg9LGkBPK3n91HQD1d5: airplane, goat, goose
   sQg9LGkBPK3n91HQD1d5: airplane, duck, goat
   lQg9LGkBPK3n91HQD1h5: airplane, catfish, goat
   9gg9LGkBPK3n91HQD1l5: airplane, catfish, goat
   FQg9LGkBPK3n91HQD1p5: airplane, goat, goose
   Jwg9LGkBPK3n91HQD1p5: airplane, goat, goose
   Fwg9LGkBPK3n91HQD1d5: airplane, duck, goat, tractor
   Kwg9LGkBPK3n91HQD1d5: airplane, goat, goose, horse

【讨论】:

就是这样,谢谢!我发誓我也试过了,我一直在摆弄所有这些最大/最小选项,并阅读那个文档页面,但显然我从来没有得到正确的组合。

以上是关于弹性搜索 MoreLikeThis 查询从不返回结果的主要内容,如果未能解决你的问题,请参考以下文章

Lucene利用MoreLikeThis实现"你可能感兴趣的"搜索

如何使弹性搜索多匹配模糊搜索始终返回最小数量的结果

ElasticsearchOperations 查询不返回完全匹配

markdown [查询DSL]弹性搜索查询DSL #elasticsearch

如何记录所有已执行的弹性搜索查询

如何过滤掉弹性搜索中不存在的字段?