Jackrabbit Oak Lucine 索引和 SQL2 查询，用于在 txt 和 pdf 中进行全文搜索

Posted 2023-02-16

技术标签:

【中文标题】Jackrabbit Oak Lucine 索引和 SQL2 查询，用于在 txt 和 pdf 中进行全文搜索【英文标题】：Jackrabbit Oak Lucine index and SQL2 query for full text search in txt and pdf 【发布时间】：2020-01-31 22:41:55 【问题描述】：

我尝试使用 Oak 1.16.0 版本在文件内容中实现全文搜索。

尝试像 Oak 文档中所说的那样创建索引来索引所有属性。

/oak:index/assetType
  - jcr:primaryType = "oak:QueryIndexDefinition"
  - type = "lucene"
  - compatVersion = 2
  - async = "async"
  + indexRules
    - jcr:primaryType = "nt:unstructured"
    + nt:base
      + properties
        - jcr:primaryType = "nt:unstructured"
        + allProps
          - name = ".*"
          - isRegexp = true
          - nodeScopeIndex = true

创建索引。尝试了不同的节点类型组合。没有任何效果。

 public static void createIndex(Repository repository) 
        Session session = null;
        try 
            session = repository.login();

            Node root = session.getRootNode();
            Node index = root.getNode("oak:index");
            Node lucineIndex = index.addNode("assetType","oak:QueryIndexDefinition");
            lucineIndex.setProperty("compatVersion", "2");
            lucineIndex.setProperty("type", "lucene");
            lucineIndex.setProperty("async", "async");
            Node rules = lucineIndex.addNode("indexRules", "nt:unstructured");
                Node base = rules.addNode("nt:base");
                    Node properties = base.addNode("properties", "nt:unstructured");
                        Node allProps = properties.addNode("allProps");
                        allProps.setProperty("jcr:content", ".*");
                        allProps.setProperty("isRegexp", true);
                        allProps.setProperty("nodeScopeIndex", true);
            session.save();
         catch (LoginException e) 
            e.printStackTrace();
         catch (RepositoryException e) 
            e.printStackTrace();
         finally 
            session.logout();

添加一些文件

    public static void saveFileIfNotExist(byte[] rawFile, String fileName, String folderName, String mimeType, Repository repository) 
        Session session = null;
        try 
            session = repository.login(new SimpleCredentials("admin", "admin".toCharArray()));
            Node root = session.getRootNode();
            Binary binary = session.getValueFactory().createBinary(new ByteArrayInputStream(rawFile));
            if(!root.hasNode(folderName)) 
                System.out.println("NO FOLDER");
                Node folder = root.addNode(folderName, "nt:folder");
                Node file = folder.addNode(fileName, "nt:file");
                Node content = file.addNode("jcr:content", "nt:resource");
                content.setProperty("jcr:mimeType", mimeType);
                content.setProperty("jcr:data", binary);
             else 
                System.out.println("FOLDER EXIST");
            
            session.save();
        
        catch (RepositoryException e) 
            e.printStackTrace();
          finally 
            session.logout();

文件内容：

An implementation of the Value interface must override the inherited method
Object.equals(Object) so that, given Value instances V1 and V2,
V1.equals(V2) will return true if.

尝试搜索文件内容

DocumentNodeStore rdb = new DocumentNodeStore(new RDBDocumentNodeStoreBuilder().setRDBConnection(dataSource));
        Repository repo = new Jcr(new Oak(rdb)).with(new OpenSecurityProvider()).createRepository();


createIndex(repo);

        byte[] rawFile = readBytes("D:\\file.txt");
        saveFileIfNotExist(rawFile, "txt_folder", "text_file", "text/plain", repo);


        Session session = null;
        try 
            session = repo.login();
            Node root = session.getRootNode();
            Node index = root.getNode("oak:index");
            QueryManager queryManager = session.getWorkspace().getQueryManager();session.getWorkspace().getQueryManager();

            Query query = queryManager.createQuery("SELECT * FROM [nt:resource] AS s WHERE CONTAINS(s.*, '*so*') option(traversal warn)", Query.JCR_SQL2);

            QueryResult result = query.execute();
            RowIterator ri = result.getRows();
            while (ri.hasNext()) 
                Row row = ri.nextRow();
                System.out.println("Row: " + row.toString());
            

         catch (RepositoryException e) 
            e.printStackTrace();
        
        finally 
            session.logout();
            ((RepositoryImpl) repo).shutdown();
            rdb.dispose();

但没有任何返回，并在日志中发出警告：

2019-10-02 18:27:35,821 [main] WARN  QueryImpl - Traversal query (query without index): SELECT * FROM [nt:resource] AS s WHERE CONTAINS(s.*, '*so*') option(traversal warn); consider creating an index

那么，如何在文件内容中建立正确的索引和正确的请求？如何在 pdf 文档中进行搜索？

【问题讨论】：

【参考方案1】：

我没有仔细检查所有的 sn-ps，但似乎缺少的一件事是设置异步索引器（您的索引 def 有 async="async"）。只是从我的头顶打字，但做类似的事情

new Oak(rdb)).with(new OpenSecurityProvider().withAsyncIndexing("async", 5) // 5 is number seconds to define period at which async indexer would run

顺便说一句，由于它是一个异步索引，因此您需要稍等片刻才能将结果显示在查询中。但是，即使结果没有显示出来，查询仍然会获取您的索引。

【讨论】：

谢谢。我添加了 LuceneProvider

LuceneIndexProvider provider = new LuceneIndexProvider();         repository = new Jcr(new Oak(rdb))                 .with(new OpenSecurityProvider())                 .with(new LuceneIndexEditorProvider())                 .with((QueryIndexProvider) provider)                 .withAsyncIndexing("async", 5)                 .createRepository()

看看，它试图在日志中建立索引。但查询结果仍然为空，并且警告消息仍在日志中：

以上是关于Jackrabbit Oak Lucine 索引和 SQL2 查询，用于在 txt 和 pdf 中进行全文搜索的主要内容，如果未能解决你的问题，请参考以下文章

如何通过 WebDAV 访问 Jackrabbit Oak 存储库？

Apache Jackrabbit Oak 1.5.9 发布

使用 Jackrabbit Oak 优于 MongoDB 的优势

Apache Jackrabbit OAK - 按节点路径跨集群分片 DocumentNodeStore

Jackrabbit Oak：入门并通过 RMI 连接到独立存储库

Jackrabbit Oak 合并不保存更改