Jackrabbit Oak Lucine 索引和 SQL2 查询,用于在 txt 和 pdf 中进行全文搜索
Posted
技术标签:
【中文标题】Jackrabbit Oak Lucine 索引和 SQL2 查询,用于在 txt 和 pdf 中进行全文搜索【英文标题】:Jackrabbit Oak Lucine index and SQL2 query for full text search in txt and pdf 【发布时间】:2020-01-31 22:41:55 【问题描述】:我尝试使用 Oak 1.16.0 版本在文件内容中实现全文搜索。
尝试像 Oak 文档中所说的那样创建索引来索引所有属性。
/oak:index/assetType
- jcr:primaryType = "oak:QueryIndexDefinition"
- type = "lucene"
- compatVersion = 2
- async = "async"
+ indexRules
- jcr:primaryType = "nt:unstructured"
+ nt:base
+ properties
- jcr:primaryType = "nt:unstructured"
+ allProps
- name = ".*"
- isRegexp = true
- nodeScopeIndex = true
-
创建索引。尝试了不同的节点类型组合。没有任何效果。
public static void createIndex(Repository repository)
Session session = null;
try
session = repository.login();
Node root = session.getRootNode();
Node index = root.getNode("oak:index");
Node lucineIndex = index.addNode("assetType","oak:QueryIndexDefinition");
lucineIndex.setProperty("compatVersion", "2");
lucineIndex.setProperty("type", "lucene");
lucineIndex.setProperty("async", "async");
Node rules = lucineIndex.addNode("indexRules", "nt:unstructured");
Node base = rules.addNode("nt:base");
Node properties = base.addNode("properties", "nt:unstructured");
Node allProps = properties.addNode("allProps");
allProps.setProperty("jcr:content", ".*");
allProps.setProperty("isRegexp", true);
allProps.setProperty("nodeScopeIndex", true);
session.save();
catch (LoginException e)
e.printStackTrace();
catch (RepositoryException e)
e.printStackTrace();
finally
session.logout();
-
添加一些文件
public static void saveFileIfNotExist(byte[] rawFile, String fileName, String folderName, String mimeType, Repository repository)
Session session = null;
try
session = repository.login(new SimpleCredentials("admin", "admin".toCharArray()));
Node root = session.getRootNode();
Binary binary = session.getValueFactory().createBinary(new ByteArrayInputStream(rawFile));
if(!root.hasNode(folderName))
System.out.println("NO FOLDER");
Node folder = root.addNode(folderName, "nt:folder");
Node file = folder.addNode(fileName, "nt:file");
Node content = file.addNode("jcr:content", "nt:resource");
content.setProperty("jcr:mimeType", mimeType);
content.setProperty("jcr:data", binary);
else
System.out.println("FOLDER EXIST");
session.save();
catch (RepositoryException e)
e.printStackTrace();
finally
session.logout();
文件内容:
An implementation of the Value interface must override the inherited method
Object.equals(Object) so that, given Value instances V1 and V2,
V1.equals(V2) will return true if.
-
尝试搜索文件内容
DocumentNodeStore rdb = new DocumentNodeStore(new RDBDocumentNodeStoreBuilder().setRDBConnection(dataSource));
Repository repo = new Jcr(new Oak(rdb)).with(new OpenSecurityProvider()).createRepository();
createIndex(repo);
byte[] rawFile = readBytes("D:\\file.txt");
saveFileIfNotExist(rawFile, "txt_folder", "text_file", "text/plain", repo);
Session session = null;
try
session = repo.login();
Node root = session.getRootNode();
Node index = root.getNode("oak:index");
QueryManager queryManager = session.getWorkspace().getQueryManager();session.getWorkspace().getQueryManager();
Query query = queryManager.createQuery("SELECT * FROM [nt:resource] AS s WHERE CONTAINS(s.*, '*so*') option(traversal warn)", Query.JCR_SQL2);
QueryResult result = query.execute();
RowIterator ri = result.getRows();
while (ri.hasNext())
Row row = ri.nextRow();
System.out.println("Row: " + row.toString());
catch (RepositoryException e)
e.printStackTrace();
finally
session.logout();
((RepositoryImpl) repo).shutdown();
rdb.dispose();
但没有任何返回,并在日志中发出警告:
2019-10-02 18:27:35,821 [main] WARN QueryImpl - Traversal query (query without index): SELECT * FROM [nt:resource] AS s WHERE CONTAINS(s.*, '*so*') option(traversal warn); consider creating an index
-
那么,如何在文件内容中建立正确的索引和正确的请求?
如何在 pdf 文档中进行搜索?
【问题讨论】:
【参考方案1】:我没有仔细检查所有的 sn-ps,但似乎缺少的一件事是设置异步索引器(您的索引 def 有 async="async"
)。只是从我的头顶打字,但做类似的事情
new Oak(rdb)).with(new OpenSecurityProvider().withAsyncIndexing("async", 5) // 5 is number seconds to define period at which async indexer would run
顺便说一句,由于它是一个异步索引,因此您需要稍等片刻才能将结果显示在查询中。但是,即使结果没有显示出来,查询仍然会获取您的索引。
【讨论】:
谢谢。我添加了 LuceneProviderLuceneIndexProvider provider = new LuceneIndexProvider(); repository = new Jcr(new Oak(rdb)) .with(new OpenSecurityProvider()) .with(new LuceneIndexEditorProvider()) .with((QueryIndexProvider) provider) .withAsyncIndexing("async", 5) .createRepository()
看看,它试图在日志中建立索引。但查询结果仍然为空,并且警告消息仍在日志中:以上是关于Jackrabbit Oak Lucine 索引和 SQL2 查询,用于在 txt 和 pdf 中进行全文搜索的主要内容,如果未能解决你的问题,请参考以下文章
如何通过 WebDAV 访问 Jackrabbit Oak 存储库?
Apache Jackrabbit Oak 1.5.9 发布
使用 Jackrabbit Oak 优于 MongoDB 的优势
Apache Jackrabbit OAK - 按节点路径跨集群分片 DocumentNodeStore