不从Remotely索引或提取Document(.pdf .doc)

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了不从Remotely索引或提取Document(.pdf .doc)相关的知识,希望对你有一定的参考价值。

我使用Solr 3.1,Apache Tika 0.9和Solrnet 0.3.1来索引像.doc和.pdf文件这样的docuent。

我已使用此代码在本地成功索引和提取文档

Startup.Init<Article>("http://k9server:8080/solr");
        ISolrOperations<Article> solr = ServiceLocator.Current.GetInstance <ISolrOperations<Article>>();
        string filecontent = null;
        using(var file = File.OpenRead(@"D:\solr.doc")){
                    var response = solr.Extract(new ExtractParameters(file, "abcd1") {
                        ExtractOnly  = true,
                        ExtractFormat = ExtractFormat.Text,
            });
            filecontent = response.Content;
        }
        solr.Add(new Article() {
                ID  = "36",
                EMAIL = "1234",
                COMMENTS = filecontent,
                PRO_ID = 256
        });
        // commit to the index
        solr.Commit();

但我面临的问题是使用相同的代码远程提取或索引文档,我收到错误:

The remote server returned an error: (500) Internal Server Error. 
SolrNet.Exceptions.SolrConnectionException was unhandled

信息

Apache Tomcat/6.0.32 - Error report HTTP Status 500 - org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;

java.lang.NoSuchMethodError: org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;
    at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:65)
    at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:57)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:164)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:196)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
    at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
    at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:864)
    at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
    at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1665)
    at java.lang.Thread.run(Unknown Source)

信息

org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;    
    java.lang.NoSuchMethodError: org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;
            at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:65)
            at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:57)
            at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:164)
            at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
            at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
            at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
            at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:196)
            at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
            at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
            at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238)
            at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
            at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
            at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
            at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
            at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
            at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
            at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
            at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
            at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
            at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
            at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
            at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:864)
            at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
            at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1665)
            at java.lang.Thread.run(Unknown Source)

描述

The server encountered an internal error (org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;

java.lang.NoSuchMethodError: org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;
    at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:65)
    at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:57)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:164)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:196)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
    at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:238)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
    at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:864)
    at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
    at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1665)
    at java.lang.Thread.run(Unknown Source)
) that prevented it from fulfilling this request.
  Source=SolrNet
  StackTrace:
       at SolrNet.Impl.SolrConnection.PostStream(String relativeUrl, String contentType, Stream content, IEnumerable`1 parameters)
       at SolrNet.Commands.ExtractCommand.Execute(ISolrConnection connection)
       at SolrNet.Impl.SolrBasicServer`1.Send(ISolrCommand cmd)
       at SolrNet.Impl.SolrBasicServer`1.SendAndParseExtract(ISolrCommand cmd)
       at SolrNet.Impl.SolrBasicServer`1.Extract(ExtractParameters parameters)
       at SolrNet.Impl.SolrServer`1.Extract(ExtractParameters parameters)
       at SolrNetSample.Program.Main(String[] args) in E:TestProjectSolrNetSampleSolrNetSampleSolrNetSampleProgram.cs:line 38
       at System.AppDomain._nExecuteAssembly(Assembly assembly, String[] args)
       at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
       at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
  InnerException: System.Net.WebException
       Message=The remote server returned an error: (500) Internal Server Error.
       Source=System
       StackTrace:
            at System.Net.HttpWebRequest.GetResponse()
            at HttpWebAdapters.Adapters.HttpWebRequestAdapter.GetResponse()
            at SolrNet.Impl.SolrConnection.GetResponse(IHttpWebRequest request)
            at SolrNet.Impl.SolrConnection.PostStream(String relativeUrl, String contentType, Stream content, IEnumerable`1 parameters)
答案

您的远程服务器在类路径上有两个不同版本的Apache POI,这就是您获得所见异常的原因

您应该删除旧版本的POI,并留下SOLR / Tika附带的新罐子。如果找不到,请参阅POI FAQ以了解如何识别额外的罐子。

另一答案

如果它对您的本地Solr实例起作用而不是针对另一个实例,则可能未正确配置另一个实例。

从堆栈跟踪判断,似乎POI库不正确(可能是错误的版本)。确保从Solr 3.1.0发行版中复制所有Tika JAR。

以上是关于不从Remotely索引或提取Document(.pdf .doc)的主要内容,如果未能解决你的问题,请参考以下文章

我如何提取合并的数据并将其放入不同的工作表中?

PubSub 主题不从云功能中提取数据

NHibernate 关联类而不从存储库中提取

py.test 不从数据库中提取数据

mysql数据库添加数据时索引不从1开始

是否可以在不从原始源重新索引的情况下更改 Solr 架构中指定的分析器?