Lucene索引数据异常

2020-04-02 17:48:02 浏览数 (1)

最近突然发现本站的全文检索功能失效了,在本地进行调试,发现抛出 TokenStream contract violation异常,很奇怪的异常,因为之前本功能是好好的,也没改动,并且我也没使用TokenStream,异常如下:

代码语言:javascript复制
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: 
    reset()/close() call missing, reset() called multiple times, or subclass does not call
    super.reset(). Please see Javadocs of TokenStream class for more information about the 
correct consuming workflow.
	at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
	at java.io.BufferedReader.fill(BufferedReader.java:154)
	at java.io.BufferedReader.read(BufferedReader.java:175)
	at java.io.FilterReader.read(FilterReader.java:65)
	at java.io.PushbackReader.read(PushbackReader.java:90)
	at com.chenlb.mmseg4j.MMSeg.readNext(MMSeg.java:42)
	at com.chenlb.mmseg4j.MMSeg.next(MMSeg.java:64)
	at com.chenlb.mmseg4j.analysis.MMSegTokenizer.incrementToken(MMSegTokenizer.java:64)
	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:102)
	at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1520)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1190)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1171)
	at blog.test.TxtFileIndexer.main(TxtFileIndexer.java:52)

   仔细观察代码,发现最近确实没有改动这块代码。但是搜索功能为什么会报错了呢?观察maven依赖发现:

   原来我之前使用的是Lucene 4.6,现在改用maven管理后,而引进mmseg4j-analysis-1.9.1分词器后,默认引入了lucene 4.3的依赖包,

这就导致了Lucene在分词的时候抛出异常。解决办法也很简单,排除多余的Lucene包就可以了:

代码语言:javascript复制
	    com.chenlb.mmseg4j
	    mmseg4j-analysis
	    1.9.1
	    
	    	
	    		lucene-core
	    		org.apache.lucene
	    	
	    	
	    		lucene-queryparser
	    		org.apache.lucene

    但是修改后,发现报错依旧,这下不得不深究了,原来MMSeg4j 1.9.1的分词器有个bug,观察源码:

代码语言:javascript复制
MMSegTokenizer.java

public void reset() throws IOException {
		//lucene 4.6
		//org.apache.lucene.analysis.Tokenizer.setReader(Reader)
		//setReader 自动被调用, input 自动被设置。
		super.reset();   //加这一句
		mmSeg.reset(input);
	}

    其实只要进行reset即可,这是Lucene高版本之后,MMSeg的一个bug,但是MMSeg后来都没更新了,导致没有较新的MMSeg版本了。所以需要重新下载MMSeg 4j源码并进行重新编译,打成本地jar,添加到Maven依赖库中,如下:

代码语言:javascript复制
	    com.chenlb.mmseg4j  
	    mmseg4j-analysis  
        1.9.2  
        system  
        D:MyJavaBlogV3srcmainwebappWEB-INFlibmmseg4j-analysis-1.9.2.jar 
	 
	  
	    com.chenlb.mmseg4j  
	    mmseg4j-core  
        1.9.1  
        system  
        D:MyJavaBlogV3srcmainwebappWEB-INFlibmmseg4j-core-1.9.1.jar 

    这下,重新编译打包发布即可,问题搞定。

0 人点赞