最近突然发现本站的全文检索功能失效了,在本地进行调试,发现抛出 TokenStream contract violation异常,很奇怪的异常,因为之前本功能是好好的,也没改动,并且我也没使用TokenStream,异常如下:
代码语言:javascript复制Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation:
reset()/close() call missing, reset() called multiple times, or subclass does not call
super.reset(). Please see Javadocs of TokenStream class for more information about the
correct consuming workflow.
at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.read(BufferedReader.java:175)
at java.io.FilterReader.read(FilterReader.java:65)
at java.io.PushbackReader.read(PushbackReader.java:90)
at com.chenlb.mmseg4j.MMSeg.readNext(MMSeg.java:42)
at com.chenlb.mmseg4j.MMSeg.next(MMSeg.java:64)
at com.chenlb.mmseg4j.analysis.MMSegTokenizer.incrementToken(MMSegTokenizer.java:64)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:102)
at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1520)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1190)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1171)
at blog.test.TxtFileIndexer.main(TxtFileIndexer.java:52)
仔细观察代码,发现最近确实没有改动这块代码。但是搜索功能为什么会报错了呢?观察maven依赖发现:
原来我之前使用的是Lucene 4.6,现在改用maven管理后,而引进mmseg4j-analysis-1.9.1分词器后,默认引入了lucene 4.3的依赖包,
这就导致了Lucene在分词的时候抛出异常。解决办法也很简单,排除多余的Lucene包就可以了:
代码语言:javascript复制 com.chenlb.mmseg4j
mmseg4j-analysis
1.9.1
lucene-core
org.apache.lucene
lucene-queryparser
org.apache.lucene
但是修改后,发现报错依旧,这下不得不深究了,原来MMSeg4j 1.9.1的分词器有个bug,观察源码:
代码语言:javascript复制MMSegTokenizer.java
public void reset() throws IOException {
//lucene 4.6
//org.apache.lucene.analysis.Tokenizer.setReader(Reader)
//setReader 自动被调用, input 自动被设置。
super.reset(); //加这一句
mmSeg.reset(input);
}
其实只要进行reset即可,这是Lucene高版本之后,MMSeg的一个bug,但是MMSeg后来都没更新了,导致没有较新的MMSeg版本了。所以需要重新下载MMSeg 4j源码并进行重新编译,打成本地jar,添加到Maven依赖库中,如下:
代码语言:javascript复制 com.chenlb.mmseg4j
mmseg4j-analysis
1.9.2
system
D:MyJavaBlogV3srcmainwebappWEB-INFlibmmseg4j-analysis-1.9.2.jar
com.chenlb.mmseg4j
mmseg4j-core
1.9.1
system
D:MyJavaBlogV3srcmainwebappWEB-INFlibmmseg4j-core-1.9.1.jar
这下,重新编译打包发布即可,问题搞定。