analysis說(shuō)明lucene ananlysis應(yīng)用場(chǎng)景lucene提供了analysis用來(lái)將文本轉(zhuǎn)換到索引文件或提供給IndexSearcher查詢(xún)索引; 對(duì)于lucene而言,不管是索引還是檢索,都是針對(duì)于純文本輸入來(lái)講的; 通過(guò)lucene的強(qiáng)大類(lèi)庫(kù)我們可以訪問(wèn)各種格式的文檔,如HTML、XML、PDF、Word、TXT等, 我們需要傳遞給lucene的只是文件中的純文本內(nèi)容; lucene的詞語(yǔ)切分lucene的索引和檢索前提是其對(duì)文本內(nèi)容的分析和詞組的切分;比如,文檔中有一句話叫“Hello World,Welcome to China” 我們想找到包含這段話的文檔,而用戶輸入的查詢(xún)條件又不盡詳細(xì)(可能只是hello) 這里我們就需要用到lucene索引該文檔的時(shí)候預(yù)先對(duì)文檔內(nèi)容進(jìn)行切分,將詞源和文本對(duì)應(yīng)起來(lái)。 有時(shí)候?qū)υ~語(yǔ)進(jìn)行簡(jiǎn)單切分還遠(yuǎn)遠(yuǎn)不夠,我們還需要對(duì)字符串進(jìn)行深度切分,lucene不僅能夠?qū)λ饕齼?nèi)容預(yù)處理還可以對(duì)請(qǐng)求參數(shù)進(jìn)行切分; 使用analyzerlucene的索引使用如下: - package com.lucene.analysis;
-
- import java.io.IOException;
- import java.io.StringReader;
-
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.analysis.TokenStream;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
- import org.junit.Test;
-
- public class AnalysisTest {
- @Test
- public void tokenTest() {
- Analyzer analyzer = new StandardAnalyzer();
- TokenStream ts = null;
- try {
- ts = analyzer.tokenStream("myfield", new StringReader(
- "some text goes here"));
- OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
- ts.reset();
- while (ts.incrementToken()) {
-
-
- System.out.println("token: " + ts.reflectAsString(true));
-
- System.out.println("token start offset: "
- + offsetAtt.startOffset());
- System.out.println("token end offset: "
- + offsetAtt.endOffset());
- }
- ts.end();
- } catch (IOException e) {
-
- e.printStackTrace();
- } finally {
- try {
- ts.close();
- } catch (IOException e) {
-
- e.printStackTrace();
- }
- }
- }
-
- }
自定義Analyzer和實(shí)現(xiàn)自己的analysis模塊
1.要實(shí)現(xiàn)自己的analyzer,我們需要繼承Analyzer并重寫(xiě)其中的分詞模塊。 2.維護(hù)停止詞詞典 3.重寫(xiě)TokenStreamComponents方法,選擇合適的分詞方法,對(duì)詞語(yǔ)進(jìn)行過(guò)濾 示例代碼如下 - package com.lucene.analysis.self;
-
- import java.io.IOException;
-
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.analysis.TokenStream;
- import org.apache.lucene.analysis.Tokenizer;
- import org.apache.lucene.analysis.core.LowerCaseTokenizer;
- import org.apache.lucene.analysis.core.StopAnalyzer;
- import org.apache.lucene.analysis.core.StopFilter;
- import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
- import org.apache.lucene.analysis.util.CharArraySet;
-
- public class MyAnalyzer extends Analyzer {
- private CharArraySet stopWordSet;
-
- public CharArraySet getStopWordSet() {
- return stopWordSet;
- }
-
- public void setStopWordSet(CharArraySet stopWordSet) {
- this.stopWordSet = stopWordSet;
- }
-
-
- public MyAnalyzer() {
- super();
- this.stopWordSet = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
- }
-
-
-
-
- public MyAnalyzer(String[] stops) {
- this();
- stopWordSet.addAll(StopFilter.makeStopSet(stops));
- }
-
- @Override
- protected TokenStreamComponents createComponents(String fieldName) {
-
- Tokenizer source = new LowerCaseTokenizer();
- return new TokenStreamComponents(source, new StopFilter(source, stopWordSet));
- }
- public static void main(String[] args) {
- Analyzer analyzer = new MyAnalyzer();
- String words = "A AN yuyu";
- TokenStream stream = null;
-
- try {
- stream = analyzer.tokenStream("myfield", words);
- stream.reset();
- CharTermAttribute offsetAtt = stream.addAttribute(CharTermAttribute.class);
- while (stream.incrementToken()) {
- System.out.println(offsetAtt.toString());
- }
- stream.end();
- } catch (IOException e) {
-
- e.printStackTrace();
- }finally{
- try {
- stream.close();
- } catch (IOException e) {
-
- e.printStackTrace();
- }
- }
- }
- }
運(yùn)行結(jié)果如下:
說(shuō)明該分詞器對(duì)a an 進(jìn)行了過(guò)濾,這些過(guò)濾的詞在stopWordSet中 添加字長(zhǎng)過(guò)濾器有時(shí)候我們需要對(duì)字符串中的短字符進(jìn)行過(guò)濾,比如welcome to BeiJIng中過(guò)濾掉長(zhǎng)度小于2的字符串,我們期望的結(jié)果就變成了Welcome BeiJing,我們僅需要重新實(shí)現(xiàn)createComponents方法,相關(guān)代碼如下: - package com.lucene.analysis.self;
-
- import java.io.IOException;
-
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.analysis.TokenStream;
- import org.apache.lucene.analysis.Tokenizer;
- import org.apache.lucene.analysis.core.LowerCaseTokenizer;
- import org.apache.lucene.analysis.core.StopAnalyzer;
- import org.apache.lucene.analysis.core.StopFilter;
- import org.apache.lucene.analysis.core.WhitespaceTokenizer;
- import org.apache.lucene.analysis.miscellaneous.LengthFilter;
- import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
- import org.apache.lucene.analysis.util.CharArraySet;
-
- public class LengFilterAanlyzer extends Analyzer {
- private int len;
-
- public int getLen() {
- return len;
- }
-
-
- public void setLen(int len) {
- this.len = len;
- }
-
-
- public LengFilterAanlyzer() {
- super();
- }
-
-
- public LengFilterAanlyzer(int len) {
- super();
- this.len = len;
- }
-
-
- @Override
- protected TokenStreamComponents createComponents(String fieldName) {
- final Tokenizer source = new WhitespaceTokenizer();
- TokenStream result = new LengthFilter(source, len, Integer.MAX_VALUE);
- return new TokenStreamComponents(source,result);
-
- }
- public static void main(String[] args) {
- Analyzer analyzer = new LengFilterAanlyzer(2);
- String words = "I am a java coder";
- TokenStream stream = null;
-
- try {
- stream = analyzer.tokenStream("myfield", words);
- stream.reset();
- CharTermAttribute offsetAtt = stream.addAttribute(CharTermAttribute.class);
- while (stream.incrementToken()) {
- System.out.println(offsetAtt.toString());
- }
- stream.end();
- } catch (IOException e) {
-
- e.printStackTrace();
- }finally{
- try {
- stream.close();
- } catch (IOException e) {
-
- e.printStackTrace();
- }
- }
- }
- }
程序的執(zhí)行結(jié)果如下: 說(shuō)明小于2個(gè)字符的文本被過(guò)濾了。
|