Lucene分词器，使用中文分词器，扩展词库，停用词

m635674608

浏览: 4932594 次
性别:
来自: 南京

最近访客更多访客>>

millerchu

xdung

yunnick

lijun4010

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

搜索引擎，爬虫

停止词：lucene的停止词是无功能意义的词，比如is 、a 、are 、”的”，“得”，“我” 等，这些词会在句子中多次出现却无意义，所以在分词的时候需要把这些词过滤掉。

扩展词库：就是不想让哪些词被分开，让他们分成一个词。

同义词：假设有一个电子商务系统，销售书籍，提供了一个搜索引擎，一天，市场部的人要求客户在搜索书籍时，同义词就是比如输入“电子”，除了展示电子相关的书籍，还需要展现“机器”相关的书籍。

1. 常见的中文分词器有：极易分词的(MMAnalyzer) 、"庖丁分词"分词器(PaodingAnalzyer)、IKAnalyzer 等等。其中 MMAnalyzer 和 PaodingAnalzyer 不支持 lucene3.0及以后版本。

   使用方式都类似，在构建分词器时

     Analyzer analyzer = new [My]Analyzer();



2. 这里只示例 IKAnalyzer，目前只有它支持Lucene3.0 以后的版本。

   首先需要导入 IKAnalyzer3.2.0Stable.jar 包

3. 示例代码

     view plaincopy to clipboardprint?
public class AnalyzerTest {
       @Test
       public void test() throws Exception {
              String text = "An IndexWriter creates and maintains an index.";
              /* 标准分词器：单子分词 */
              Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
              testAnalyzer(analyzer, text);

              String text2 = "测试中文环境下的信息检索";
              testAnalyzer(new IKAnalyzer(), text2); // 使用IKAnalyzer，词库分词
       }

       /**
        * 使用指定的分词器对指定的文本进行分词，并打印结果
        *
        * @param analyzer
        * @param text
        * @throws Exception
        */
       private void testAnalyzer(Analyzer analyzer, String text) throws Exception {
              System.out.println("当前使用的分词器：" + analyzer.getClass());

              TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
              tokenStream.addAttribute(TermAttribute.class);

              while (tokenStream.incrementToken()) {
                     TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
                     System.out.println(termAttribute.term());
              }
       }
}

public class AnalyzerTest {
       @Test
       public void test() throws Exception {
              String text = "An IndexWriter creates and maintains an index.";
              /* 标准分词器：单子分词 */
              Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
              testAnalyzer(analyzer, text);

              String text2 = "测试中文环境下的信息检索";
              testAnalyzer(new IKAnalyzer(), text2); // 使用IKAnalyzer，词库分词
       }

       /**
        * 使用指定的分词器对指定的文本进行分词，并打印结果
        *
        * @param analyzer
        * @param text
        * @throws Exception
        */
       private void testAnalyzer(Analyzer analyzer, String text) throws Exception {
              System.out.println("当前使用的分词器：" + analyzer.getClass());

              TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
              tokenStream.addAttribute(TermAttribute.class);

              while (tokenStream.incrementToken()) {
                     TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
                     System.out.println(termAttribute.term());
              }
       }
}


3. 如何扩展词库：很多情况下，我们可能需要定制自己的词库，例如 XXX 公司，我们希望这能被分词器识别，并拆分成一个词。

   IKAnalyzer 可以很方便的实现我们的这种需求。

   新建 IKAnalyzer.cfg.xml

     view plaincopy to clipboardprint?
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
       
       
       <entry key="ext_dict">/mydict.dic</entry>
</properties>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
       
       
       <entry key="ext_dict">/mydict.dic</entry>
</properties>



       解析：

               <entry key="ext_dict">/mydict.dic</entry> 扩展了一个自己的词典，名字叫 mydict.dic

               因此我们要建一个文本文件，名为：mydict.dic  （此处使用的 .dic 并非必须）

               在这个文本文件里写入：

                    北京XXXX科技有限公司

               这样就添加了一个词汇。

               如果要添加多个，则新起一行：

                    词汇一

                    词汇二

                    词汇三



               需要注意的是，这个文件一定要使用 UTF-8编码

4. 停用词：

    有些词在文本中出现的频率非常高，但是对文本所携带的信息基本不产生影响，例如英文的"a、an、the、of"，或中文的"的、了、着"，以及各种标点符号等，这样的词称为停用词（stop word）。

    文本经过分词之后，停用词通常被过滤掉，不会被进行索引。在检索的时候，用户的查询中如果含有停用词，检索系统也会将其过滤掉（因为用户输入的查询字符串也要进行分词处理）。

    排除停用词可以加快建立索引的速度，减小索引库文件的大小。

    IKAnalyzer 中自定义停用词也非常方便，和配置 "扩展词库" 操作类型，只需要在 IKAnalyzer.cfg.xml 加入如下配置：

       <entry key="ext_stopwords">/ext_stopword.dic</entry>

       同样这个配置也指向了一个文本文件 /ext_stopword.dic （后缀名任意），格式如下：

           也

          了

          仍

          从