python 全文检索引擎详解-编程学习网

python 全文检索引擎详解

最近一直在探索着如何用Python实现像百度那样的关键词检索功能。说起关键词检索，我们会不由自主地联想到正则表达式。正则表达式是所有检索的基础，python中有个re类，是专门用于正则匹配。然而，光光是正则表达式是不能很好实现检索功能的。

python有一个whoosh包，是专门用于全文搜索引擎。

whoosh在国内使用的比较少，而它的性能还没有sphinx/coreseek成熟，不过不同于前者，这是一个纯python库，对python的爱好者更为方便使用。具体的代码如下

安装

输入命令行 pip install whoosh

需要导入的包有:


fromwhoosh.index import create_in

fromwhoosh.fields import *

fromwhoosh.analysis import RegexAnalyzer

fromwhoosh.analysis import Tokenizer,Token

中文分词解析器


class ChineseTokenizer(Tokenizer):
  """
  中文分词解析器
  """
  def __call__(self, value, positions=False, chars=False,
         keeporiginal=True, removestops=True, start_pos=0, start_char=0,
         mode='', **kwargs):
    assert isinstance(value, text_type), "%r is not unicode "% value
    t = Token(positions, chars, removestops=removestops, mode=mode, **kwargs)
    list_seg = jieba.cut_for_search(value)
    for w in list_seg:
      t.original = t.text = w
      t.boost = 0.5
      if positions:
        t.pos = start_pos + value.find(w)
      if chars:
        t.startchar = start_char + value.find(w)
        t.endchar = start_char + value.find(w) + len(w)
      yield t


def chinese_analyzer():
  return ChineseTokenizer()

构建索引的函数


@staticmethod
  def create_index(document_dir):
    analyzer = chinese_analyzer()
    schema = Schema(titel=TEXT(stored=True, analyzer=analyzer), path=ID(stored=True),
            content=TEXT(stored=True, analyzer=analyzer))
    ix = create_in("./", schema)
    writer = ix.writer()
    for parents, dirnames, filenames in os.walk(document_dir):
      for filename in filenames:
        title = filename.replace(".txt", "").decode('utf8')
        print title
        content = open(document_dir + '/' + filename, 'r').read().decode('utf-8')
        path = u"/b"
        writer.add_document(titel=title, path=path, content=content)
    writer.commit()

检索函数


 @staticmethod
  def search(search_str):
    title_list = []
    print 'here'
    ix = open_dir("./")
    searcher = ix.searcher()
    print search_str,type(search_str)
    results = searcher.find("content", search_str)
    for hit in results:
      print hit['titel']
      print hit.score
      print hit.highlights("content", top=10)
      title_list.append(hit['titel'])
    print 'tt',title_list
    return title_list

感谢阅读，希望能帮助到大家，谢谢大家对本站的支持！

文章详情

python 全文检索引擎详解

软考中级精品资料免费领

相关文章

猜你喜欢

python 全文检索引擎详解

python做全文检索引擎

PHP学习笔记：搜索引擎与全文检索

10分钟用Python快速搭建全文搜索引擎详解流程

MariaDB10.2.6启用Mroonga存储引擎用于全文索引

基于Java的全文索引引擎Lucene是怎样的

Mysql InnoDB引擎的索引与存储结构详解

Java工程师怎么掌握全文搜索引擎

php操作ElasticSearch搜索引擎流程详解

php操作ElasticSearch搜索引擎流程详解

Springboot通过lucene实现全文检索详解流程

MySQL之MyISAM存储引擎的非聚簇索引详解

Sphinx全文搜索引擎的架构与工作原理详解（Sphinx搜索引擎的内部结构和工作机制是怎样的？）

Spring Boot整合Elasticsearch如何实现全文搜索引擎

python基于搜索引擎实现文章查重功能

Python NumPy教程之索引详解

python做图片搜索引擎并保存到本地详情

怎么在Linux下安装部署分布式全文搜索引擎

图文详解Node V8引擎的内存和GC

python题解LeetCode303区域和检索示例详解