Python+Lucene(pylucene) + Paoding的安装配置
pylucene让Python可以调用Lucene API实现搜索,这个项目紧跟Lucene的步调,对用惯了Python的同学来说是个福音。
pylucene是通过JCC实现的,JCC读取 jar 包里的public class/method签名,生成C++的包装类,通过JNI(Java Native Interface)调用java的class/mathod。C++代码转成Python的扩展模块,在Python虚拟机里嵌入JVM就可以用了。细节参考http://lucene.apache.org/pylucene/jcc/documentation/readme.html 。
由于Paoding跟Lucene 2.9版本以前的接口是一致的,因此找了一个最接近的PyLucene版本(pylucene 2.4),但里面的JCC比较老了,因此使用了pylucene 3.3的JCC。
下文假定 python 2.7.2安装到 /data/python-2.7.2 目录,相关源码保存在 /data/src 目录。
1 安装 Python
下载Python 2.7.2
切换到解压目录
./configure --prefix=/data/python-2.7.2 --enable-shared
make && make install
export LD_LIBRARY_PATH=/data/python-2.7.2/lib
安装包 setuptools
wget
http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz#md5=7df2a529a074f613b509fb44feefe74e
tar zxvf setuptools-0.6c11.tar.gz
cd setuptools-0.6c11
/data/python-2.7.2/bin/python setup.py install
2 安装 JCC 2.10
下载 pylucene-3.3-3-src.tar.gz
切换到解压目录
cd jcc
给 setuptools打补丁
mkdir tmp
cd tmp
unzip -q /data/python-2.7.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg
patch -Nup0 < /data/src/pylucene-3.3-3/jcc/jcc/patches/patch.43.0.6c11
sudo zip
/data/python-2.7.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg -f
cd ..
rm -rf tmp
ln -sf /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64 /usr/lib/jvm/java-6-openjdk
/data/python-2.7.2/bin/python setup.py build
/data/python-2.7.2/bin/python setup.py install
3 安装 PyLucene + Paoding
下载 pylucene-2.4.1-2-src.tar.gz 和 paoding-analysis-2.0.4-beta.zip
tar zxvf pylucene-2.4.1-2-src.tar.gz
mkdir paoding
cd paoding
unzip ../paoding-analysis-2.0.4-beta.zip
切换到 pylucene-2.4.1-2解压目录
vi Makefile 修改内容如下
...
# Linux (Ubuntu 8.10 64-bit, Python 2.5.2, OpenJDK 1.6, setuptools 0.6c9)
PREFIX_PYTHON=/data/python-2.7.2
ANT=ant
PYTHON=$(PREFIX_PYTHON)/bin/python
JCC=$(PYTHON) -m jcc --shared
NUM_FILES=2
...
JARS=$(LUCENE_JAR) $(SNOWBALL_JAR) $(HIGHLIGHTER_JAR) $(ANALYZERS_JAR) \
$(REGEX_JAR) $(QUERIES_JAR) $(INSTANTIATED_JAR) $(EXTENSIONS_JAR) \
/data/src/paoding/paoding-analysis.jar
...
GENERATE=$(JCC) $(foreach jar,$(JARS),--jar $(jar)) \
--include /data/src/paoding/lib/commons-logging.jar \
--package java.lang java.lang.System \
...
运行
make
make install
4 测试
export LD_LIBRARY_PATH=/data/python-2.7.2/lib
export PAODING_DIC_HOME=/data/src/paoding/dic
/data/python-2.7.2/bin/python /data/src/testpylucene.py
testpylucene.py的内容如下:
# -*_ coding: utf-8 -*-
#
from lucene import *
texts = ["Python是一个很有吸引力的语言",
"C++语言也很有吸引力,长久不衰",
"我们希望Python和C++高手加入",
"我们的技术巨牛,人人都是高手"]
def search(searcher, qtext):
tq = TermQuery(Term("content", qtext))
hits = searcher.search(tq)
print "----------------------------------------------"
print "Query:'%s', %d Found" % (qtext,hits.length())
for i in range(hits.length()):
doc = hits.doc(i)
print "\t",doc.get("content")
def dump(reader):
for i in range(reader.maxDoc()):
print "-----------------------------------------------"
tv = reader.getTermFreqVector(i, "content")
for tk in tv.getTerms():
print tk
initVM()
directory = RAMDirectory()
analyzer = PaodingAnalyzer()
writer = IndexWriter(directory, analyzer, True)
for text in texts:
doc = Document()
doc.add(Field("content", text, Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.YES))
writer.addDocument(doc)
writer.optimize()
writer.close()
reader = IndexReader.open(directory)
dump(reader)
searcher = IndexSearcher(directory)
search(searcher, "python")
search(searcher, "C++")
search(searcher, "高手")
pylucene是通过JCC实现的,JCC读取 jar 包里的public class/method签名,生成C++的包装类,通过JNI(Java Native Interface)调用java的class/mathod。C++代码转成Python的扩展模块,在Python虚拟机里嵌入JVM就可以用了。细节参考http://lucene.apache.org/pylucene/jcc/documentation/readme.html 。
由于Paoding跟Lucene 2.9版本以前的接口是一致的,因此找了一个最接近的PyLucene版本(pylucene 2.4),但里面的JCC比较老了,因此使用了pylucene 3.3的JCC。
下文假定 python 2.7.2安装到 /data/python-2.7.2 目录,相关源码保存在 /data/src 目录。
1 安装 Python
下载Python 2.7.2
切换到解压目录
./configure --prefix=/data/python-2.7.2 --enable-shared
make && make install
export LD_LIBRARY_PATH=/data/python-2.7.2/lib
安装包 setuptools
wget
http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz#md5=7df2a529a074f613b509fb44feefe74e
tar zxvf setuptools-0.6c11.tar.gz
cd setuptools-0.6c11
/data/python-2.7.2/bin/python setup.py install
2 安装 JCC 2.10
下载 pylucene-3.3-3-src.tar.gz
切换到解压目录
cd jcc
给 setuptools打补丁
mkdir tmp
cd tmp
unzip -q /data/python-2.7.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg
patch -Nup0 < /data/src/pylucene-3.3-3/jcc/jcc/patches/patch.43.0.6c11
sudo zip
/data/python-2.7.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg -f
cd ..
rm -rf tmp
ln -sf /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64 /usr/lib/jvm/java-6-openjdk
/data/python-2.7.2/bin/python setup.py build
/data/python-2.7.2/bin/python setup.py install
3 安装 PyLucene + Paoding
下载 pylucene-2.4.1-2-src.tar.gz 和 paoding-analysis-2.0.4-beta.zip
tar zxvf pylucene-2.4.1-2-src.tar.gz
mkdir paoding
cd paoding
unzip ../paoding-analysis-2.0.4-beta.zip
切换到 pylucene-2.4.1-2解压目录
vi Makefile 修改内容如下
...
# Linux (Ubuntu 8.10 64-bit, Python 2.5.2, OpenJDK 1.6, setuptools 0.6c9)
PREFIX_PYTHON=/data/python-2.7.2
ANT=ant
PYTHON=$(PREFIX_PYTHON)/bin/python
JCC=$(PYTHON) -m jcc --shared
NUM_FILES=2
...
JARS=$(LUCENE_JAR) $(SNOWBALL_JAR) $(HIGHLIGHTER_JAR) $(ANALYZERS_JAR) \
$(REGEX_JAR) $(QUERIES_JAR) $(INSTANTIATED_JAR) $(EXTENSIONS_JAR) \
/data/src/paoding/paoding-analysis.jar
...
GENERATE=$(JCC) $(foreach jar,$(JARS),--jar $(jar)) \
--include /data/src/paoding/lib/commons-logging.jar \
--package java.lang java.lang.System \
...
运行
make
make install
4 测试
export LD_LIBRARY_PATH=/data/python-2.7.2/lib
export PAODING_DIC_HOME=/data/src/paoding/dic
/data/python-2.7.2/bin/python /data/src/testpylucene.py
testpylucene.py的内容如下:
# -*_ coding: utf-8 -*-
#
from lucene import *
texts = ["Python是一个很有吸引力的语言",
"C++语言也很有吸引力,长久不衰",
"我们希望Python和C++高手加入",
"我们的技术巨牛,人人都是高手"]
def search(searcher, qtext):
tq = TermQuery(Term("content", qtext))
hits = searcher.search(tq)
print "----------------------------------------------"
print "Query:'%s', %d Found" % (qtext,hits.length())
for i in range(hits.length()):
doc = hits.doc(i)
print "\t",doc.get("content")
def dump(reader):
for i in range(reader.maxDoc()):
print "-----------------------------------------------"
tv = reader.getTermFreqVector(i, "content")
for tk in tv.getTerms():
print tk
initVM()
directory = RAMDirectory()
analyzer = PaodingAnalyzer()
writer = IndexWriter(directory, analyzer, True)
for text in texts:
doc = Document()
doc.add(Field("content", text, Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.YES))
writer.addDocument(doc)
writer.optimize()
writer.close()
reader = IndexReader.open(directory)
dump(reader)
searcher = IndexSearcher(directory)
search(searcher, "python")
search(searcher, "C++")
search(searcher, "高手")