DocIndexer is a document indexer toolkit that uses the PyLucene search engine for indexing and searching document files. DocIndexer includes command-line utilities, Python index and search classes plus a Win32 COM server that can be used to integrate indexing and searching into application software. The current version has parser support for Microsoft Word, HTML, PDF and plain text documents.
1. Features
-
Command-line search and index utilities run under Linux and Win32.
-
Scriptable Win32 COM automation library for indexing and searching and text extraction.
-
Setup Wizard binary distribution installer for Microsoft Windows.
-
The index stores relative document path names in a platform neutral format so indexing and searching the same index from different mount points can be performed from a mix of UNIX and Windows clients.
-
Incremental indexing.
-
Python index and search classes.
-
Source code distribution is a complete example of how to build, deploy and use a COM server written in Python.
-
Uses the Lucene query language.
-
Indexes Microsoft Word (2000-2003, 2007), HTML, PDF, ODT, MP3 and plain text documents. Modular architecture allows other document types to be easily added.
-
Indexes entire directories and sub-directories.
-
The applications are freely distributable under the MIT license.
2. Downloads
2.1. From the DocIndexer Home Page
- Win32 binary distribution setup wizard
-
http://www.methods.co.nz/docindexer/docindexer-setup-0.9.4.3.exe
- Source code zip file
2.2. From the SourceForge
- The DocIndexer project is hosted at the SourceForge
-
http://sourceforge.net/projects/docindexer/. Previous versions of DocIndexer can be found here.
3. Mercurial Repository
The DocIndexer Mercurial repository is hosted by Google Code.
To browse the repo go to http://code.google.com/p/docindexer/source/browse/.
To create your own local DocIndexer repository:
$ hg clone https://docindexer.googlecode.com/hg/ docindexer
To pull the latest changes into your local repository:
$ cd docindexer $ hg pull https://docindexer.googlecode.com/hg/
4. Acknowledgments
DocIndexer was only made possible by the generous contributions of many individuals who created the freely available tools and libraries that went into building and creating DocIndexer.
A list of the primary resources used to build DocIndexer can be found in the Resources section.
5. How it Works
DocIndexer contains a collection of parsers that convert different document formats to plain text so they can be indexed by PyLucene.
5.1. Indexing
-
DocIndexer parsers convert supported document formats to plain text.
-
DocIndexer feeds the parsed text to the PyLucene indexing engine (along with file and status information).
-
MS Word documents are indexed using Adri van Os’s Antiword reader program. See the Resources section for more information about Antiword.
-
PDF files are indexed using Glyph & Cog’s pdftotext PDF text extraction executable which is part of the Xpdf project (see the Resources section).
-
DocIndexer has built-in parsers for extracting text from HTML, Microsoft Word 2007 (.docx), Open Office ODT and MP3 files.
5.2. Searching
-
Lucence queries are processed and a hit list of document file names and parameters are returned.
-
Searches using the docsearch command-line utility are platform independent.
-
The COM server only works on Microsoft Windows platforms.
6. PyLucene
PyLucene provides a Python wrapper for the Lucene API. There are two binary versions of PyLucene: A GCJ version which executes native compiled Lucene and the JCC version which executes Lucene Java bytecode using the Java virtual machine.
The GCJ version is currently shipped with DocIndexer because it doesn’t require an installed Java VM and because the GCJ version was used in previous versions of DocIndexer. The JCC version seems to be the preferred version and a future version of DocIndexer may move to the JCC version. More information can be found here:
7. Performance
These figures don’t represent formal benchmarks they just give a rough guide to indexing and search performance running under Ubuntu 7.04 on a bog standard 2.8GHz Pentium 4 PC indexing to local hard disk:
$ docindex -ri '*.pdf|*.doc|*.txt|*.html' docs/ start: rebuild index: Wed Aug 29 16:07:09 2007 : optimizing... files indexed: 1486 files skipped: 77 bytes indexed: 48.00MB elapsed time: 00:00:28
This equates to almost 3,000 documents and 100MB per minute (the mix included 1129 Word files, 320 PDF files, 24 text files and 13 HTML files). Incrementally updating the index only took a second:
$ docindex -i '*.pdf|*.doc|*.txt|*.html' docs/ start: update index: Wed Aug 29 16:17:52 2007 searching for documents to update... optimizing... files indexed: 0 elapsed time: 00:00:01
8. Installing DocIndexer
8.1. Linux
-
Check Python version 2.5 is installed.
-
Install antiword and pdftotext on your system (on Ubuntu 7.10 this means installing the antiword and poppler-utils packages).
-
Install Markdown for Python.
-
Install PyLucene version 2 for Python 2.5 (see the PyLucene website for details).
I couldn’t find a pre-compiled version of PyLucene for Python 2.5 and Ubuntu 7.10 so I downloaded PyLucene 2.2.0 compiled for Ubuntu 7.04 and then copied the required files manually:
$ tar -xzf PyLucene-2.2.0-1.tar.gz $ cd PyLucene-2.2.0-1/python $ sudo cp -a _PyLucene.pyd PyLucene.py security /usr/lib/python2.5/site-packages
-
Unpack the DocIndexer source distribution and run the distutils setup script:
$ unzip docindexer-0.9.4.3.zip $ cd docindexer-0.9.4.3 $ sudo python setup.py install
You should now be able to use the docindex and docsearch utilities.
8.2. Win32 binaries
Just download and install the DocIndexer setup wizard. This installs the docindex.exe and docsearch.exe as well as installing and registering the docindexer_win32com.exe COM server.
8.3. Win32 source distribution
Same as for the Linux install.
9. Building DocIndexer
DocIndexer is written in Python. Windows executables are generated using the py2exe compiler and packaged using the Inno Setup installer.
|
Note
|
You can only build Windows executables, the COM server and the DocIndexer setup wizard on the Win32 platform. |
The current Windows version of DocIndexer was built and tested with Python 2.5.2; PythonWin32 build 212; py2exe 0.6.9; PyLucene 2.1.0-2; antiword 0.36; pdftotext 3.02; Inno Setup 5.1.12; Python Markdown 1.7. The documentation was generated using AsciiDoc.
The Resources section at the end of this document lists the URLs of the build tools.
The build process has been automated by the Rakefile script. You don’t need to use Rake to build and install DocIndexer but the Rakefile is easy to read and you may find it useful to understand how DocIndexer is put together.
9.1. Linux
Follow the Linux install instructions but instead of installing build the source distribution using:
$ python setup.py sdist --dist-dir . --formats=zip
9.2. Windows
To build the Windows executables and setup wizard you need, in addition to the install prerequisites, py2exe Python Win32 Extensions and Inno Setup.
To build the executables unpack the source distribution and issue the following command:
-
Unpack the DocIndexer source distribution.
-
In the source distribution directory create antiword and xpdf directories.
-
Copy the contents of the DOS antiword distribution to the antiword directory.
-
Copy pdftotext.exe to the xpdf directory.
-
Build the executables with the following command:
python setup_win32.py
-
Build the setup wizard using the docindexer-setup.iss Inno setup script.
-
If you’ve installed the distributed DocIndexer setup wizard you’ll find the files for steps 3 and 4 in the install directory.
-
Read the distributed Rakefile to understand the nitty-gritty of the above summary.
10. Index Fields
All documents contain the following index fields:
|
Field Name
|
Description |
|
content
|
The tokenized document contents (not stored). |
|
pathname
|
The document path name relative to the indexed directory (path separators are / irrespective of the platform). |
|
dirname
|
The tokenized directory part of the pathname. |
|
filename
|
The tokenized document file name (does not include file name extension or folder name). |
|
ext
|
File name extension (does not include leading period character). |
|
mtime
|
The date the document was last modified (formatted like yyyy-mm-dd hh:mm:ss). |
|
size
|
The file size in bytes. |
|
status
|
0 ⇒ Document indexed OK. 1 ⇒ Document contains no text. 2 ⇒ Document not indexed because the file type is unsupported. 3 ⇒ An error occurred analyzing the document. |
-
If a document cannot be parsed its index entry won’t have a contents field.
10.1. Document type specific fields
|
title
|
Tokenized track title. |
|
performer
|
Tokenized artist name. |
|
album
|
Tokenized album name. |
|
track
|
Track number. |
11. DocIndexer Utilities
There are two command-line utilities: docsearch for searching and docindex for indexing.
-
The Linux install installs the docindex and docsearch commands.
-
The Win32 source distribution install installs the docindex.py and docsearch.py scripts in the Python Scripts directory.
-
The Win32 binary install installs the docindex.exe and docsearch.exe executables in the install directory.
11.1. docindex
Usage: docindex [options] [docsdir] [files...]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-a, --analyze analyze file and write words to stdout
-c, --config print configuration information
-e WILDCARDS, --excludes=WILDCARDS
one or more | separated wildcards
-i WILDCARDS, --includes=WILDCARDS
one or more | separated wildcards
-F FORMAT, --format=FORMAT
list format
-l, --list print a list of indexed documents
-L FILE, --logfile=FILE
write log messages to file
-n, --dry-run do not update the index
-q, --quiet suppress stdout while indexing
-r, --rebuild rebuild the index
-s, --summary print a summary of index statistics
-t, --text parse file and write textual content to stdout
|
docindex -r -i '*.doc|*.pdf|*.txt|*.htm*' docs
|
Rebuild the index from Word, PDF, text and HTML files in docs directory. |
|
docindex docs
|
Incrementally reindex new and modified files in docs directory. |
|
docindex /public/docs *.txt
|
Add .txt files from the current directory to the index (the files must reside in the index directory). |
|
docindex -t test.doc
|
Parse the test.doc file and write its textual contents to stdout. |
|
docindex -a intro.doc
|
Analyze file intro.doc and write indexable tokens to stdout. |
|
docindex -a the quick brown fox
|
Analyze the words the quick brown fox and write indexable tokens to stdout. |
-
DocIndexer creates and writes the index to the .docindexer sub-directory inside docsdir. The per docsdir indexes allows multiple arbitrary branches of the file system to be indexed.
-
INCLUDES defaults to * i.e. all files.
-
The --includes and --excludes options match the last component of file system path names.
-
If an --excludes option matches a directory name the contents of the directory will be excluded. This example will exclude Mercurial and Subversion repository directories:
docindex --exclude '.hg|.svn' --rebuild ~/projects
11.2. docsearch
age: docsearch [options] lucene_query [searchdir]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-a, --and join terms with AND operator
-F FORMAT, --format=FORMAT
results list format
-l, --list use long index listing format
-q, --quiet suppress results list
-s, --summary print a summary of search results
|
docsearch council AND article /public/docs
|
Search for documents containing the words council and article. |
|
docsearch +\"java ant\"\~10 +ext:pdf /public/docs
|
Search the index for PDF documents containing the words java and ant separated by no more than 10 words. |
|
docsearch -qs 'ext:(pdf doc)' /public/docs
|
Print a summary of indexed PDF and Word documents. |
|
docsearch -a ext:pdf filename:article /public/docs
|
List PDF documents that contain the word article in the file name. |
|
docsearch mtime:2007-05* AND ext:doc /public/docs
|
List Word files that were create or modified in May 2007. |
|
docsearch -qs status:3 AND ext:doc docs
|
Print a summary of all indexed Word documents that failed to parse. |
|
docsearch -F "%(size)-11d %(filename)s" ext:txt
|
List the size and names of .txt files indexed from the current and child directories. |
-
By default query terms are ORed, the --and option conjoins each term with the AND operator.
-
A full description of the Lucene query syntax can be found here.
-
See also index fields.
-
Found documents are ranked in search score order (highest score listed first, lowest listed last).
-
If a .docindexer index directory is not found in the searchdir directory then all ancestor directories are searched. This ensures that if searchdir has been indexed then the index is automatically located.
11.3. List formats
The -F,--format options in docindex and docsearch are used to set the listing format. The format is a valid Python string format containing any document index field names as mapping keys. docsearch also accepts the score mapping key which prints the floating point document match score.
12. DocIndexer COM Server
Using the DocIndexer COM server you can add document indexing and searching to Windows applications written in any COM scriptable language (Visual Basic, VBA, Delphi).
The docindexer.exe executable implements two COM object types — one for indexing and one for searching:
12.1. docindexer.indexer
This COM class is used to create and update document indexes, it implements the following methods (VBA declaration syntax):
Sub BuildIndex(docsdir As String, _
Optional incremental as Boolean=False, _
Optional includes As String="*", _
Optional excludes As String="", _
Optional dryrun as Boolean=False, _
Optional logfile as Variant, _
Optional optimize as Boolean=True)
Index the docsdir in one operation. Because indexing can take a
long time using this method can result in an unresponsive client
application -- use the OpenIndex/NextFile/AddFile/CloseIndex
interface to implement a responsive client application.
Sub OpenIndex(docsdir As String, _
Optional incremental as Boolean=False, _
Optional includes As String="*", _
Optional excludes As String="", _
Optional dryrun as Boolean=False, _
Optional logfile as Variant, _
Optional optimize as Boolean=True)
Open index for directory docsdir.
docsdir -- Directory to be recursively indexed.
incremental -- If True only new and updated files are indexed,
missing files are removed from the index.
includes -- String containing '|' separated list of file
name wildcards specifying files to be indexed.
excludes -- String containing '|' separated list of file
name wildcards specifying files to be excluded.
dryrun -- If True the indexer does not update the index.
logfile -- Log file path.
optimize -- If True the index is optimised.
Function NextFile() as String
Return full path name of next file to be indexed.
Return empty string when there are no indexable files remaining.
Sub AddFile(filename As String)
Add a file to the open index. The filename is a string containing
the full pathname of the document file. Returns True if the file
is successfully indexed. An error is raised if an unexpected
error occurs.
Sub CloseIndex(Optional optimize As Boolean=False)
Close the index that was opened with the OpenIndex method.
Property IndexedCount
Return the number of files indexed. Readonly property.
Property SkippedCount
Return the number of files skipped. Readonly property.
Property BytesIndexed
Return the number of bytes indexed. Readonly property.
12.2. docindexer.searcher
The index searcher class runs search queries and implements the following methods (VBA declaration syntax):
Sub OpenIndex(searchdir As String)
Open search index to search for documents in and below
directory searchdir.
Sub CloseIndex()
Close searcher.
Sub AndSearch(words As String)
Search for files containing all words in the words string.
Sub OrSearch(words As String)
Search for files containing one or more words in the words string.
Sub PhraseSearch(phrase As String)
Search for files containing the phrase string.
Sub QuerySearch(query As String)
Search for files satisfying the Lucene query.
Function NextFile() As String
Return a string containing the full path name of the file
satisfying the current search.
Return empty string when there are no indexable files remaining.
Function ParsedQuery() As String
Return the most recent parsed document search query.
Function TotalHits() As Integer
Return the total number of documents found.
12.3. docindexer.utils
This class exposes some useful routines:
Function TextContent(filename as String) As String
Return a string containing the textual content of a file.
Function Markdown(text as String) As String
Convert Markdown text to HTML string.
The Markdown method converts Markdown formatted text to HTML using the Markdown in Python package.
12.4. COM Exceptions
Unexpected errors throw exception number -2147467259 (E_FAIL).
12.5. Visual Basic for Applications Example
Here’s a simple VBA example taken from the docindexer-example.mdb Access 2000 example in the source distribution examples directory:
|
Note
|
You’ll need to modify the hardwired file paths in this example to match your test data. |
Attribute VB_Name = "Module1"
Option Compare Database
Option Explicit
Public Const DOCS_DIR = "P:\test-docs-small"
Public Const INCLUDES = "*.doc|*.rtf|*.txt"
Public Sub TestAll()
TestTextContent
TestDocIndexer
TestDocIndexer , True
TestBuildIndex
TestBuildIndex , True
TestDocSearcher
End Sub
Public Sub TestDocSearcher(Optional docsdir As String = DOCS_DIR)
Static query As String
query = InputBox("Enter search query:", , query)
If query = "" Then
Exit Sub
End If
Dim searcher As Object, startTime As Double
startTime = Time()
Set searcher = CreateObject("docindexer.searcher")
searcher.OpenIndex docsdir
'Pass CStr(query) else query is set to return string -- weird!
searcher.QuerySearch CStr(query)
Dim n As Long, filename As String
filename = searcher.NextFile()
Do While filename <> ""
Debug.Print filename
filename = searcher.NextFile()
Loop
Debug.Print searcher.TotalHits & " file(s) found with query: " & searcher.ParsedQuery
searcher.CloseIndex
Set searcher = Nothing
Debug.Print Format((Time() - startTime) * 24# * 3600#, "0.00") & " seconds"
End Sub
Public Sub TestDocIndexer(Optional docsdir As String = DOCS_DIR, Optional incremental As Boolean = False)
Dim indexer As Object, startTime As Double
startTime = Time()
Set indexer = CreateObject("docindexer.indexer")
indexer.OpenIndex docsdir, incremental, INCLUDES
Dim filename As String
filename = indexer.NextFile()
Do While filename <> ""
Debug.Print "Indexing " & filename;
If Not indexer.AddFile(filename) Then
Debug.Print " *** SKIPPED ***"
Else
Debug.Print
End If
filename = indexer.NextFile()
Loop
indexer.CloseIndex
Debug.Print indexer.IndexedCount & " file(s) indexed"
Debug.Print indexer.SkippedCount & " file(s) skipped"
Debug.Print indexer.BytesIndexed & " bytes indexed"
Debug.Print Format((Time() - startTime) * 24# * 3600#, "0.00") & " seconds"
Set indexer = Nothing
End Sub
Public Sub TestTextContent()
Dim utils As Object, s As String
Set utils = CreateObject("docindexer.utils")
s = utils.TextContent(DOCS_DIR & "\sitereport.pdf")
Debug.Print s
End Sub
Public Sub TestBuildIndex(Optional docsdir As String = DOCS_DIR, Optional incremental As Boolean = False)
Dim indexer As Object, startTime As Double
startTime = Time()
Set indexer = CreateObject("docindexer.indexer")
indexer.BuildIndex docsdir, incremental, INCLUDES
Debug.Print indexer.IndexedCount & " file(s) indexed"
Debug.Print indexer.SkippedCount & " file(s) skipped"
Debug.Print indexer.BytesIndexed & " bytes indexed"
Debug.Print Format((Time() - startTime) * 24# * 3600#, "0.00") & " seconds"
Set indexer = Nothing
End Sub
13. Resources
|
Adri van Os’s Antiword reader
|
|
|
AsciiDoc
|
|
|
Glyph & Cog’s Xpdf reader
|
|
|
Inno Setup
|
|
|
ISTool
|
|
|
Jakarta Lucene
|
|
|
Markdown in Python
|
|
|
py2exe
|
|
|
pychecker
|
|
|
PyLucene
|
|
|
Python
|
|
|
Python Win32 Extensions
|
|
|
Ned Batchelder’s ID3 reader
|
14. Bugs
-
Running PyLucene library from outside the standard Python sites-packages directory results in the harmless warnings starting with:
WARNING: could not properly read security provider files ...
-
If you get an error with the phrase maxClauseCount is set to 1024 it is probably caused by Lucene wildcard query limitation. Here’s an explanation.
-
On Windows 98 during indexing antiword occasionally throws a GPF (observed in 3 documents out of 1600).
-
VBA arguments passed by reference in COM server methods are sometimes modified. Noted in the optional files and logfile arguments of the docindexer.OpenIndex method but there may be others. The workaround is to pass in copies or the arguments using the Cxxx() conversion functions, for example: CStr(logfile).
-
I’ve experienced COM server problems with missing VBA positional arguments i.e. if the supplied arguments list has a missing intermediate argument then it had to be explicitly include (not investigated).