DocIndexer rewritten to use PyLucene

0.9 is the long overdue rewrite of 0.7 — the Lupy search library has been replaced with PyLucene plus there are lots of new features along with significant performance increases.

News
2011-07-20: Version 0.9.4.3 released

Updated to Python Markdown 2.0.3 in Windows version which resolves not recognizing abbreviated HTTPS links (<https:...>).

2009-03-11: Version 0.9.4.2 released

Bug fix release (fixed skipped files in Win32 COM library). See the CHANGELOG for a full list of changes.

2009-02-12: Version 0.9.4.1 released

Bug fix release (fixed broken Win32 COM library). for a full list of changes.

2008-12-20: Version 0.9.4.0 released

This release adds a parser for indexing MP3 *.mp3 files.

2008-12-08: Version 0.9.3.0 released

This release adds a parser for indexing Microsoft Office 2007 .docx files.

2007-12-28: Version 0.9.2.0 released

This release adds a parser for indexing Open Office ODT files plus a Markdown method for the COM server.

2007-11-20: Opened public repository

The DocIndexer’s Mercurial repository is now hosted by ShareSource.

2007-11-04: Version 0.9.1.0 released

The latest 0.9.1.0 release handles unicode (the previous release was only comfortable with ascii content and file names). I’ve tested it with Western European characters sets — send me any failing examples.

DocIndexer is a document indexer toolkit that uses the PyLucene search engine for indexing and searching document files. DocIndexer includes command-line utilities, Python index and search classes plus a Win32 COM server that can be used to integrate indexing and searching into application software. The current version has parser support for Microsoft Word, HTML, PDF and plain text documents.

1. Features

  • Command-line search and index utilities run under Linux and Win32.

  • Scriptable Win32 COM automation library for indexing and searching and text extraction.

  • Setup Wizard binary distribution installer for Microsoft Windows.

  • The index stores relative document path names in a platform neutral format so indexing and searching the same index from different mount points can be performed from a mix of UNIX and Windows clients.

  • Incremental indexing.

  • Python index and search classes.

  • Source code distribution is a complete example of how to build, deploy and use a COM server written in Python.

  • Uses the Lucene query language.

  • Indexes Microsoft Word (2000-2003, 2007), HTML, PDF, ODT, MP3 and plain text documents. Modular architecture allows other document types to be easily added.

  • Indexes entire directories and sub-directories.

  • The applications are freely distributable under the MIT license.

2. Downloads

2.1. From the DocIndexer Home Page

2.2. From the SourceForge

The DocIndexer project is hosted at the SourceForge

http://sourceforge.net/projects/docindexer/. Previous versions of DocIndexer can be found here.

3. Mercurial Repository

The DocIndexer Mercurial repository is hosted by Google Code.

To create your own local DocIndexer repository:

$ hg clone https://docindexer.googlecode.com/hg/ docindexer

To pull the latest changes into your local repository:

$ cd docindexer
$ hg pull https://docindexer.googlecode.com/hg/

4. Acknowledgments

DocIndexer was only made possible by the generous contributions of many individuals who created the freely available tools and libraries that went into building and creating DocIndexer.

A list of the primary resources used to build DocIndexer can be found in the Resources section.

5. How it Works

DocIndexer contains a collection of parsers that convert different document formats to plain text so they can be indexed by PyLucene.

5.1. Indexing

  • DocIndexer parsers convert supported document formats to plain text.

  • DocIndexer feeds the parsed text to the PyLucene indexing engine (along with file and status information).

  • MS Word documents are indexed using Adri van Os’s Antiword reader program. See the Resources section for more information about Antiword.

  • PDF files are indexed using Glyph & Cog’s pdftotext PDF text extraction executable which is part of the Xpdf project (see the Resources section).

  • DocIndexer has built-in parsers for extracting text from HTML, Microsoft Word 2007 (.docx), Open Office ODT and MP3 files.

5.2. Searching

  • Lucence queries are processed and a hit list of document file names and parameters are returned.

  • Searches using the docsearch command-line utility are platform independent.

  • The COM server only works on Microsoft Windows platforms.

6. PyLucene

PyLucene provides a Python wrapper for the Lucene API. There are two binary versions of PyLucene: A GCJ version which executes native compiled Lucene and the JCC version which executes Lucene Java bytecode using the Java virtual machine.

The GCJ version is currently shipped with DocIndexer because it doesn’t require an installed Java VM and because the GCJ version was used in previous versions of DocIndexer. The JCC version seems to be the preferred version and a future version of DocIndexer may move to the JCC version. More information can be found here:

7. Performance

These figures don’t represent formal benchmarks they just give a rough guide to indexing and search performance running under Ubuntu 7.04 on a bog standard 2.8GHz Pentium 4 PC indexing to local hard disk:

$ docindex -ri '*.pdf|*.doc|*.txt|*.html' docs/
start: rebuild index: Wed Aug 29 16:07:09 2007
 :
optimizing...
files indexed: 1486
files skipped: 77
bytes indexed: 48.00MB
elapsed time:  00:00:28

This equates to almost 3,000 documents and 100MB per minute (the mix included 1129 Word files, 320 PDF files, 24 text files and 13 HTML files). Incrementally updating the index only took a second:

$ docindex -i '*.pdf|*.doc|*.txt|*.html' docs/
start: update index: Wed Aug 29 16:17:52 2007
searching for documents to update...
optimizing...
files indexed: 0
elapsed time:  00:00:01

8. Installing DocIndexer

8.1. Linux

  1. Check Python version 2.5 is installed.

  2. Install antiword and pdftotext on your system (on Ubuntu 7.10 this means installing the antiword and poppler-utils packages).

  3. Install Markdown for Python.

  4. Install PyLucene version 2 for Python 2.5 (see the PyLucene website for details).

    I couldn’t find a pre-compiled version of PyLucene for Python 2.5 and Ubuntu 7.10 so I downloaded PyLucene 2.2.0 compiled for Ubuntu 7.04 and then copied the required files manually:

      $ tar -xzf PyLucene-2.2.0-1.tar.gz
      $ cd PyLucene-2.2.0-1/python
      $ sudo cp -a _PyLucene.pyd PyLucene.py security /usr/lib/python2.5/site-packages
  5. Unpack the DocIndexer source distribution and run the distutils setup script:

    $ unzip docindexer-0.9.4.3.zip
    $ cd docindexer-0.9.4.3
    $ sudo python setup.py install

You should now be able to use the docindex and docsearch utilities.

8.2. Win32 binaries

Just download and install the DocIndexer setup wizard. This installs the docindex.exe and docsearch.exe as well as installing and registering the docindexer_win32com.exe COM server.

8.3. Win32 source distribution

Same as for the Linux install.

9. Building DocIndexer

DocIndexer is written in Python. Windows executables are generated using the py2exe compiler and packaged using the Inno Setup installer.

Note
You can only build Windows executables, the COM server and the DocIndexer setup wizard on the Win32 platform.

The current Windows version of DocIndexer was built and tested with Python 2.5.2; PythonWin32 build 212; py2exe 0.6.9; PyLucene 2.1.0-2; antiword 0.36; pdftotext 3.02; Inno Setup 5.1.12; Python Markdown 1.7. The documentation was generated using AsciiDoc.

The Resources section at the end of this document lists the URLs of the build tools.

The build process has been automated by the Rakefile script. You don’t need to use Rake to build and install DocIndexer but the Rakefile is easy to read and you may find it useful to understand how DocIndexer is put together.

9.1. Linux

Follow the Linux install instructions but instead of installing build the source distribution using:

$ python setup.py sdist --dist-dir . --formats=zip

9.2. Windows

To build the Windows executables and setup wizard you need, in addition to the install prerequisites, py2exe Python Win32 Extensions and Inno Setup.

To build the executables unpack the source distribution and issue the following command:

  1. Unpack the DocIndexer source distribution.

  2. In the source distribution directory create antiword and xpdf directories.

  3. Copy the contents of the DOS antiword distribution to the antiword directory.

  4. Copy pdftotext.exe to the xpdf directory.

  5. Build the executables with the following command:

    python setup_win32.py
  6. Build the setup wizard using the docindexer-setup.iss Inno setup script.

Tips
  • If you’ve installed the distributed DocIndexer setup wizard you’ll find the files for steps 3 and 4 in the install directory.

  • Read the distributed Rakefile to understand the nitty-gritty of the above summary.

10. Index Fields

All documents contain the following index fields:

Field Name

Description

content

The tokenized document contents (not stored).

pathname

The document path name relative to the indexed directory (path separators are / irrespective of the platform).

dirname

The tokenized directory part of the pathname.

filename

The tokenized document file name (does not include file name extension or folder name).

ext

File name extension (does not include leading period character).

mtime

The date the document was last modified (formatted like yyyy-mm-dd hh:mm:ss).

size

The file size in bytes.

status

0 ⇒ Document indexed OK. 1 ⇒ Document contains no text. 2 ⇒ Document not indexed because the file type is unsupported. 3 ⇒ An error occurred analyzing the document.

Notes
  • If a document cannot be parsed its index entry won’t have a contents field.

10.1. Document type specific fields

MP3 files
title

Tokenized track title.

performer

Tokenized artist name.

album

Tokenized album name.

track

Track number.

11. DocIndexer Utilities

There are two command-line utilities: docsearch for searching and docindex for indexing.

  • The Linux install installs the docindex and docsearch commands.

  • The Win32 source distribution install installs the docindex.py and docsearch.py scripts in the Python Scripts directory.

  • The Win32 binary install installs the docindex.exe and docsearch.exe executables in the install directory.

11.1. docindex

Usage: docindex [options] [docsdir] [files...]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -a, --analyze         analyze file and write words to stdout
  -c, --config          print configuration information
  -e WILDCARDS, --excludes=WILDCARDS
                        one or more | separated wildcards
  -i WILDCARDS, --includes=WILDCARDS
                        one or more | separated wildcards
  -F FORMAT, --format=FORMAT
                        list format
  -l, --list            print a list of indexed documents
  -L FILE, --logfile=FILE
                        write log messages to file
  -n, --dry-run         do not update the index
  -q, --quiet           suppress stdout while indexing
  -r, --rebuild         rebuild the index
  -s, --summary         print a summary of index statistics
  -t, --text            parse file and write textual content to stdout
Examples
docindex -r -i '*.doc|*.pdf|*.txt|*.htm*' docs

Rebuild the index from Word, PDF, text and HTML files in docs directory.

docindex docs

Incrementally reindex new and modified files in docs directory.

docindex /public/docs *.txt

Add .txt files from the current directory to the index (the files must reside in the index directory).

docindex -t test.doc

Parse the test.doc file and write its textual contents to stdout.

docindex -a intro.doc

Analyze file intro.doc and write indexable tokens to stdout.

docindex -a the quick brown fox

Analyze the words the quick brown fox and write indexable tokens to stdout.

Notes
  • DocIndexer creates and writes the index to the .docindexer sub-directory inside docsdir. The per docsdir indexes allows multiple arbitrary branches of the file system to be indexed.

  • INCLUDES defaults to * i.e. all files.

  • The --includes and --excludes options match the last component of file system path names.

  • If an --excludes option matches a directory name the contents of the directory will be excluded. This example will exclude Mercurial and Subversion repository directories:

    docindex --exclude '.hg|.svn' --rebuild ~/projects

11.2. docsearch

age: docsearch [options] lucene_query [searchdir]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -a, --and             join terms with AND operator
  -F FORMAT, --format=FORMAT
                        results list format
  -l, --list            use long index listing format
  -q, --quiet           suppress results list
  -s, --summary         print a summary of search results
Examples
docsearch council AND article /public/docs

Search for documents containing the words council and article.

docsearch +\"java ant\"\~10 +ext:pdf /public/docs

Search the index for PDF documents containing the words java and ant separated by no more than 10 words.

docsearch -qs 'ext:(pdf doc)' /public/docs

Print a summary of indexed PDF and Word documents.

docsearch -a ext:pdf filename:article /public/docs

List PDF documents that contain the word article in the file name.

docsearch mtime:2007-05* AND ext:doc /public/docs

List Word files that were create or modified in May 2007.

docsearch -qs status:3 AND ext:doc docs

Print a summary of all indexed Word documents that failed to parse.

docsearch -F "%(size)-11d %(filename)s" ext:txt

List the size and names of .txt files indexed from the current and child directories.

Notes
  • By default query terms are ORed, the --and option conjoins each term with the AND operator.

  • A full description of the Lucene query syntax can be found here.

  • See also index fields.

  • Found documents are ranked in search score order (highest score listed first, lowest listed last).

  • If a .docindexer index directory is not found in the searchdir directory then all ancestor directories are searched. This ensures that if searchdir has been indexed then the index is automatically located.

11.3. List formats

The -F,--format options in docindex and docsearch are used to set the listing format. The format is a valid Python string format containing any document index field names as mapping keys. docsearch also accepts the score mapping key which prints the floating point document match score.

12. DocIndexer COM Server

Using the DocIndexer COM server you can add document indexing and searching to Windows applications written in any COM scriptable language (Visual Basic, VBA, Delphi).

The docindexer.exe executable implements two COM object types — one for indexing and one for searching:

12.1. docindexer.indexer

This COM class is used to create and update document indexes, it implements the following methods (VBA declaration syntax):

Sub BuildIndex(docsdir As String, _
              Optional incremental as Boolean=False, _
              Optional includes As String="*", _
              Optional excludes As String="", _
              Optional dryrun as Boolean=False, _
              Optional logfile as Variant, _
              Optional optimize as Boolean=True)

    Index the docsdir in one operation. Because indexing can take a
    long time using this method can result in an unresponsive client
    application -- use the OpenIndex/NextFile/AddFile/CloseIndex
    interface to implement a responsive client application.

Sub OpenIndex(docsdir As String, _
              Optional incremental as Boolean=False, _
              Optional includes As String="*", _
              Optional excludes As String="", _
              Optional dryrun as Boolean=False, _
              Optional logfile as Variant, _
              Optional optimize as Boolean=True)

    Open index for directory docsdir.
    docsdir     -- Directory to be recursively indexed.
    incremental -- If True only new and updated files are indexed,
                   missing files are removed from the index.
    includes    -- String containing '|' separated list of file
                   name wildcards specifying files to be indexed.
    excludes    -- String containing '|' separated list of file
                   name wildcards specifying files to be excluded.
    dryrun      -- If True the indexer does not update the index.
    logfile     -- Log file path.
    optimize    -- If True the index is optimised.

Function NextFile() as String
    Return full path name of next file to be indexed.
    Return empty string when there are no indexable files remaining.

Sub AddFile(filename As String)
     Add a file to the open index. The filename is a string containing
     the full pathname of the document file.  Returns True if the file
     is successfully indexed. An error is raised if an unexpected
     error occurs.

Sub CloseIndex(Optional optimize As Boolean=False)
    Close the index that was opened with the OpenIndex method.

Property IndexedCount
    Return the number of files indexed. Readonly property.

Property SkippedCount
    Return the number of files skipped. Readonly property.

Property BytesIndexed
    Return the number of bytes indexed. Readonly property.

12.2. docindexer.searcher

The index searcher class runs search queries and implements the following methods (VBA declaration syntax):

Sub OpenIndex(searchdir As String)
    Open search index to search for documents in and below
    directory searchdir.

Sub CloseIndex()
    Close searcher.

Sub AndSearch(words As String)
    Search for files containing all words in the words string.

Sub OrSearch(words As String)
    Search for files containing one or more words in the words string.

Sub PhraseSearch(phrase As String)
    Search for files containing the phrase string.

Sub QuerySearch(query As String)
    Search for files satisfying the Lucene query.

Function NextFile() As String
    Return a string containing the full path name of the file
    satisfying the current search.
    Return empty string when there are no indexable files remaining.

Function ParsedQuery() As String
    Return the most recent parsed document search query.

Function TotalHits() As Integer
    Return the total number of documents found.

12.3. docindexer.utils

This class exposes some useful routines:

Function TextContent(filename as String) As String
    Return a string containing the textual content of a file.

Function Markdown(text as String) As String
    Convert Markdown text to HTML string.

The Markdown method converts Markdown formatted text to HTML using the Markdown in Python package.

12.4. COM Exceptions

Unexpected errors throw exception number -2147467259 (E_FAIL).

12.5. Visual Basic for Applications Example

Here’s a simple VBA example taken from the docindexer-example.mdb Access 2000 example in the source distribution examples directory:

Note
You’ll need to modify the hardwired file paths in this example to match your test data.
Attribute VB_Name = "Module1"
Option Compare Database
Option Explicit

Public Const DOCS_DIR = "P:\test-docs-small"
Public Const INCLUDES = "*.doc|*.rtf|*.txt"

Public Sub TestAll()
    TestTextContent
    TestDocIndexer
    TestDocIndexer , True
    TestBuildIndex
    TestBuildIndex , True
    TestDocSearcher
End Sub

Public Sub TestDocSearcher(Optional docsdir As String = DOCS_DIR)
    Static query As String
    query = InputBox("Enter search query:", , query)
    If query = "" Then
        Exit Sub
    End If
    Dim searcher As Object, startTime As Double
    startTime = Time()
    Set searcher = CreateObject("docindexer.searcher")
    searcher.OpenIndex docsdir
    'Pass CStr(query) else query is set to return string -- weird!
    searcher.QuerySearch CStr(query)
    Dim n As Long, filename As String
    filename = searcher.NextFile()
    Do While filename <> ""
        Debug.Print filename
        filename = searcher.NextFile()
    Loop
    Debug.Print searcher.TotalHits & " file(s) found with query: " & searcher.ParsedQuery
    searcher.CloseIndex
    Set searcher = Nothing
    Debug.Print Format((Time() - startTime) * 24# * 3600#, "0.00") & " seconds"
End Sub

Public Sub TestDocIndexer(Optional docsdir As String = DOCS_DIR, Optional incremental As Boolean = False)
    Dim indexer As Object, startTime As Double
    startTime = Time()
    Set indexer = CreateObject("docindexer.indexer")
    indexer.OpenIndex docsdir, incremental, INCLUDES
    Dim filename As String
    filename = indexer.NextFile()
    Do While filename <> ""
        Debug.Print "Indexing " & filename;
        If Not indexer.AddFile(filename) Then
            Debug.Print " *** SKIPPED ***"
        Else
            Debug.Print
        End If
        filename = indexer.NextFile()
    Loop
    indexer.CloseIndex
    Debug.Print indexer.IndexedCount & " file(s) indexed"
    Debug.Print indexer.SkippedCount & " file(s) skipped"
    Debug.Print indexer.BytesIndexed & " bytes indexed"
    Debug.Print Format((Time() - startTime) * 24# * 3600#, "0.00") & " seconds"
    Set indexer = Nothing
End Sub

Public Sub TestTextContent()
    Dim utils As Object, s As String
    Set utils = CreateObject("docindexer.utils")
    s = utils.TextContent(DOCS_DIR & "\sitereport.pdf")
    Debug.Print s
End Sub

Public Sub TestBuildIndex(Optional docsdir As String = DOCS_DIR, Optional incremental As Boolean = False)
    Dim indexer As Object, startTime As Double
    startTime = Time()
    Set indexer = CreateObject("docindexer.indexer")
    indexer.BuildIndex docsdir, incremental, INCLUDES
    Debug.Print indexer.IndexedCount & " file(s) indexed"
    Debug.Print indexer.SkippedCount & " file(s) skipped"
    Debug.Print indexer.BytesIndexed & " bytes indexed"
    Debug.Print Format((Time() - startTime) * 24# * 3600#, "0.00") & " seconds"
    Set indexer = Nothing
End Sub

13. Resources

14. Bugs

  • Running PyLucene library from outside the standard Python sites-packages directory results in the harmless warnings starting with:

    WARNING: could not properly read security provider files ...
  • If you get an error with the phrase maxClauseCount is set to 1024 it is probably caused by Lucene wildcard query limitation. Here’s an explanation.

  • On Windows 98 during indexing antiword occasionally throws a GPF (observed in 3 documents out of 1600).

  • VBA arguments passed by reference in COM server methods are sometimes modified. Noted in the optional files and logfile arguments of the docindexer.OpenIndex method but there may be others. The workaround is to pass in copies or the arguments using the Cxxx() conversion functions, for example: CStr(logfile).

  • I’ve experienced COM server problems with missing VBA positional arguments i.e. if the supplied arguments list has a missing intermediate argument then it had to be explicitly include (not investigated).

15. CHANGELOG