The ff.rb (Ferret Finder) utility indexes and searches document files using Ferret (Ferret is a is a port to Ruby of the Apache Lucene Java project).

Here are some examples:

$ ff                                          # Display help text
$ ff -i ~/doc ~/projects                      # Create new index of doc and projects directories
$ ff -iI '*.txt' -x '*OLD*' ~/doc ~/projects  # Only index text files without OLD in path name
$ ff instantiation ruby                       # Find docs with both words
$ ff "array ruby -python"                     # Find docs with array and ruby but not python
$ ff file:*ruby*.txt                          # Find docs with file names like *ruby*.txt

The ferret_helper.rb module contains a set of MIME type detection and text conversion methods for generating indexable text for Ferret.

Here is a tarball containing source and documentation for the latest version: ff-1.1.1.tar.gz

Here is an older version which is compatible with Ferret 0.9.x: ff-1.0.5.tar.gz

The program prerequisites are documented at the ff.rb and ferret_helper.rb files.

Installation
Note
The Ferret index will be created in a directory called ff_index in the directory containing the ff.rb script, to put it elsewhere just change the INDEX_DIR constant in ff.rb.

Email bugs, comments or patches to srackham@methods.co.nz.

ff.rb
#!/usr/bin/env ruby
#
# ff - Search and index document files using Ferret
#
# Author:  Stuart Rackham <srackham@methods.co.nz>
# License: This source code is released under the MIT license.
# Home page: http://www.methods.co.nz/ff/
#
# Requisites:
# - Ferret 0.10.4 or better installed as a Ruby Gem.
#   See http://ferret.davebalmain.com/trac for Ferret installation.
# - The accompanying ferret_helper.rb file.
# - External text file converters documented in ferret_helper.rb file.
#
# Installation:
# Drop this file and the accompanying ferret_helper.rb file into your search
# $PATH. Indexes are stored in a sub-directory called ff_index in this file's
# directory.
# Check the shebang line is right for your system.
#
HELP = <<EOF

NAME
  ff - Search and index document files using Ferret

SYNOPSIS
  ff -i [OPTIONS] DIRECTORY...
  ff WORD...
  ff QUERY

OPTIONS
  --version
    Print program version number

  -h, --help
    Print this message

  -I, --include WILDCARD
    Only include file paths matching WILDCARDS in index (may be repeated)

  -x, --exclude WILDCARD
    Exclude file paths matching WILDCARDS from index (may be repeated)

DESCRIPTION
  The first form recursively indexes all indexable document types in DIRECTORYs
  (defaults to current directory). Currently accepted document types are PDF,
  HTML, plain text and Microsoft Word. A document's format is determined by
  checking it's MIME type (determined by file(1)) and file extension -- both
  must be acceptable for the document to be indexed.

  The Ferret index will be created in a directory called `ff_index` in the
  directory containing the `ff.rb` script.

  The second form lists files containing all WORDs (highest score first).

  The third form lists files satisfying the Ferret Query Language query QUERY
  (highest score first).

  The last form prints the program version number to stdout and exits.

  This help message is printed if there are no arguments or there is a -h or
  --help option.

PREREQUISITES
  Requires file(1) to calculate MIME types.  Requires pdftotext(1),
  html2text(1), antiword(1) and odt2txt(1) to index PDF, HTML, Microsoft Word
  and Open Document documents respectively.
EOF

# HISTORY
#   1.1.1: 2007-01-27:
#     - Added --include and --exclude options.
#   1.1.0: 2006-09-09:
#     - Rewrite for Ferret 0.10.x compatibility.
#   1.0.4: 2006-05-22:
#     - Fixed bug in FerretHelper.file_mime_type
#     - Fixed documentation errors in ferret_helper.rb
#   1.0.3: 2006-05-15:
#     - Don't assume ASCII input stream, fall back to ASCII encoder if
#       input stream does not conform to the default encoding (determined
#       by the locale).
#   1.0.2: 2006-04-17:
#     - Strip non-ascii characters before indexing.
#   1.0.1: 2006-04-12:
#     - Summary by file type.
#     - Store absolute path name.
#   1.0.0: 2006-04-06: First release.
#
VERS= '1.1.1'

require 'pathname'
require 'rubygems'
require 'ferret'
include Ferret
begin
  require 'ferret_helper'
rescue LoadError
  # Try this file's directory.
  require File.join(File.dirname(Pathname.new(__FILE__).realpath),'ferret_helper')
end
include FerretHelper

# Limit the number of documents found to this number.
NUM_DOCS = 1000

# The Ferret index directory is created in the same directory as this file.
INDEX_DIR = File.join(File.dirname(Pathname.new(__FILE__).realpath),'ff_index')

# Processed mime types and their usual file extensions.
MIME_TYPE_FILE_EXTENSIONS = {
  'application/msword' => ['.doc'],
  'application/pdf' => ['.pdf'],
  'application/vnd.oasis.opendocument.text' => ['.odt'],
  'text/html' => ['.html','.htm'],
  'text/plain' => ['.txt'],
}

# Add file +filename+ to the +index+.
def index_file(index, filename, mime_type)
  text = convert_to_text_string(filename, mime_type)
  raise "empty document #{filename}" if text.strip.empty?
  fields = {}
  fields[:file] = File.expand_path(filename)
  fields[:content] = text
  index << fields
end

# Recursively add all qualifying files in directory +dir+ to +index+.
def index_directory(index, dir, excludes, includes, counters)
  # Only visit files with allowed extensions.
  pat = "**/*{#{MIME_TYPE_FILE_EXTENSIONS.values.flatten.join(',')}}"
  Dir.glob(File.join(dir, pat), File::FNM_CASEFOLD) do |filename|
    add = (includes.empty? or includes.any? { |m| File.fnmatch(m, filename, File::FNM_DOTMATCH) })
    if add
      add = (not excludes.any? { |m| File.fnmatch(m, filename, File::FNM_DOTMATCH) })
    end
    # Skip files in Darcs repositories or hidden directories.
    if add and File.file?(filename) and not filename =~ /.*\/(_darcs|\..+?)\/.*/
      begin
        $stderr.puts "indexing: #{filename}"
        # Trying to guess MIME type from file contents is not reliable for text
        # files.  The strategy used here is to infer from file name extension
        # and rely on the convertor routine to fail if type is incorrect.
        mime_type = filename_mime_type(filename)
        index_file(index, filename, mime_type)
      rescue => e
        $stderr.puts "skipped: #{e.message}"
        counters[mime_type].skipped += 1
      else
        counters[mime_type].size += File.size(filename)
        counters[mime_type].count += 1
      end
    end
  end
end

def create_index(dirs, excludes, includes)
  Dir.mkdir INDEX_DIR unless File.directory?(INDEX_DIR)
  index = Index::IndexWriter.new(:create => true, :path => INDEX_DIR)
  # Although not intuitively obvious, until I tokenized the file name, wildcard
  # file name searches did not return all matching documents.
  index.field_infos.add_field(:file, :store => :yes, :index => :yes)
  index.field_infos.add_field(:content, :store => :no, :index => :yes)
  Struct.new('Counter', :size, :count, :skipped)
  counters = {}
  MIME_TYPE_FILE_EXTENSIONS.each_key do |key|
    counters[key] = Struct::Counter.new(0,0,0)
  end
  begin
    dirs.each { |dir| index_directory(index, dir, excludes, includes, counters) }
    index.optimize
  ensure
    index.close
  end
  counters.each_pair do |key,value|
    $stderr.puts "\n#{key}:"
    $stderr.puts "files indexed: #{value.count} (#{value.size} bytes)"
    $stderr.puts "files skipped: #{value.skipped}" unless value.skipped.zero?
  end
  total_count = counters.values.inject(0) {|sum,count| sum + count.count}
  total_size = counters.values.inject(0) {|sum,count| sum + count.size}
  total_skipped = counters.values.inject(0) {|sum,count| sum + count.skipped}
  $stderr.puts "\ntotal files indexed: #{total_count} (#{total_size} bytes)"
  $stderr.puts "total files skipped: #{total_skipped}" unless total_skipped.zero?
end

def search_index(args)
  query_parser = QueryParser.new(:default_field => :content,
                                 :or_default => false)
  query = query_parser.parse(args.join(' ').downcase)
  index = Index::Index.new(:path => INDEX_DIR)
  count = 0
  begin
    index.search_each(query, :limit => NUM_DOCS) do |doc, score|
      puts index[doc][:file]
      #puts "#{score}: #{index[doc][:file]}"
=begin Prints highlighted excerpts (but need to store content in index to work).
      index.highlight(query, doc,
        :field => :content, :excerpt_length => 60,
        :pre_tag => "\033[7m", :post_tag => "\033[m"
      ).each { |s| puts s; puts }
=end
      count += 1
    end
  ensure
    index.close
  end
  $stderr.puts
  $stderr.puts "query: #{query}"
  $stderr.puts "files: #{count}"
end

def main
  require 'optparse'
  index = false
  excludes = []
  includes = []
  opts = OptionParser.new do |opts|
    opts.on '-h', '--help' do
      puts HELP
      exit
    end
    opts.on '--version' do
      puts "ff #{VERS}"
      exit
    end
    opts.on '-i', '--index' do
      index = true
    end
    opts.on '-x', '--exclude WILDCARD' do |wildcard|
      excludes << wildcard
    end
    opts.on '-I', '--include WILDCARD' do |wildcard|
      includes << wildcard
    end
  end
  opts.parse! ARGV
  start = Time.now
  if index
    create_index ARGV, excludes, includes
  else
    if ARGV.empty?
      $stderr.puts 'missing search query arguments'
      exit 1
    end
    search_index ARGV
  end
  $stderr.puts "time: #{Time.now - start} seconds"
end

if __FILE__ == $0
  main
end
ferret_helper.rb
# Wrapper methods for converting PDF, HTML, Open Document and Microsoft Word
# files to text for the Ferret index analysers.
#
# Author:  Stuart Rackham <srackham@methods.co.nz>
# License: This source code is released under the MIT license.
#
# Include as instance methods of the client class:
#
#   include FerretHelper
#
# or add them as class methods to a client class:
#
#   extend FerretHelper
#
# File conversion to indexable text and MIME type detection rely on the
# following external applications (the following list was tested on Debian
# based Kubuntu Linux):
#
# MIME type detection:
#   Program: file(1)
#   Test version: 4.12
#   Installation: Pre-installed, but see file_mime_type notes below
#
# PDF to text conversion:
#   Program: pdftotext
#   Version tested: 3.00
#   Installation: Debian xpdf-utils package
#   Home page: http://www.foolabs.com/xpdf/
#
# HTML to text conversion:
#   Program: html2text
#   Version tested: 1.3.2a
#   Installation: Debian html2text package
#   Home page: http://userpage.fu-berlin.de/~mbayer/tools/html2text.html
#
# Open Document to text conversion:
#   Program: odt2txt
#   Version tested: 0.1
#   Home page: http://www.freewisdom.org/projects/python-markdown/odt2txt.php
#
# Microsoft Word to text conversion:
#   Program: antiword
#   Version tested: 0.35
#   Installation: Debian antiword package
#   Home page: http://www.winfield.demon.nl/
#

require 'fileutils'
require 'tempfile'

module FerretHelper

  # Infer MIME type from file contents using file(1).
  #
  # file(1) sometimes has problems detecting text files (the hardest type to get
  # correct):
  #
  # - Occasional false positive: HTML seen as 'text/plain'.
  # - A more serious problem is a high number of false negatives with text
  #   files: often getting the subtype wrong and sometimes as 'message/rfc822'
  #   or 'application/x-not-regular-file'.
  #
  # Fortunately users seldom write documents in plain text.
  #
  # file(1) 4.12 on Kubuntu 5.04 did not come with Open Document detection so I
  # added the following to /etc/magic:
  #
  # 0       string  PK\003\004
  # >30     string  mimetype
  # >>50    string  vnd.oasis.opendocument.text     Open Document text
  #
  # And the following to /etc/magic.mime:
  #
  # 0       string  PK\003\004
  # >30     string  mimetype
  # >>50    string  vnd.oasis.opendocument.text     application/%s
  #
  def file_mime_type(filename)
    result = %x{file -i '#{filename}'}.chomp
    raise 'missing file(1) command' if $?.exitstatus == 127
    raise "unable to determine mime type: #{filename}" unless $?.exitstatus == 0
    raise "unknown mime type: #{filename}" if result =~ /:\s*$/
    raise "malformed mime type: #{filename}" unless result =~ /:\s+(\S+?)(;|,|$)/
    $1
  end

  FILE_EXTENSION_MIME_TYPES = {
    '.doc'  => 'application/msword',
    '.html' => 'text/html',
    '.htm'  => 'text/html',
    '.odt'  => 'application/vnd.oasis.opendocument.text',
    '.pdf'  => 'application/pdf',
    '.txt'  => 'text/plain',
  }

  # Infer MIME type from file name (not a safe way to do things).
  def filename_mime_type(filename)
    FILE_EXTENSION_MIME_TYPES[File.extname(filename).downcase] ||
      'application/octet-stream'
  end

  def odt_to_text(src, dst)
    %x{odt2txt.py '#{src}' > '#{dst}' 2>/dev/null}
    raise 'missing odt2txt.py(1) command' if $?.exitstatus == 127
    raise "failed to convert Open Document text file: #{src}" unless $?.exitstatus == 0
  end

  def pdf_to_text(src, dst)
    %x{pdftotext '#{src}' '#{dst}' 2>/dev/null}
    raise 'missing pdftotext(1) command' if $?.exitstatus == 127
    raise "failed to convert pdf file: #{src}" unless $?.exitstatus == 0
  end

  def msword_to_text(src, dst)
    %x{antiword '#{src}' > '#{dst}' 2>/dev/null}
    raise 'missing antiword(1) command' if $?.exitstatus == 127
    raise "failed to convert Word file: #{src}" unless $?.exitstatus == 0
  end

  def html_to_text(src, dst)
    %x{html2text -ascii -nobs -o '#{dst}' '#{src}' 2>/dev/null}
    raise 'missing html2text(1) command' if $?.exitstatus == 127
    raise "failed to convert HTML file: #{src}" unless $?.exitstatus == 0
  end

  # Convert file to text file.
  def convert_to_text_file(src, dst, mime_type=nil)
    mime_type = file_mime_type(src) unless mime_type
    FileUtils.rm dst, :force => true
    case mime_type
    when 'text/plain'
      FileUtils.cp src, dst
    when 'text/html'
      html_to_text src, dst
    when 'application/pdf'
      pdf_to_text src, dst
    when 'application/msword'
      msword_to_text src, dst
    when 'application/vnd.oasis.opendocument.text'
      odt_to_text src, dst
    else
      raise ArgumentError, "no convertor for #{src} (#{mime_type})"
    end
  end

  # Convert file to text string.
  def convert_to_text_string(filename, mime_type=nil)
    mime_type = file_mime_type(filename) unless mime_type
    if mime_type == 'text/plain'
      result = File.open(filename) do |file|
        file.read
      end
    else
      temp_file = Tempfile.new('ferret_helper')
      begin
        temp_file.close   # So it can be written by external program.
        convert_to_text_file(filename, temp_file.path, mime_type)
        result = File.open(temp_file.path) do |file|
          file.read
        end
      ensure
        temp_file.unlink
      end
    end
    result
  end

  # Keep only white space and printable ASCII characters.
  def strip_non_ascii(s)
    s.gsub(/[^ \t\r\n\x21-\x7e]/, '')
  end

end