SourceForge Project Home Page
SourceForge Logo
Wordindex: Full text indexer and search engine
What is it?
Wordindex is a full text indexing suite developed using perl as the backend and PHP as the web based search utility. Any language can be used to search as long as it has access to MySQL databases.

Wordindex is capable of indexing huge amounts of data, one production system has indexed over 14GB of textual, PDF, and compressed text files. Searches on that system are still less than a second on a modest server.

Wordindex is clusterable, the indexing process which can take a very long time to complete on a huge dataset (meaning ~10G+) can be run over a couple nodes to spread out the load.

Features
  • Indexing can be spread across nodes (clustering)
  • Fast access, Searching through even huge datasets is quite fast
  • No limits, the internal design of the software allows you to break up the tables to increase speed and eliminate the 2G table size limit under Linux. the main "heap" tables can be broken up into an arbitrary number of tables, 100 or 1000 tables are common but any number is possible.
  • Indexer only re-indexes documents that have changed since last run
  • Open Access, All software is Free and Open so you can integrate it as you wish
  • Indexes many file types, text and compressed text, postscript and compressed PS, MS Word, Corel WordPerfect, tar and compressed tar files (indexes the file listing), PDF and compressed PDF, ZIP (indexes the file listing), and can be extended to any other file type.
  • Will skip "bad words", words of little indexing value. (ie: a, the, it, is etc.. (currently about 360 words are "bad"))
  • Skips non text files using filename filter (ie: gif, jpg, etc...)
  • On unknown files, runs equiv of "strings"
  • Wish List
  • I wish the search interface was much much much better. (really needs a complete re-write)
  • I wish to make the "heap" able to be spread over many database servers for reliability and performance.
  • I wish to alter the indexing system to get documents by traversing over URL's instead of disk based files. (rather trivial with perls LWP module). Then make the whole project more of a web type search engine. (not so trivial)
  • I wish i had enough time to do it all myself.
  • I wish to move all the file type filter selector stuff into a sep config file.
  • I wish to clean up the regular expressions a bit. Correct some known problems.
  • Install scripts, Documentation etc... all that good stuff.


  • Try it!
    A demo system is now available online, here.

    Take a look at the documents it has indexed (a very small set (because of sourceforge's quotas)) here.
    The test data is a mirror of some Linux related news groups.

    Take a look at the log from the indexing process that indexed those documents here.

    Good test searches would be anything related to Linux like: kernel, Linux, device, driver
    Download it!
    A semi current release can be found in http://wordindex.sourceforge.net/releases/

    This cheezy page was designed by matt@automagically.net