Read me that file so I can index it, please

One of those easy-to-overlook but important details of a search engine: will it actually read your files? You may be interested in Lucene, but you'll have to find a way to feed it Office documents and PDFs.

Search engines don't actually directly index the Word document or PDF, they index text. This is where document filters come into play. These do their best to get the text from the file (and usually some metadata, such as an "author" field). If you've ever tried to open some exotic document format in a plain text editor (i.e., Notepad, or VI) you'll understand this can be far from trivial: many of these formats aren't very straightforward.

The problem isn't just trying to find the text, there are quite a few complications: reading across two or three column layouts; what to do with footnotes; or what to index, period. Spreadsheets are troublesome, but what do you make of images, audio, video? And for many scenarios (like indexing a file share) there will be exotic file types to deal with. (I recall the comments at a municipality once: "But we don't have any exotic file types". Three months later, a full crawl unearthed a stack of CAD/CAM files that were vital for planning). To make matters worse, file formats change with the software versions that come out (will the converter read Office 2007 or just Office 95?).

Since it's complicated to build and maintain good filters, most vendors buy them off-the-shelf. As I've talked about before, the market has been cornered by Oracle (with the INSO filters) and Autonomy (with the KeyView filters). Almost all the search engines out there use either Oracle's or Autonomy's converters. A notable exception is Microsoft, which has its own standard for this, IFilters. But IFilters are of varying quality, they don't always work with every Microsoft software product, and you may very well have to build a custom filter yourself for some ancient or rare software.

And there's ISYS -- probably the only vendor we cover in our Search & Information Access Report that has developed converters for over 200 document types entirely by themselves. (Even Oracle and Autonomy didn't really build filters themselves -- they bought the companies that produced them).

It makes sense, then, that ISYS now tries to bank on that hidden capital. The vendor announced last week it's releasing its File Readers as a separately available product. It'll be interesting to see these show up in Lucene implementations (and in content management systems embedding search). More options means more choice. Black may be the fastest drying paint, but maybe you can now have that Model T in purple again.


Our customers say...

"I've seen a lot of basic vendor comparison guides, but none of them come close to the technical depth, real-life experience, and hard-hitting critiques that I found in the Search & Information Access Research. When I need the real scoop about vendors, I always turn to the Real Story Group."


Alexander T. Deligtisch, Co-founder & Vice President, Spliteye Multimedia
Spliteye Multimedia

Other Posts