Etymon Systems, Inc.
           
    Text retrieval — Amberfish and Isearch
 

Amberfish®

Amberfish is general purpose text retrieval software, developed at Etymon by Nassib Nassar and distributed as open source software under the terms of version 2 of the GNU General Public License (GPL). Its distinguishing features are indexing/search of semi-structured text (i.e. both free text and multiply nested fields), built-in support for XML documents using the Xerces library, structured queries allowing generalized field/tag paths, hierarchical result sets (XML only), automatic searching across multiple databases (allowing modular indexing), TREC format results, efficient indexing, and relatively low memory requirements during indexing (and the ability to index documents larger than available memory). Z39.50 support is available. Other features include Boolean queries, right truncation, phrase searching, relevance ranking, support for multiple documents per file, incremental indexing, and easy integration with other UNIX tools. The architecture is also designed to permit proximity queries; however, they are not fully implemented at present.

Amberfish documentation includes a tutorial: [PDF]. The software is available.

Amberfish began life as a series of prototypes in 1998 initially to experiment with the idea of developing a transactional software architecture for text retrieval. The following year, we had a commercial client interested in indexing huge amounts of XML data. We adapted the algorithms for this purpose and added data structures for querying nested XML tags, making Amberfish among the earliest general purpose text retrieval systems to support most of the wide range of queries needed for searching semi-structured XML-encoded text.

If you are using Amberfish in an interesting way, please let us know and send us your feedback!

Isearch

Isearch is open source text retrieval software developed in 1994 by Nassib Nassar at the Clearinghouse for Networked Information Discovery and Retrieval (CNIDR), which was funded by the National Science Foundation. Isearch was designed as a proof-of-concept software architecture for use in distributed information retrieval, known at the time as wide-area information systems, or WAIS. Isearch formed the text retrieval component of the Isite software, which was a complete prototype implementation of ANSI/NISO Z39.50 (ISO 23950). Prior to that time, most available software such as freeWAIS did not view the retrieval protocol as being separate from the index algorithms, and Isite/Isearch decoupled these components.

One of the useful things that came out of the Isearch project was the "document type" model, which is simply a method of associating each document with a class of functions providing a standard interface for accessing the document. Amberfish uses a variation of this model. This was also used to provide public searching access to heterogeneous (multiple format) legacy patent documents from the U.S., Europe, and Japan.

The main features of Isearch include full text and field searching, relevance ranking, Boolean queries, and support for many document types such as HTML, mail folders, list digests, and text with SGML-style tags.

Isearch was not designed to handle the extremely large data sets that became popular in the mid to late 1990's; however, Isearch was widely adopted and used in hundreds of public search sites, including many high profile projects such as the U.S. Patent and Trademark Office patent search site, the Federal Geographic Data Clearinghouse (FGDC), the NASA Global Change Master Directory, the NASA EOS Guide System, the NASA Catalog Interoperability Project, the Astronomical pre-print service based at the Space Telescope Science Institute, The PCT Electronic Gazette at the World Intellectual Property Organization (WIPO), Linsearch (a search engine for Open Source Software designed by Miles Efron), the SAGE Project of the Special Collections Department at Emory University, and Eco Companion Australasia (an environmental geospatial resources catalog).

In many cases Isearch was adapted or modified to use different algorithms but usually retained the document type model and the architectural relationship with Isite. Edward Zimmermann of Basis Systeme netzwerk made the most comprehensive enhancements of Isearch, turning it into a commercial product and adding support for many international character sets, dozens of new document type classes, and major work on the algorithms to handle more data. Archibald Warnock developed a version called Isearch2 which uses a completely different index data structure. There are several other versions still in circulation.

An introduction to Isearch that appeared in the May 1997 issue of Web Techniques magazine serves as tutorial documentation for the software. An entire chapter was devoted to Isearch in The UNIX Web Server Book, Second Edition, by R. Douglas Matthews et al (Ventana Press, 1997).

Isearch is available for download; however, this original version is no longer actively maintained.

 

  Copyright © 1998-2005 Etymon Systems, Inc. Legal notice.