Research and Advances
Computing Applications

Greenstone: Open-Source Dl Software

Posted
  1. Article
  2. Authors

Greenstone is a comprehensive system for constructing and presenting collections of thousands or millions of documents, including text, images, audio, and video. Greenstone libraries contain many collections, individually organized, though they bear a strong family resemblance. Easily maintained, collections can be augmented and rebuilt automatically.

Greenstone constructs full-text indexes from the document text and from metadata elements such as title and author. Indexes can be searched for particular words, Boolean combinations, or phrases, and results are ranked by relevance or sorted by a metadata element.

Browsing involves hierarchical lists the user can examine interactively. Metadata is the raw material for browsing, and must be provided explicitly or be derivable automatically from the source documents. Different collections offer different searching and browsing facilities. Indexes for both are constructed during a building process, according to information in a collection configuration file.

Greenstone creates all searching and browsing structures automatically from the documents themselves: nothing is done manually. If new documents in the same format become available, they can be merged into the collection automatically. Indeed, for many collections this is done by processes that awake regularly, scout for new material, and rebuild the indexes—all without manual intervention.

Source documents come in a variety of formats, and are converted by plugins into a standard form for indexing. Plugins distributed with Greenstone process HTML, Word, and PDF documents, Usenet and email messages; new ones can be written for different document types. To build browsing structures from metadata, an analogous scheme of classifiers is used. These create browsing indexes of various kinds: scrollable lists, alphabetic selectors, dates, and arbitrary hierarchies.

Collections can contain text, pictures, audio, and video. Nontextual material is currently either linked into the textual documents or accompanied by textual descriptions (such as figure captions) to allow full-text searching and browsing. The architecture, however, permits implementation of plugins and classifiers for nontextual data.

Unicode is used throughout, allowing any language to be processed and displayed in a consistent manner. Collections have been built containing Arabic, Chinese, English, French, Mãori, and Spanish. Multilingual collections embody automatic language recognition, and the interface is available in all these languages, among others.

Collections are accessed over the Internet or published, in precisely the same form, on a self-installing Windows CD-ROM. Compression is used to compact the text and indexes. A CORBA protocol supports distributed collections and graphical query interfaces.

The New Zealand Digital Library (nzdl.org) provides many example collections, including historical documents, humanitarian and development information, technical reports and bibliographies, literary works, and magazines. Other examples appear in Apperley et al. and Witten, Loots et al. in this special section.

Being open source, Greenstone is readily extensible, and benefits from the inclusion of Gnu-licensed modules for full-text retrieval, database management, text extraction from proprietary document formats, and Z39.50 protocol support. Only through international cooperative efforts will digital library software become sufficiently comprehensive to meet the world’s needs with the richness and flexibility that users deserve.

Back to Top

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More