Research and Advances
Artificial Intelligence and Machine Learning

Metadata Lessons from the Ilumina Digital Library

They follow from the five-year effort to implement metadata standards for learning objects in the iLumina digital library of undergraduate teaching resources in science, technology, engineering, and mathematics education.
Posted
  1. Introduction
  2. iLumina Digital Library
  3. iLumina Experience
  4. iLumina Taxonomies
  5. Conclusion
  6. References
  7. Authors
  8. Footnotes
  9. Figures
  10. Tables

Try finding high-quality educational resources on the Web. The difficulty often stems from the lack of good metadata attached to the learning objects of interest. Content creators in any teaching repository must distinguish between the descriptions, or metadata, of a learning object and the learning object itself; doing so would help search engines quickly and accurately find the objects that meet the criteria specified by the user. For example, a learning object (such as a video clip depicting the chemical reaction of sodium and water) could be described by its format (MPEG), its running time (10 seconds), and the discipline for which it was created (chemistry). The terms MPEG, 10 seconds, and chemistry are metadata values for the video clip, while the terms format, running time, and discipline are metadata elements.

For users, metadata is critical to finding resources, especially as the collections housing them are increasingly distributed across the Web. Content creators must federate information about resources, making them accessible through centralized sites or portals. Standardized metadata—essential for enabling distributed access—is less critical when Web-based resources are full text, since they can be examined directly by search engines. But fewer and fewer resources on the Web, especially those with educational value, are text alone; many are composites of data types and formats. Even for full-text resources, metadata enables users to browse resources, pursuing smart federated searches.

The two main metadata schemas and standards used today by digital libraries of educational resources are Dublin Core (DC) and IEEE Learning Object Metadata (LOM) (see Table 1). LOM was released as IEEE 1484.12.1 in June 2002 [2]. DC was approved by the American National Standards Institute in September 2001 as ANSI/NISO Z39.85 and ratified by the International Standards Organization in January 2003 as ISO 15836 [5]. DC and LOM are two approaches to providing standardized metadata. DC takes a minimal approach, keeping elements simple, perhaps by trading off limited expressive power. LOM is structural, offering rich description, perhaps by trading off size and cataloging effort.

Here, we share our five years of experience with an implementation of LOM (with imports/exports of metadata in DC), drawing general lessons useful to anyone who wants to understand the practical challenges of using metadata to describe digital materials on the Web. We are also aware that metadata standards are evolving and that our practical implementation results, coupled with everyday use, will continue to shape the evolution of the standards.

Back to Top

iLumina Digital Library

The iLumina digital library (www.ilumina-dlib.org), funded by the National Science Foundation, contains undergraduate teaching materials in science, technology, engineering, and mathematics. Maintained at the University of North Carolina Wilmington, it covers a range of granular resources, from individual data items (such as pictures and audio clips) to complex data items (such as complete books and online courses; see the figure). Our experience developing and running iLumina and adding metadata to items in its collections suggests that educators in these disciplines throughout the U.S. have created a wealth of digital resources for teaching and are happy to share them. iLumina and other such digital libraries (see Table 1) provide repositories where educators submit metadata for their materials, find related resources (without having to reinvent them), create new content (either individually or through collaboration), and collectively improve the quantity and quality of digital teaching resources.

We derive iLumina metadata from the IEEE LOM standards. Table 2 outlines the subset of LOM metadata elements we’ve identified through the iLumina browse, search, and search-results pages, along with their abstract structural organization. LOM consists of complex types and placeholder elements; complex types are active elements that hold metadata values, collectively forming a hierarchical metadata structure. iLumina resources populate 44 different metadata elements (only a subset of the 78 possible unique elements in LOM). Table 2 also includes the mapping of LOM to DC (as suggested in [4]) when importing or exporting data to services based on DC. Simple DC is based on a 15-element set.

Many theoretical discussions focus on individual metadata standards, but few empirical studies focus on the patterns of use of the standards that would be useful for guiding implementation decisions. Although many elements require vocabularies as values for elements, few standards are available for using them, and few established policies are available for selecting them. LOM offers some best-practice vocabularies [4], though they are often provisional, and end users frequently find it necessary to establish their own terms and taxonomies.

The few empirical studies available today do not provide encouraging evidence that users adopt standards easily or systematically. For example, [9] examined the DC metadata and element use of more than 100 collections registered as data providers with the Open Archives Initiative, which develops and promotes interoperability standards for content dissemination (www.openarchives.org). It found that of the 15 DC elements, two—identifier and creator—accounted for almost 50% of element use. Overall, the top seven DC elements—creator, identifier, title, date, type, subject, and description—accounted for over 70% of the elements used in the records; 50% of the data providers never populated any other elements; and the least-used elements were accessed only 6% of the time. In this large data set, the use of DC metadata elements by data providers registered with the OAI was selective and sparse.

The main mechanisms for interoperability, explored in [1], are controlled vocabularies for metadata, taxonomies for classification, and thesauri and crosswalks between vocabularies and taxonomies. Prior research with the Open Archives repositories [7] indicates that most taxonomies use a controlled vocabulary rather than freeform data input for an element; most also use different controlled vocabularies [6]. Conducting effective federated searches across multiple repositories involves identifying the source for the vocabularies and developing a discipline-specific thesaurus. How to best implement or extend standardized vocabularies and taxonomies is an open question. Meanwhile, recognizing that LOM is still in the early stages of its development as a standard, the IEEE Learning Technology Standards Committee Metadata Working Group is investigating the experience of implementers, as well as users, in order to further refine the standard [3].

Back to Top

iLumina Experience

The iLumina implementation continues to be driven by a joint application development team of users, digital library aficionados/academics, and IT specialists, including faculty, staff, and students at the University of North Carolina Wilmington. (The team meets monthly to review submitted resources and integration issues with the University’s Randall Library computer system.) The team initially used a rapid application development process model to develop a mockup, then implemented a prototype application for review. It provided feedback and revisions that were then incorporated into the next version. The team decided early to use a relational database with a schema to support all LOM elements, though only a subset would be populated. Leaving some LOM elements unpopulated means flexibility in the addition/removal of metadata elements to library services.

Many team members from the various disciplines, including chemistry, biology, mathematics, physics, and computer science, have contributed digital resources to the library. It was during this population process that we learned about the importance of LOM metadata elements and how to use a standard, controlled vocabulary.

In order to catalog resources for iLumina, we developed sets of vocabularies and taxonomies based on the LOM specification. After an initial cataloging of a set of resources, it was evident that the educational learning resources would need more descriptive information than we originally anticipated. We thus created a modified LOM specification table to capture this information (www.ilumina-dlib.org/documents/vocabulary_comparison_chart.htm).

Decisions regarding implementation of the metadata elements and associated vocabularies were influenced by the fact that responsibility for cataloging would eventually shift from trained catalogers to less-experienced submitters of resources. With this in mind, the development team insisted that the controlled vocabulary use standard language. The result was two major changes: modification of the original LOM vocabulary and an additional metadata element.

Most iLumina vocabularies are a modified set of the recommended LOM vocabulary (see Table 3). For example, educational.learningresourcetype involves a suggested vocabulary that proved to be of only limited use for the scientific resources cataloged in iLumina. Because not all learning resources being submitted could be limited to the suggested list, we included other resource types. Adding resource types is an iterative process; as new items were cataloged, we added new vocabularies until we were confident we had addressed the majority of resource-submission options (see www.ilumina-dlib.org/documents/vocabulary.htm for a complete table).

We also found it desirable to add an additional metadata element to the original IMS Global Learning Consortium (LOM-based) [4] specification. Calling it technical.mediatype, we used it to assist with the presentation of information about the technical.format of a resource. Technical.format is the IMS specification that describes the MIME type of the resource, though in keeping with the focus on standard language, we found it useful to categorize the MIME type list by media type (www.ilumina-dlib.org/documents/datacategories.htm). We referred to these categories as the technical.mediatype. They are often quite helpful to library users who may want an image but don’t care if it’s gif, jpeg, or in some other format. The media type is presented in the advanced search as a simple means of searching specific file types in iLumina. It is also included in the resource-contribution form where it functions as a filter to limit MIME type choices.

Back to Top

iLumina Taxonomies

To assist with the placement of resources within iLumina, we created three levels of taxonomies: discipline, subject, and topic (www.ilumina-dlib.org/documents/ims_classifications.htm). For example, discipline representatives find or create specific taxonomies for their respective disciplines. Computer scientists defined their taxonomies based on the ACM/IEEE Computing Curricula 2001 Classification Scheme. Chemists used a modified version of the Library of Congress taxonomy. Biologists developed their own taxonomy. And mathematicians created a common taxonomy for all educational levels (people.uncw.edu/hermanr/MathTax/index.htm).

Our experience with LOM also revealed an Achilles heel in the standard-specified (Internet Mail Consortium RFC 2426) way of dealing with the submission of directory information for electronic business cards; LOM includes the use of a vCard, or a standard way of providing vital directory information (such as name, street address, phone number, and email address) as its preferred format for personal data, rather than an XML syntax format. Although XPath expressions can be written to parse the vCard for desired data, we chose an xCard, or a vCard expressed in XML semantics. Using an xCard turned out to be a good way to standardize the internal representation of directory information; it was also easier to integrate, parse, and maintain and was generally helpful simplifying the software code used to find information in the directory (such as author last name).

The iLumina project hired students to catalog the library’s digital resources. Beginning early in the development effort, 2000–2001, the cataloging process sought out digital resources submitted by some of our members. It informed us about the arrangement of the input form, difficulties with vocabularies, errors in programming, and standardization of metadata appearance. We used this information to create the final versions of the metadata specification, input form, metadata page, and organization of iLumina’s resources. This work, completed in 2002, included an initial set of 200 learning resources contributed by the development team. Limiting the number of resources at this stage made it easier to make global changes to the metadata.

The metadata review process is simple and efficient. Resources submitted to iLumina are initially categorized as submitted and not available to the public. From a pending-items list, the iLumina librarian views the resource, metadata, and date submitted, then forwards the resource to the appropriate discipline editor for review. The discipline editor receives an email message with a link to the review materials. The review itself includes 22 questions in three categories: metadata, content, and technical. The reviewer then emails completed review forms to the discipline editor. The editor checks the status of the review and determines whether the resource is acceptable, accepted with revisions, or rejected. Once the resource is reviewed, it is tagged with its status. For a review flowchart, sample review checklist, and review summary see www.ilumina-dlib.org/documents/.

Usability is a major consideration in presenting information on the iLumina Web site. In addition to adhering to best practices for usability and accessibility, as well as to addressing feedback from end users, the iLumina Web site was scrutinized at our request in 2002 by two independent, outside usability studies, one by a group at Virginia Tech, the other by a group at the University of North Carolina Chapel Hill.

Back to Top

Conclusion

International efforts continue on the specification and use of the LOM standard, including its various data models; implementation efforts are also under way. We find certain parts of LOM useful for describing resources to be added to the iLumina library, as both an individual collection and as part of a distributed digital library, the National Science Digital Library, funded by the National Science Foundation.

Do LOM benefits outweigh LOM costs, especially when compared to the DC minimal-metadata approach? The following paragraphs cover eight propositions based on the iLumina implementation. Several follow directly from the experience discussed earlier. Others are generalizations suggested, though not fully established, by our effort defining metadata language for incorporating library resources. We include them here because they represent broad claims that still need to be refined and tested by future implementations of metadata standards, including LOM and DC.

LOM elements. Many LOM elements are useful in describing learning resources; however, the most useful ones are also in (and mappable to) DC. The exception is the classification.taxon element, which is more expressive than subjective.

No evidence. A few of the LOM education elements (not shared by DC) may be valuable, but we found no compelling evidence for these additional fields. Though the NSDL Core Integration team has suggested adding three LOM educational elements to DC, iLumina has found little use for them. In particular, NSDL suggests using educational.interactivitytype (not populated in iLumina), educational.typicallearningtime (not populated in iLumina), and educational.interactivitylevel (populated as low, high, or unspecified in iLumina). Some users like being able to distinguish the high interactivity of a particular resource, meaning that more user interaction is required than just pressing a button or clicking a mouse. iLumina has also cataloged educational.difficulty and educational.intendedenduserrole, though it is unclear how useful this information is to the users of the library.

Other categories. Elements in other categories (not education-specific) appear to be equally important in describing resources in educational digital libraries, including NSDL. Although we cataloged technical.size, finding it useful as a caution alert for time-consuming downloads, it could be automated by programmatically looking at the file properties, then notifying users of the expected download time.

Non-LOM and non-DC elements. At least a few elements not in LOM and DC may be important. For example, because mediatype could be derived from technical.format, it may not have to be added as an element. It is useful when designing the user interface to map format into mediatype. However, iLumina still needs a format for interoperability. For example, a user must use .swf video, not .mov video, for Macromedia Flash videos.

Useful in the future. Although several LOM elements, including many in the educational category, are of limited value, due to their semantic ambiguity, they may be valuable in the future—but only if they are first well defined and attract a user community that applies them systematically. For example, semantic density and difficulty are often highly dependent on context. Furthermore, iLumina narrowed its vocabulary for interactivity to low, high, or not specified; determining whether or not a resource is “medium” or “very high” (as allowed in LOM) is subjective. Moreover, technical.requirements are most useful for resources that require some sort of runtime environment (such as source code, executables, and plug-ins). Resource types do not all benefit equally from the technical.requirements element.

Vocabularies. Vocabularies represent a much greater challenge than specifying metadata elements in terms of interoperability in distributed libraries, especially when trying to share metadata for interoperability. Dealing with mismatched vocabularies is much more difficult, as we discovered. In order to achieve aggregation across repositories, we need a common controlled vocabulary and taxonomies. An NSDL working group researching vocabularies and establishing standards for elements (such as LearningResourceType and EducationalLevel) recently held a workshop and set up an online community to debate the issues (metamanagement.comm.nsdl.org/cgi-bin/wiki.pl?VocabDevel).

Less is more. Descriptive information should not all be encapsulated in a single metadata record or schema; for example, administrative records should be kept separate from object metadata; so should vCards and xCards. Any distinct schema information should be held separately, including review data (submitted, accepted, in-review) and annotations.

Cost/benefit balance. The automation of metadata record creation could change the cost/benefit balance between LOM and DC metadata. Also worth pointing out is that the high cost of creating LOM records could be reduced, making it more attractive for digital libraries relative to DC. However, such an efficiency would still not solve some of the more fundamental and semantic problems of defining and using metadata.

Back to Top

Back to Top

Back to Top

Back to Top

Figures

UF1 Figure. The iLumina home page user interface for accessing science and mathematics educational resources.

Back to Top

Tables

T1 Table 1. Examples of learning technology initiatives and their metadata schema.

T2 Table 2. iLumina element subset of LOM and corresponding DC elements.

T3 Table 3. Changes to the LOM-controlled vocabulary.

Back to top

    1. Duval, E., Forte, E., Cardinaels, K., Verhoeven, B., Van Durm, R., Hendriks, K., Forte, M., Ebel, N., Macowicz, M., Warkentyne, K., and Haenni, F. The ARIADNE Knowledge Pool System. Commun. ACM 44, 5 (May 2001), 73–78.

    2. IEEE. Standard for Information Technology—Education and Training Systems—Learning Objects and Metadata. IEEE Standard 1484.12.1; ltsc.ieee.org/wg12/.

    3. IEEE Learning Technology Standards Committee. Position Statement on 1484.12.1—2002 Learning Object Metadata (LOM) Standard Maintenance Revision; ltsc.ieee.org/wg12/index.html.

    4. IMS Global Learning Consortium, Inc. IMS Learning Resource Metadata Best Practices and Implementation Guide, Version 1.0—Final Specification, 2001; www.imsproject.org/metadata/mdbest01.html.

    5. ISO. Information and Documentation—The Dublin Core Metadata Element Set, ISO 15836: 2003; www.niso.org/international/SC4/n515.pdf.

    6. Liu, X., Maly, K., Zubair, M., Hong, Q., Nelson, M., Knudson, F., and Holtkamp, I. Federated searching interface techniques for heterogeneous OAI repositories. J. Digital Inform. 2, 4 (May 2002); jodi.ecs.soton.ac.uk/Articles/v02/i04/Liu/.

    7. Open Archives Initiative. The Open Archives Initiative Protocol for Metadata Harvesting, 2002; www.openarchives.org/OAI/openarchivesprotocol.htm.

    8. Soergel, D. A framework for digital library research. D-Lib Magazine 8, 12 (Dec. 2002); www.dlib.org/dlib/december02/soergel/12soergel.html.

    9. Ward, J. A quantitative analysis of unqualified Dublin Core metadata element set usage within data providers registered with the Open Archive Initiative. In Proceedings of the 2003 Joint Conference on Digital Libraries (Houston, May 27–31). IEEE Computer Society, Washington, D.C., 2003, 315–317.

    This material is based on work supported by the National Science Foundation under Grant No. 0002935. Any opinions, findings, and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More