Open Metadata Handbook/Recommendations
The purpose of this section is to help GLAM institutions decide what is the best standard to use for the description of their works. Recommendations for metadata formats should reflect practical issues rather than abstract ideals of self-describing schemas.
Caution against choosing ONE metadata schema, but rather advise toward the selection of interoperable schemas. There will never be ONE schema that works for all of the varieties of resources that need to be described. The advantage of graph-based design is that a wide range of metadata users can share a core of data but each specialized community can easily add the terms it needs without interrupting the whole.
As of today, the battle seems to be between:
- RDF-based metadata models + a list of ontologies that would integrate best with bibliographic works (e.g. foaf, bibo, etc.)
- Ad-hoc metadata formats which have been designed for one or more specific purpose. Because they are more simple, they might lower the barrier to participation in open bibliographic data provision. The point is to capture enough bibliographic metadata structure for the purposes of open bibliography, allowing for a simple low cost & low tech exchange of basic biblio metadata.
Both have their respective qualities and drawbacks, which must necessarily be taken account to determine what standard should be used for this particular field of application. The decision will depends on:
- the tools available for data exchange and display
- the limited resources available for conversion and creation of metadata
Although it can be costly and time-consuming to produce, metadata adds value to the bibliographic records. Detailed description of records materials is often limited by the amount of information known about each item, which may require significant research to be complete.
Detailed, Flexible & Extensible
RDF/Sparql provide advanced tools for the description / identification / management of Open Bibliographic Data. A proper RDF database cannot however be done without significant investment of time and cost.
Simple, Rapid & Low-cost Implementation
Although the need for more precise description of digital resources exists so that they can be searched and identified, for many large-scale digitization projects, this is not realistic. Light-weight Ad-hoc metadata format designed for the rapid proliferation of Open Bibliographic Data. (e.g. BibTeX provides a simple metadata scheme, probably nearly adequate for most present purposes, with a few conventions for handling identifiers. This scheme and suitable conventions are being incorporated into BibJSON, which offers a lightweight RDF/LD compatible format. Full mapping of BibJSON to RDF/LD can be done by other interested parties, not necessarily the initial data provider. BibTeX is not recommended as a metadata format, for many reasons.)
Will the metadata need to interact or be exchanged with other systems? Need to look at the data-exchange format, and then decide the metadata format based on the decision made.
Interoperability requires standardized ways of recording metadata. Metadata also has to be stored in or assembled into a document format, such as XML, that promotes easy exchange of data. [METS-compliant digital objects, for example, promote interoperability by virtue of their standardized, “packaged” format.]
The best way to ensure maximum interoperability and high levels of consistency is for everyone to agree on the same schema, such as the MARC (Machine-Readable Cataloging) format or the Dublin Core (DC). However, a uniform standard approach it is not always feasible or practical, particularly in heterogeneous environments where different resources are described by a variety of specialized schemas.
Interoperability at the Schema Level
To ensure that the data processed according to a given schema will be interoperable with other digital collections. At the schema level, interoperability actions usually take place before operational level metadata records are created (EX-ANTE INTEROPERABILITY]. Methods used to achieve interoperability at this stage mainly include: derivation, application profiles, crosswalks, switching-across, framework, and registry.
A new schema is derived from an existing one: e.g. an existing complex schema such as the MARC format may be used as the "source" or "model" from which new and simpler individual schemas may be derived. Derivation methods include adaptation, modification, expansion, partial adaptation, translation, etc. Examples:
- TEI Lite derived from the full Text Encoding Initiative (TEI).
- MODS (Metadata Object Description Schema) & MARC Lite derived from the full MARC 21 standard.
- Translation of the DC element set into different languages.
- Additional elements proposed for the Dublin Core Metadata Element Set [DCMI Metadata Terms].
- Electronic Theses and Dissertations Metadata Standard (ETD-MS). This standard uses 13 of the 15 Dublin Core elements and an additional element: thesis.degree [ETD-MS].
- Gateway to Educational Materials (GEM) expansion of Dublin Core. Additional elements include: cataloging, essential resources, pedagogy, standards, and duration [GEM Element Set].
- Rare Materials Descriptive Metadata developed by the Peking University Library. It uses 12 DC elements, plus edition and physical description as two local core elements and a collection history element for the third level extension [Yao et al., 2004].
Application profiles consist of metadata elements drawn from one or more metadata schemas and combined into a compound schema. They ensures a similar basic structure and common elements, while allowing for varying degrees of depth and detail and for different user communities. Examples:
- Australasian Virtual Engineering Library's AVEL Metadata Set consists of nineteen elements. In addition to supporting 14 DC elements (excluding dc.source element), it also supports one AGLS (Australian Government Locator Service) metadata element (AGLS.Availability), one EDNA (Education Network Australia) element (EdNA.Review), and three Administrative elements (AC.Creator, AC.DateCreated, and AVEL.Comments).
Application profiles may also be based on one single schema but tailored to different user communities. Examples:
- DC-Library Application Profile (DC-Lib) clarifies the use of the DC metadata element set in libraries and library-related applications and projects.
- DC Government Application Profile clarifies the use of DC in a government context.
- Biological Data Profile of the National Biological Information Infrastructure (NBII) is based on the Content Standard for Digital Geospatial Metadata (CSDGM) of the Federal Geographic Data Committee (FGDC).
Mapping of the elements, semantics, and syntax from one metadata scheme to another. Usually done through a chart or table that represents the semantic mapping of data elements in one data standard (source) to those in another standard (target) based on the similarity of function or meaning of the elements. They enable heterogeneous collections to be searched simultaneously with a single query as if they were a single database (semantic interoperability). Mapping from an element in one scheme to an analogous element in another scheme will require that the meaning and structure of the data is shareable between the two schemes, in order to ensure usability of the converted metadata. Ad-hoc formats that can very easily mapped to from most existing formats (but often lossy conversion) - Examples:
- Almost all schemas have created crosswalks to popular schemas such as DC, MARC, LOM, etc.
- VRA Core 3.0, which lists mapped elements in target schemas VRA 2.0 (an earlier version), CDWA, and DC.
One of the problems of crosswalking is the different degrees of equivalency: one-to-one, one-to-many, many-to-one, and one-to-none. This means that when mapping individual elements, often there are no exact equivalents. Meanwhile, many elements are found to overlap in meaning and scope. For this reason, data conversion based on crosswalks could create quality problems.
- MARC, Z39.50, SRLI/SRW, BibJSON?
- Conversion between ad-hoc formats requires to define extensibility mechanism and methods for vocabulary alignment (e.g. explain that your "title" is the same as dc:title or some other schema's title)
Using a switching schema (new or existing) to channel crosswalking among multiple schemas. Instead of mapping between every pair in the group, each of the individual metadata schemas is mapped to the switching schema only. Examples:
- Getty's crosswalk in which seven schemas all crosswalk to CDWA
A framework can be considered as a skeleton upon which various objects are integrated for a given solution. Two approaches are possible for building a metadata framework: 1) establishing a framework before the development of individual schemas and applications, and 2) building a framework based on existing schemas Examples:
- Open Archival Information System (OAIS) reference model issued as a recommendation by the ISO Consultative Committee for Space Data Systems (CCSDS). It establishes a common framework of terms and concepts that comprise an Open Archival Information System, providing a basis for further standardization within an archival context.
- Metadata framework currently used in the DLESE (Digital Library for Earth System Education) Discovery System. After a few years' exploration of establishing a framework for DLESE metadata based on IMS (Instructional Management Systems) Learning Resource Meta-data Specification, the Alexandria Digital Earth Prototype (ADEPT ) project, DLESE, and NASA's Joined Digital Library (JDL) decided in June 2001 to create an ADN metadata framework that all three organizations can use. The purpose of the ADN framework, as stated on its web page, is to "describe resources typically used in learning environments (e.g., classroom activities, lesson plans, modules, visualizations, some datasets) for discovery by the Earth system education community.
The purpose of a metadata registry is fairly straightforward: to collect data regarding metadata schemas. Metadata registries are expected to "provide the means to identify and refer to established schemas and application profiles, potentially including the means for machine mapping among different schemas.
- Cross-domain and cross-schema registry. For example, UKOLN (UK Office for Library Networking)'s SCHEMAS Registry, now used in the new CORES project, contains several metadata element sets and related documents. Through a web interface, one can search or browse according to agencies, element sets, elements, encoding schemes, application profiles, and element usages that are included in this registry. Currently the registry consists of 12 element sets from 10 institutions [CORES].
- Domain-specific, cross-schema registry. For example, UKLON's MEG (Metadata for Education Group) Registry facilitates schema registration within the educational domain[MEG Registry].
- Project-specific registry. The European Library (TEL) metadata registry [TEL] was established for the purpose of recording all metadata activities associated with TEL. The registry contains translations of element names in different languages and declares whether the element is repeatable, searchable, and mandatory [Van Veen and Oldroyd, 2004].
- Schema-specific registry, such as Dublin Core Metadata Initiative's (DCMI) Registry or Open Data Registry [Dublin Core Metadata Registry], for recording the valid elements within the DC schema. Currently the registry provides details regarding the elements, element refinements, controlled vocabulary terms (DCMI-Type Voc.), and vocabulary and encoding schemas.
Interoperability at the Record Level
Often, a particular metadata schema had been adopted and metadata records had already been created before the issue of interoperability was carefully considered. Converting metadata records becomes one of the few options for integrating established metadata databases. It is sometimes desirable to domain-specific metadata standards in combination with each others. Data providers should be able to assemble the components required for some particular set of functions, even if that means drawing on components specified within different metadata standards. Data providers should also be ensured that the result can be interpreted by independently designed applications.
Conversion of Metadata Records
Record-centric approach: Traditionally top-down approach of library data (i.e. producing MARC records as stand-alone descriptions for library material): lower costs and easier implementation. Ad-hoc formats give data providers a simple exchange format to dump out their records, which can easily be extracted and aggregated together. [The focus is on the records, which is what we want to get hold of and make openly accessible.] Ex-post conversion according to the smallest denominator, but risk of lossy conversion: data gets lost, as opposed of being enriched. The major challenge is how to minimize loss or distortion of data. Examples:
- Library of Congress provides tools (available at <http://www.loc.gov/standards/mods/>) to convert between the MARC record and the MODS record, and between the DC record and the MODS record.
- The Picture Australia project serves as a good example of data conversion. It is a digital library project encompassing a variety of institutions, including libraries, the National Archives, and the Australian War Memorial, many of which came with legacy metadata records prepared under different standards. Records from participants are collected in a central location (the National Library of Australia) and then translated into a "common record format," with fields based on the Dublin Core.
- National Science Digital Library (NSDL) Metadata Repository where metadata records from various collections were harvested. For instance, ADL (Alexandria Digital Library) metadata records had to be converted into a Dublin Core record when these records were harvested by the NSDL Metadata Repository. When converting an ADL record into a DC-based record for display, value strings in the ADL elements are displayed in equivalent DC-elements. For example, coordinates recorded in ADL Bounding Coordinates now appear in DC Coverage and Producer becomes Source.
The problem is that data values may be lost when converting from a rich structure to a simpler structure. Other complicated situations include converting value strings associated with certain elements that require the use of controlled vocabularies.
Data Reuse and Integration
Linked Data: No conversion, but rather a mechanism that permits to identify common concepts in different databases: artists, events, etc. - without being limited to the smallest denominator In a modular metadata environment, different types of metadata elements (descriptive, administrative, technical, use, and preservation) from different schemas, vocabularies, applications, and other building blocks can be combined in an interoperable way. The metaphor of the Lego can properly describe this process: an application designer should be able to "snap together" selected "building blocks" drawn from the "kits" provided by different metadata standards to build the construction that meets their requirements, even if the kits that provide those blocks were created independently. The components of a metadata record can be regarded as various pieces of a puzzle. They could be put together by combining pieces of metadata sources coming from different processes (by human or machine). They could also be used and reused piece by piece when new records need to be generated by human or machine. Examples:
- Metadata Encoding and Transmission Standard (METS) provides a framework for incorporating various components from various sources under one structure and also makes it possible to "glue" the pieces together in a record. METS is a standard for packaging descriptive, administrative, and structural metadata into one XML document for interactions with digital repositories. The descriptive metadata section in a METS record may point to descriptive metadata external to the METS document such as a MARC record in an Online Public Access Catalog (OPAC) or an Encoded Archival Description (EAD) finding aid maintained on a WWW server. Or, it may contain internally embedded descriptive metadata. It can therefore provide a useful standard for the exchange of digital library objects between collections or repositories.
- Resource Description Framework (RDF) of the World Wide Web Consortium (W3C) is another model that "provides a mechanism for integrating multiple metadata schemes." It is a data model that provides a framework within which independent communities can develop vocabularies that suit their specific needs and share vocabularies with other communities. Multiple namespaces may be defined to allow elements from different schemas to be combined into a single resource description. An RDF record links multiple descriptions, which may have been created at different times for different purposes, to one another. RDF + useful principles that come out of the semantic web community help with interoperability and expansion of metadata—Proper vocabulary alignment require an accurate mapping through rdf ontologies
RDF can be combined with specific protocols for linking data together, in order to allow for better:
- Standardisation: Linked Data methods support the retrieval and re-mixing of data in a way that is consistent across all metadata providers.
- Interoperability: Linked Data favors interdisciplinarity by enriching knowledge through linking among multiple domain-specific knowledge bases: i.e. the totality of datasets using RDF and URIs presents itself as a global information graph that users and applications can seamlessly browse.
- Decentralization: With Linked Data, different kinds of data about the same asset can be produced in a decentralized way by different actors, then aggregated into a single graph. Resources can be described in collaboration with other GLAM institutions and linked to data contributed by other communities or even individuals.
- Efficiency: GLAM institutions can create an open, global pool of shared data that can be used and re-used to describe resources. Linked Open Data enables institutions to concentrate their effort on their domain of local expertise, rather than having to re-create existing descriptions that have been already elaborated by others.
- Resiliency: Linked Data is more durable and robust than metadata formats that depend on a particular data structure because describes the meaning of data ("semantics") separately from specific data structures ("syntax" or "formats").
Interoperability at the Repository Level
When multiple sources are searched through a single search engine, one of the major problems is that the retrieved results are rarely presented in a consistent, systematic, or reliable format. A metadata repository provides a viable solution to such interoperability problems by maintaining a consistent and reliable means of accessing data. One question a repository faces is whether to allow each original metadata source to keep its own format. If not, how would it convert / integrate all metadata records into a standardized format? If so, how would it support cross-collection search?
Open Archives Initiative (OAI) Protocol
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol whose goal is to supply and promote an application-independent interoperability framework that can be used by a variety of communities engaged in publishing content on the Web.
Multiple Formats without Record Conversion
A different approach that circumvents the need to convert metadata records in an integrated service is taken by the Digital Library for Earth System Education (DLESE). The mechanism that resulted from this effort – the DLESE Collection System (DCS) – is a tool that allows participants to build their own collections of Earth system item-level metadata records, and to develop, manage, search, and share these collections, all without converting every metadata record into a uniform format. The metadata records for each collection are structured according to an XML schema that specifies required and optional metadata (and in some cases legal values) for a particular metadata field. The DLESE Collection System currently supports the DLESE metadata frameworks of ADN (ADEPT/DLESE/NASA) for resources typically used in learning environments. Other XML schema-based metadata frameworks can be supported by configuring the DLESE Collection System to point to the XML schema file.
the NSDL Metadata Repository employs an automated "ingestion" system based on OAI-PMH, whereby metadata flows into the Metadata Repository with a minimum of ongoing human intervention. The NSDL, from this perspective, functions essentially as a metadata aggregator. The notion behind this process is that each metadata record contains a series of statements about a particular resource, and therefore metadata from different sources can be aggregated to build a more complete profile of that resource. As a result, several providers might contribute to an augmented metadata record. These enhancements are exposed via OAI-PMH, and the Metadata Repository can then harvest them.
Element-based and Value-based Crosswalking services
While presently crosswalks have paved a way to the relatively effective exchange and sharing of schema and data, there is a further need for effective crosswalks to solve the everyday problem of ensuring consistency in large databases that are built of records from multiple sources. The OCLC researchers have developed a model that associates three pieces of information: the crosswalk, the source metadata standard, and the target metadata standard. The work proceeded from the hypothesis that "usable crosswalks must have the following characteristics: (1) A set of mappings between metadata standards that is endorsed by a stakeholder community. (2) A machine-processable encoding. (3) A well-defined relationship to source and target metadata standards, which must make reference to particular versions and syntactic encodings"
Value-based Mapping for Cross-database searching
The Multilingual Access to Subjects (MACS) project illustrates another value-based mapping approach to achieving interoperability among existing metadata databases. MACS is a European project designed to allow users to search across library cataloging databases of partner libraries in different languages, which currently include English, French, and German. Specifically, the project aims to provide multilingual subject access to library catalogs by establishing equivalence links among three lists of subject headings: SWD/RSWK (Schlagwortnormdatei / Regeln für den Schlagwortkatalog) for German, Rameau (Répertoire d'autorité-matière encyclopédique et alphabétique unifié) for French, and LCSH (Library of Congress Subject Headings) for English.
Value-based Co-occurrence Mapping
With regard to searching, co-occurrence mapping is similar to what is done in the MACS project discussed above. However, this approach uses the values already present in the subject fields and regards the subject terms in different languages registered in the subject fields of the same record as equivalent. When a metadata record includes terms from multiple controlled vocabularies, the co-occurrence of subject terms enables an automatic, loose mapping between vocabularies. As a group, these loosely-mapped terms can answer a particular search query or a group of questions. Existing metadata standards and best practice guides have provided an opportunity to use the co-occurrence mapping method.
- The art- and image-related metadata standard VRA Core Categories version 3.0 requires the use of the Art and Architecture Thesaurus (AAT) for the Type, Material, and Style/Period elements; and, for the Culture and Subject elements, recommends the use of AAT, LCSH, Thesaurus of Graphic Materials (TGM), ICONCLASS (an international classification system for iconographic research and the documentation of images), and Sears Subject Headings
- Another example of co-occurrence mapping source is the Gazetteer Standard Report of the Alexandria Digital Library. Under Feature Class, terms from two controlled vocabularies are recorded.
RDF has proposed a robust and flexible architecture for supporting metadata. The goal is to support interoperability by providing a common framework to describe any item that can have Uniform Resource Identifier (URI). RDF specifications provide a large number of ontologies to support the exchange of knowledge on the web. --- NEED TO INCLUDE A SERIES OF OWL FOR DIFFERENT TYPES OF BIBLIOGRAPHIC WORKS
The Dublin Core (DC) a popular and widely accepted metadata standard to describe physical resources such as books, digital materials such as video, sound, image, or text files, and composite media like web pages. The DC is a flexible standard characterized by simplicity, extensibility, and interoperability. The main advantage of Dublin Core is that can potentially be incorporated in various metadata models, including, but not limited to, RDF/OWL. This means that it can be used by any GLAM institutions, whether or not they are willing to invest the money and time to set-up a rdf-based databese. The DC standards support cross resource discovery by acting as intermediaries between a large number of community specific formats.
Google, Microsoft, and Yahoo have launched the "schema.org" initiative intended to allow the annotation of Web pages in very simple ways recognized by major search providers. The particularity of Schema.org is that it was not been designed to describe things well, but rather to provide better search results. The advantage of Schema.org is that it is based on a really simple set of microdata metadata, that is really simple and easy to include into any page that comes from an OPAC.
[CG: Great, but this is not in the form of a recommendation so shouldn't it go into the previous chapter?]
Ontology for Media Resources
The Ontology for Media Resources 1.0 (http://www.w3.org/TR/mediaont-10/) is presently a “W3C Candidate Recommendation” (W3C = World Wide Web Committee). It will evolve into a full “W3C Recommendation” as soon as the work about the corresponding API (see API for Media Resources 1.0, http://www.w3.org/TR/mediaont-api-1.0/) that provides a uniform access to all of its elements will be completed. This Media ontology is both i) a core vocabulary, i.e., a set of properties describing media resources selected taking into account the metadata formats currently in use and ii) a mapping between its set of properties and the elements from some metadata formats presently published on the Web like, e.g., Dublin core, EXIF 2.2, ITPC, Media RSS, MPGE-7, QuickTime, XMP, YouTube etc. The purpose of the mapping is to provide an interoperable set of metadata, to be shared and reused among different applications. Ideally, the mapping should preserve the semantics of a metadata item across metadata formats. In reality, this cannot be done in general because of the differences in the definition of the associated values see, e.g., the property “dc:creator” from the Dublin Core and the property “exif:Artist” defined in the Exchangeable Image File Format (EXIF) - both mapped to the property “creator” in the Media Ontology. “Types” of mapping are then defined in the Ontology: "exact", "more specific", "more generic" and "related". Mechanisms for correcting the possible loss of semantics when mapping back and forth between properties from different schemata using only the Media Ontology are beyond the scope of the Media Ontology work. A Semantic Web compatible implementation of the Ontology in terms of the Semantic Web languages RDF and OWL is also available, and presented in Section 7 of the http://www.w3.org/TR/mediaont-10/ document.
If we want to endorse the use of BibJSON, we perhaps should conclude the guide with 2 alternatives approaches: - the simple approach, which is to adopt the BibJSON model - the more performant, albeit more complex approach, which is to adopt RDF in a manner that can be easily mapped to bibjson and managed with bibserver. (This is the current approach used in Bibliographica if I am not mistaken?) Could anyone from the list edit the wiki in order to provide some additional details on that approach? Mark? ;)