Open Metadata Handbook/Technical Overview
Anatomy of Metadata
To understand a piece of metadata you need to know the special 'language' (structure) it uses. This 'language' is like a secret code the Army uses to scramble messages, but for metadata we are 'scrambling' it so computers can understand it and the 'codebook' to decipher it is called a 'metadata standard' (which is written down using things called schemas, data models, elements, etc.).
The point of metadata standards is to make it easier to find similar items described by similar terms and constructs - this is harder when the descriptions are free form or free-text.
To do this, a metadata standard aims to specify three layers:
- Vocabulary (what the pieces of information are): a particular set of metadata elements (fields) that can be used to describe objects
- Format (how the above pieces of information are arranged): a data model (structure)
- Syntax (how the above format is expressed, i.e. written down): a specific serialization and data format.
So each metadata record consists of the various metadata elements describing an object organized into a particular data format and expressed according to a particular serialization (generally XML or other machine-readable formats).
Metadata elements define the vocabulary (field names) used to express the content of a metadata schema.
A metadata data model describes the syntax of the metadata schema, independently from the vocabulary that is being used. It determines the structure of data, i.e. the rules created to structure the fields or elements of metadata (as opposed to the content thereof).
A serialization puts the elements and data model into actual bits and bytes data. Every metadata format must be expressed into a particular markup language or serialization (JSON, XML or whatever). However, not every metadata model is necessarily dependent upon one specific serialization. The same metadata model may be expressed in a number of different markup or programming languages, each of which requires a different syntax.
Keeping the formats as simple as possible lowers the barrier to compliance. Ad-hoc metadata formats are much easier to deal with, the documents are easy to parse, there are no hierarchic dependencies of any sort and they can be extremely handy for database insertion and extraction (e.g. bigtable of google, couchdb, non-relational db, NoSQL, etc). However, most of these standards are inherently incompatible with each other, they cannot be processed unless appropriate documentation is provided. The meaning of the markup language is implemented in the logic of the parser: each define their own specification with a specific series of tags that can be considered valid (e.g. Facebook, Twitter, Google's API).
A good reference for understanding metadata can be found at the following address: http://www.niso.org/publications/press/UnderstandingMetadata.pdf
Why are they different
In the context of metadata, one size does not fit all. Different communities have different needs for metadata and use it for very diverse applications. Even amongst communities with common metadata needs, different metadata formats are used to represent different things.
Library-specific standards suffer from a lack of Standardisation. Many library standards, such as MARC or Z39.50, have been developed or are developed in a library-specific context. Standardization in libraries is often undertaken by bodies only dedicated to the domain, such as IFLA or the JSC for development of RDA.
Common Metadata elements
Metadata elements can be subdivided into three basic categories:
- descriptive metadata elements (the insides): provide information about the content and context of an object
- technical or structural metadata elements (the container): provide information about the format, the process, and relationship between objects
- administrative metadata elements (the outside): provide information needed to manage or use an object
Objects generally also have a unique identifier metadata element.
- Dublin Core Metadata Element Set, Version 1.1: http://dublincore.org/documents/dces/
- Table of Core Metadata Elements for Library of Congress Digital Repository Development: http://www.loc.gov/standards/metable.html
Most standards can be expanded with additional metadata elements that better fit the needs of specialized communities. e.g. the Dublin Core Metadata Initiative provides a framework for designing a Dublin Core Application Profile (DCAP). Different communities can define specialized metadata records that better fit their needs while maintaining semantic interoperability with other applications on the basis of globally defined vocabularies and models.
In this section, we will provide a general overview of key metadata elements used for the discovery, identification and description of different types of works, such as books, articles, phonograms, photographs, films, artworks, etc
For a more detailled overview, see e.g. http://www.w3.org/2005/Incubator/lld/XGR-lld-vocabdataset/#Metadata_Element_Sets
- creator(s) (taking into account that some books may be anonymous)
- date of publication
- place of publication
- identifiers (e.g. ISBN, ISSN, DOI, etc)
- links (e.g. URL if online)
- type (e.g. bibliography, encyclopedia, novel, etc)
- topic (tags)
- description (abstract)
- no. of pages
- volume (if appropriate)
- start page / end page (e.g. for book sections or articles)
- format (e.g. hard-cover, paper-back, digital format, pdf, html etc)
- date of creation
- date of first publication
- date of birth/death of creator(s)
- copyright status
- date of last access / last update (for online works)
- subject (tags)
- description (abstract).
- file format
- set (if in a series)
- creation date
- copyright permissions
- provenance (history)
Common Metadata Formats
This section is intended to provide a general overview of common metadata formats used in the bibliographic field. We want to focus only on a few examples, giving a detailed description of the most commonly used formats rather than a comprehensive list of available formats. For each metadata format, we will highlight the historical context in which they have been introduced, the objectives that they are meant to achieve, their corresponding pros & cons, and, when possible, a personal note or quote from a key individual involved either in the development or in the usage of that metadata format.
Unstructured data refers does not have a predefined data model or has one that does not properly fit into relational tables. Unstructured information is generally presented in the for of text containing relevant data such as dates, numbers, or other facts. As opposed to data stored as records in databases or semantically annotated into a document, unstructured data produces a series of ambiguities and irregularities that make it difficult to be processed or understood by a machine. Data that comes with some form of structure might still be regarded as unstructured data if the selected structure is not properly documented or is impractical for the desired processing task.
Unstructured data can be converted into structured data through a variety of methods. Common techniques for structuring text usually involve manual tagging with metadata, data mining or text analytics techniques. For instance, although most of Wikipedia content is unstructured data, by processing this information, it is possible to extract meaning and create structured data about the information. DBpedia is an effort to publish structured data extracted from Wikipedia: the data is published in RDF and made available on the Web for use under the GNU Free Documentation License, thus allowing Semantic Web agents to provide inferencing and advanced querying over the Wikipedia-derived dataset and facilitating interlinking, re-use and extension in other data-sources.
MARC is an international descriptive metadata format. The MARC standard defines the following components:
- Markup: the metadata element set
- Semantics: the meaning of elements (although their content is defined by other standards)
- Structure: the syntax for communication
There are many different MARC versions: national agencies (in France, US, UK, etc) originally developed their own national MARCs, which were then unified in an international UNIMARC. However, in the last years, US MARCs have imposed themselves over UNIMARC due to their adoption in US catalogs of which data are also imported outside US. In practice, today, library catalogues in different countries might be using different MARC versions.
MARC fields are connected with the International Standard Book Description (ISBD), developed by the international library community through decades, where elements are marked by punctuation. Although ISBD may look complex, also very simple uses are allowed, such as: Title / Author. - City : Publisher, year.
- XML serialization
- MARC 21 is expressed in an XML structure.
- Due to its long-standing popularity, a coordinated set of tools has been developed to improve the interoperability of MARC 21 with others metadata format: e.g. transformations to and from other standard formats (DC, ONIX, …)
- Widespread use of bibliographic utilities and ILS implementations based on MARC for standard communication format with predictable content and for the sharing of records: e.g. standardize MARC 21 for OAI harvesting;
- the MARC cataloging standard are slowly becoming obsolete. MARC is great for describing books but less so for other types of media. The problem lies in the fact that MARC was conceived to describe single published material (i.e. monographs). With the internet, the use of MARC might become more problematic as multimedia formats requires different elements and categories.
Like MARC 21 MAB 2 is part of the ISO-2709 format family. MAB means "Maschinelles Austauschformat für Bibliotheken" (automated exchange format for libraries) and is in part very similar to MARC, with a structure not that statical, i.e. there are links to semantically related concepts. In general, MAB is a bit more diversified.
Just like MARC, there are many different MAB versions.
- XML serialization
- MAB 2 is expressable in an XML structure.
- not very interoperable
- only a handful library unions use MAB
- not so many utilities available as for MARC
- its a fossil:
- like MARC MAB is long overdue to be replaced by something more fitting to our present-day information technology. Remember MARC/MAB is 40 year old - in the seventies a magnetic tape was the common data storage.
Personal Note MAB is a legacy format and should be detached through MARC soon (German wikipedia says 2012). In my opinion that does not make much sense since MARC itself is obsolet and the data transformation to MARC will only result at best in loss of some information.
BibJSON (http://bibjson.org/) is a simple description of how to represent bibliographic metadata in JSON. Also based on the BibTeX model.
A JSON object is an unordered list of key-value pairs. A BibJSON object is a bibliographic record as a JSON object.
BibJSON is just JSON with some agreement on what particular keys are expected to mean. Various parsers are being written to convert from other formats into BibJSON, so as to make it easier for people to share bibliographic records and collections. See http://bibserver.okfn.org/roadmap/open-bibliography-for-stm/ and http://www.bibkn.org/bibjson/index.html
Semantic & Linked Data
The W3C standard Resource Description Framework (RDF) provides a conceptual framework for defining and using metadata. It can be subdivided into different components:
RDF - Resource Description Framework: This is the basic standard for the semantic web, describing the data model that all of the other semantic web standards are built upon. The RDF data model imposes structural constraints on the expression of application data models for consistent encoding, exchange and processing of metadata. It defines the concept of the triple and basic rules that allow this data to function in web space. RDF is formulated as a stratification of concepts and ontologies that can be extended indefinitely. The description of resources is based on objects and properties which are themselves described in RDF.
RDFs - The RDF Schema: While RDF is a set of rules that do not have an actual encoding, RDFs provides the coding so that RDF can be "made real" through applications.
RDFa - Resource Description and Access allows you to include Semantic Web data in an XHTML page, consistent with Tim Berners-Lee's original vision. (Much linked data today is not found in web pages but has been exported from traditional data stores like DBMS's, and lives on the web without being related to specific web documents.)
OWL - Web Ontology Language: OWL is a subset of RDF (just as RDF/xml is a subset of XML). It enables anyone to create new vocabularies for the description of different resources. These vocabularies provide the semantic linkage needed to extract information from the raw data defined by RDF triples. A variety of ontologies have already been developed, each with a specific purpose in mind. If none of the existing ontologies are adequate for a particular application, a new ontology can be created. An "ontology" is the description of the knowledge space that your metadata will address. Using OWL you define your entities and all of your elements and relationships. You can include rules governing your data and some relationships between elements that will facilitate understanding your data in a heterogeneous, mixed-data environment like the Web.
In RDF, everything is based on the concept of "semantic triples": Subject, Property, Object
- the Subject is the resource identified by an URI / URL
- the property is another resource identified by an URI. it must be defined elsewhere (e.g. they can be extracted from a dictionary, namespace, schema, or ontology)
- the object can be an URI, or a "value": a string, a number, etc.
Both the subject and the object can, eventually, also be a blank node (http://en.wikipedia.org/wiki/Blank_node).
RDF does not have a specific application domain. It defines a limited number of basic concepts for other ontologies to build upon them. These basic elements are:
- Classes: Resource, Class, Property, List, Literal, Numbers, etc.
- Properties: 'to be' => 'type', subClassOf, subPropertyOf, label, etc.
Everything else can be derived from that. These components are similar to the components of spoken language - e.g. "Judy owns Spot (an animal)" is "Subject Property Object (Class)", and an object can be the subject in another triple, e.g. "Spot is a Dog" (so we can work out that Judy owns a dog) - this means RDF is powerful as it can be used to describe just about anything!
Anyone can create a RDF document that creates/describes a new class or property that does not exist yet. Once it has been defined, it can be used like any other class or property. Just as in object-oriented programming, where one can create new classes by extending other classes, RDF allows to create new concepts by extending other concepts. The only difference is that RDF is property-oriented as opposed to being object-oriented.
For instance, the FOAF ontology provides a definition of Foaf:Person as a RDF:Class described as follows:
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/> // the entity is of type OWL Class rdfs:label="Person" // the name of the entity is "Person" rdfs:comment="A person." <rdfs:subClassOf><owl:Class rdf:about="http://xmlns.com/foaf/0.1/Agent"/></rdfs:subClassOf> // the entity is a subclass of the Class Agent <owl:disjointWith rdf:resource="http://xmlns.com/foaf/0.1/Org"/> // the entity has the property of being disjoint with the entity Organisation
See e.g. http://www.w3.org/People/Berners-Lee/card.rdf that describes Berners-Lee using various vocabularies (OWL's)
- Extensibility & Adaptability
- RDF can be expressed in 3 different ways (turtle, n3, xml).
- RDF allows different communities to define their own semantics: anyone can create new ontologies based upon pre-existing ontologies to describe new resources.
- RDF permits the integration of an indefinite number of Ontologies (as dictionaries of terms/properties/resources) within the same RDF file.
- RDF is endorsed by W3C and used in many academic projects. It is easy to find well maintained and well documented RDF ontologies online.
- Open Bibliographic Data
- Many ontologies (OWL) can be appropriated by open bibliographic efforts, to the extent that they have been made available under an open license.
- Working with RDF, all the data can be shared using open standards and Linked Data (http://en.wikipedia.org/wiki/Linked_Data)
- SPARQL is a powerful query system that can be used to query any database in which RDF metadata has been inserted.
- This is the equivalent of SQL designed for the semantic web. It allows the construction of queries for linked data.
- External dependency:
- Before it can be used to describe anything, RDF must necessarily rely on one or more external sources.
- Ressource intensive:
- RDF might require big triple stores (with hundreds of millions of triples) and SPARQL systems which might turn out to be too heavy. Many institutions currently do not have the infrastructure to handle that well.
- Excessive burden and lack of scalability for what should be simple bibliographic tasks like managing a few million bibliographic records.
- Open Bibliographic Data
- RDF may be fine as an abstract model, but its practical implementation for open bibliographic purposes remains to be provided and supported. Only very big players can manage the infrastructure necessary to deal with RDF.
- With SPARQL, if a query is not entirely predictable, it could result in NP (i.e. it could not return in any determined amount of time)
Schema.org is an initiative launched on 2 June 2011 by Bing, Google and Yahoo! to introduce the concept of the Semantic Web to websites. On 1 November Yandex (the largest search engine in Russia) joined the initiative. The operators of the world's largest search engines propose to mark up website content as metadata about itself, using microdata, according to their schemas. Those schemas can be recognized by search engine spiders and other parsers, thus gaining access to the meaning of the sites. The initiative started with a small number of formats, but the long term goal is to support a wider range of schemas Schema.org provides a collection of schemas (i.e. html tags) which can be used for simple bibliographic data and is currently being pushed by major search engine companies (e.g. Google. Bing, Yahoo!) Many sites are generated from structured data, which is often stored in databases. When this data is formatted into HTML, it becomes very difficult to recover the original structured data. Many applications, especially search engines, can benefit greatly from direct access to this structured data. On-page markup enables search engines to understand the information on web pages and provide richer search results in order to make it easier for users to find relevant information on the web. Markup can also enable new tools and applications that make use of the structure.
An OPAC that publishes unstructured data produces HTML that looks something like this:
<div> <h1>Avatar (Mysteries of Septagram, #2)</h1> <span>Author: Paul Bryers (born 1945)</span> <span>Science fiction</span> <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg"> </div>
The following is an example of how the data would look after embedding metadata in the format of schema.org:
<div itemscope itemtype="http://schema.org/Movie"> <h1 itemprop="name">Avatar</h1> <div itemprop="director" itemscope itemtype="http://schema.org/Person"> Director: <span itemprop="name">James Cameron</span> (born <span itemprop="birthDate">August 16, 1954</span>) </div> <span itemprop="genre">Science fiction</span> <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" itemprop="image"> <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a> </div>
For a general overview, see http://labs.mondeca.com/dataset/lov/index.html
Dublin Core (DC)
Dublin Core is a vocabulary that can potentially be incorporated in any metadata standard. The Dublin Core Metadata Initiative is a consortium that releases different specifications for every typology of metadata so that DC can be used anywhere.
Dublin Core can be used in two ways:
- A set of predefined metadata elements - ready to re-use in other metadata standards (e.g. FOAF)
- A standalone metadata schema with its own data format and serialization.
Standalone implementations of Dublin Core typically make use of XML and are Resource Description Framework (RDF) based, but Dublin Core can also be implemented in pure XML (http://dublincore.org/documents/dc-xml-guidelines/), HTML or text. (This is true of any RDF-defined metadata, whose properties are defined in RDF, which is serialization neutral.)
Dublin Core can be used to describe physical resources such as books, digital materials such as video, sound, image, or text files, and composite media like web pages. Metadata records based on Dublin Core are intended to be used for cross-domain information resource description and have become standard in the fields of library science and computer science.
The Simple Dublin Core Metadata Element Set (DCMES) consists of 15 metadata elements: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights. But there are many other terms also available. For more information, see: http://dublincore.org/documents/dcmi-terms/ http://dublincore.org/2010/10/11/dcterms.rdf
Dublin Core allows for the implementation of "Application Profiles" for extending the standard vocabulary. The domain model used in an application is often based on a domain model in wider use; for example, the generic model Functional Requirements for Bibliographic Records (FRBR) is an important point of reference for resource description in the library world.
- Dublin Core is a stable and well defined standard.
- it provides a core of semantically interoperable properties.
- It is made of a variety of fields which have been specifically and accurately defined.
- It is a good standard to be imposed as a working rule for a database over which there is full control.
- Problems arise if it is necessary to deal with third-parties data that may or may not have all the required elements.
- Standalone implementations cannot benefit from additional metadata that is outside the scope of Dublin Core: e.g. a photograph may contain metadata such as: type of camera they were shot on, settings (F-number, zoom level, ISO..), location, etc. Even if it is useful metadata, this kind of information is outside the scope of Dublin Core and cannot be accounted for. [Note, however, that any freeform or extensible metadata system (e.g. key-value pairs) will suffice to resolve that drawback]
Friend of a Friend (FOAF) RDF vocabulary, described using W3C RDF Schema and the Web Ontology Language. Conceived for the description of groups and persons, it provide basic properties and resources to express concept such as: friend of, son of, lives in, works in, knows someone, is mine, etc
The BIBO ontology is an extension of Dublin Core for the description of bibliographic data. The Bibliographic Ontology Specification provides main concepts and properties for describing citations and bibliographic references (i.e. quotes, books, articles, etc).
The Functional Requirements for Bibliographic Records (FRBR) standardises a set of terms and relationships that are essential to any cataloguer. FRBR is both a general model and a set of properties. For more information, see: http://metadataregistry.org/schema/show/id/5.html
The Resource Description and Access (RDA) RDA is an implementation of the FRBR model. It has about 1400 properties and over 60 term lists. It covers text, sound, film, cartographic materials, and objects, as well as archival materials. http://metadataregistry.org/rdabrowse.htm/
The Simple Knowledge Organization System (SKOS) is a special language for encoding terms lists and thesauri. It provides a RDF model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary.
SKOS can be used on its own, or in combination with more formal languages such as the Web Ontology Language (OWL). The aim of SKOS is not to replace original conceptual vocabularies in their initial context of use, but to allow them to be ported to a shared space, based on a simplified model, enabling wider re-use and better interoperability.
SKOS introduces the class skos:Concept, which allows implementers to assert that a given resource is a concept. It also has built-in relationships like "broader than" and "narrower than." In basic SKOS, conceptual resources (concepts) are identified with URIs, labeled with strings in one or more natural languages, documented with various types of note, semantically related to each other in informal hierarchies and association networks, and aggregated into concept schemes. It also provides for preferred and alternate display forms.
More info at http://www.w3.org/TR/skos-primer/
Common Serialization Schema
Turtle RDF serialization is simple to understand, easy to read/edit by humans in raw form, and relatively compact as far as RDF goes. The main advantages of Turtles are the following:
Writing full URIs all the time takes up a lot of space. Turtle enables the declaration of namespaces to prefix them. All prefixes should come at the beginning of the Turtle file. The prefix: @prefix bibo: <http://purl.org/ontology/bibo/> is such that bibo:Book will be interpreted as <http://purl.org/ontology/bibo/Book>
The RDF spec defines the property rdf:type (note the use of a prefix) that is used to type a particular resource. A shortcut for rdf:type in Turtle is a. Writing bibo:Document a bibo:Book will be interpreted as bibo:Document rdf:type bibo:Book
Blank nodes are a great shortcut when writing a query. They are connoted with square brackets . Blank nodes can be used to refer to the subject: e.g. there exists a book with the title "Hamlet"
 a bibo:Book ; dc:title "Hamlet"^^xsd:string . or to the object of an RDF statement: e.g. the book has been written by someone whose first name is "William"
bibo:Book dc:creator [ a foaf:Agent ; foaf:name "William"^^xsd:string ] .
XML is useful for data that can be marked up in a flat record
JSON is a record-based serialization. JSON Schemas can themselves be described using JSON Schemas. A self-describing JSON Schema for the core JSON Schema can be found at http://json-schema.org/schema for the latest version or http://json-schema.org/draft-03/schema for the draft-03 version. The hyper schema self-description can be found at http://json-schema.org/hyper-schema or http://json-schema.org/draft-03/hyper-schema.
MARC is another serialization schema that can carry a variety of data types (as ISO 2709)
MAB is used in some German speaking countries (Germany, Austria). Based on ISO 2709, MAB is similar to MARC. ISO 2709 mostly corresponds to the American Standard Z39.2 from the year 1971. The ISO standard originated in 1973 and was originally intended to be used to exchange bibliographic data on magnetic tape.
Examples (who uses what)
Major libraries in the UK/USA use MARC21, as do many European libraries. In Germany widely used is MAB2 and Pica. Those formats are be used for record creation, data exchange and internal storage.
More and more established institutions are committing resources to linked data projects, from the national libraries of Sweden, Hungary, Germany, France, the Library of Congress and the British Library, to the Food and Agriculture Organization of the United Nations, not to mention OCLC. These institution can provide a stable base on which library linked data will build over time. See http://ckan.net/group/lld for a comprehensive list of library datasets.
Library of Congress
- Digital library projects (Library of Congress)
AV-Prototype: digital preservation for audio and video uses METS and MODS with focus on metadata Cataloging report to use as intermediate level of description
UNESCO's CDS/ISIS library software
Common Communications Format (CCF)
British National Library
RDF with British Library Terms ontology See http://www.bl.uk/bibliographic/pdfs/british_library_data_model_v1-00.pdf http://www.bl.uk/bibliographic/pdfs/britishlibrarytermsv1-00.pdf
Creative Commons was founded in 2001, at which time debate concerning Digital Restrictions Management (DRM) was red-hot, as was the development of Semantic Web (RDF) technologies. Creative Commons realized that metadata could be used to make free works more useful (e.g., by facilitating discovery and provenance), flipping the DRM paradigm of crippling the usefulness of non-free works. Aaron Swartz led development of the Creative Commons RDF schema, which remains the basis of most subsequent Creative Commons metadata work. Creative Commons has also benefited over the years from interaction with the microformats community, and most recently has led the Learning Resources Metadata Initiative, prompted by longstanding additional metadata needs of the open education community and by renewed attention to web data due to schema.org.
The risks and downsides of Creative Commons metadata are no different from those of metadata generally: unless metadata production and publishing is closely aligned with other objectives and processes, it will often be expensive and wrong. Creative Commons has attempted to mitigate this risk by providing metadata as a side effect of its license chooser, encouraging other services and software to do likewise, and by not pushing metadata as a requirement for proper use of Creative Commons licenses -- rather a best practice.
Creative Commons metadata has two major parts: a work description, and a license description. The work description uses properties from Dublin Core, SIOC, and some developed by Creative Commons to provide information about the work -- including identifying the license (or where appropriate, public domain dedication or mark) the work is released under, and information needed to comply with the license, e.g., the names of attribution parties and the link with copyright information regarding the work to be used for attribution purposes. Creative Commons license self-describe their permissions, requirements, and prohibitions.
CC REL is a set of recommendations for implementation and use of Creative Commons metadata, focused on web annotations (RDFa), and a facility for embedding metadata in files (XMP) that refers to web annotations. Use of RDFa allows works to be annotated in a granular fashion (e.g., a web page or specific objects linked to or included), colocated with descriptions intended for human consumption, and mixed in with annotations and descriptions more broadly concerning the work or works in question.
For more information, see:
- http://creativecommons.org/ns# - where all properties developed by Creative Commons are defined
- http://labs.creativecommons.org/2011/ccrel-guide/ - examples of basic and advanced use of CC REL
- http://wiki.creativecommons.org/CC_REL - references to papers, presentations, and other resources concerning CC REL
Europeana Data Model (EDM)
Europeana started out harvesting metadata from hundreds of cultural institutions using a flat common metadata format based on simple Dublin Core. This simple solution, similar in nature to traditional record approaches, allows Europeana to handle a highly heterogeneous metadata input with minimum efforts. But it loses some of the richness from fine-grained metadata curated by Europeana's partners. It is also quite poor at providing a framework for producing and exchanging rich data where cultural objects are connected to the people, places and other objects that have a natural connection with them. This result in turn in much poorer services (search, display) for the user.
Between 2008 and 2011, the Europeana has researched a new framework for collecting, connecting and enriching metadata, inspired by Semantic Web and Linked Data technology: the Europeana Data Model (EDM). This model re-uses existing vocabularies, such as Dublin Core, SKOS, OAI-ORE, adapting them to the Europeana context: technically, it's an "application profile" of these vocabularies (http://dublincore.org/documents/profile-guidelines/). It is also inspired by CIDOC-CRM.
EDM enables to represent complex, especially hierarchically structured objects as in the archive or library domains. In terms of a book, for example, the individual chapters, illustrations and index can be visualized as one whole. In addition, EDM can show multiple views on an object (painting, book), including information on the physical object and digitised representation, distinctly yet together. It makes a distinction between the object and the metadata about it, which helps to represent different perspectives on a given cultural object, an important requirement related to enrichment.
Finally it allows Europeana to represent contextual information, in the form of entities (places, agents, time periods) explicitly represented in the data and connected to a cultural object. This is a crucial feature for the cultural heritage field, where knowledge organization resources such as thesauri, gazetteers and name authority files are widely used and could be made available to Europeana and the wider Linked Open Data space.
EDM has been developed together with technical experts from the library, museum, archive and audio-visual collection domains. While its implementation at Europeana is still work in progress, it has been tested against domain-specific metadata such as LIDO for museums, EAD for archives or METS for digital libraries. As an advanced feature, EDM is aimed at allowing several "grains" of metadata to co-exist seamlessly: it should be possible to express metadata as close as possible to original models, while still allowing for interoperability using mappings between the specialized levels and more generic ones like Dublin Core. Several case studies (http://pro.europeana.eu/case-studies-edm) illustrate the challenges and benefits of applying EDM to cultural heritage collections.
Open Images platform
Open Images is an open media platform that offers online access to audiovisual archive material to stimulate creative reuse. Footage from audiovisual collections can be downloaded and remixed into new works. Users of Open Images also have the opportunity to add their own material to the platform and thus expand the collection. Open Images also provides an API, making it easy to develop mashups.
All Open Images media items and their metadata are accessible via an Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) API. This allows third parties to access Open Images in a structured way. OAI-PMH is a powerful tool for data and metadata sharing between institutions and platforms. For example, OAI-PMH can be used to harvest all data available on the server, or to request specific records and periodic updates.
The Open Images OAI implementation uses two different metadata formats. They both include a required minimum data set of OAI-PMH records called 'oai_dc' (OAI Dublin Core). Dublin Core is a set of elements that can describe physical objects. oai_dc contains 15 elements specified by Dublin Core. The second, more comprehensive, set of metadata elements is a refinement of these core elements. 'oai_oi' (OAI Open Images) is a Open Images specific implementation consisting of a mixture of DC Terms and an XML interpretation of ccREL.
Bibliothèque Nationale de France (BnF)
The Bibliothèque nationale de France has designed a new project in order to make its data more useful on the Web. "data.bnf.fr" gathers data from different databases, so as to create Web pages about Works and Authors, together with a RDF view on the extracted data.This involves transforming existing data, enriching and interlinking the dataset with internal and external resources, and publishing HTML pages. The raw data is accessible in RDF following the principles of linked data, with an open licence (attribution). data.bnf.fr builds HTML pages from this data, about major authors and Works, so that the benefits can immediately be seen. Example: http://data.bnf.fr/11913795/machiavel/ http://data.bnf.fr/11913795/machiavel/rdf.xml
The purpose is to take data from silos and to put them on the Web.All the processes have to be made automatically: we rely on the use of persistent identifiers (ARK) in all our applications. The application is built with the open source software CubicWeb. http://www.cubicweb.org/ more information: http://data.bnf.fr/about-en
We need do gather data from several formats: MARC (bibliographic database and authority file: 14 million books), EAD (archives and manuscripts), and OAI-DC (Gallica digital library : 1,5 million items). This structured data has to be gathered with Web standards. We want to make something that is both efficient internally, and possible to reuse. The vocabularies we use are, mainly,
- SKOS: for concepts
- FOAF: for persons
- DC/RDA: for resources
more information: http://data.bnf.fr/semanticweb-en bulk download: http://echanges.bnf.fr/PIVOT/databnf_all_rdf_xml.tar.gz?user=databnf&password=databnf
Pros & Cons
- makes the "library data" fully available on the Web with an open licence - links between resources make them easier to use for the general public - algorithms help us improve the original data - Web techniques allow us to know what people look for and to adapt our service accordingly
- mistakes in the original data appear - scale is always an issue with millions or resources more information: http://data.bnf.fr/docs/databnf-presentation-en.pdf
"Sed querelae, ne tum quidem gratae futurae cum forsitan necessariae erunt, ab initio certe tantae ordiendae rei absint". Titus Livius, Ab Urbe condita, Praefatio 12. to contact the team: firstname.lastname@example.org
Centre Pompidou Virtuel
Archives de France
Contact: Claire Sibille, responsable du bureau du traitement des archives et de l'informatisation au Service interministériel des Archives de France du Ministère de la Culture et de la communication Thesaurus W for the indexation of local archives published by the Archives de France
- EAD (Encoded Archive Description)
- EAC-CPF (Encoded Archive Context - Collectivities, Persons, Families)
History: 1. XML, 2. excel sheets, 3. XML/SKOS (with ThManager) Today:
- URI identification for each term + relationship between terms defined by SKOS
- relationships between these terms defined by RDF triplets
- thesaurus has been aligned with RAMEAU & DBpedia
Consultation can be made in HTML or RDF/XML + can download the whole DB in rdf + consultation by SPARql requests + web API to the thesaurus
- URI can be dereferenced in different manners according to the context
University of California press
Using METS with MODS for freely available ebooks
MODS as exchange format between National Library of Australia and ScreenSoundAustralia Allows for consistency with MARC data
Biblioteca Nazionale Centrale di Firenze
Maintains the national bibliography of Italian books and develops the Nuovo Soggettario, a national general thesaurus, also available as SKOS under Creative Commons 2.5 license. Declares to be "defining ways of online publication as Linked Data of produced metadata" at a "first prototypical experimental stage" (contact: Giovanni Bergamin): http://thes.bncf.firenze.sbn.it/thes-dati.htm
Archive Hub, COPAC with Linked Data Creation of links with other databases (e.g. BBC, OCLC, LCSH)..