Open Metadata Handbook/Open Metadata

From Wikibooks, open books for an open world
Jump to navigation Jump to search

what does open mean ?[edit | edit source]

“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.”

Metadata is open if it satisfies the following conditions:

  1. access: it shall be publicly accessible, preferably via the Internet and free of charge (or at a reasonable reproduction cost).
  2. redistribution: it shall be possible for anyone to freely redistribute it, either as such or as part of a broader dataset derived from many different sources.
  3. reuse: it shall be possible to modify or incorporate it into derivative datasets, which can be distributed under the same terms as the original.
  4. no technological restrictions: it shall be provided in such a form that there are no technological obstacles to the performance of the above activities.
  5. attribution: it may be necessary, as a condition for its redistribution and re-use, to provide attribution of the relevant contributors and creators.
  6. integrity: it may be necessary, as a condition for metadata being distributed in modified form, that the resulting dataset carry a different name or version.
  7. no discrimination against persons or groups: it shall not be possible to discriminate against any person or group of persons.
  8. no discrimination against fields of endeavor: it shall not be possible to prevent anyone from making use of the metadata in a specific field of endeavor.
  9. distribution of license: attached rights shall apply to all to whom it is redistributed without the need for execution of an additional license by those parties.
  10. license must not be specific to a package: attached rights attached shall not depend on the work being part of a particular package.
  11. license must not restrict the distribution of other works: there shall be no restrictions on other works that are distributed along with the licensed dataset.

For more details, see http://opendefinition.org


Why open up metadata?[edit | edit source]

Producers of bibliographic data such as libraries, publishers, universities, scholars or social reference management communities have an important role in supporting the advance of humanity’s knowledge. For society to reap the full benefits from bibliographic endeavors, it is imperative that bibliographic data be made open — that is available for anyone to use and re-use freely for any purpose.

Large collections of data (or metadata) are protected by the law of many jurisdictions and cannot therefore be freely used or reused. It is therefore critical that they be published with a clear and explicit statement of the wishes and expectations of the publishers with respect to the use and re-use of the whole data collection, subsets of the collection or individual bibliographic descriptions.

Restrictions on the commercial re-use or limits to the production of derivative datasets make it impossible to effectively integrate and re-use a particular dataset. They also prevent the deployment of commercial services that might add value to bibliographic data or commercial activities that could be used to support data preservation.

For metadata to be effectively used and added to by others, it should be open as defined by the Open Definition (http://opendefinition.org) – in particular non-commercial and other restrictive clauses should not be used. The use of the Public Domain Dedication Licence or Creative Commons Zero Waiver is recommended to promote the maximum reuse of metadata, in line with the general ethos of sharing within the publicly funded cultural heritage sector.

For more details, see the Open Bibliographic principles at http://openbiblio.net/principles

Legal Issues[edit | edit source]

Default position of the law[edit | edit source]

The law of many countries prevent third-parties from using, reusing and redistributing data without explicit permission.

In Europe, sui generis database rights have been implemented by virtue of the 1996 EC Council Directive on the legal protection of databases, defined as "a collection of independent works, data or other materials which are arranged in a systematic or methodical way and are individually accessible by electronic or other means." Provided a set of data comes within the definition of a database, it will qualify for protection (irrespective of whether it also benefits from protection under copyright) if there has been a "substantial investment" in obtaining, verifying or presenting the contents of the database. A person infringes a database right if they extract or re-utilize all or a substantial part of the contents of a protected database without the consent of the owner. Like copyright, a database right is an automatic right which exists as soon as the database exists in a recorded form. Database rights last for either 15 years from the end of the year in which the making of the database was completed (or, if it was published during that period, 15 years from the end of the year in which the database was first made available to the public). If there is a substantial change to the contents of the database then the 15 year protection period recommences.

No database right exists in the United States. Although databases may be protected as compilations under U.S. copyright law, the underlying data is not automatically granted protection. While database owners have been lobbying for the introduction of such a right, it has so far been prevented by the successful lobbying of research libraries, consumer groups and firms who benefit from the free use of factual information.

In the absence of a statutory database right, protection of non-copyright data collections can however be implemented by contractual means or by relying on other bodies of law. In the U.S., the doctrine of “unfair competition” and “misappropriation” have been used to protect database manufacturers from losing business to competitors who free ride on their investments by republishing the work that has taken them so long to acquire or to create.

Hence, even in places where the existence of database rights is uncertain, it is important to apply a license simply for the sake of clarity.

Open licences[edit | edit source]

We recommend to use one of the licenses conformant with the Open Definition and marked as suitable for data. These includes:

  • Open Data Commons Public Domain Dedication and Licence (PDDL): Dedicate to the Public Domain (all rights waived)
  • Open Data Commons Attribution License: Attribution for data(bases)
  • Open Data Commons Open Database License (ODbL): Attribution-ShareAlike for data(bases)
  • Creative Commons CCZero: Dedicate to the Public Domain (all rights waived)

A more comprehensive list (along with instructions for usage) can be found at: <http://opendefinition.org/licenses/>

A short 1-page instruction guide to applying an open data license can be found on the Open Data Commons site: <http://opendatacommons.org/guide/>

Technical Issues[edit | edit source]

Accessibility[edit | edit source]

Opening up metadata does not guarantee that the data will be used (or seen). Making metadata publicly available under an Open license is only the first step. The next step is to make it accessible in a technical sense. Otherwise there is a risk that data will be underutilized.

Open bibliographic metadata must be available to anyone, without any discrimination against persons or groups. It should be provided at no more than a reasonable reproduction cost, so as to preclude financial discrimination. It should be downloadable over the internet as a whole and not exclusively upon request.

Several mechanisms restrict access to data. They include:

  • compilation in databases or websites to which only registered members or customers can have access.
  • the provision of single data points as opposed to tabular queries or bulk downloads of data sets.
  • time-limited access to resources, as opposed to indefinite access to them
  • restriction of robots to websites, with preference to certain search engines

Interoperability[edit | edit source]

Interoperability refers o the ability of diverse systems and organizations to work together (inter-operate). To the extent that it allows for different standards to work together, interoperability refers to the ability to combine different datasets together to develop more and better products and services. With regard to bibliographic data, it is important that record can be freely intermixed with another record containing complementary information. The ability to ‘plug together’ different datasets from different sources is essential to building large, comprehensive databases. There is no point in having lots of datasets but little or no ability to combine them together into the larger systems, which is where the real value lies.

Interoperability implies the use of "open standards": standards made available to the general public, developed and maintained via a collaborative consensus-driven process. Theses standards are intended for widespread adoption, they facilitate interoperability and data exchange among different datasets.

Several mechanisms reduce the interoperability of data. They include:

  • use of a proprietary or closed technology or encryption which creates a barrier for access.
  • license restricting the reuse of data to specific conditions that might render it incompatible with other datasets (such as many share-alike licenses)

Reusability[edit | edit source]

Everyone should be able to use, reuse and redistribute Open bibliographic metadata. There should be no discrimination against fields of endeavor or against persons or groups - such as ‘non-commercial’ restrictions that would prevent ‘commercial’ use or restrictions of use for certain purposes (e.g. only in education).

The data must also be available in a convenient and modifiable form. Bibliographic information is too often made available to the public into a format that does not allow for further modification (e.g. locked up in PDF files). Open bibliographic metadata should be encoded in a non-proprietary format that can be understood by a machine, easily modifiable and structured such as to allow for the automatic processing of data.

Several mechanisms restrict the reuse of data. They include:

  • encoding data in a format that cannot be automatically understood by a computer
  • license forbidding (or obfuscating) re-use of the data (such as education or non-commercial licenses)

Case studies[edit | edit source]

http://obd.jisc.ac.uk/examples

Europeana[edit | edit source]

Europeana aims to provide the widest access possible to cultural heritage, and empower others to build services that contribute to this mission. Making data openly available to the public and private sectors alike is thus central to its business strategy. Europeana is also trying to provide a better service through making available richer data, where millions of texts, images, videos and sounds are linked to other relevant resources.

Europeana has therefore been interested in Linked Open Data as a technology that facilitates these objectives, as the W3C Library Linked Data report has emphasized it for the cultural sector. Last year, it released a first Linked Data pilot at data.europeana.eu . This was a first opportunity to play with Linked Data, from a technical perspective. A first prototype was deployed quite easily (see this technical paper). Metadata was published using the Europeana Data Model (EDM), a crucial evolution of Europeana's approach to metadata. data.europeana.eu provides enriched metadata from Europeana, distinct from the original metadata. It is also connected to other Linked Data sources, such as Geonames. While such data publication is also possible via other channels, Semantic Web and Linked Data technology provides a much finer, native way doing it--the links are just part of the data model.

data.europeana.eu was yet still not part of the production system behind the main europeana.eu portal. More important, the metadata was not explicitly open, an obvious obstacle to re-use.

After several months we have thus released a second version. Though it is still a pilot, it nows contain fully open metadata (CC0). This is however only for a subset of the objects Europeana provides access to: in February 2012, data.europeana.eu contains metadata on 2.4 million objects. These objects come from data providers who have reacted early to Europeana's efforts of promoting more open data. It is hoped that this subset will be used by third parties to develop innovative applications and services. This would of course help to convince more partners to contribute metadata openly in the future.


COMET[edit | edit source]

Background

The Cambridge Open METadata project (COMET) was a collaboration between Cambridge University Library and CARET, University of Cambridge with assistance from OCLC. In ran from February to July in 2011 and was funded by the JISC Infrastructure for Resource Discovery program. It followed o from the libraries successful contribution to the Open bibliography Project. The initial aim was to identify and release a substantial record set to an external platform under an open license (Public Domain Dedication License) initially as MARC 21. The project aimed to also deploy and test and number of technologies and methodologies for releasing open bibliographic data including XML, RDF, SPARQL, and JSON and test integration with authority control services.

Key outputs

  • Open source software to analyze Marc21 record ownership codes (so assist in license assignment) and convert to RDF. An Open Source RDF publishing toolset was also constructed.
  • Over 2 million bibliographic records as RDF triples, many searchable via a SPARQL endpoint

Impact

The Cambridge University Library released data partly as a response to the increasing expectation of its readership that 'everything is open'. The Library is following through as partner in the Open bibliography project and with key involvement in two other JISC funded projects helping other libraries to release open data and develop services on top of it.

Comet was successful in releasing large amounts of data relatively quickly in a re-usable form. Some of the tools and methodologies constructed have since been employed on the Open Education project, in particular, COMETs' approach to IPR disambiguation.

Comets' preference for PDDL licensing of data was also noted by the JISC in its summary and synthesis work. Some data from OCLC was released under an ODBc-By license in accordance with their requirements. The project provides a valuable use case for examining the advantages and disadvantages of both options.

More information

Project blog - http://cul-comet.blogspot.com/ Datasets - http://data.lib.cam.ac.uk Project code - https://github.com/edchamberlain/COMET