Open Metadata Handbook/Data Integration: Difference between revisions

From Wikibooks, open books for an open world
Jump to navigation Jump to search
[unreviewed revision][unreviewed revision]
Content deleted Content added
Line 9: Line 9:
[http://thedatahub.org/ Thedatahub] acts as an hub for all kind of data. It is possible to filter the datasets i.e. to get only Bibliographic Open Data.
[http://thedatahub.org/ Thedatahub] acts as an hub for all kind of data. It is possible to filter the datasets i.e. to get only Bibliographic Open Data.
As for the 7. Februar 2012 there are about 3130 datasets listed in the data hub.
As for the 7. Februar 2012 there are about 3130 datasets listed in the data hub.

===== Library Catalogues ====
* [www.bl.uk/bibliographic/datafree.html British National Library]: Linked Open Data + MARC21 released under CC0.
* [https://wiki.dnb.de/display/LDS/ German National Library]: Linked Open Data released under CC0.
* [http://data.bnf.fr Bibliothèque Nationale de France]: Linked Open Data released under the Licence Ouverte.



===== Crowd contributed data =====
===== Crowd contributed data =====

Revision as of 11:21, 8 February 2012

This chapter will be about fetching data, integrating it in a project then presenting it.

Data sources

Retrieving data

Data stores

Data and metadata can be found on a variety of sources. A "data store" is a generic term for an online resource storing and providing access to data. In a broad sense, this could be just about any server online, for example a web server that provides data represented as webpages. In the context of this handbook, we'll focus on data stores whose purpose is to allow free and open access, in reusable forms, to bibliographic metadata.

Data Hub

Thedatahub acts as an hub for all kind of data. It is possible to filter the datasets i.e. to get only Bibliographic Open Data. As for the 7. Februar 2012 there are about 3130 datasets listed in the data hub.

= Library Catalogues


Crowd contributed data
  • Wikimedia
    • One major source of crowd contributed knowledge comes from the Wikimedia galaxy of sites: Wikipedia, Wiktionary, Wikimedia Commons. This data is usually presented in a quite unstructured form (we'll explain more about that later). Efforts have been made to turn this information into structured data and provide data stores:
  • Content sharing websites
    • These websites not only provide user generated content such as pictures, videos or music, they also provide metadata and APIs to access this metadata and search for content:


Downloading data

A large part of available open datasets are provided as downloadable files. This is the easiest way to retrieve data as it only implies first finding the right dataset, then clicking to download it. But such downloads usually don't integrate well in automated processes and there often is no other way to make sure the data is up to date than to manually check for updates.

Accessing data through APIs

"API" stands for "Application Programmation Interface". As this name suggests, APIs allow for more complex interactions than downloads.

In most open knowledge APIs, the interface to access data is based on the HTTP protocol, the same browsers use to access web pages, which guarantees an easy access from almost any internet connection.

Just like when you open a webpage, to request data from a web-based API, you'll need to call an URL (Unique Resource Locator), the address for this webpage (or in the second case, for this API endpoint, hence the use of the neutral term "Resource" to designate both).

Most APIs follow the REST (REpresentational State Transfer) architecture, in which parameters (e.g. the name of a dataset, a specific range within a dataset) are passed within the URL. This allows for an easy testing of APIs, as you can try them in your browser and see the results.

The world of APIs ranges from very little more than parametrable downloads to fully replicating the functions of an online service (from user authentication to content creation), allowing to build custom clients on top of these services.

An example

The endpoint for the Wikipedia API is http://en.wikipedia.org/w/api.php which means that any URL starting this way will redirect to the API.

If you open Wikipedia's endpoint URL without any further parameters, you'll see a web page containing detailed information about the API syntax, i.e. how to build URLs to access data inside Wikipedia. Most APIs don't provide documentation through their endpoint, but will offer developer resources, such as [the Mediawiki API page]

Adding parameters will provide access to specific actions. For example http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Mexico_City&prop=revisions&rvprop=content will return the contents of the lastest revision of the Mexico City Wikipedia article, encapsulated in an XML file.

You can easily fiddle around and change the parameters: Replace the titles=Mexico_City part by titles=London|Paris and you'll get the articles for both London and Paris. Replace format=xml by format=json and you'll have a different encapsulation

API Tools
  • API Consoles such as APIgee provide tools to build and test API requests
  • JSON being one of the most popular formats for API data, we recommend the Firefox extension JSON View that helps explore the nested structure of such data.

Automating data retrieval

Source formats

Unstructured data

Proprietary formats

CSV

XML and JSON

Semantic and linked data

Transforming data

Structuring data : An introduction to data models

Tables : The relational model

Hierarchical data

Graph data