SI521 "Open Educational Resources at the University of Michigan" Open Textbook/OpenData

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Open Data is a term used to describe processes by which scientific data may be published and re-used without price or permission barriers. Scientists often see published data as being public goods, but many entities claim a copyright or license over data which prevents its re-use without permission, which is increasingly being seen as a major roadblock to the progress of scholarship and scientific inquiry.

Definitions of Data[edit | edit source]

Data refers to pieces of information or facts collected as the result of experience, observation, experiment, or a set of premises. They may be numbers, words, images, genomes, scientific formulas, geographic information. Data are often viewed as a lowest level of abstraction from which information and knowledge are derived. These data types can have commercial value, which provides an incentive for organizations (public or private) to apply intellectual property restrictions on data they have discovered or aggregated. James Boyle refers to these restrictions as enclosures [1], which can take the form of patents, copyrights, licenses, fees for use or access and other mechanisms to retain control over the data.

Open Data is a term and philosophy that asserts that raw data should be treated as a public good and made freely accessible, without intellectual property restrictions or ownership. Various models for providing shared access and easy mechanisms for putting data into the public domain have been developed and will be explored in this chapter.

The sciences depend on access to and use of factual data. Powered by developments in electronic storage and computational capability, scientific inquiry is becoming more data-intensive in almost every discipline. Whether the field is meteorology, genomics, medicine or high-energy physics, research depends on the availability of multiple databases, from multiple public and private sources, and their openness to easy recombination, search and processing. [2]

Data Resources[edit | edit source]

Data sets are different from other types of resources, because they are non-rivalrous resources that can't be depleted. Usage of data by one entity does not reduce the total availability for others to use. It is also conditionally renewable, in that most data becomes less useful over time and can become obsolete. Garrett Hardin's article "The Tragedy of the Commons" illustrates the argument that unrestricted access and unrestricted demand for a finite resource ultimately dooms the resource through over-use. This occurs because the benefits of exploitation accrue to individuals or groups, each of whom is motivated to maximize use of the resource to the point in which they become reliant on it, while the costs of the exploitation are borne by all those to whom the resource is available. The nature of data sets generally precludes this "tragedy" because of they cannot be depleted, thus avoiding over-use scenarios.

Community Behavior[edit | edit source]

The practice of science has a history of treating scientific discoveries as de facto public goods. The tradition of peer review requires that scientific discoveries and claims be subject to the scrutiny of other scientists in the same field. Peer review requires experts who have access to the data used to generate the claims to be reviewed. Moreover, the overarching normative behavior of scientists has tended towards a more open environment. The Mertonian Norms of Science are a set of ideals that Robert K. Merton developed to explain the manner in which scientists should behave and approach scientific practice.

  • Communalism - the common ownership of scientific discoveries, according to which scientists give up intellectual property rights in exchange for recognition and esteem
  • Universalism - according to which claims to truth are evaluated in terms of universal or impersonal criteria, and not on the basis of race, class, gender, religion, or nationality
  • Disinterestedness - according to which scientists are rewarded for acting in ways that outwardly appear to be selfless
  • Organized Skepticism - all ideas must be tested and are subject to rigorous, structured community scrutiny [3]

These norms inform the view that allowing as many scientists as possible to have access to raw data is in the best interest of the progress of science. Allowing multiple perspectives on data sets helps find mistakes and prevent repetition of existing work, which can have significant costs. Providing new information to the scientific and scholarly community brings with it prestige, as well as well as serving the practical needs of the entity providing the data.

However community norms for re-using data can vary between different scientific disciplines. Bioscience, for example, has a long tradition of publicly provided databases in which scientists contribute and aggregate raw data. Disciplines that utilize instruments such as telescopes and satellites to collect information often utilize community-provisioned equipment to collect the data across a range of facilities and thus have respected data re-use norms and policy. As instruments grow more sophisticated and data sets become larger, there are practical technological challenges to sharing these large amounts of data. While discoveries funded by public sources have traditionally been put into the public domain, privately funded ventures increasingly are doing original research and providing data with a number of restrictions that are reflective of their commercial roots. Further, the difficulty in distinguishing between information created for the specific display of data (database) and collections of raw data captured as a result of research (data sets) creates confusion and controversy when navigating access rights and responsibilities.

Challenges[edit | edit source]

Historically, intellectual property law in the US did not protect raw data or facts, but rather inventions and original created works that may be based on the raw data. For example, "one could patent the mousetrap, not the data on the behavior of mice, or the tensile strength of steel. A scientific article could be copyrighted; the data on which it rested could not."

US law also mandated that federal government works were released into the public domain immediately. This was applicable even if the works were copyrighted because of governmental involvement in scientific research. Federally funded scientific research was to encourage the widespread dissemination of data at or below cost, in the belief that, like the interstate system, this provision of a public good would yield economic benefits.

The challenge of privately funded data collection was exemplified during the Human Genome project when Celera Genomics, a private entity, announced their intention to patent gene sequences. The 1996 Bermuda Accords was a gathering of stakeholders working on sequencing the human genome that produced a set of principles to encourage the immediate release and publication of sequences as well as putting the entire sequence fin the public domain. This community agreement was intended to further the larger goals of science to server the greater public good. Celera's used a "shotgun strategy" of genome sequencing and operated at a lower cost than the publicly funded Human Genome Project. This competition lead the HGP to work faster and more efficiently, but Celera's position on the rights of the data they were uncovering would prove to be controversial. Celera used public data into their genome, but would not deposit their findings into public databases and would not allow any public usage of their data. Celera, although publicly agreeing to the Bermuda Principles, proceeded to file 6,500 paten applications. Further, they did not initially release their research under a license that permitted distribution or re-use of their data. In 2000, President Bill Clinton announced that human genes were not patentable and must be made freely available.

Licensing[edit | edit source]

James Boyle describes intellectual property enclosures as "a conversion into private property of something that had formerly been common property or, perhaps, had been outside of the property system altogether." [4] The genome projects exemplified the private/public struggle over how data and research should be treated in the realm of both science and intellectual property. As the commercial interest collides with traditional scientific behavior, navigating the "enclosures" has become more difficult task. Commercial investments have an interest in both controlling access to their discoveries, as well as asserting intellectual property rights over the data sets or databases they possess. Further, as research becomes more globalized and data sets get larger, there are legitimate technical hurdles that are often addressed by entities with commercial interest in providing infrastructure. The collection and storage of large amounts of raw data is a non-trivial problem and private investments expect a return on those risks. Protecting these interests often comes in the form of licensing agreements for access or restrictions on how the data can be re-used.

Click Wrap Agreement[edit | edit source]

A click wrap agreement involves end users first viewing the conditions for getting access and using the data they are trying to access. The user then clicks a link or button to accept the conditions of agreement and is granted access. The agreement may contain conditions that prevent data from being used in certain contexts or combined with data with conflicting licenses. Click wrap agreements have been affirmed in US courts.

A click-wrap license presents the user with a message on his or her computer screen, requiring that the user manifest his or her assent to the terms of the license agreement by clicking on an icon. The product cannot be obtained or used unless and until the icon is clicked. For example, when a user attempts to obtain Netscape's Communicator or Navigator, a web page appears containing the full text of the Communicator / Navigator license agreement. Plainly visible on the screen is the query, "Do you accept all the terms of the preceding license agreement? If so, click on the Yes button. If you select No, Setup will close." Below this text are three buttons or icons: one labeled "Back" and used to return to an earlier step of the download preparation; one labeled "No," which if clicked, terminates the download; and one labeled "Yes," which if clicked, allows the download to proceed. Unless the user clicks "Yes," indicating his or her assent to the license agreement, the user cannot obtain the software.

[5]

Towards a Science Commons[edit | edit source]

"A Large, Leaky Market"[edit | edit source]

“A large, leaky market may actually provide more revenue than a small one over which one’s control is much stronger.” - James Boyle [6]

Researchers who need to draw from many databases must deal with differing and overlapping data sharing policies, agreements, and laws, which may introduce conflicting obligations, limitations, and restrictions. These agreements can not only impede research, they can also enable data providers to exercise control over data users, dictating not only what research can be done, and by whom, but also what data can be published or disclosed, what data can be combined and how, and what data can be re­used and for what purposes. Scientists increasingly see these hurdles as a threat to serious scientific inquiry and practice.

Michael Heller described the "Tragedy of the anticommons", a condition where these conflicting interests of rights holders were working against scientific and social progress. This results in the underuse of scarce resources because too many rights owners can block access to other potential users. [7] Carol M. Rose extended this framework to "the comedy of the commons", which is a situation that arises when freely available resources provides more utility for society as a result of many people using these resources to their fullest extent.

Protocol for Implementing Open Access Data[edit | edit source]

In response to the growing concerns over large data sets, interoperability, and open access, Science Commons engaged stockholders in the scientific community to draft a protocol that would allow data sets to be interoperable. The result was The Protocol for Implementing Open Access Data, which provides information for people interested in distributing data for open access under the Open Knowledge Definition. The Protocol is intended to provide a framework for public domain data that is compatible internationally, as different countries have different approaches to the intellectual property status of scientific discoveries. [8]

Open Access Tools[edit | edit source]


Open Data Commons[edit | edit source]

Public Domain Dedication and License[edit | edit source]

The PDDL is method for waive all rights for data and put it into the public domain. The provider forfeits all rights, including attribution rights. They may choose to attach a set of Community Norms to suggest user behavior.

Open Database License[edit | edit source]

This license is similar to the Attribution/Share-Alike but purposed for data.

Norms[edit | edit source]

The Community Norms document describes a set of non-legally binding suggested behaviors by which users of the data should abide. These are free to be ignored, but may result in others not wishing to share data back with the person who violated these norms. Providers may choose their own set of Norms to attach to their data rather than the default set of suggestions. [9]

Creative Commons: CC0 Universal Waiver[edit | edit source]

File:Cc-zero.png
Creative Commons CC0 Logo Mark (for illustrative purposes only)

The CC Zero universal waiver (CC0) is a tool developed to enable researchers to put their data into the public domain in a simple manner. It is provided by Creative Commons as a result of Science Commons’ “Protocol for Implementing Open Access Data”. It is designed to be a compatible with international precedents for intellectual property law to help ensure that it is reliable, portable, and legally sound manner to waive and/or affirm the public domain status of a work. This waiver is beneficial for providers putting large data sets into systems like ProteomeCommons by simplifying the process by which they can grant open access and providers users of this data a clear signal as to its intended use. The CC0 waiver essentially steps outside the realm of copyright by granting public domain-like rights for use and defers to community normative behavior to self-regulate. Systems like ProteomeCommons have their own internal incentive systems to encourage positive behavior on the part of data providers, such as tying the unique hash used for identification of scholarly reference to the original license designation.

Similarly, the CC0 license depends on users to act in accordance with community-based standards. Since applying CC0 to a data set is one-way street that does not require attribution, it is be possible for others to use, remix or adapt the material in any manner they wish without attribution to the original data provider. However, Creative Commons does point out that providers can request attribution in accordance with community norms and standards. By removing the requirement for attribution that is present in other CC licenses, the upstream usage of data is not complicated by ensuring all uses of attributable data are properly attributed or referenced pursuit to a non-public domain license.

CC0 1.0 Universal / No Copyright The person who associated a work with this document has dedicated this work to the Commons by waiving all of his or her rights to the work under copyright law and all related or neighboring legal rights he or she had in the work, to the extent allowable by law.

CC Zero is appropriate for data, but can also be used for any type of content protected by copyright. CC Zero is expressed in three ways:

  • human-readable summary
  • legal code
  • machine-readable, digital markup code

Case Studies[edit | edit source]

ProteomeCommons.org[edit | edit source]

Proteomics is the large-scale study of proteins, particularly their structures and functions. Proteins are vital parts of living organisms, as they are the main components of the physiological metabolic pathways of cells. Proteomics is often considered the next step in the study of biological systems after genomics, but with its own set of challenges as the proteome differs from cell to cell and from time to time. DNA is chemically and physically simpler than proteins. There are also a larger number of analytical technologies and instruments used in the collection of proteomic data, which leads to larger, more complex databases. Identification of these separated proteins or peptide fragments is typically achieved by mass spectrometry measurements. Aggregating and integrating the data collected from disparate platforms and instruments is a large challenge for the field of proteomics, which has led to the development of systems and community standards to address these needs.

ProteomeCommons.org was created by Dr. Jayson Falkner and Dr. Pete Ulintz in the laboratory of Dr. Phillip Andrews at the University of Michigan to help address some of these challenges. The site uses the Tranche distributed platform for permanent storage of scientific data in a manner that is suitable for publication. The service provides the ability to annotate data using common standards, manage projects and easily apply licensing terms or waivers to data uploads. ProteomeCommons has adopted the CC0 waiver as the default option to promote data sharing in the scientific community

The ProteomeCommons.org Tranche network is one such early adopter. Our goal is to remove as many barriers to scientific data sharing as possible in order to promote new discoveries. The Creative Commons CC0 waiver was incorporated into our uploading options as the default in order to help achieve this goal. By giving a simple option to release data into the public domain, CC0 removes the complex barriers of licensing and restrictions. This lets researchers focus on what's most important, their research and new discoveries. Dr. Philip Andrews [10]

By selecting CC0 as the default terms of use, ProteomeCommons removes much of the uncertainty for scientists that use and cite data housed in their Tranche network. The CC0 asserts that data can be used upstream without fear of navigating complex licensing and usage agreements. Further, the storage system is not designed to be a database, but rather a data storage system that is file format and structure agnostic; it can accept any type of raw data as data sets. This mitigates the possibility of creative expression inherent in a database presentation.

Tranche Project[edit | edit source]

File:Tranche fractal small.png
Tranche Fractal

ProteomeCommons.org is built on the Tranche software platform and primarily stores tandem mass spectrometry proteomics data. The Tranche project is primarily developed and supported by the University of Michigan but it is provided as free and open-source software. Anyone may participate in the development of the code and use it any way they please. A Subversition repository is located at Source Forge

Secure Storage[edit | edit source]

Tranche addresses the data sharing problem through the use of secure distributed file system, with data sliced in to small chunks (1MB) and shared across many servers around the world. Each file is replicated at least 3 times across the servers, which allows for faster, distributed downloads and more redundancy against failure as server instances can fail with little effect on the network on a whole. The risk of file corruption or loss is greatly limited and repairable by "self-healing" properties of the distributed system. T

Security is provider by a 256-bit Advanced Encryption Standard (AES-156) encrypted hash code to ensure data integrity. This encryption allows users to know who published data to the system and prevents illicit data from getting published and shared. Data providers have the ability to share data securely and privately with others if they are not ready to release it publicly.

Citation[edit | edit source]

The importance of standardized, reliable citation for scholarly publishing is a critical concern when utilizing a system like Tranche. Scientists must be confident that any references to data stored on ProteomeCommons.org are unique and persistent so that other scientists may reference and review their data. To solve this, the Tranche system uses a checksum to generate what they call a "Tranche Hash". The hash provides a static, meaningful, and persistent reference for the data that can be used as a permanent reference and to verify the integrity of the data itself.

  • The hash is based on the data itself. It is not an arbitrary URL.
    • Anyone with the data can use the hash to verify that the data is identical to what was published.
    • Software results are more reproducible because you'll know if, or if not, the software or data has changed since publication.
    • Any server on the network can look up the data based on its hash.
  • Hashes don't change. You will never have a 'broken link', which often occurs with URLs.
  • The hash is based on a standard algorithm -- no new scheme for referencing is being made up.
  • You have many choices for downloading the data as the network isn't restricted to HTTP and web browsers.
  • You can also download data from the network using many other tools, including customized programs. [11]

The hash is also used to convey the license by which the data set is provided. Contributors can choose the licensing terms for their data sets, which is included when generating the hash. If these terms change at any point, the hash itself will be regenerated to incorporate these changes. [12] This is done to reinforce community norms for open and consistent data sharing licensing and behavior. Also the Tranche system is license-agnostic, it encourages use of the CC0 license for data sharing. If the provider decides to change this license at any time, the hash will change, breaking links to the data they stored in the system. Since data providers are often the first people to cite the data they provided in publication, this behavior is an inventive for providers to be honest and consistent with their data licensing terms.


References[edit | edit source]

  1. The Second Enclosure Movement and the Construction of the Public Domain. Boyle, James.
  2. ScienceCommons.org: Towards a Science Commons
  3. Mertonian Norms
  4. The Second Enclosure Movement and the Construction of the Public Domain. Boyle, James.
  5. Specht v. Netscape Communications Corp.
  6. The Second Enclosure Movement and the Construction of the Public Domain. Boyle, James.
  7. Heller et al. Can Patents Deter Innovation? The Anticommons in Biomedical Research. ScienceNew Series (1998) vol. 280 (5364) pp. 698-701
  8. Protocol for Implementing Open Access Data
  9. [1]
  10. Creative Commons Blog:Expanding the Public Domain: Part Zero
  11. Tranche Project: About
  12. [2]Tranche and the Open Access Database Protocol: Tranche Hashes