Big data

From Wikibooks, open books for an open world
Jump to navigation Jump to search

PAPER 2 - ⇑ Fundamentals of databases ⇑

← Client server databases Big Data


Specification coverage
  • 3.11 Big data

Big data[edit]

Big data - a generic term for large or complex datasets that are difficult to store and analyse.

Big data is a generic term given to datasets that are so large or complicated that they are difficult to store, manipulate and analyse. The three main features of big data are:

  • volume: the sheer amount of data is on a very large scale
  • variety: the type of data being collected is wide-ranging, varied and may be difficult to classify.
  • velocity: the data changes quickly and may include constantly changing data sources.

The lack of structure in Big Data is considered to be the aspect creating the most difficulties. For this reason, traditional data analysis and organisation methods such as relational databases or SQL are no longer useful when it comes to Big Data . However, when the correct techniques are applied to Big Data, a vast amount of useful information can be revealed. Processing Big Data allows professionals such data scientists to spot and analyse hidden patterns and relationships which wouldn't have been easy to interpret before.

Big data is used for different purposes. In some cases, it is used to record factual data such as banking transactions. However, it is increasingly being used to analyse trends and try to make predictions based on relationships and correlations within the data. Big data is being created all the time in many different areas of life. Examples include:

  • scientific research
  • retail
  • banking
  • government
  • mobile networks
  • security
  • real-time applications
  • the Internet.

Latency - the time delay that occurs when transmitting data between devices.

Latency is critical here and could be described as the time delay of the amount of time it takes to turn the raw data into meaningful information. With big data there may be a large degree of latency due to the amount of time taken to access and manipulate the sheer number of records.

Structured and unstructured data[edit]

Structured data - data that fit into a standard database structure of columns and rows (fields and records).

Unstructured data - data that doesn't fit into a standard database structure of columns and rows (fields and records).

Most databases work on the model that the data will fall into columns and rows, otherwise referred to as fields and records. This makes data easy to organise and store as they can be entered into the appropriate fields. When data are analysed, it is relatively easy to carry out searches and sorts to query the data. Some data doesn't fit into this model. Data can be defined as either structured or unstructured.

  • Structured data: data that can be defined using traditional database techniques using fields and records.
  • Unstructured data: data that cannot be defined in columns and rows. These might include multimedia data, web pages and the contents of emails, documents, presentations. This type of data is much harder to analyse.