Practical DevOps for Big Data/Introduction

From Wikibooks, open books for an open world
< Practical DevOps for Big Data
Jump to: navigation, search

Big Data Matters[edit]

Big data is a major trend in information and communication technologies (ICT). With the constant proliferation of mobiles and sensors around the world, the always growing amount of user-generated contents on the social Web, and the soon advent of the Internet of things, through which our everyday equipments (smartphones, electrical appliances, vehicles, homes, etc.) will communicate and exchange data, this data flood that characterises our modern society opens new opportunities and pose interesting challenges. Governments too, today tend to automate more and more public services. Therefore, the ability to process efficiently massive and ever-increasing volumes of data becomes crucial. Last but not least, big data is a job-creating paradigm shift. In the United Kingdom alone, from 2008 to 2012, the demand for big data architects, big data analysts, big data administrators, big data project managers, and big data designers has respectively risen by 157%, 64%, 186%, 178%, and 329%[1].

Characteristics of Big Data Systems[edit]

An organisation that wants to benefit from big data has to reconsider how its information system is architected. To be big data ready, the system must be designed with three essential key features consistently kept in mind. The first is scalability: infrastructure scalability, data platform scalability, and processing scalability. At the infrastructure level, scalability allows the organisation to be able to add to the system as many storage and computing resources as it wants. Small and medium-sized enterprises (SME) normally have not this power by themselves. But they can use the infrastructure of a cloud provider that offers them the possibility to create more virtual machines or containers at will (in return for payment, of course). At the data platform level, scalability refers to the adoption of data management software inherently distributed, i.e. that disperse data over a network, and that can exploit new available storage spaces resulting from the addition of computers to the network. At the processing level, scalability is about the cooperation of several operating system processes, generally running on different machines, to parallelise a task to be done. An unlimited number of processes can join the effort, particularly when the workload is high.

In sum, the mission of a big data architect is to coordinate the activities of various scalable data platform and processing technologies, deployed on a scalable infrastructure, in order to meet precise business needs. The way the technologies are organised and the relationships between them is what we call an architecture. Some architectural patterns have been discovered and documented, and can be found in the literature. For instance, Nathan Marz’s Lambda Architecture[2]. Although the architecture of the big data system is opaque to end-users, there is one thing they inevitably expect: an overall low latency. So, the second vital property of a big data system is the quality of service (QoS). The QoS is a set of requirements on the values of specific performance metrics like CPU times, input/output wait times, and response times.

The third fundamental characteristic of a big data system is robustness. When some machines or operating system processes of the big data architecture go down, the big data system must continue to function, and the problem must not be noticeable from an external perspective.

Big Data for Small and Medium-Sized Enterprises[edit]

We can see from the features described in the previous section that architecting a big data system is out of reach for many SMEs desiring to start competing in the big data market. Firstly, they have to rent the infrastructure of big players. Secondly, it is difficult for them to shortly master big data technologies (sometimes designed by the same big players) due to their high learning curves. Thirdly, it is even more complex to coordinate these technologies and reach an acceptable quality of service.

We believe this situation is not irremediable. To change it, we identified and tried to remove the following hindrances:

  1. Lack of tools to reason about possible architectures for a big data system. An architect needs a tool to clarify his ideas about how to organise big data technologies. Modelling languages are perfect for this purpose.
  2. Lack of tools able to connect architectural big data models with production. Models are not solely abstract representations; they are based on a concrete reality. In the case of big data, this reality is the production environment, that is, the cloud infrastructure upon which the big data system will be built. An integrated development environment (IDE) should allow architects to move in a user-friendly manner from models (ideas) to production (results). The more a modelling language is formal (i.e. the more its concepts have a precise, even mathematical, meaning) the more it is easy for developers to write programs to exploit, analyse, and transform models, generate software, package it, and deploy it automatically.
  3. Lack of tools to refactor deployed big data architecture according to measured runtime performances and pre-planned objectives. When the quality of service obtained is not as good as what models seemed to promise, refactoring tools should be there to correct architectural mistakes.

This book explains our work to overcome these handicaps.

References[edit]

  1. e-skills UK (2013). "Big Data Analytics. An assessment of demand for labour and skills, 2012-2017". https://ec.europa.eu/digital-single-market/en/news/big-data-analytics-assessment-demand-labour-and-skills-2012-2017. 
  2. Marz, Nathan; Warren, James (2015). Big Data. Principles and best practices of scalable realtime data systems. Manning. ISBN 9781617290343.