Practical DevOps for Big Data/Methodology

From Wikibooks, open books for an open world
Jump to: navigation, search

In this chapter, we are going to introduce a way of designing big data applications. The underlying idea is to incorporate techniques from model-driven engineering into a DevOps development life cycle. Why is such an approach suitable and fruitful for data-intensive software? The question is fair, and we shall answer it first. We will start by advocating the use of models in DevOps. We will then look at some benefits of applying model-driven DevOps to big data application construction. Finally, we will introduce our methodology proposal and how the DICE IDE gives support to it.

Model-Driven DevOps[edit]

In a typical organisation, developers build and test software in an isolated, provisional, development environment—by using a so-called integrated development environment (IDE) such as Eclipse or Visual Studio—, while operators are in charge of the targeted, final, production environment. The latter conceptually comprises all entities with which the application is planned to interact: operating systems, servers, software agents, persons, and so forth. Operators are responsible for, among other things, preparing the production environment, controlling it, and monitoring its behaviour, especially once the application is deployed into it.

Before DevOps, developers and operators usually formed two separate teams. As a consequence, the former were not fully acquainted with the fact that their development environment differed from the production environment. And problems they did not see during the build and test of the application in the development environment suddenly emerged once operators installed it into the production environment. DevOps recommends placing developers—the “Dev” prefix of “DevOps”—and operators—the “Ops” suffix of “DevOps”—in the same team. This way, there is now a continuous collaboration between the two that greatly increases the confidence that every software component created is really ready for production. Indeed, since operators are the ones who understand the production environment very well, their active and constant support to developers results in a program that is designed with a production-aware mindset. The continuity of the cooperation manifests its full virtue when the build, unit testing, integration, integration testing, delivery, system testing, deployment, and acceptance testing phases are all completely automated.

DevOps: from integration to deployment

In DevOps, software components built by distinct DevOps teams are tested in isolation (unit testing). They are then assembled to create a single software (integration). Afterwards, the software is tested (integration testing) and put into a form—often called a package—that facilitates its distribution (delivery). Next, the package is tested (system testing) and installed into the production environment (deployment) to be tested one last time (acceptance testing). DevOps prescribes that operators should assist developers in order to automate entirely these stages. When done, continuous integration, continuous delivery, continuous deployment, and continuous testing are respectively achieved.

Model-driven practices have proven their usefulness for designing programs and generating their source codes. Thus, model-driven engineering already covers quite well the build and unit testing of software components. Our idea is to use models to automate integration, delivery, and deployment as well. The DICE Consortium has invented a modelling language for that purpose. (The language does not handle integration though.) Tools have been developed to let developers and operators concertedly model the application, together with its production environment, at different levels of abstraction. These tools interpret models to automatically prepare and configure the production environment, and also to deploy the application thereinto. Here are some advantages of this approach:

  1. Models not only specify the application but also the production environment fit for it.
  2. Even more, since properties of the production environment are described, one can sometimes exploit models to simulate or verify formally (i.e. abstractly and mathematically) the behaviour of the application in its production environment before any deployment.
  3. A common modelling language for developers and operators contributes to better communications between them. Ambiguities and misunderstandings are diminished because concepts employed in models are clearly defined.
  4. It is possible to refactor the application and its production environment directly through their models.
  5. The time to market of the application is reduced because some development and operation tasks are now performed by computers.

The DICE Consortium has experimented with these five points and has found the method pertinent. Big data applications generally rely on big data technologies mostly reputed to be difficult to learn. From our point of view, model-driven DevOps may help significantly.

Model-Driven DevOps for Big Data[edit]

The appellation big data is used to name the inability of traditional data systems to efficiently handle new datasets, which are massive (volume), arrive rapidly (velocity) from sundry sources and in diverse formats (variety), and whose flow rate may change abruptly (variability)[1]. Volume, velocity, variety, and variability are known as the four Vs of big data. The orthodox way to deal with them is to scale vertically, that is, to speed up algorithms or hardware (e.g. processors or disks). Such an approach is patently limited and fails without surprise. Big data is a shift towards horizontally scalable parallel processing systems whose capabilities grow simply by adding more machines to the existing collection[2].

The National Institute of Standards and Technology (NIST) published a valuable big data taxonomy[3]. A slightly revised version is depicted in the figure below. From top to bottom is pictured the service provision chain: system orchestrators expect some functionalities from the big data application, which is implemented by big data application providers, with the help of big data frameworks designed by big data framework providers. From left to right is shown the information flow chain: pieces of information are presented by data owners to data providers, which digitize them and transfer them to a big data application for processing. Its outputs are got by data consumers, which may convey them to data viewers in a user-friendly manner. The different activities of these nine functional roles are encompassed by security and privacy issues.

Revised NIST big data taxonomy

There is always a social aspect to big data. System orchestrators, big data application providers, big data framework providers, data owners, and data viewers are normally people or organisations. That is why a dedicated symbol represents them. Data ownership is particularly a judicial notion not easy to define comprehensively. A basic definition could be:

Data owner
A data owner is a person or an organisation to which a piece of data belongs and which exercises its rights over it. It has legal power to prevent someone else to make the most of its data, and it can take court action to do so—customarily under certain conditions.

People, scientists, researchers, enterprises, and public agencies obviously are data owners[3]. Data ownership is shareable. Indeed, before using a big data system—e.g. Facebook—, a data owner commonly must accept an agreement by which s/he and system orchestrators will be bound. Sometimes, using the system implicitly implies acceptance of the agreement. To some extent, this agreement may concede data ownership to system orchestrators. For example, every internaut who has filled online forms asking personal data probably has seen on one occasion an “I agree to terms and conditions” checkbox. Frequently, these terms and conditions ask the user permission to do specific things with his data. By clicking the checkbox, the user may, for instance, grant system orchestrators the liberty to disclose it to business partners. The role of system orchestrator can be the following:

System orchestrator
A system orchestrator is a person or an organisation whose position allows it to participate in the decision-making process regarding requirements of a big data system, as well as in monitoring or auditing activities to ensure that the system built or being built complies with those requirements[3].

System orchestrators settle, for example, data policies, data import requirements, data export requirements, software requirements, hardware requirements, and scalability requirements. We find among system orchestrators: business owners, business leaders, business stakeholders, clients, project managers, consultants, requirements engineers, software architects, security architects, privacy architects, and network architects[3].

Data viewer
A data viewer is a person or an organisation to which the big data system communicates some information.

In the figure, the boxes denote hardware or software. Data providers, data consumers, big data applications, and big data frameworks are all digital entities. Their job is to cope with data. When we say data, we always mean digital data. We use the term information to designate non-digital data. And we call knowledge every piece of information that is true. Veracity—yet another V—is an important concern in big data.

Data provider
A data provider is a hardware or software that makes data available to itself or to others[3]. It takes care of rights of access, and determines who can read or modify its data, by which means, and what are allowed and forbidden exploitations.

Database management systems match this definition. Data providers have many methods at their disposal to transmit data: event subscription and event notification, application programming interface, file compression, Internet download, and so on. They can equally decide to create a query language or another mechanism to let users import processing without fetching data. This practice is described as moving the computation to the data rather than the data to the computation[3], and is represented in the above depicted figure by directed arcs. A data provider may turn information entered or taken from data owners into digital data that can be processed by computers. In that case, it is a boundary between the big data system and the non-digital world. This is, for instance, the function of a dialog box or an online form. All capture devices such as video cameras and sensors are data providers too.

Big data application and big data application provider
A big data application is a software that derives new data from data supplied by data providers. It is designed and developed by big data application providers.

A big data application is a special kind of data provider—namely, a data provider that depends on other data providers. But not all data providers are big data applications. For example, a database management system is not a big data application because it does not generate new data by itself. Big data applications usually perform the following tasks: data collection from data providers, data preparation, and analytics. Data preparation occurs before and after the analytical process. Indeed, as the saying goes, garbage in, garbage out. Likewise, bad data, bad analytics. Hence the necessity to prepare data. For example, data coming from untrustworthy data providers and incorrectly formatted data should be discarded. After the analysis, a big data application may arrange the results so that a data consumer will display them more easily and meaningfully on a screen. All these tasks exist in traditional data processing systems. Nonetheless, the volume, velocity, variety, and variability dimensions of big data radically change their implementation[4]. In DevOps, developers and operators are big data application providers.

Big data framework and big data framework provider
A big data framework is an infrastructural or technological resource or service that imparts a big data application with the efficiency and horizontal scalability required to handle ever-growing big data collections. It is made available by big data framework providers.

The NIST classified big data frameworks as infrastructure frameworks, data platform frameworks, or processing frameworks[3]. An infrastructure framework provider furnishes a big data system with the pieces of hardware it needs: storage devices, computers, networking equipments, cooling facilities, etc. The infrastructure may be hidden—clouded—behind a service delivered through an application programming interface. This service may, for example, allow users to remotely create, start, stop, and remove virtual machines or containers on demand. Data centres and cloud providers belong to this category of big data framework providers. Data platform frameworks are data storage programs that capitalise on a network to spread data over several machines in a scalable way. Thus, the more machines we connect together, the more storage space we get. The Hadoop distributed file system (HDFS) and Apache Cassandra are both data platform frameworks. Processing frameworks, like Hadoop MapReduce and Apache Spark, distribute processing in place of data. The addition of machines must not be detrimental to the efficiency of data retrieval and processing. Instead, a better performance should be observed. Open source communities innovate a lot to improve data platform and processing frameworks.

Data consumer
A data consumer is a software or hardware that uses or receives outputs originating from a big data application.

The nine functional roles are not mutual exclusive. It is theoretically possible to be a system orchestrator, a data owner, a data viewer, a big data application provider, and a big data framework provider at the same time. And a program can simultaneously be a big data application, a data provider, a data consumer, and a big data framework. As an example, let us consider the fictitious scenario shown in the following figure:

Example of a NIST big data system

Two users User 1 and User 2 are connected to a big data system. They both interact with it by means of a graphical user interface. Let us imagine GUI 1 is a website browsed by User 1, and GUI 2 is a desktop application run by User 2. Application 1 and Application 2 are two big data applications. Storage and Analyser are respectively a data platform framework and a processing framework. Information entered by User 1 is transferred to Application 1 by GUI 1. Application 1 saves it in Storage and relays it to Application 2. Analyser uninterruptedly analyses data contained in Storage and sends its analysis also to Application 2. With all these inputs on hand, Application 2 executes a certain algorithm and transmits the results to GUI 1 and GUI 2, which, lastly, reveal them to their users. In this scenario, many actors play multiple roles. For instance, User 1 is a data owner and a data viewer, and Storage is a data provider, a data consumer, and a big data framework.

We can see NIST's taxonomy as a technology-agnostic modelling language, and the figure above as a model designed with it. Technological choices are not indicated: Storage may be implemented by Apache Cassandra or HDFS, and Analyser may be a MapReduce or a Spark job. Some of the rules of this modelling language informally could be:

  1. Every node has a name (e.g. “Analyser”), an icon, and labels—at least one.
  2. Node names and node icons are freely chosen.
  3. Allowed node labels are: data owner (DO), data viewer (DV), system orchestrator (SO), big data application provider (BDAP), big data framework provider (BDFP), data provider (DP), big data application (BDA), data consumer (DC), big data framework (BDF), infrastructure framework (IF), data platform framework (DPF), and processing framework (PF).
  4. Since big data applications are data providers, a node labelled with BDA must also be labelled with DP. Similarly, a node labelled with IF, DPF, or PF must also be labelled with BDF.
  5. An information flow is allowed only: (a) from a data owner to a data provider; and (b) from a data consumer to a data viewer.
  6. A data flow is allowed only: (a) from a data provider to a big data application; (b) from a big data application to a data consumer or a big data framework; and (c) from a big data framework to a big data application or another big data framework.
  7. A “provides” association is allowed only: (a) from a big data application provider to a big data application; and (b) from a big data framework provider to a big data framework.
  8. A “uses” association is allowed only: (a) from a system orchestrator to a big data application; (b) from a data owner to a data provider; (c) from a big data application to a data provider or a big data framework; (d) from a big data framework to another big data framework; (e) etc.
  9. Etc.

The Meta-Object Facility (MOF) is a standard of the Object Management Group (OMG) appropriate to define modelling languages rigorously. The famous Unified Modeling Language (UML) itself is specified with MOF. The case of UML is remarkable because UML has a profile mechanism that makes it possible for designers to specialise for a particular domain the meaning of all UML diagrams. In practice, language inventors have two options: either they begin from scratch and work directly with MOF, or they adapt UML to a subject of interest. Eclipse supports both methods. The Eclipse Modeling Framework (EMF) is a set of plugins for the Eclipse IDE that incorporates an implementation of MOF called Ecore. And the Eclipse project Papyrus supplies an implementation of UML based on Ecore.

In the previous section, we have given five advantages of model-driven DevOps. In the context of big data, there is more to say. Big data frameworks—infrastructure frameworks, data platform frameworks, and processing frameworks—are generally difficult to learn, configure, and manage, mainly because they involve scalable clusters of an unlimited number of computational resources. With model-driven DevOps, things become easier. Operators just declare in their models which technologies they want in the production environment, along with performance requirements. They let model-to-production tools install and configure the clusters automatically in a manner that satisfies the quality of service (QoS) requirements. The burden is partially borne by the tools.

Methodology Proposal[edit]

Now that we have clarified what is model-driven DevOps and the convenience thereof for big data, it is time to detail our software development methodology. The actors concerned by this methodology are big data application providers—developers and operators—and some of the system orchestrators—architects and project managers—since they are the only ones that actually construct, supervise, or monitor the construction of big data applications. In a nutshell, by following our approach, these actors easily experiment with big data systems' architectures thanks to a modelling language with which they give shape to their ideas. An architecture model, with all its big data technologies, is automatically and concretely reproducible onto a cloud infrastructure by model-to-production tools. So, they can compare what they envisioned and what they got. And every model change to improve performance or quality can be automatically propagated onto the cloud infrastructure too (continuous deployment). We divide the methodology into three scenarios that illustrate alternative ways to take advantage of model-driven DevOps for big data: architecture modelling, architecture analysis, and architecture experimentation.

Architecture Modelling[edit]

Nowadays, modelling has become a standard in software engineering. Whether it be for architectural decisions, documentation, code generation, or reverse engineering, today it is common for software specialists to draw models, and many industrial notations are at their disposal. Modelling is the cornerstone of our methodology. The modelling language of the DICE Consortium follows the philosophy of OMG’s model-driven architecture (MDA): it includes three layers of abstraction: a platform-independent layer, a technology-specific layer, and a deployment-specific layer. With platform-independent concepts, architects describe a big data system in terms of computation node, source node, storage node, etc., without explicitly stating the underlying technologies. A DICE platform-independent model (DPIM) resembles a model done with NIST’s taxonomy, except that the naming of concepts differ. Here is a partial conceptual correspondence:

DPIM concept Corresponding concepts in NIST's taxonomy
Data-intensive application (DIA) Big data system
Computation node Processing framework and big data application
Source node Data provider
Storage node Data platform framework

Contrary to NIST’s taxonomy, DPIM concepts are mutual exclusive. (DPIM source nodes do not provide persistence features; therefore, a storage node cannot be a source node.) Moreover, there is no DPIM concept for infrastructure frameworks because infrastructure is a low-level issue tackled at the deployment-specific layer. The full list of DPIM concepts will be explained in a subsequent chapter.

The DPIM model corresponds to the OMG MDA PIM layer and describes the behaviour of the application as a directed acyclic graph that expresses the dependencies between computations and data[5].
—DICE Consortium

A DICE technology-specific model (DTSM) refines a DPIM by adopting a technology for each computation node, source node, storage node, etc. For example, an architect may choose Apache Cassandra or HDFS as one of his storage nodes. Again, DTSMs say nothing about the infrastructures in which these technologies will be deployed. It is the role of DICE deployment-specific models (DDSM) to specify them. DDSMs refine DTSMs with infrastructural choices. Here, the word infrastructure should be read as infrastructure as a (programmable) service (IaaS). It refers actually to the computing power of an infrastructure accessed over the Internet. Although the user does not see the underlying hardware, there is an application programming interface that enables him to programmatically create virtual machines or containers, select operating systems, and run software. A graphical user interface may allow him to carry out the same job interactively. Model-to-production tools rely on DDSMs and infrastructures' API to function.

Here are the steps of this scenario:

  1. Draw an UML object diagram profiled with DPIM concepts to describe the components of a big data system.
  2. Draw an UML activity diagram profiled with DPIM concepts to describe the actions of these components.
  3. Refine the two previous diagram with DTSM concepts.
  4. Draw an UML deployment diagram profiled with DDSM concepts to describe how the technologies will be deployed into infrastructures.
  5. Generate scripts that use infrastructures' API to and run these scripts to obtain the production environment.

Architecture Analysis[edit]

Here are the steps of this scenario:

  1. Draw an UML object diagram profiled with DPIM concepts to describe the components of a big data system.
  2. Draw an UML activity diagram profiled with DPIM concepts to describe the actions of these components.
  3. Draw a deployment diagram profiled with DPIM concepts to describe the deployment of the components into a simulated environment.
  4. Run a simulation.
  5. Refine the three previous diagrams with DTSM concepts.
  6. Run a simulation or a verification.
  7. Run an optimisation to generate an optimised UML deployment diagram profiled with DDSM concepts.
  8. Generate scripts that use infrastructures' API to and run these scripts to obtain the production environment.

Architecture Experimentation[edit]

Monitor and test the quality of the big data system in production and refactor of the models.

The DICE Methodology in the IDE[edit]

The DICE IDE is based on Eclipse, which is the de-facto standard for the creation of software engineering models based on the MDE approach. DICE customizes the Eclipse IDE with suitable plugins that integrate the execution of the different DICE tools in order to minimize learning curves and simplify adoption. Not all tools are integrated in the same way. Several integration patterns, focusing on the Eclipse plugin architecture, have been defined. They allow the implementation and incorporation of application features very quickly. DICE Tools are accessible through the DICE Tools menu.

The DICE IDE offers the ability to specify DIAs through UML models. From these models, the toolchain guides the Developer through the different phases of quality analysis, formal verification being one of them.

The IDE acts as the front-end of the methodology and plays a pivotal role in integrating the DICE tools of the framework. The DICE IDE can be used for any of the scenarios described in the methodology. The IDE is an integrated development environment tool for Model-driven engineering (MDE) where a Designer can create models at different levels (DPIM, DTSM and DDSM) to describe data-intensive applications and their underpinning technology stack.

The DICE IDE initially offers the ability to specify the data-intensive application through UML models stereotyped with DICE profiles. From these models, the tool-chain guides the developer through the different phases of quality analysis (e.g., simulation and/or formal verification), deployment, testing and acquisition of feedback data through monitoring data collection and successive data warehousing. Based on runtime data, an iterative quality enhancements tool-chain detects quality incidents and design anti-patterns. Feedbacks are then used to guide the Developer through cycles of iterative quality enhancements.


  1. ISO/IEC (2015). "Big Data. Preliminary Report 2014". ISO. 
  2. NIST Big Data Public Working Group (2015-09). "NIST Big Data Interoperability Framework: Volume 1, Definitions. Final Version 1". NIST. doi:10.6028/NIST.SP.1500-1. 
  3. a b c d e f g NIST Big Data Public Working Group (2015-09). "NIST Big Data Interoperability Framework: Volume 2, Big Data Taxonomies. Final Version 1". NIST. doi:10.6028/NIST.SP.1500-2. 
  4. NIST Big Data Public Working Group (2015-09). "NIST Big Data Interoperability Framework: Volume 6, Reference Architecture. Final Version 1". NIST. doi:10.6028/NIST.SP.1500-6. 
  5. DICE Consortium (2015). "DICE: Quality-Driven Development of Data-Intensive Cloud Applications".