Practical DevOps for Big Data/Delivery

From Wikibooks, open books for an open world
Jump to navigation Jump to search

In this book so far, we have been focusing on design work, which most of the time happens in our IDE. All the analysis took place in the development space and without running a single line of our code. But the purpose of the design and development is to create an application, which can run in some data centre or on an infrastructure, so that we can use the application's functionality. In this chapter, we will therefore look at the methods and tools that will install, start and test our applications.

In a typical model-driven fashion, a Deployment Model provides an exact representation of our application. Using the Deployment-Specific Modelling approach is a good start in this direction. A Deployment Model contains all the pieces that make up an application, so we need the tools that can read this model and take steps to set up the application. The expected outcome of these tools is an instantiation of the resources, described in the model, occupied by the services specified to be associated with these resources.

Before getting to the tool capable of deploying application, there is usually one more step: transforming the model into a format that is understandable to the target tool. Instead of the UML representation, tools usually prefer a more compact and targeted language, commonly named as Domain Specific Language (DSL),[1] which is used to prepare blueprints of DIAs. We will use OASIS TOSCA,[2] a recently emerged industry standard for describing portable and operationally manageable cloud applications. One of its benefits is that it enables flexible use and ability to extend the available nomenclature to fit a specific domain such as Big Data.

The tool we will use for deploying our DIA needs to offer a simple access to managing deployments, and needs to provide repeatable and reliable DIA deployment operations.


[edit | edit source]

A DIA is composed of a number of support services and technologies, on top of which we normally run some custom user code. Complexity of the DIA's setup is a function of the needs of the application itself. But regardless of whether we have a simple application, which only stores the data in a NoSQL storage, or a multi-tiered stack of building blocks, someone or something needs to install and start all of the components. In the DevOps setting, we use a cloud orchestration tool[3] to automate this process.

Individual open source services and tools come with extensive instructions on how to set up and configure the services. Doing this manually requires time and skill, which is often missing in the teams who are starting up in the world of Data-Intensive Application development. The need to learn to deploy services runs mostly orthogonally to the need to learn to make the most of the service itself. Therefore any help to speed up or entirely replace the deployment steps can be a major time and cost saving factor.

The need to automate the deployment process in DevOps has brought a concept of Infrastructure as Code,[4] which enables to store blueprints for whole platforms in a version-oriented repository such as GIT. A great benefit of this is that we have a single source of truth for what the DIA should be like when deployed.

Blueprints written in TOSCA YAML can contain any number of details of the individual components that comprise the DIA. However, like in a traditional application source code, the blueprints too should not repeat concepts that can be represented at a lower level of abstraction. These can be packaged into a TOSCA library, which can be imported into each blueprint as a plug-in.

Existing solutions

[edit | edit source]

Automation of configuration and deployment is not new, because various forms of batch scripts have existed for decades. This is traditionally an imperative approach to configuration automation, where a script describes a series of instructions that need to run in order. A more recent, but an already widespread approach is to use a declarative approach, where we focus on describing the desired state of the system.

The best approach to deploying complex systems is to break the process down such that different tools focus on different levels of the system. Starting from the level of the infrastructure, i.e., the computation, network and storage resources, we can pick from a number of configuration management[5] tools: Chef,[6] Puppet,[7] Ansible,[8] and others. These tools use definitions with names such as cookbooks, playbooks and similar as the source code for installation and configuration processes. A user or an orchestrator would then execute units of these definitions (e.g., recipes of Chef's cookbooks) to transition between states of the system.

For managing the Cloud applications such as Big Data applications, we need a higher level approach, which takes a form of a cloud orchestrator.[3] Some of the configuration managers such as Ansible or Chef already have some orchestration capabilities. This ensures that interdependencies are set up in a proper order and configured to properly discover each other. For instance, a web application needs a web server to be configured first, and a database set up and running on another node. This is done by the orchestrator tools, representatives of which include Ubuntu Juju,[9] Apache Brooklyn[10] and Cloudify.[11]

The tools are only one part of the delivery automation. The other crucial part are the recipes, playbooks, plug-ins and blueprint type definitions. Each tool has a community of vendor's own or community's contributions to the large repository or marketplace of the code for configuration and orchestration. The quality of the material available in this way varies, and the compatibility of different recipes is not always guaranteed. But this is a valuable source for initial experiments with technologies and prototyping, as long as we have the time to study each cookbook or playbook.

How the tool works

[edit | edit source]

TOSCA technology library used in blueprints

[edit | edit source]

The DICE delivery process aims to make the interaction with the orchestration as simple as possible. We start off with an application blueprint in a TOSCA YAML format. We can write this document ourselves, but it is even better to follow the DICE methodology and transform a Deployment-Specific Model using DICER into an equivalent YAML blueprint. In each case, the blueprint will look similar to this example:

tosca_definitions_version: cloudify_dsl_1_3



    type: dice.firewall_rules.mongo.Common

    type: dice.hosts.ubuntu.Medium
      - type: dice.relationships.ProtectedBy
        target: mongo_fw

    type: dice.components.mongo.Server
        enabled: true
      - type: dice.relationships.ContainedIn
        target: standalone_vm

    type: dice.components.mongo.DB
      name: accounts
      - type: dice.relationships.ContainedIn
        target: standalone_mongo


    description: Mongo client connection details
        - "mongo --host "
        - get_attribute: [ standalone_mongo, fqdn ]
        - :27017

This blueprint is composed of node templates, the types of which are based on the definitions in the DICE Technology Library, for example:

  • dice.hosts.ubuntu.Medium: represents a compute host of a medium size.
  • dice.firewall_rules.mongo.Common: a node type for defining a networking security group or firewall, such that only the ports needed for MongoDB to communicate are accessible, and this includes peer engine services and any clients.
  • dice.components.mongo.Server: a component containing all the MongoDB-related modules that comprise a stand-alone instance of the MongoDB engine.
  • dice.components.mongo.DB: represents a database in a MongoDB engine.
  • dice.components.mongo.User: a user in a MongoDB cluster.

Many of the node templates need to be in a relationship with another node template. We do this using the following relationship types:

  • dice.relationships.ProtectedBy: the source of this relationship is a compute host, and the target is a dice.firewall_rules node template defining the secure group or a firewall.
  • dice.relationships.ContainedIn: may connect a service-related node template to a compute host, or a database to a database engine such as MongoDB.
  • dice.relationships.mongo.HasRightsToUse: enables permission of the source user node template with the target database node template.

More blueprint examples are available.

Deployment service concepts

[edit | edit source]

A deployment is an instantiation of a blueprint in the target infrastructure. The cloud orchestrator can generate one or more concurrent or serial (i.e., the previous one is destroyed because a new one is created) deployments. To make the management of deployments simpler, the DICE Deployment Service operates with the concept of virtual deployment containers. A blueprint submitted to a particular virtual deployment container will result in a deployment associated with that virtual deployment container. A new blueprint submitted to the same virtual deployment container will result in a new deployment that will replace the previous one.

Virtual Deployment Containers concepts for the DICE Deployment Service's

This is illustrated in the figure above, where Blueprint B.1 has previously been deployed in Virtual Deployment Container 2. But then the users improved the application, resulting in the Blueprint B.2. After submitting this blueprint to Virtual Deployment Container 2, the previous deployment has been removed and a new one gets installed. This change leaves the deployment in Virtual Deployment Container 1 intact. Users can create as many containers as needed, for instance to use for personal experimentation of a new technology, specific branches in Continuous Integration, or for manual acceptance testing of new releases.

Another feature of the DICE Deployment Service and the DICE TOSCA technology library is that it enables deployment to any of the supported cloud infrastructures. In the blueprint, the users can specify the type of the platform to deploy into (e.g., OpenStack or Amazon EC2), and this can be different for different components in the blueprint. For the nodes that do not explicitly specify the target platform, the DICE Deployment Service uses its default target platform. In this case, the same blueprint can be submitted to different cloud providers without any change in the blueprint. On the figure above, we submitted the Blueprint B2 to two different cloud providers. As shown, any platform-specific input parameters are supplied by the DICE Deployment Service. It is up to the administrator of the data centre to set these to the proper values. The designers and developers, on the other hand, do not need to handle such specifics.

Deployment diagram depicting components of the DICE Deployment Service

Behind the scenes, the actual blueprint deployment gets submitted to a RESTful service, which augments the blueprint with any platform-specific details. The actual blueprint orchestration is then performed by Cloudify,[11] which takes care of pulling the relevant elements of the technology library and Chef cookbooks before executing them in the deployment workflow. The figure above illustrates the architecture of the service that enables this workflow.

Open Challenges

[edit | edit source]

The combination of the presented deployment service and the TOSCA technology library highly speeds up preparation and setup of the applications, which are composed of the supported technologies. However, many of the data intensive applications will combine other first party and third party components than the one supported in the library.

The TOSCA technology library provides a short-cut for certain cases of custom elements by providing a node type for custom (bash or Python) scripts. But these are available only for really simple customizations. Because of their over-simplicity, using them does not provide the full functionality of the TOSCA blueprints. The result is that a larger part of of such blueprints become imperative than it would otherwise be necessary. A cleaner way to address such issues is to apply extensions and changes at the library level, but this requires a higher degree of skill and knowledge on the user part.

The main strength of the DICE delivery toolset is that it provides deployment of open source solutions as a collection of easy-to-use building blocks. But at the same time this represents a weakness, because the definitions in the library are only as up-to-date as their providers keep them. Since this is not the service's original vendor, the adopters face a risk of lagging behind the development of the mainstream Big Data services.

Application domain: known uses

[edit | edit source]

The DICE Deployment Service in combination with the DICE TOSCA Technology Library may be used in a wide variety of workflows. It also readily caters for a wide variety of Big Data applications. Here are a few of the suggested uses.

Spark applications

[edit | edit source]

The DICE TOSCA technology library contains building blocks, which include support for Apache Spark,[12] Apache Storm,[13] Hadoop[14] File System, Hadoop Yarn, Kafka,[15] Apache Zookeeper,[16] Apache Cassandra[17] and MongoDB.[18] This means that any application, which requires a subset of these technologies, can be expressed in a TOSCA blueprint and deployed automatically.

As an illustrative example, a Storm application blueprint shows how all the necessary ingredients can be represented together:

  • The goal is to deploy and run a .jar containing a compiled Java code of a WordCount topology. This is encapsulated in the wordcount node template.
  • The platform to execute this program consists of the Storm nodes: a Storm Nimbus node (represented by the nimbus node template) and the Storm Workers (the storm node templates).
  • For the Storm to work, we also need Zookeeper in the cluster. The blueprint defines it as the zookeeper node template.
  • All of these services need dedicated compute resources to run: zookeeper_vm, nimbus_vm and storm_vm.
  • In terms of the networking resources, the blueprint defines what ports need to be open for the services to work: zookeeper_security_group, nimbus_security_group, storm_security_group. For convenience, we also publish Zookeeper and Storm Nimbus on a public network address: zookeeper_floating_ip and nimbus_floating_ip.

Deployments with Continuous Integration

[edit | edit source]

The DICE Deployment Service is a web service, but it also comes with a command line tool. This enables a simple use from Continuous Integration engines. Here we show an example of a Jenkins[19] pipeline,[20] which builds and deploys a Storm application.

pipeline {
    agent any

    stages {

        stage('build-test') {
            steps {
                sh 'mvn clean assembly:assembly test'

        stage('deploy') {
            steps {
                sh '''
                    dice-deploy-cli deploy --config $DDS_CONFIG \
                        $STORM_APP_CONTAINER \

                    dice-deploy-cli wait-deploy --config $DDS_CONFIG \


[edit | edit source]

The main promise of automated deployment is in making life and jobs of the operators easier. The power of the approach is such that it enables that most of the tasks can be done by itself - without much intervention of the system administrators. This has a tremendous positive impact on the speed of delivery of applications. We could almost talk about the NoOps approach, where the system operator only sets up the services and configures the necessary quotas in the resource pool of the infrastructure. From this point on, the developers or testers can use the delivery tools in a self-service manner.

Moreover, the ability to consistently and repeatably deploy the same application is an important factor in being able to integrate (re)deployment of application into a larger workflow. In particular, it enables Continuous Integration, where an application gets redeployed to the testbed whenever the developers push a new update to the repository branch. The application can also be periodically tested, whereby installation and clean-up are both parts of the workflow. This opens up many possibilities for performing quality testing (as presented later in the book) and integration testing.


[edit | edit source]
  1. Domain Specific Language
  3. a b Orchestration (computing)
  4. Infrastructure as a service manifest
  5. Configuration Management
  6. Chef
  7. Puppet
  8. Ansible
  9. Ubuntu Juju
  10. Apache Brooklyn
  11. a b Cloudify
  12. Apache Spark
  13. Apache Storm
  14. Hadoop
  15. Kafka
  16. Apache Zookeeper
  17. Apache Cassandra
  18. MongoDB
  19. Jenkins
  20. Jenkins Pipeline