Practical DevOps for Big Data/Monitoring

Introduction

Big Data technologies have become an ever more present topic in both academia and industrial world. These technologies enable businesses to extract valuable insight from their available data, hence more and more SMEs are showing increasing interest in using these types of technologies. Distributed frameworks for processing large amounts of data, such as Apache Hadoop^[1], Spark^[2], or Storm^[3] gained in popularity and applications developed on top of them are more and more prevalent. However, developing soft- ware that meets these high-quality standards expected for business-critical Cloud applications remains a challenge for SMEs. In this case model-driven development (MDD) paradigm and popular standards such as UML, MARTE, TOSCA^[4] hold strong promises to tackle this challenge. During development of Big Data applications it is important to monitor performance for each version of the application. Information obtained can be used by software architects and developers to track the evolution in time of the developed application. Monitoring is also useful in determining main factors that impact the quality of different application versions. Throughout the development stage, running applications tend to be more verbose in terms of logged information so that developers can get insights about the developed application. Due to verbosity of logs, data- intensive applications produce large amounts of monitoring data, which in turn need to be collected, pre-processed, stored and made available for high-level queries and visualization. It is clear that there is a need for a scalable, highly available and easy deployable platform for monitoring multiple Big Data frameworks. Which can collect resource-level metrics, such as CPU, memory, disk or network, together with framework level metrics collected from Apache HDFS, YARN, Spark and Storm.

Motivations

In all development environments and in particular in a DevOps one infrastructure and application performance monitoring is very important. It allows developers to check the current performance and to evaluate potential performance bottlenecks. Most monitoring tools and services focus on providing a discrete view of a particular application and/or platform. Meaning that if a DIA is based on more than one platform it is very likely that one monitoring solution will not support all of the platforms. DMon is designed in such a way that it is able to monitor the most common Big Data platforms as well as system/infrastructure metrics. This way during development only one monitoring solution is able to handle all monitoring needs.

Of course platform monitoring is not enough, most of the time developers require information about their application performance. DMon can be easily extended to collect metrics via JMX or even custom log formats. All of the resulting metrics can then be queried and visualize using DMon.

Existing solutions

Currently on the market there are a lot of tools and services geared towards monitoring different cloud deployed resources and services. Big Data services are no different, here are just a few solutions which are used:

Hadoop Performance Monitoring UI^[5] provides an Hadoop inbuilt solution for quickly finding performance bottlenecks and provide a visual representation of the configuration parameters which might be tuned for better performance. Basically it is a lightweight monitoring UI for Hadoop server. One of its main advantage is the availability in the Hadoop distribution and the ease of usage. On the other hand it proves to be fairly limited with regard to performance. For example, the time spent in gc by each of the tasks is fairly high.

SequenceIQ^[6] provides a solution for monitoring Hadoop clusters. The architecture proposed by SequenceIQ and used in order to do monitoring is based on Elasticsearch^[7], Kibana and Logstash.The architecture has as its main objective obtaining a clear separation between monitoring tools and the existing Hadoop deployment. For achieving this three Docker containers are used. In a nutshell the monitoring solution consist of client and server containers. The server container takes care of actual monitoring tools. In this particular deployment Kibana is used for visualization and Elasticsearch for consolidation of the monitoring metrics. Through the capabilities of Elasticsearch one can horizontally scale and cluster multiple monitoring components. The client container contains the actual deployment of the tools that have to be monitored. This instance contains Logstash, Hadoop and the collectd modules. Logstash connects to the Elasticsearch cluster as a client and stores the processed and transformed metrics data there. Basically this solution consists of a collection of tools that are used in order to monitor different metrics from different layers. Among the main advantages of this solution one notices the ease of adding and removing different compo- nents from the system. Another interesting aspect of this architecture is the ease with which one can extract different informations from tools.

Datastax^[8] provide a solution, OpsCenter , that can be integrated in order to monitor Cassandra^[9] installation. Using OpsCenter one can monitor different parameters of a Cassandra instance. Also it can handle many other parameters provided by the actual machines on which it runs. OpsCenter exposes an interactive web UI that allow administrators to add or remove nodes from the deployment. An interesting feature provided by the OpsCenter is the automatic load balancing. For integration of OpsCenter with other tools and services a developer API is provided.

How the tool works

The DICE Monitoring platforms (DMon)^[10] architecture is designed as a web service that enables the deployment and management of several subcomponents which in turn enable the monitoring of big data applications and frameworks. In contrast to other monitoring solutions, DMon aims to provide as much data as possible about the current status of the big data framework subcomponents. This intent brings a wide array of technical challenges, not present in more traditional monitoring solutions, as serving near real-time fine grained metrics entails a system that should exhibit a high availability, as well as easy scalability. Traditionally, web services have been built using a monolithic architecture where almost all components of a system ran in a single process (traditionally JVM). This type of architecture has some key advantages such as: deployment and networking are trivial while scaling such a system requires running several instances of the service behind a load-balancer instance. On the other hand, there are some severe limitations to this type of monolithic architecture, which would directly impacts the development of DMon. Firstly, changes to one component can have an unforeseen impact on seemingly unrelated areas of the application, thus adding new features or any new development can be potentially costly both in time and resources. Secondly, individual components cannot be deployed independently. This means that if one only needs a particular functionality of the service this cannot be decoupled and deployed separately thus hindering reusability. Lastly, even if components are designed to be reusable these tend to focus more on readability than on performance.

Considering these limitations of a monolithic architecture, we decided to use the so called microservice architecture for DMon, which is widely used in large Internet companies. This architecture replaces the monolithic service with a distributed system of lightweight services, which are by design independent and narrowly focused. These can be deployed, upgraded and scaled individually. As these microservices are loosely coupled, it better enables code reusability, while changes made to a single service should not require changes in a different service. Integration and communication should be done using HTTP (REST API) or RPC requests. We also want to group related behaviours into separate services. This will yield a high cohesion, which enables us to modify the overall system behaviour by only modifying or updating one service instead of many. DMon uses REST APIs for communication between different services with request payload encoded as JSON message. This makes the creation of synchronous or asynchronous messages much easier.

The figure shows the overall architecture of the DMon platform, which will be part of a lambda architecture together with the anomaly detection and trace checking tools. In order to create a viable lambda architecture we need to create 3 layer: speed, batch and serving layers. Elasticsearch will represent the serving layer responsible for loading batch views of the collected monitoring data and enabling other tools/layers to do random reads on it. The speed layer will be used to look at recent data and represent it in a query function. In the case of anomaly detection this will mean using unsupervised learning techniques or using pre-trained models from the batch layer. The batch layer needs to compute arbitrary functions on large sections of the dataset stored in Elasticsearch. This means running long running jobs to train predictive models that than can be instantiated on the speed layer. All trained models will then be stored inside the serving layer and accessed via DMon queries. The core components of the platform are Elasticsearch, for storing and indexing of collected data, and Logstash for gathering and processing logfile data. Kibana server provides a user-friendly graphical user interface. The main services composing DMon are the following: dmon-controller, dmon-agent, dmon-shipper, dmon-indexer, dmon-wui and dmon-mas. These services will be used to control both the core and node-level components.

Open Challenges

Big Data technologies are continuously evolving. Because of this any monitoring solution has to keep up with the ever changing metrics and metric exposing systems. DMon for example is capable of monitoring the most common platforms currently in use however new platforms such as Apache Flink are not yet supported.

Furthermore some basic data preprocessing such as averaging, windowing and filtering are already supported by DMon additional operations should could be required in future versions. This is also true for data exporting formats.

Application domains

DMon has been tested on all Big Data platforms currently supported in DICE. These include:

Apache Yarn (including HDFS)
Apache Spark (versions 1.6.x and 2.x.x)
Apache Storm
Cassandra
MongoDB

In addition to these platforms DMon is also able to collectd via collectd a wide array of system metrics. In short it supports all collectd supported metrics. During development application metrics are in some ways more important than the platform metrics, because of this DMon can be easily extended to support any log or metrics format required via Logstash filter plugins.

An important point to mention is that the metrics collected for a DIA that uses more than one of the supported platform can be easily aggregated and exported based on the timestamp they were received. For example if we have a DIA that uses Spark which in turn runs on top of Yarn and HDFS, DMon is able to show all the metrics for any given timeframe for the aforementioned platforms.

Conclusion

The DICE Monitoring Platform is a distributed, highly available system for monitoring Big Data technologies, as well as system metrics. Aligning DMon objectives to DICE visions, that is bringing together Model-Driven Development and DevOps to enable fast development of high-quality data intensive applications.

DMon features automation at numerous levels: deployment of software components the nodes of monitored cluster, easy management of the monitoring platform itself, or automatic creation of visualizations based on collected data. Thanks to close integration with DICE Deployment Service (based on Cloudify and Chef cookbooks), software engineers/architects only need to annotate appropriate nodes in the DDSM model or TOSCA blueprint as monitorable and the Deployment service will install and configure the agents on selected nodes, so that the moment the DIA is deployed on the cluster the runtime data will be flowing into the DMon platform, with absolutely no manual intervention from end-users.

Engineered using a micro-services architecture, the platform is easy to deploy, and operate, on heterogeneous distributed Cloud environments. We reported successful deployment on Flexiant Cloud Orchestrator and OpenStack using Vagrant scripts.

The work done on this tool has highlighted the need for a specialized monitoring solution in a DevOps environment. The usage of a lightweight yet high throughput distributed monitoring solution capable of collecting thousands of metrics across a wide array of Big Data services is ok paramount importance. This importance is even more evident when dealing with preliminary versions of applications that are in development. Monitoring solutions like DMon can provide an excellent overview of the current performance of an application version.

References

[1] ttp://hadoop.apache.org/

[2] ttps://spark.apache.org/

[3] ttp://storm.apache.org/

[4] ttps://www.oasis-open.org/committees/tosca/

[5] ttps://www.datadoghq.com/blog/monitor-hadoop-metrics/

[6] ttp://sequenceiq.com/

[7] ttps://www.elastic.co/

[8] ttps://www.datastax.com/

[9] ttp://cassandra.apache.org/

[10] ttps://github.com/dice-project/DICE-Monitoring

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]