Practical DevOps for Big Data/Technology-Specific Modelling

From Wikibooks, open books for an open world
< Practical DevOps for Big Data
Jump to: navigation, search

Introduction

DPIM-to-DTSM Transformation Example: Apache Hadoop MapReduce

When all essential architecture elements are in place, by means of architectural reasoning in the DPIM layer, it could be made available ad-hoc model transformations that parse DPIM models and produce equipollent DTSM models where the specified data processing needs are exploded (if possible) into a possible configuration using appropriate technologies (e.g., Spark for streaming or Hadoop for batch). At this layer it should be provided for architects and developers several key technological framework packages that can evaluate possible alternatives for Technological Mapping and Logical Implementation, that is, selecting the technological frameworks that map well with the problem at hand and implementing the needed processing logic for that framework. Once designers choose the appropriate technological alternative, DICE will provide model transformations that instantiate the alternative (if available) e.g., by instantiating pre-made, ad-hoc packages that contain: (a) framework elements needed to ``link" the Data-Intensive application logic (e.g., through inheritance); (b) framework elements that contain (optional) configuration details (c) framework elements that represent deployable entities and nodes (e.g., Master Nodes and Resource Managers for Hadoop Map-Reduce). Software Architects proceed by filling out any wanted configuration details to run the chosen frameworks, probably in collaboration with Infrastructure Engineers.

DTSM Modelling Explained: The Apache Storm Example

The Data-Intensive Technology Specific profile (DTSM) includes several technology-specific sub-profiles, including the DTSM::Storm Profile, the DTSM::Hadoop Profile, and the refinement of the DICE::DICE_Library. The Hadoop and Storm Profiles fully incorporate quality and reliability aspects. Other DTSM profiles, such as the Spark profile, are currently under further experimentation and will be finalized in the near future. In summary, the Apache Storm profile core elements, stereotypes, and tagged values are defined in the following Table.

Storm Concepts

Storm concepts that impact in performance
# Concept Meaning
1. Spout (task) Source of information
2. Emission rate Number of tuples per unit of time produced by a spout
3. Bolt (task) Data elaboration and production of results
4. Execution time Time required for producing an output by a bolt
5. Ratio Number of tuples required or produced
6. Asynchronous policy The bolt waits until it receives tuples from any of the incoming streams
7. Synchronous policy The bolt waits until it receives tuples from all the incoming streams
8. Parallelism Number of concurrent threads per task
9. Grouping Tuple propagation policy (shuffle/all/field/global)

Storm is a distributed real-time computation system for processing large volumes of high-velocity data. A Storm application is usually designed as a directed acyclic graph (DAG) whose nodes are the points where the information is generated or processed, and the edges define the connections for the transmission of data from one node to another. Two classes of nodes are considered in the topology. On the one hand, spouts are sources of information that inject streams of data into the topology at a certain emission rate. On the other hand, bolts consume input data and produce results which, in turn, are emitted towards other bolts of the topology.

A bolt represents a generic processing component that requires n tuples for producing m results. This asymmetry is captured by the ratio m/n. Spouts and bolts take a certain amount of time for processing a single tuple.

Besides, different synchronization policies shall be considered. A bolt receiving messages from two or more sources can select to either 1) progress if at least a tuple from any of the sources is available (asynchronously), or 2) wait for a message from all the sources (synchronously).

A Storm application is also configurable by the parallelism of the nodes and the stream grouping. The parallelism specifies the number of concurrent tasks executing the same type of node (spout or bolt). Usually, each task corresponds to one thread of execution. The stream grouping determines the way a message is propagated to and handled by the receiving nodes. By default, a message is broadcast to every successor of the current node. Once the message arrives to a bolt, it is redirected randomly to any of the multiple internal threads (shuffle), copied to all of them (all) or to a specific subset of threads according to some criteria (i.e., field, global, etc.).

As per its own definition, the DTSM Profile includes a list of stereotypes that addresses the main concepts of the Apache Storm technology. In particular, we stress those concepts that directly impact on the performance of the system. Consequently, these parameters are essential for the performance analysis of the Storm applications and are useful for the DICE Simulation, Verification and Optimization tools.

In addition, a Storm framework is highly configurable by various parameters that will influence the final performance of the application. These concepts are converted into stereotypes and tags, which are the extension mechanisms offered by UML. Therefore, we devised a new UML profile for Storm. In our case, we are extending UML with the Storm concepts. The stereotypes and annotations for Storm are based on MARTE, DAM, the DICE::DPIM and Core profiles. The DICE::DTSM::Storm profile provides genuine stereotypes.

Storm Profile[edit]

Spouts and bolts have independent stereotypes because they are conceptually different, but <<StormSpout>> and <<StormBolt>> inherit from MARTE::GQAM::GaStep stereotype via the DTSM::-Core::-Core-DAG-Node and CoreDAGSourceNode since they are computational steps. Moreover, they share the parallelism, or number of concurrent threads executing the task, which is specified by the tag parallelism.

On the other hand, the spouts add the tag avgEmitRate, which represents the emission rate at which the spout produces tuples. Finally, the bolts use the hostDemand tag from GaStep for defining the task execution time. The time needed to process a single tuple can also be expressed through the alpha tag in case that the temporal units are not specified. The minimum and the maximum time to recover from a failure is denoted by the minRebootTime and maxRebootTime tags.

The Storm concept of stream is captured by the stereotype <<StormStreamStep>>. This stereotype also inherits from MARTE::GQAM::GaStep stereotype, which enables to apply it to the control flow arcs of the UML activity diagram. This stereotype has two tags, grouping and numTuples that match the grouping and ratio concepts in Storm, respectively. The type of grouping is StreamPolicy, an enumeration type of the package Basic_DICE_Types that has the values all, shuffle, field, global for indicating the message passing policy. The ratio of m/n of a bolt can be expressed either through the attribute sigma in the bolt stereotype, or by specifying the incoming and outcoming numTuples of a bolt via the stream stereotype.

Finally, the <<StormScenarioTopology>> stereotype is introduced for defining contextual execution information of a Storm application. That is, the reliability of the application or the buffer size of each task for exchanging messages. All these stereotypes are summarized in the following picture, which describes the DTSM::Storm profile for UML-based environments.

Description of DTSM::Storm profile

As previously explained in Introduction to Modelling, the DTSM profiles, and particularly the DTSM::Storm profile, rely on the standard MARTE and DAM profiles. This is because DAM is a profile specialized in dependability and reliability analysis, and MARTE offers the GQAM sub-profile, a complete framework for quantitative analysis. Therefore, they matches perfectly to our purposes: the quality assessment of data intensive applications. Moreover, MARTE offers the NFPs and VSL sub-profiles. The NFP sub-profile aims to describe the non-functional properties of a system, performance in our case. The latter, the VSL sub-profile, provides a concrete textual language for specifying the values of metrics, constraints, properties, and parameters related to performance. VSL expressions are used in Storm-profiled models with two main goals: (i) to specify the input parameters of the model and (ii) to specify the performance metric(s) that will be computed for the model (i.e., the output results). An example of VSL expression for a host demand tagged value of type NFP_Duration is:

expr=$b_1 (1), unit=ms (2), statQ=mean (3), source=est (4)

This expression specifies that bolt_1 demands $b_1 (1) milliseconds (2) of processing time, whose mean value (3) will be obtained from an estimation in the real system (4). $b_1 is a variable that can be set with concrete values during the analysis of the model. Another VSL interesting expression is the definition of the performance metric to be calculated, the utilization in the image of next example:

expr=$use (1), statQ=mean (2), source=calc (3)

This expression specifies that we want to calculate (3) the utilization, as a percentage of time, of the whole system or a specific resource, whose mean value (2) will be assigned to variable $use (1). Such value is obtained by the evaluation of a performance model associated to this Storm-profiled UML diagram. The rest of picture corresponds to a UML activity diagram that represents the Storm DAG. Every UML element is stereotyped with a DTSM::Storm stereotype matching one of the concepts presented in the previous table. All the configuration parameters of the profiled elements ($_) are variables.

Activity Diagram of Storm example

Nodes in the UML activity diagram are grouped by partitions (e.g., Partition1 and Partition2). Each partition is mapped to a computational resource in the UML deployment diagram following the scheduling policy defined for the topology. Next picture shows the deployment diagram, which complements the previous activity diagram. Each computational resource is stereotyped as GaExecHost and defines its resource multiplicity, i.e., number of cores. The deployment also allows one to know which messages exchanged by tasks can introduce network delays, i.e., tuples exchanged between cores in different physical machines, which is of importance for the eventual performance model. Therefore, we use the GaCommHost stereotype. Both stereotypes are inherited from MARTE GQAM.

Deployment Diagram of Storm example