Practical DevOps for Big Data/News and Media

From Wikibooks, open books for an open world
< Practical DevOps for Big Data
Jump to: navigation, search

Use Case Description[edit]

The news and media domain is a highly demanding sector in terms of handling large data streams coming from the social media. ATC SA, as one of the leading brands around the world in ICT application for the news and media domain, has developed NewsAssset – a commercial product positioned in the news and media domain. NewsAsset is a commercial product positioned in the news and media domain, branded by Athens Technology Center, an SME located in Greece. The NewsAsset suite constitutes an innovative management solution for handling large volumes of information offering a complete and secure electronic environment for storage, management and delivery of sensitive information in the news production environment. The platform proposes a distributed multi-tier architecture engine for managing data storage composed by media items such as text, images, reports, articles, videos, etc. Innovative software engineering practices, like Big Data technologies, Model-Driven Engineering (MDE) techniques, Cloud Computing processes and Service-Oriented methods have penetrated in the media domain. News agencies are already feeling the impact of the capabilities that these technologies offer (e.g. processing power, transparent distribution of information, sophisticated analytics, quick responses, etc.) facilitating the development of the next generation of products, applications and services. Especially considering interesting media and burst events which is out there in the digital world, these technologies can offer efficient processing and can provide an added value to journalists. At the same time, heterogeneous sources like social networks, sensor networks and several other initiatives connected to the Internet are continuously feeding the world of Internet with a variety of real data at a tremendous pace: media items describing burst events, traffic speed on roads; slipperiness index for roads receiving rain or snowfall; air pollution levels by location; etc. As more of those sources are entering the digital world, journalists will be able to access data from more and more of them, aiding not only in disaster coverage, but being used in all manner of news stories. As that trend plays out, when a disaster is happening somewhere in the world, it is the social networks like Twitter, Facebook, Instagram, etc. that people are using to watch the news ecosystem and try to learn what damage is where, and what conditions exist in real-time. Many eyewitnesses will snap a disaster photo and post it, explaining what’s going on. Subsequently, news agencies have realized that social-media content are becoming increasingly useful for disaster news coverage and can benefit from this future trend only if they adopt the aforementioned innovative technologies. Thus, the challenge for NewsAsset is to catch up with this evolution and provide services that can handle the developing new situation in the media industry. In addition, during DICE project we have identified a great business opportunity on the “Fake News” sector. More specific, we have used DICE tools in order to develop specific modules (part of the News Orchestrator application) for a new and innovative product which is being connected via an API to our NewsAsset suite or can be sold as a standalone solution. This new product in being called TruthNest (www.truthnest.com). TruthNest is a service ATC has implemented for assessing the trustworthiness of information found in Social Media. TruthNest users are able to capture streams from social networks, from which they are then able to analyse a single post and gain insights according to several dimensions of a verification process.

Use Case Scenarios[edit]

TruthNest is an online comprehensive tool that can promptly and accurately discover, analyse, and verify the credibility and truthfulness of reported events, news and multimedia content that emerge in social media in near real time. The end user has the ability to verify the credibility of a single post within seconds by activating, with a single click, a series of analysis events for achieving the desired result. More specific, TruthNest users will be able to bring in streams from social networks which will then be able to analyse and gain insights as to several dimensions of the verification process. In addition, they will also be able to create and monitor new “smart” streams from within TruthNest.

TruthNest Screenshot

An important module for TruthNest, which has been developed from scratch, is the “Trend Topic Detector”. The “Trend Topic Detector” provides to the end user a visualisation environment showing prominent trends that derive from social media, and, more specifically, from Twitter. What is critical to mention at this stage, is that only the “Trend Topic Detector” module has been developed by using DICE tools while the rest of TruthNest’s components have been developed by using conventional tools and methodologies as these have been used by ATC’s engineering and development team.

“Trend Topic Detector” Detailed Description[edit]

The Trend Topic Detector is centered around a clustering module. This creates clusters of tweets that relate to the search criteria submitted. The clusters are formulated by grouping the tweets found based on their common terms. The module tracks a percentage of the tweets posted onwards on Twitter, as the Twitter streaming API limitations impose. While it is restricted currently on Twitter stream API, it can take input from multiple social media (YouTube, Flickr and others) however it has not been implemented yet.

Topic Detector Screenshot

Architecture[edit]

The main pipeline of the clustering module is implemented as a Storm topology, where sequential bolts perform specific operations on the crawling procedure. These bolts include entity extraction (by using Stanford NER classifier) and minHash computation to estimate how similar the formulated sets are. The tweets terms are extracted and indexed in a running Solr instance. The most frequent terms are computed in a rolling window of 5 minutes and 20 clusters are formulated by default. A label (typically the text of a tweet) is assigned to each cluster. The results are stored in a Mongo database. The module is highly configurable and offers nearly real time computation of clusters.

Topic Detector Topology

How to use the clustering module:[edit]

  1. The user sets search terms through the user interface. The default language is English, other languages are not officially supported. Some trending topic cluster computation settings are also available (e.g. window computation time).
  2. The process is initialized and the user is informed on the number of tweets currently analysed. A diagram (line chart) is shown that is renewed every few seconds, showing the stream progress. After 10 minutes if tweets were found, the first trending topic clusters are presented. A set of 20 trending topic clusters is shown. The trending topic clusters are clickable and the user can view the items that consist them.
  3. The trending topic clusters are re-computed every five minutes and their content is updated. The user can view details (e.g. clusters evolution through time).
  4. The user can search the trending topic clusters and tag the most important ones. A favourite filter is also available.
  5. The user can start/stop a trending topic cluster, edit and delete it. He can save a trending topic cluster and restart it at a later time.
  6. There are limits on the number of trending topic clusters created and their activity period. Typically, the clusters are stopped after 24h.

DICE Tools[edit]

DICE Tools Benefit for our Use Case
DICE IDE The fact that most DICE tools is planned to be progressively integrated in a common place, the DICE IDE, makes the interaction between the various DICE tools really effective since most tools depend on the DICE Profile models. For example, the DICE Verification tool expects a DTSM model, properly annotated with some quality characteristics regarding the level of parallelization of the NewsAsset DIA, as input and the verification process can be triggered from within the DICE IDE. In this way, the DICE IDE not only covers the design time modelling requirements for our DIA but it effectively links the outcome artefacts to the tools related to other phases of a DevOps approach, for example deployment and monitoring.
DICER The DICER tool allowed us to express the infrastructure needs and constraints for the NewsAsset application and also to automatically generate deployment blueprints to be used on a cloud environment. We evaluated the usefulness of the DICER tool in terms of time saving, with regard to the time needed to setup the infrastructure manually from scratch, and the degree of automation that DICER offers. Note that in the total time that we computed for the DICER execution time we included the time needed by the DICE Deployment service to deploy the generated blueprint.
Deployment Service The whole process was really fast and we achieved a time saving almost 80% compared to the time we needed previously when we were installing manually the Storm cluster and all of its dependencies (Zookeeper etc) as well as the Storm application and all the dependencies for the persistence layer. The fact that the TOSCA blueprints allow the refinement of the Storm-specific configuration parameters in advance is really convenient since we can experiment with different Storm cluster setups by applying another reconfiguration, resulting in a new testbed, until we reach the most efficient in terms of performance and throughput for our topologies.
Monitoring Platform The ability to monitor NewsAsset’s deployed topologies is necessary in order to identify any possible bottleneck or even to optimize the performance and throughput by adding more parallelization for example. We have installed the Monitoring platform core modules (ELK stack, DMon core), and we then use one of the features of latest Deployment service release to automatically register the Storm cluster nodes on the core services of Monitoring platform during the infrastructure/application deployment phase. The monitoring can be performed not only on a per (Stormcluster) node but also on a per application level which means that we can distinguish between issues related to hardware specifications of the nodes and issues related to the application’s internal mechanics like the size of internal message buffer of Storm.
Fault Injection Since the NewsAsset’s topologies deal with a cloud deployment it is quite critical to be able to test the consequences of faults early in the development phase in a controlled environment rather than once the application is in production. It is important for the NewsAsset DIA to be comprised of reliable topologies due to the nature of the processing it performs: if for example there is a burst event at some time then losing some of the social networks messages due to network failures/repartitions could affect the quality of the trending topics identified at that period. Thus, having a tool like Fault Injection can help us to test how the application behaves in terms of reliability and fault-tolerance. There is also the need for the NewsAsset application to eliminate any single point of failure as possible. We used the Fault Injection tool to randomly stop/kill not only the various types of Storm processes (nimbus, supervisor and worker processes) but also a whole node of the Storm cluster, and we checked how those actions affect the proper execution of the topologies. Finally, we used the Fault Injection tool to generate high memory and CPU usage for the Storm cluster VMs to simulate high memory/CPU load. The various processing bolts of the topology were not affected too much since none of them is high CPU-bound or memory greedy. We plan to continue experimenting with the Fault Injection tool specially by simulating high bandwidth load since the trending topic detector topology makes heavy use of the Twitter Stream API.
Quality Testing The validation KPI for the Quality Testing Tool refers to more than 30% reduction in the manual time required in a test cycle. For that purpose, a chrono assessment has been performed comparing the manual time that ATC engineers would need to perform manually the stress testing of the topology to the time that is required by the Quality Testing tool to perform a similar task. In our use case scenario only 5 iterations/experiments the goal has been achieved and the reduction per test cycle has reached the 34%. In case that more test cycles are required the reduction can get even higher. Another important finding is that the load testing process can be fully automated by using the Quality Testing Tool.
Configuration Optimisation In general, a typical Storm deployment comprises of a variety of configuration properties that affect the behaviour of nimbus, supervisors as well as the topologies. It is a non-trivial task for a developer to select optimal values for the configuration properties in order to fine tune the Storm topology execution. This is usually a complicated task accomplished by experts in this area. Also, each Storm based application has its own characteristics (either I/O bound or CPU bound or even both) that should be taken carefully into account while experimenting with candidate optimal values, so following a generic template for setup is not an option. Also, the overhead and the time required to update the configuration and reload it on the cluster each time one tries to optimize some properties is significant.

In order to evaluate the improvement by applying the Configuration Optimization tool on the Storm configuration that the News Orchestrator relies on, the ATC engineers have monitored the topology throughput as well as the latency. The impact on performance is more than twice compared to the default configuration which is a significant improvement. This achievement has been achieved after only 100 iterations which resulted in a total of 16 hours (100 * 10 minutes) of execution. The corresponding validation KPI has been fully addressed, resulting in a more than 30% improvement in both the throughput and the latency metrics.

Simulation Tool Scalability, bottleneck detection and simulation/predictive analysis are some of the core requirements for the News Orchestrator DIA. The DICE Simulation tool promises that it can perform a performance assessment of a Storm based DIA that would allow the prediction of the behaviour of the system prior to the deployment on a production cloud environment. The News Orchestrator engineers are often spending much time and effort in order to configure and adapt the topology configuration according to the target runtime execution context. Introducing a tool that can perform such a demanding task efficiently would clearly increase the developer’s productivity and also facilitate their testing needs. The corresponding validation KPI of the Simulation tool refers to a prediction error of the utilization of bolts less than 30%. The comparison should be performed between the results that the Simulation tool has generated and those monitored by the metrics interface that Storm framework exposes. In this way, the predicted/simulated results were validated against the actual results that were monitored by deploying the News Orchestrator topology on a Storm cluster of 4 nodes running on the cloud.

The results show that for some of the bolts (approximately more than half) the prediction error is indeed very small, less than 10%, predicting quite accurately the capacity of the bolts. What is interesting is what are the characteristics of the bolts that the Simulation tool fails to predict their capacity, or at least with an acceptable precision. There is a tricky part in the workflow of the bolts of the News Orchestrator topology: some bolts are not constantly consuming and analysing social items but instead their execution is triggered periodically whenever a predefined time window expires (that time window reflects how frequently the News Orchestrator engineers want to detect news topics, which is configured at 5 minutes). So, it was proved that for the topic detection and minhash clustering bolts the average execution time was hard to be computed, affecting thus the precision of the Simulation tool results. Additionally, some of those bolts are by nature of low utilization mostly due to the limited time they are executed compared to the total experiment time. In those cases, is also observed a deviation between the Simulation tool results and the actual monitored values that surpasses the threshold of 30% regarding the predictions. For the ATC engineers, having a tool like the DICE Simulation tool in the stack of the tools that help them to optimize and evaluate the News Orchestrator DIA is a significant advantage. Every new feature that is added in the News Orchestrator DIA may result in an undesired imbalance with regards to the performance of the system. Being able to validate the performance impact on the DIA prior to the actual deployment on a cloud infrastructure gives the flexibility to fine tune the topology configuration in advance and take corrective actions (i.e. scaling by increasing bolts parallelism) without wasting resources (costs and efforts) that would be otherwise required by an actual deployment on the cloud.