Practical DevOps for Big Data

From Wikibooks, open books for an open world
Jump to: navigation, search

Preface[edit]

Around 2010, the head of the Research & Development unit of Alcatel-Lucent mentioned the expression “Big Data” for a meeting of the ICT scientific council of the French Research Agency (a.k.a. ANR). He told us that networks increasingly convey data on an exponential scale and, as a consequence, ICT scientific issues must be totally rethought and re-addressed in light of the durable “Big Data” phenomenon. Indeed, network players were disturbed in their daily business by IT customers that ask for new means for creating value from data (early, from networks, if possible!). As gold nuggets, data includes “senses”, but immense volumes do not, by definition, facilitate any inherent comprehension, or allow for smart manipulation within reasonable time scales. Worse still, well-known storing, formatting, searching, mining or whatever data technology is no longer suitable for the requirements of “Big Data”!

In charge of defining relevant research directions of funded research programmes for the French public and privates sectors, I was first skeptical about the “big” adjective, which confuses me with “raw”, “coarse-grain”, “unstructured”, etc. For example, while everybody uses “Big Data” as a buzzword, few state a more appropriate usage of “Big Data”: are “Big Data” sets, tera, peta, exa, zetta… data sets? In other words, which average data quantity qualifies “Big Data”? Are, for instance, exa (1015) or zetta (1018) “Big Data++” sets or common “Big Data”?

Moving forward, and thus interrogating the High-Performance Computing community that used massively parallel computing infrastructures, my feeling was that this community’s assumption is the fact that all data is available, close, fairly structured… To exaggerate, all data is in high-performance memory at any time! This cloud computing anti-vision disappointed me…

In contrast, interrogating the Massive Data Sets community, they proposed “data methods” in this case, of course, data volumes are “massive” (“huge”? “big”? what else?). Ambiguity arose from the difference/equivalence between “massive” (their chosen name) and “big” in “Big Data”. I quickly understood that their methods cannot be appropriate since they do not approach problems with High-Performance Computing: a proof that their data quantities remain “not so big”. In relation with this loophole, assumptions about “data methods” systematically push SQL, XML… as formatting frameworks, something often far from the “Big Data” reality!

Other disciplines (e-commerce, energy, health, sustainability …) helped us out of the confusion. They challenge ICT people about the unstoppable digitalization of their own sector, which requires, again, rethinking and re-addressing how systems have to be designed in each sector.

Typically, in energy, a smart grid is not an energy distribution system equipped with hardware and software for intelligent monitoring and control. Instead, a smart grid is a “Big Data” processing infrastructure whose intimate relationship with the real world relies on sensors and actuators. The real world deals with energy through a surrounding physical infrastructure. How then, as an innovating system in the energy sector, has a smart grid to be designed? The digital component of the overall system is the core, and, as such, produces constraints on the way the remaining part (surrounding physical infrastructure) has to be composed with! This vision displeases energy engineers, but perfectly illustrates the way “Big Data” problems may be solved: the price of progress!

In fact, “Big Data” as a new scientific discipline is the digitalization of business systems at a never encountered scale. Scientific problems are not computer science problems on one side and/or extant scientific problems in other disciplines on the other side. Solutions depend upon the way scalability issues are properly addressed: scalability is the core of “Big Data” problematics. In other words, data methods or technologies for “lower” scales (i.e., “controllable” data volumes) become obsolete when scales increase.  New solutions have to be invented, including software development methods that effectively single out collaborative multidisciplinary behaviors from both the software (dev. engineers, op. engineers) and sectoral (IT customers, end-users…) sides.

“Practical DevOps for Big Data”, this book, intends to tackle this objective. From proven and relevant software engineering paradigms, namely model driven engineering and DevOps, authors expose results from the European ICT DICE project (http://www.dice-h2020.eu/). DICE focuses on quality assurance for data-intensive applications. Due to the fact that “Big Data” applications stress a more interdisciplinary approach in tight collaboration with IT customers/end-users, DICE puts forward key criteria like performance and trust. These criteria drives the way software is first designed and next released in production. Criteria act as early input constraints or quality contracts. As inescapable constraints, DICE wonders how running applications process data and deliver “senses” while satisfying constraints from end to end. The DICE methodology is loopback-based, including monitoring and control for applications in production so that performance anomalies for instance, may be injected in maintenance cycles for quality preservation. Of course, the book also explains how dedicated integrated tools support the methodology for the successful “Big Data” world!

Authors[edit]

  • University of Pau — Franck Barbier
  • Netfective Technology — Youssef Ridene, Joas Yannick Kinouani, Laurie-Anne Parant
  • Imperial College London — Giuliano Casale, Chen Li, Lulai Zhu, Tatiana Ustinova, Pooyan Jamshidi
  • Politecnico di Milano — Danilo Ardagna, Marcello Bersani, Elisabetta Di Nitto, Eugenio Gianniti, Michele Guerriero, Matteo Rossi, Damian Andrew Tamburri, Safia Kalwar, Francesco Marconi
  • IeAT — Gabriel Iuhasz, Dana Petcu, Ioan Dragan
  • XLAB d.o.o. — Matej Artač, Tadej Borovšak
  • flexiOPS — Craig Sheridan, David McGowran, Grant Olsson
  • ATC SA — Vasilis Papanikolaou, George Giotis
  • Prodevelop — Christophe Joubert, Ismael Torres, Marc Gil
  • Universidad Zaragoza — Simona Bernardi, Abel Gómez, José Merseguer, Diego Pérez, José-Ignacio Requeno

How to Read This Book[edit]

This book is about a methodology for constructing big data applications. A methodology exists for the purpose of solving software development problems. It is made of development processes—workflows, ways of doing things—and tools to help concretise them. The ideal and guiding principle of a methodology is to facilitate the job and guarantee the satisfaction of stakeholders involved in a software project—end-users and maintainers included. Our methodology addresses the problem of reusing complex and not easily learned big data technologies to effectively and efficiently build big data systems of good quality. To do so, it takes inspiration from two other successful methodologies: DevOps and model-driven engineering. Regarding prerequisites, we assume the reader has a general understanding of software engineering, and, from a tool point of view, a familiarity with the Unified Modeling Language (UML) and the Eclipse IDE.

The book is composed of eight parts. Part I is an introduction (Chapter 1) followed by a state of the art (Chapter 2). Part II sets our methodology forth (Chapter 3) and reviews some UML diagrams convenient for modelling big data systems (Chapter 4). Part III shows how to adjust UML in order to make it support a stepwise refinement approach, where models become increasingly detailed and precise. Except Chapter 5 introduces the subject, aand each of the following chapters (Chapters 6, 7, and 8) is dedicated to one of our three refinement steps. Part IV focuses on model analysis. Indeed, models enable designers to study carefully the system without needing an implementation thereof: a model checker (Chapter 9) may verify whether the system, as it is modelled, satisfies some quality of service requirements; a simulator (Chapter 10) may explore its possible behaviours; and an optimiser (Chapter 11) may find the best one. Part V explains how models serve to automatically install (Chapter 12), configure (Chapter 13), and test (Chapters 14 and 15) the modelled big data technologies. Part VI describes the collect of runtime performance data (Chapter 16) in order to detect anomalies (Chapter 17), violations of quality requirements (Chapter 18), and rethink models accordingly (Chapter 19). Part VII presents three case studies of this methodology (Chapters 20, 21, and 22). Finally, Part VIII concludes (Chapter 23) and mentions future research directions (Chapter 24).

Contents[edit]

  1. Introduction
    1. 100% developed Introduction
    2. 0% developed Related Work
  2. DevOps and Big Data Modelling
    1. 75% developed Methodology
    2. 0% developed Review of UML Diagrams Useful for Big Data
  3. Modelling Abstractions
    1. 0% developed Introduction to Modelling
    2. 0% developed Platform-Independent Modelling
    3. 0% developed Technology-Specific Modelling
    4. 0% developed Deployment-Specific Modelling
  4. Formal Quality Analysis
    1. 0% developed Quality Verification
    2. 0% developed Quality Simulation
    3. 0% developed Quality Optimisation
  5. From Models to Production
    1. 0% developed Delivery
    2. 0% developed Configuration Optimisation
    3. 0% developed Quality Testing
    4. 0% developed Fault Injection
  6. From Production Back to Models
    1. 0% developed Monitoring
    2. 0% developed Anomaly Detection
    3. 0% developed Trace Checking
    4. 0% developed Iterative Enhancement
  7. Case Studies
    1. 50% developed Fraud Detection
    2. 0% developed Maritime Operations
    3. 0% developed News and Media
  8. Conclusion
    1. 0% developed Future Challenges
    2. 0% developed Closing Remarks
  9. Appendices
    1. 0% developed Glossary
    2. 0% developed Index