Design of Main Memory Database System/Introduction
Part I: Introduction to Database and Database Management Systems
Chapter 1: Overview
1.1 Introduction[edit | edit source]
Database systems have become an essential component of every software applications. Database systems emerged in 1960s and took 10 years to gain widespread usage. More and more organizations began to adopt database technology to manage their corporate data during the mid 1970s
Generalized Update Access Method (GUAM) was a hierarchical database system developed in early 1960s by Rockwell International. Rockwell developed this software to manage the data usually associated with manufacturing operations. IBM introduced Information Management System (IMS) as a hierarchical database management system soon after that. The 1970s were the dawn of relational database technology. Dr. Edgar F. Codd's paper on the [[w:Relational model|relational model\\ revolutionized the thinking on data systems. The industry quickly responded to the superiority of the relational model by adapting their products to that model. During the 1980s, database systems gained a lot of ground and a large percentage of businesses made the transition from file-oriented data systems to database systems. Some of the leading products like ORACLE, DB2, SQL Server, Informix and Sybase started ruling the database world with their flagship relational database management systems (RDBMS).
The relational model matured and became the leading data model in the 1990s. Towards the end of the '90s, object-oriented databases gained popularity; however, older applications that were already developed using the relational model were reluctant to move to the object-oriented model.
The late 2000s saw the rise of the NoSQL movement, which moved towards more application-specific data models than the relational model. The object-oriented model is one of the alternative data models being used.
Most leading database management systems support the object-oriented model. Many of them offer object-to-relational mapping to achieve object model support. For example, DB2 is Relational, Hierarchical (XML), Object oriented Database Management system.
Open Source DBMS
1.2 Database[edit | edit source]
The word database is in common use. We use the terms database and database management system interchangeably, which is quite wrong most of the time. So we must start with defining what ‘database’ means.
A database is collection of related data designed, built and populated with data for a specific purpose.
A database can be of any size and of varying complexity. For example, a list of names and addresses of employees may consist of few hundred to thousand records varying upon the organization size. Indeed, there are databases that store much more data, for example, the database maintained by the Income Tax Department to keep track of all paid taxes. In India, lets say there are around 1 billion tax payers, and if each taxpayer files with approximately 500 characters of information per form, we would get a database of 10^9 * 500 = 500 Giga bytes (GB) of data. For keeping at least past three years of returns, we need 1.5 Terabytes(TB) of space. This huge amount of information must be organized and managed so that users can search and update the data as needed.
Databases can also be very complex due to the complex relationships that exist between the different records. A good example of such a database is Indian Railways' reservation system.
1.3 Database Management Systems[edit | edit source]
A database management system (DBMS) is a general-purpose software system that enables users to define, construct, manipulate and share information or data. Additionally it offers data "persistence" over long periods of time. The information being stored is called the database.
The database management system includes the fundamental operations that can be applied to data. Every database management system provides for the following basic operations, collectively referred to by the acronym CRUD:
- C—Create data by inserting it into the database
- R—Read data from the database
- U—Update data already stored in the database
- D—Delete data from the database
1.3.1 Benefits of DBMS[edit | edit source]
To simply run the CRUD operations does not require a DBMS. In fact, it is possible to use a spreadsheet or even write our own set of programs to simply create and maintain data. However, DBMS software has already solved multiple other problems which are bound to arise during this process, such as:
- Sharing of data with access control for multiple users
- Multi-user transaction processing
- Concurrent processing
- Redundancy control
- Query processing and optimization
- Backups and recovery
- Integrity constraints enforcement
- Standards enforcement
- Data abstraction for applications
1.4 Database System Types[edit | edit source]
Database software has evolved to support different type of data models. As we try to represent real-world data requirements in a data model, we come up with different data model over a period of time. It turns out that we can look at data requirements and create data models in a few different ways. They are listed below.
1.4.1 Hierarchical DBMS[edit | edit source]
A hierarchical database is similar in nature to a file system, with a root node and one or more children referencing the parent. This gives very fast data-access path, but high application maintenance.
Data is organized into a tree like structure, which allows repeating information using parent/child as one to many relationships. Parent can have many children but child has only one parent. This model was widely used in the first main frame database management systems. The most common form of hierarchical model used currently is the LDAP model. This model gained popularity again with the recent XML databases.
An XML database is a data persistence software system that allows data to be imported, accessed and exported in the XML format. Two major classes of XML database exist:
- XML-enabled: These map all XML to a traditional database (such as a relational database), accepting XML as input and rendering XML as output.
- Native XML (NXD): The internal model of such databases depends on XML and uses XML documents as the fundamental unit of storage.
1.4.2 Network DBMS[edit | edit source]
This model is an extension to hierarchical data model in which each record can have multiple parents and multiple child records. In effect, it supports many to many relationships. It provides flexible way to represent objects and their relationships. But before it gains popularity, new model ‘relational model’ was proposed and that replaced network database model as soon as it was proposed.
1.4.3 Relational DBMS[edit | edit source]
The relational model is the basis for any relational database management system(RDBMS). It defines how to create, store and retrieve the data and to make the data logically consistent. A relational model has three core components: a collection of objects or relations, operators that act on the objects or relations, and data integrity methods.
Dr. E. F. Codd, the father of the relational model, stipulated the rules and proposed this model. Data is represented as mathematical n-ary relations, an n-ary relation being a subset of the Cartesian product of n domains. "Relation" is a mathematical term for "table", and thus "relational" roughly means, "based on tables".
The basic principle of the relational model is the Information Principle: all information is represented by data values in relations. In accordance with this Principle, a relational database is a set of relations and the result of every query is presented as a relation.
The basic relational building block is the domain or data type. A tuple is an unordered set of attribute values. An attribute is an ordered pair of attribute name and type name. An attribute value is a specific valid value for the type of the attribute. This can be either a scalar value or a more complex type. A relation is defined as a set of n-tuples. A table in a relational database, alternatively known as a relation, is a two-dimensional structure used to hold related information. A database consists of one or more related tables. Don’t confuse a relation with relationships. A relation is essentially a table, and a relationship is a way to correlate, join, or associate the two tables.
A row in a table is a collection or instance of one thing, such as one employee or one line item on an invoice. A column contains all the information of a single type, and the piece of data at the intersection of a row and a column, a field, is the smallest piece of information that can be retrieved with the database’s query language.
The consistency of a relational database is enforced, not by rules built into the applications that use it, but rather by constraints, declared as part of the logical schema and enforced by the DBMS for all applications. The relational model establishes the connections between related data occurrences by means of logical links implemented through foreign keys.
The relational model defines operations such as select, project and join. Although these operations may not be explicit in a particular query language, they provide the foundation on which a query language is built.
SQL, which stands for Structured Query Language, supports the database components in virtually every modern relational database system. SQL has been refined and improved by the American National Standards Institute (ANSI) for more than 20 years.
ANSI introduced standard querying language to access relational databases, SQL (Structured Query Language). All database vendors developed SQL Engines on top of their relational engines to interpret and execute these SQL Statements. This lead to standard interfaces to emerge in programming languages as well. ODBC for C and JDBC for JAVA became the de-facto standard to access the SQL engine.
This book is mainly focused on relational database management systems (RDBMS) as other database systems are built on top of this RDBMS, and they have nothing to gain from the main memory nature of the MMDBs.
1.4.4 Main Memory DBMS[edit | edit source]
In-Memory Database system (IMDB) is a memory-resident relational database that eliminates disk access by storing and manipulating data in main memory. It is also known as main memory database (MMDB) or real-time database (RTDB). Predictability is more than performance in case of real-time databases.
Disk and memory capacities continue to grow much faster than latency and bandwidth improvements. Now, multi-terabyte RAM scans take minutes and terabyte-disk scans take many hours. We shall now keep the whole database in memory and design our data structures and algorithms intelligently and use multi-processors sharing a massive main memory, and intelligently use precious disk bandwidth. The database engines need to overhaul their algorithms to deal with the fact that main memories are huge (billions of pages trillions of bytes). Main memory database implementation has proved that they can execute queries ten to twenty times faster than traditional approach. The era of main memory databases has finally arrived. In this book we will discuss about this type of database management systems.
1.4.5 Column or Vertical DBMS[edit | edit source]
Storing data in column-wise as ternary relations (key, attribute, value) allows extraordinary compression, often as a bitmap. Querying such bitmaps can reduce query times by orders of magnitude and enable whole new optimization strategies. This is due to the fact that the total disk I/Os it performs to execute as OLAP(Online Analytical Processing) query will come done drastically because of the compaction in data. These types of databases started gaining popularity in OLAP applications because of multi-fold reduction in query execution time.
1.4.6 Stream Processing DBMS[edit | edit source]
Data is increasingly generated by instruments that monitor the environment – telescopes looking at the heavens, patient monitors watching the life-signs of a person in the emergency room, cell-phone and credit-card systems looking for fraud and RFID scanners watching products flow through the supply chain. In each of these cases, one wants to compare the incoming data with the history of an object. The data structures, query operators, and execution environments for such stream processing systems are qualitatively different from classic DBMS architectures. In essence, the arriving data items each represent a fairly complex query against the existing database. Researchers have been building stream-processing systems, and their stream-processing ideas have started appearing in mainstream products.
1.4.7 Object Relational DBMS[edit | edit source]
An object-relational database management system (ORDBMS) provides a relational database management system that allows developers to integrate a database with their own custom data-types and methods. Object Views allow the developer to define an object-oriented structure over an existing relational database table. In this way, existing applications do not need to change immediately, and any new development can use the object-oriented definitions of the table. This makes the transition from a relational to an object-relational database relatively easy, because object definitions can reference existing relational components. The term object-relational database sometimes also refers to external software products running over traditional DBMSs to provide similar features.
1.4.8 Distributed DBMS[edit | edit source]
Distributed databases bring the advantages of distributed computing to the database management domain. It is a collection of multiple logically interrelated databases distributed over the computer network, and distributed manager software that manages a distributed database. It comprises data replication, data fragmentation, distributed query processing, distributed transaction processing, distributed database recovery, etc.
In this book, focus is given on relational, main memory database system.
1.5 Database Related Roles and Responsibilities[edit | edit source]
For small personal database, such as list of contacts, one person typically defines, constructs and manipulates the database and there is no sharing of data. However huge and complex database requires many persons to define, construct, manipulate, maintain database. There are many roles involved with a database and they are listed below
- Database Administrator takes care of database itself, DBMS and related software. He is responsible for authorizing access to the database, monitor database usage and for acquiring software and hardware resources as needed.
- Database Designers are responsible for identifying the data to be stored in the database and for choosing appropriate structures to represent and store this data. They interact with specialized users and develop ‘views’ of the database that meets their application requirements.
- System Analysts / Software Engineers who thoroughly understands the functionalities of DBMS so as to implement their applications to meet their complex application requirements.
- DBMS Kernel Developers are persons who design and implement the DBMS interfaces and modules as a software package. .
- DBMS Tool Developers include persons who develop tools to access and use DBMS software. Typical packages include database design, performance monitoring, GUI, etc.
This book is mainly focused on persons who perform “DBMS Kernel Developers” role.
1.6 Programming Interfaces[edit | edit source]
Most database systems provide interactive interface where SQL commands can be typed and given as input to database system, which will retrieve and display the resultant records. For example, in a computer system where MYSQL RDBMS is installed, the command mysql will start the interactive interface. This tool is convenient for schema creation and for occasional adhoc queries. However majority of database interactions in practice are executed though programs. These programs are generally called as database applications. As more than 90% of applications involve database, we shall say all applications are database applications.
There are three ways to access database programmatically
- Embedded SQL – Embedding SQL Commands in a general purpose programming language. Database statements are embedded into the programming language and are identified by preprocessor with prefix “EXEC SQL”. These pre processors convert these statements into DBMS generated code. ESQL for C/C++ and SQLJ for Java.
- Native language drivers – standard interface on top of SQL commands. These provide functions to connect to database execute statement, retrieve resultant records, etc. ODBC, JDBC are examples
- Proprietary language/drivers – PL/SQL, PHP drivers. Mysql Provides PHP driver to access the database.