Structured Query Language/Managing Indexes
Indexes are a key feature of all SQL databases. They provide quick access to the data. Therefore almost all implementations support a CREATE INDEX statement.
Nevertheless the CREATE INDEX statement is not part of the SQL standard! The reason for this is unknown. Possibly it is a deliberate decision against all implemenation issues. Or it results from the wide range of different syntaxes realized by vendors and the lack of finding a compromise.
On this page we offer some basic ideas concerning indexes and the syntax which is common to a great number of implementations.
CREATE [UNIQUE] INDEX <index_name> ON <table_name> (<column_name> [, <column_name>]);
The Concept of Indexes
DBMSs offer quick access to data stored in their tables. One might think that such high-speed access is due to fast hardware of modern computers: millions of CPU cycles per second, I/O rates in the range of milliseconds, access to RAM within micro- or nanoseconds, etc. That is true, but only partly so. Instead, the use of intelligent software algorithms, especially in the case of handling large amounts of data, is the dominant factor.
Consider a request to the DBMS to determine, whether or not a person with a certain name can be found in a table with 1 million entries. With a primitive, linear algorithm the system has to read 500,000 rows (on average) to decide the question. The binary search algorithm implements a more sophisticated strategy which answers the question after reading 20 rows or less. In this case this choice of algorithm leads to a factor of 25,000 in performance. In order to really grasp the magnitude of this improvement you may want to multiply your salary by 25,000.
Admittedly this comparision between the linear access and the binary search algorithm is a little bit simple. First, DBMS usually read blocks containing multiple rows and not single rows. But this didn't change the situation. If a block contains 100 rows, modify the above example from 1 million to 100 million rows. Second, the binary seach algorithm assumes that the data is ordered. This means that during data entry there is an additional step for sorting the actual input into the existing data. This applies only once and is independent from the number of read accesses. In summary there is additional work during data entry and much less work during data access. It depends on the typical use of the data whether the additional work is worthwhile.
The index is an additional storage holding data which is copied or deducted from the original data in the table. He consists only of redundant data. What parts make up the index? In the case of the binary search stategy the index holds the original values of the tables column plus a backreference to the original row. In most cases he is organized as a balanced tree with the columns value as the trees key and the backreference as additional information for each key.
The binary search algorithm is one of a lot of methods for building indexes. The common characteristics of indexes are that they consists only of redundant information and use additional resouces in sense of CPU cycles, RAM or disc space and offer better performance for queries on large data amounts. If they are used on small tables or there are too much indexes for the same table it is possible that the disadvantages outweighs the benefits.
If an application use to retrieve data by a certain criterion - e.g. a person name for a phone book application - and this criterion consists of a tables column, this column should have an index.
CREATE INDEX person_lastname_idx ON person(lastname);
The index has its own freely selectable name - person_lastname_idx in this example - and is build on a certain column of a certain table. The index may be defined and created directly after the CREATE TABLE statement (when there is no data in the table) or after some or a huge number of INSERT commands. After it is created the DBMS should be in the state to answer questions like the following quicker than before.
SELECT count(*) FROM person WHERE lastname = 'Miller';
The index is used during the evaluation of the WHERE clause. But it is not sure that the index is used. The DBMS has the choice between on the one hand reading all person rows and counting such where the lastname is 'Miller' or on the other hand reading the index (possibly with binary search) and counting all nodes with value 'Miller'. Which strategy is used depends on a lot of decisions. If, for example, the DBMS knows that about 30% of all rows contains 'Miller' it may choose a different strategy than if it knows that only 0.3% contains 'Miller'.
A table may have more than one index.
CREATE INDEX person_firstname_idx ON person(firstname);
What will happen in such a situation to a query like the following one?
SELECT count(*) FROM person WHERE lastname = 'Miller' AND firstname = 'Henry';
Again, the DBMS has more than one choice to retrieve the expected result. It may use only one of the two indexes, read the resulting rows and look for the missing other value. Or it reads both indexes and count the common backreferences. Or it ignores both indexes, reads the data and counts such rows where both criterias apply. As mentioned it depends on a lot of decisions.
If an application typically searches in two columns within one query, e.g. for first- and lastname, it can be useful to build one index for both columns. This strategy is very different from the above example where we build two independent indexes, one per column.
CREATE INDEX person_fullname_idx ON person(lastname, firstname);
In this case the key of the balanced tree is the concatenation of last- and firstname. The DBMS can use this index for queries which ask for last- and firstname. It can also use the index for queries for lastname only. But it cannot use the index for queries for firstname only. The firstname can occur at different places within the balanced tree. Therefore it is worthless for such queries.
In some cases an existing index cannot be used for queries on the underlying column. Suppose the query to person names should be case-insensitive. To do so the application converts all user-input to upper-case and use the UPPER() function to the column in scope.
-- Original user input was: 'miller' SELECT count(*) FROM person WHERE UPPER(lastname) = 'MILLER';
As the criterion in the WHERE clause looks only for uppercase characters and the index is build in a case-sensitive way, the key in the balanced tree is worthless: 'miller' is sorted at a very different place than 'Miller'. To overcome the problem one can define an index, which uses exactly the same strategy as the WHERE criterion.
CREATE INDEX person_uppername_idx ON person(UPPER(lastname)); -- not supported by MySQL
Now the 'UPPER()' query can use this so-called functional index.
The Primary Key of every table is unique, which means that no two columns can contain the same value. Sometimes one column or the concatenation of some columns is also unique. To ensure this criterion you can define a UNIQUE CONSTRAINT or you can define an index with the additional UNIQUE criterion. (Often UNIQUE CONSTRAINTS silently use UNIQUE INDEX in the background.)
CREATE UNIQUE INDEX person_lastname_unique_idx ON person(lastname);
Unique indexes can only be created on existing data, if the column in scope really has nothing but unique values (which is not the case in our database example).
Drop an Index
Indexes can be dropped by the command:
DROP INDEX <index_name>;