User:Steve11235/sandbox

From Wikibooks, open books for an open world
Jump to navigation Jump to search

This book takes a practical, non-mathematical look at relational database normalization using the normal forms defined by Codd and others.

Introduction[edit | edit source]

Edgar Codd invented the concept of the relational database. Starting in 1970, he defined three "normal forms", to which three more forms were added, which govern the organization of columns into tables. These definitions use precise but abstract mathematical terms, which makes them difficult to understand, much less apply. However, practical application of the principles behind these forms leads to solid table designs that avoid pitfalls. Once the design is fully "normalized", the design may then be "de-normalized" for performance reasons, if necessary.

Example Tables[edit | edit source]

The following tables are used in examples. The first group illustrates basic relationships. Material_Vendor is an example of a "relationship table"; these are used to describe the relationship between the rows of two or more other tables. The second group is used to illustrate the possibility of creating matrices of attributes, which should be avoided.

Table Primary Key Foreign Keys Other
Material material_number
Vendor vendor_number vendor_name
vendor_contact_name
Material_Vendor material_number
vendor_number
vendor_material_number
Order order_number vendor_number
Order_Detail order_number
line_number
material_number required_date

Notes about Order_Detail. Business requirements mandate the following:

  • The use of a "line number" to identify order lines
  • On an order, the combination of material and required date must be unique
Table Primary Key
Shoe_Style style
Shoe_Size size
Shoe_Color color
Shoe_Width width

Keys[edit | edit source]

A "key" consists of a set of one or more columns that uniquely identify a row. For example, order_number is a key for the Order table, while order_number, line_number is a key for the Order_Detail table. Note that a table could have no keys, but then its rows could then only be accessed sequentially.

A "candidate key" is a key in which all the columns are needed to uniquely identify the row. So, the key order_number, line_number, material_number, release_date uniquely identify an Order_Detail row, but it is not a candidate key because neither material_number nor release_date are needed when line_number is present.

Some tables have more than one candidate key. The Order_Detail table has two:

  • order_number, line_number
  • order_number, material_number, release_date

The "primary key" is the candidate key chosen to identify rows in the table.

Note that all the examples of keys up to this point use columns that are actual attributes of the tables that contain them. A "surrogate key" is a single column with unique, non-null values added to the table for the sole purpose of acting as the primary key. Note that the use of surrogate keys is frowned upon by purists, and the following discussion of normalization assumes that they are not used. Having said that, the use of surrogate keys is a common practice and is discussed later.

The "natural key" is a candidate key, not the surrogate key, that otherwise would have been chosen as the primary key.

Normal Forms[edit | edit source]

Codd defined the first three normal forms, while the second three were added later. Note that each form requires that the requirements of the previous form be met as a prerequisite. Therefore, second normal form requires the first, etc.

The rules of each normal form must hold for every candidate key. This becomes an issue when surrogate keys are used, as they tend to mask the columns of the natural key. For example, if a surrogate key is added to Order_Detail, then order_number is redundant and should be removed. However, it cannot be ignored.

First, Second, Third Normal Forms[edit | edit source]

These forms can be summed up by, "Each non-key column must state a single fact that depends on the key, the whole key, and nothing but the key."

First Normal Form[edit | edit source]

Each column must have one meaning and contain one piece of information. Note that the definition of "one piece of information", or "atomicity", is vague. Here are examples of column values that violate this principle.

  • contact =
    • '1-415-555-1212'
    • 'Jane Bloggs'
  • phone_numbers = '1-415-555-1212, 1-415-555-1213'
  • address = '123 Main St., Anytown, CA, 94501'

In the first example, the column contains contains a phone number in one row and a name in another. This violates "same meaning."

In the second example, one column contains two phone numbers. This violates "one piece of information."

In the third example, the column holds one address, but the address is (typically) composed of several pieces of information: street, city, state, postal code. This also violates "one piece of information."

Again, the application of this rule is somewhat subjective. For example, U.S. phone numbers could be decomposed into country code (1), area code, and the prefix and suffix. Whether that is appropriate depends on how phone numbers are validated and queried by applications.

Practically, having one piece of information per column greatly simplifies maintenance, query, and data integrity.

Second Normal Form[edit | edit source]

Every non-key column must describe the full key and not a subset.

For example, adding vendor_number to the Order_Detail table violates this rule, since it describes order_number.

Practically, following this rule avoids repeating information found in parent tables. This simplifies maintenance and maintains data integrity.

Third Normal Form[edit | edit source]

Every non-key column must directly describe the primary key. Effectively, this forbids B describes A describes the primary key.

For example, adding vendor_name to the Order table violates this rule. While vendor_name indirectly describes order_number, it directly describes vendor_number that describes order_number.

Practically, this rule simplifies maintenance and protects data integrity. If a buyer wants to change a vendor name, updating a row in Vendor is simple. Finding all the tables that contain the vendor's name, given that there is no guarantee that the column names are all vendor_name, and then updating all their rows can become very complicated.

Additional Rules[edit | edit source]

Since Codd laid down his rules, the definitions of the first three normal forms have been refined.

Two columns may not share the same meaning[edit | edit source]

This forbids repeating groups or arrays of columns.

For example, updating the Vendor table to have columns vendor_contact_name_1, vendor_contact_name_2 violates this rule.

Practically, this avoids data integrity problems where two columns contain the same value and simplifies queries. It also avoids a situation where another column or two must be added to hold all applicable values for a row.

No optional columns[edit | edit source]

This forbids creating one or more columns that are populated only if another column contains specific values. The goal is to avoid situations where the columns have meaning only for a subset of rows.

For example, adding columns to the Order table to hold electronic data transmission information only for participating vendors violates this rule. The EDT columns have no meaning for non-EDT vendors.

Practically, this doesn't have huge consequences in itself. It avoids the temptation to assign multiple meanings to the same column, which violates first normal form and can lead to data integrity issues and confusion among developers.

No dependencies between rows[edit | edit source]

This rule forbids a value in a column in one row from depending on a value in another row.

For example, in the Material_Vendor table, adding a column primary_vendor violates this rule. The reason is that, for any material_number, the column may be true for only one vendor_number. Setting the value to true for another vendor_number causes data corruption, or at a minimum, an unique constraint exception.

Practically, this rule avoids complex updates and potential data corruption. In the example, the information could be stored by adding a primary_vendor_number column to the Material table.

Boyce/Codd, Fourth, Fifth Normal Forms[edit | edit source]

These normal forms deal with specialized situations.

Boyce/Codd Normal Form[edit | edit source]

For each candidate key, it should not be possible to determine any of its columns using another set of columns.

For example, adding vendor_material_number to the Order_Detail table violates this rule.

* order_number, vendor_material_number, required_date is a candidate key
* vendor_number, material_number can be used to determine the vendor_material_number using the Material_Vendor table  (vendor_number is obtained from the Order table)

Practically, this sort of thing can occur when a column's value is derived from a non-key column of a "relationship table", e.g. vendor_material_number in Material_Vendor. Generally, data integrity and design simplicity are better preserved by looking up such values with a query.

Fourth Normal Form[edit | edit source]

Do not create a matrix of unrelated facts.

For example, assume each shoe style comes in a subset of sizes, a subset of colors, and a subset of widths, but all combinations of these subsets are available. A table with columns style, size, color, width violates this rule.

Practically, matrices of unrelated values create large numbers of rows that are difficult to maintain. For example, 20 sizes by 3 colors by 3 widths requires 180 rows, and the addition of one color requires 60 additional rows. Three tables require 26 rows total, and adding a color requires one row.

Fifth Normal Form[edit | edit source]

Do not create a matrix of independent keys related to one fact.

For example, assume each shoe style comes in a subset of sizes, a subset of colors, and a subset of widths, but all combinations of these subsets are available. In addition, multiple vendors supply each style, but some vendors supply only certain sizes, colors, and widths, independent of style. A table with style, vendor_number, size violates this rule, since size is a fact about both style and vendor_number.

Practically, matrices of independent keys related to a fact also create large numbers of rows that are difficult to maintain.

Etcetera[edit | edit source]

This section discusses a few points of database table design.

Common Normalization Violations[edit | edit source]

Most normalization violations result from misapplying the "Keep It Simple" principle. Avoiding another table or another join in a query seem to reduce complexity. In practice, the hacks needed to maintain data integrity or to query unnormalized tables end up adding far more complexity, typically after the design is in use and correcting it is no longer possible.

The most frequent fix for an improperly normalized table is to remove the problem columns, and then to add them to a new table if they don't already exist elsewhere.

Many designers insist that 3rd normal form is sufficient, and the other normal forms are unnecessary. That's true, except when it isn't. Most of the time, the higher normal forms don't come into play because they deal with unusual situations. When those unusual situations come up, handling them appropriately can avoid problems. If a table's columns look unusual, look for problems even if they meet 3rd normal form criteria.

Here are a few thought patterns that lead to bad designs.

  • This table can have at most five values for this column; I don't need child table.
  • These values never change, I can save joins in my queries by adding columns to the main table.
  • I have three sets of rules that all apply to the same value; I'll put all their inputs into one table and set the unused columns to null.
  • These rows are grouped by a partial key; I'll set a flag to indicate which row is preferred in each group.
  • This key has multiple sets of lookup values; it will be simpler to put them all in one table.

Surrogate Keys[edit | edit source]