Relational Database Design/Retrieving data
The relational algebra is a formal language used to express database or relational operations. The most important operations are:
- σ(lower-cased Sigma)-select values with a condition)
- Π(Uppercase pie)-project, only show the named attributes
- X OR Cross Product (each row of one relation shown combined with every row of another relation),
- (join on a condition, equijoin, natural join (an equijoin on the common fields of two relations) ).
Also, the set operations of union, intersection, difference: , , (where x is not an element of A intersect B).
There is also a division operation, which is like a natural join where the attributes of a second relation is a subset of the attributes of the first relation, and only the rows of the first relation which have the values in the corresponding attributes in the second relation, are selected, and the non-common attributes projected.
A divided by B, might be better said, A given B, or all the rows in A that will show all the rows in B.
SQL statements are divided into two types, data declaration statements (DDL), and data manipulation statements (DML), (the L is for language ).
Data Manipulation SQL
This is by far the majority use of SQL, as once a database is designed , SQL data manipulation is usually flexible enough to get around poor design, and null values are allowed, so poorly designed normalization can be gotten around by leaving fields empty. Therefore, it pays to know about the data manipulation statements over the data declaration statements, in order to know how to cope with bad design, which is likely to be a more likely reason for hiring , than an expert in Extract, Transform and Load, who is also a whiz at optimally normalized database design.
Retrieving data from a database using SQL is achieved using the SELECT keyword, and can range from using a very simple statement to retrieve a single column from a table, to an incredibly complex statement using conditional joining, grouping, ordering, limiting etc.
The most basic SELECT query takes the form:
SELECT <columns> FROM <table>;
Multiple column names are separated by commas, and the asterisk character can be used as a wild card to retrieve all columns from a table.
An alternative method of retrieving a given column or columns from a table would be to use the syntax:
SELECT table.column ...
It is of course important to be able to only return certain rows from a table, based on given conditions. The basic syntax for this is:
SELECT <columns> FROM <table> WHERE <condition>;
So far statements based on these basic summations of the relevant SQL syntax would allow you to retrieve data from an SQL compliant database, but not do so in a relational way. The key to retrieving data in such a way is the JOIN keyword. The most basic JOIN syntax is:
SELECT <columns> FROM <table1> JOIN <table2> ON <join_condition> WHERE <condition>;
JOIN keyword in this manner would allow you to retrieve columns from multiple tables into a single result set, based on a condition. This allows you to 're-attach' data that has been normalized into multiple tables back into coherent data at the time of retrieval.
However, one of the reasons for such normalization is that there may not always be a result in the second table that corresponds to one in the first, or vice versa. In such a situation it is necessary to decide what is to be done with data that isn't 'attached' together.
Generally the decision is whether to ignore data from the second table that doesn't match with any data from the first, or ignore data from the first table that doesn't match with any from the second; in SQL this is achieved by using
LEFT JOIN or
RIGHT JOIN respectively.
Data Declaration SQL
These statements construct a database for you. The heirarchy is Database have one or more Schema which have one or more Tables, which have one or more Columns.
Data declaration also contains statements for creating Users or Roles, which that allow for access control using privilege assignment: the database will define certain types of operation such as SELECT, REFERENCE, UPDATE, DELETE, USAGE, CREATION and allow users or roles to be GRANTed these operations at various granularities e.g. to a database, schema, table or part of a table.
Below is an example of an organic usage of DML to hack up a database without any planning. One way of database design might to put such a hack frontended with a web hack, and invite people to play the virtual simulation of a business for free, and see what kind of problems turn up.
CREATE ROLE CUSTDB_ADMIN; GRANT CUSTDB_ADMIN CREATE DATABASE; CREATE DATABASE CUSTDB OWNER CUSTDB_ADMIN; CREATE SCHEMA CUST; CREATE TABLE CUST.CUSTOMER ( CUST_ID INTEGER PRIMARY KEY, FIRSTNAME TEXT, SURNAME TEXT, ADDRESS TEXT, PHONE TEXT ); CREATE SCHEMA STOCK; CREATE TABLE STOCK.PRODUCT ( PROD_ID INTEGER PRIMARY KEY, REFERENCE_NO NUMERIC(15), NAME TEXT, DESCRIPTION TEXT ); CREATE TABLE STOCK.INVENTORY ( INVENT_ID INTEGER PRIMARY KEY, PROD_ID INTEGER REFERENCES STOCK.PRODUCT, QTY INTEGER ); CREATE TABLE STOCK.INVENTORY_CHANGE ( INV_CHANGE_ID INTEGER PRIMARY KEY, INVENT_ID INTEGER REFERENCES STOCK.INVENTORY, QTY INTEGER, WHEN TIMESTAMP ); CREATE TABLE STOCK.PRICING ( PRICING_ID INTEGER PRIMARY KEY, PRODUCT_ID INTEGER REFERENCES PRODUCT, PRICE NUMERIC(10,2) ); CREATE TABLE STOCK.COST ( COST_ID INTEGER PRIMARY KEY, PRODUCT_ID INTEGER REFERENCES PRODUCT, COST NUMERIC(10,2) ); CREATE SCHEMA SALES; CREATE TABLE SALES.PRICE ( PRICE_ID INTEGER PRIMARY KEY, PROD_ID INTEGER REFERENCES STOCK.PRODUCT, PRICING_ID REFERENCES STOCK.PRICING, ALTERATION NUMERIC(10,2) ); CREATE TABLE CUST.ORDER ( ORDER_ID INTEGER PRIMARY KEY, CUST_ID INTEGER REFERENCES CUST.CUSTOMER, WHEN TIMESTAMP ); CREATE TABLE CUST.ORDER_ITEM ( ORDER_ITEM_ID INTEGER PRIMARY KEY, ORDER_ID INTEGER REFERENCES CUST.ORDER, PROD_ID INTEGER REFERENCES STOCK.PRODUCT, QTY INTEGER, PRICE_ID INTEGER REFERENCES SALES.PRICE );
Note that every table has a primary key constraint, and many tables a foreign key constraint with the keyword REFERENCES either applied after a column declaration where the type of the column must match the type of the primary key, or at the end of the table , where FOREIGN KEY CONSTRAINT field REFERENCES other_table is the preamble.
Relating Data Declaration SQL , to relational design using normalization
In these tables, there are no composite keys, and if these tables are in 3NF , then they are also in 5NF, according to Date/Fagin, as above. Do the tables meet the definition of 3NF - for a functional dependency X->Y, then X is a superkey or Y is part of a candidate key ?
For customer : FDs are CUST_ID -> FIRSTNAME, LASTNAME, ADDRESS, PHONE - 3NF (also 5NF)
For product : FDs are PROD_ID -> REFERENCE_NO -1, REFERENCE_NO -> NAME, DESCRIPTION -2
FD1 X is a superkey, FD2 Y is NAME, DESCRIPTION , and could probably be a candidate key , so 3NF holds. However, it might be argued NAME is a candidate key, and TEXT is a dependendant on NAME i.e. NAME -> DESCRIPTION. Here NAME may be unique for each product row, so it too is a candidate key, and therefore X is a superkey and 3NF still holds for NAME-> DESCRIPTION.
There seems to be an apparent problem with PROD_ID and REFERENCE_NO , and perhaps product should be decomposed to (PROD_ID , REFERENCE_NO) , (PROD_ID, NAME, DESCRIPTION).
But 2NF holds ( no non-prime attribute dependent on a part and not a whole of a minimal superkey, or candidate key, which are PROD_ID, or REFERENCE_NO ), 3NF holds ( X->A, X is a superkey, or A is part of a candidate key ), and even BCNF holds ( REFERENCE_NO->NAME, DESCRIPTION, and REFERENCE_NO is an alternate candidate key ). So the decomposition seems unnecessary.
For inventory , INVENT_ID -> PROD_ID , QTY , is the keying FD , but as it is , PROD_ID -> QTY might be valid, so qty might as well go in Product table. Inventory table can be dropped, and Inventory_change should reference product instead.
For inventory_change, what was invent_id , is now prod_id , and prod_id is not a candidate key, as there may be multiple inventory_changes for one prod_id. The FDs are then inv_change_id -> prod_id, qty, when. 3NF ( also BCNF)
For pricing, there may be different prices for one product, there probably should be something like a pricing period, and reason for pricing, probably in separate tables. FDs are pricing_id -> price, prod_id ( 3NF, BCNF)
For Order, Order_id -> customer, when .
For Order Item, order_item_id -> Order_id, prod_id, qty, pricing_id, but pricing_id -> prod_id, price, so during construction of a minimal covering set of FD , the order_item_id FD might become order_item_id -> order_id, pricing_id, qty , eliminating prod_id.
Since pricing_id -> prod_id, this FD has a X which is not a superkey of the order_item relationship, violating BCNF, but not 3NF ,as pricing_id is part of the alternate candidate key order_id, pricing_id, qty.
It is probably better to follow the same process of trying to achieve BCNF first, so prod_id should be eliminated from the order_item table.
Thus, normalization as been used to check an organically grown database thrown together by a incremental design process. ER diagramming would probably have shown a redundant relationship between order_item , pricing , and order as well, something like
order_item - about - product is a redundant relationship.
Also, after checking, all the tables were found to have primary keys of single attributes, and to be in 3NF, so they must be in 5NF as well, although there seemed to be only multivalued depedencies already properly decomposed ; e.g. order ->-> order_item, product ->-> pricing,