Data Management in Bioinformatics/Normalization
| Chapter Navigation | |
| Top | E/R Theory - Normalization - Data Querying - Integrating SQL and Programming Languages |
Contents |
[edit] Functional Dependency (FD)
[edit] Introduction of FD
| gene_id | name | annotation |
|---|---|---|
| ... | ... | ... |
| g23 | hsp39 | XX protein |
| g24 | hsp39 | XX protein |
Functional Dependency: X->Y if two tuples agree on X, then they agree on Y. e.g. "gene_id -> name" and "gid-> annotation" are FDs, but "name->gene_id" and "name->annotation" are NOT FDs.
Relationship between key and FD:
A set of attribute{A1, A2, ...At} is a key for relation R if
(1) {A1, A2, ...At} functionally determines all other attributes;
(2) when no proper subset of {A1, A2, ...At} functionally determin all other attributes, it is called superkey (or Candidate key?)
rules of FDs:
Observation:
Correct:
X->Y, A->B: XA->BY;
X->AY: X->A, X->Y;
X->Y, A->Y: XA->Y;
Incorrect:
XY->A: X->A, Y->A;
Formal inference rules for functional dependencies:
Armstrong’s axioms:
1. Reflexivity: If Y ⊆ X then X -> Y (called trivial FD);
2. Transitivity: If X -> Y and Y -> Z then X -> Z;
3. Augmentation: If X -> Y then XZ -> YZ where Z is a set of attributes.
Where do FDs come from?
looking either at
(1) Domain Knowledge (the meaning of the attributes);
(2) the data;
| a | b | c |
|---|---|---|
| 1 | 2 | 3 |
| 4 | 2 | 6 |
b->a is not FD, but a->b is a valid FD.
[edit] BCNF
Formal way to break tables:
| gene | name | annotation | experiment id | experiment description | experiment level |
|---|---|---|---|---|---|
| ... | ... | ... | ... | ... | ... |
⇒
R1 (gid, name, annotation); gid->name annotation;
R2 (exp id, exp desc); exp id -> exp desc;
R3 (gid, expid, exp level); gid, expid -> exp level;


A relation R is in BCNF if for all dependencies X -> Y , at least one of the following holds:
• X -> Y is a trivial FD (Y ⊆ X)
• X is a superkey for R
[edit] 3NF
Formal Definition: a relation R is in 3NF if for all dependencies X® Y in F+, at least one of the following holds:
• X -> Y is a trivial FD (Y ⊆ X)
• X is a superkey for R
• Y ⊂ a candidate key for R
[edit] Comparison of BCNF and 3NF
| Removes Redundancy |
Preserves FDs |
Lossless Decomposition |
|
|---|---|---|---|
| BCNF | Yes | No | Yes |
| 3NF | No | Yes | Yes |
[edit] Lossy Decomposition
Suppose we have a table such as the one below.
| a | b | c |
|---|---|---|
| 1 | 2 | 3 |
| 4 | 2 | 6 |
into the following "normalized" tables
|
|
This means we followed
- b → a or
- b → c
as our FD for the decomposition of the original table. (Remember, BCNF decomposes X → Y to the sets {X, Y} and {everything except Y}). However, notice that column b has non-unique values (two of digit 2). When we recombine these two tables in a join, we wind up with a table in which the original relationships are lost:
| a | b | c |
|---|---|---|
| 1 | 2 | 3 |
| 1 | 2 | 6 |
| 4 | 2 | 3 |
| 4 | 2 | 6 |
We declare this as a lossy decomposition, because the FDs assumed by the decomposed tables do not hold true (both b → a and b → c are false). Thus, BCNF can not remove all forms of redundancy. We will need to use another model, Multivalued Dependency (MD), to aid us in decomposing tables properly for normalization.
[edit] Multivalued Dependency (MD)
Let's explore a concrete scenario. We'd like to represent belongings of affluent students. We create a multivalued dependency (MD) of
- SID →→ car
read as, "A given student ID can have many cars."
Compare this to an FD like
- SID → name
which we read as, "A given SID can have at most one name."
Suppose we also had the following MD:
- SID →→ clothes
We can get the following table:
| SID | car | clothes |
|---|---|---|
| 1 | Honda | jeans |
| 1 | Toyota | khakis |
| 1 | Toyota | jeans |
| 1 | Honda | khakis |
| … | … | … |
Perhaps we want to decompose this table further. We decompose along the MD of
- SID →→ car
This leaves us with {SID, clothes}, leaving us with the following tables
| SID →→ car | {SID, clothes} | ||||||||||||||||
|
|
We can't actually do this, because these are not FDs! Neither SID →→ car or SID →→ clothes have a superkey. The key must be all three attributes. We need new rules for decomposing MDs.
[edit] Rules for MDs
[edit] Spotting trivial MDs
We say that an MD X →→ Y is trivial in R if X ∪ Y = {all attributes} .
For example, consider the following relation R1 where we have the MD a →→ b:
| R1 | |
| a | b |
|---|---|
| 1 | 2 |
| 1 | 3 |
| 2 | 2 |
| … | … |
Because we can have many bs to each a, we can not generate data to violate the MD.
Now consider the following relation R2:
| R3 | ||
| a | b | c |
|---|---|---|
| 1 | 2 | 6 |
| 1 | 3 | 7 |
| 2 | 2 | 6 |
| … | … | … |
In this case, a →→ b is non-trivial, because the relation is three-way.
[edit] Non-trivial MDs come in pairs
In fact, a →→ b has a partner MD which is also non-trivial: a →→ c
Here are the listings of all non-trivial pairs of MDs in R2
|
||
|
||
|
Compare this to examples of trivial MDs for R2:
| a →→ bc |
| ab →→ c |
| b →→ ac |
| bc →→ a |
| c →→ ab |
| ca →→ b |
| … |
[edit] Implications of FDs do not hold for MDs
In FDs, we know 
However, this same implication does not exist for MDs. That is, 
[edit] MDs are actually about independence
Given the MDs
- a →→ b
- and
- a →→ c
We know that, for each a
- there can be many a
- there can be many a
- and, bs are independent of cs with respect to a, which we represent symbolically as
[NOTE: the actual first binary relation symbol should be \Perp instead of \perp, but this symbol is not supported by Mediawiki.]
[edit] Every FD is an MD
By definition, X → Y ⇒ X →→ Y. If X has at most one Y, then it satisfies the condition of having many Y.
For example, given the following relation
| a | b |
|---|---|
| … | … |
and an FD of a → b, then we also have a trivial MD of a →→ b.
Now, consider the FD a → b with following relation:
| a | b | c | d |
|---|---|---|---|
| … | … | … | … |
In this case, we actually have two non-trivial MDs:
| a →→ b |
| a →→ cd |
(Remember, non-trivial MDs come in pairs.)
This page may need to be
[NOTE: the actual first binary relation symbol should be \Perp instead of \perp, but this symbol is not supported by Mediawiki.]