Data Management in Bioinformatics/Normalization
Chapter Navigation  
Top  E/R Theory  Normalization  Data Querying  Integrating SQL and Programming Languages 
Functional Dependency (FD)[edit]
Introduction of FD[edit]
gene_id  name  annotation 

...  ...  ... 
g23  hsp39  XX protein 
g24  hsp39  XX protein 
Functional Dependency: X>Y if two tuples agree on X, then they agree on Y. e.g. "gene_id > name" and "gid> annotation" are FDs, but "name>gene_id" and "name>annotation" are NOT FDs.
Relationship between key and FD:
A set of attribute{A_{1}, A_{2}, ...A_{t}} is a key for relation R if
(1) {A_{1}, A_{2}, ...A_{t}} functionally determines all other attributes;
(2) when no proper subset of {A_{1}, A_{2}, ...A_{t}} functionally determin all other attributes, it is called superkey (or Candidate key?)
rules of FDs:
Observation:
Correct: X>Y, A>B: XA>BY; X>AY: X>A, X>Y; X>Y, A>Y: XA>Y; Incorrect: XY>A: X>A, Y>A;
Formal inference rules for functional dependencies:
Armstrong’s axioms: 1. Reflexivity: If Y ⊆ X then X > Y (called trivial FD); 2. Transitivity: If X > Y and Y > Z then X > Z; 3. Augmentation: If X > Y then XZ > YZ where Z is a set of attributes.
Where do FDs come from?
looking either at
(1) Domain Knowledge (the meaning of the attributes); (2) the data;
a  b  c 

1  2  3 
4  2  6 
b>a is not FD, but a>b is a valid FD.
BCNF[edit]
Formal way to break tables:
gene  name  annotation  experiment id  experiment description  experiment level 

...  ...  ...  ...  ...  ... 
⇒
R1 (gid, name, annotation); gid>name annotation; R2 (exp id, exp desc); exp id > exp desc; R3 (gid, expid, exp level); gid, expid > exp level;
A relation R is in BCNF if for all dependencies X > Y , at least one of the following holds:
• X > Y is a trivial FD (Y ⊆ X)
• X is a superkey for R
3NF[edit]
Formal Definition: a relation R is in 3NF if for all dependencies X® Y in F+, at least one of the following holds:
• X > Y is a trivial FD (Y ⊆ X)
• X is a superkey for R
• Y ⊂ a candidate key for R
Comparison of BCNF and 3NF[edit]
Removes Redundancy 
Preserves FDs 
Lossless Decomposition 


BCNF  Yes  No  Yes 
3NF  No  Yes  Yes 
Lossy Decomposition[edit]
Suppose we have a table such as the one below.
a  b  c 

1  2  3 
4  2  6 
into the following "normalized" tables


This means we followed
 b → a or
 b → c
as our FD for the decomposition of the original table. (Remember, BCNF decomposes X → Y to the sets {X, Y} and {everything except Y}). However, notice that column b has nonunique values (two of digit 2). When we recombine these two tables in a join, we wind up with a table in which the original relationships are lost:
a  b  c 

1  2  3 
1  2  6 
4  2  3 
4  2  6 
We declare this as a lossy decomposition, because the FDs assumed by the decomposed tables do not hold true (both b → a and b → c are false). Thus, BCNF can not remove all forms of redundancy. We will need to use another model, Multivalued Dependency (MD), to aid us in decomposing tables properly for normalization.
Multivalued Dependency (MD)[edit]
Let's explore a concrete scenario. We'd like to represent belongings of affluent students. We create a multivalued dependency (MD) of
 SID →→ car
read as, "A given student ID can have many cars."
Compare this to an FD like
 SID → name
which we read as, "A given SID can have at most one name."
Suppose we also had the following MD:
 SID →→ clothes
We can get the following table:
SID  car  clothes 

1  Honda  jeans 
1  Toyota  khakis 
1  Toyota  jeans 
1  Honda  khakis 
…  …  … 
Perhaps we want to decompose this table further. We decompose along the MD of
 SID →→ car
This leaves us with {SID, clothes}, leaving us with the following tables
SID →→ car  {SID, clothes}  


We can't actually do this, because these are not FDs! Neither SID →→ car or SID →→ clothes have a superkey. The key must be all three attributes. We need new rules for decomposing MDs.
Rules for MDs[edit]
Spotting trivial MDs[edit]
We say that an MD X →→ Y is trivial in R if X ∪ Y = {all attributes} .
For example, consider the following relation R_{1} where we have the MD a →→ b:
R_{1}  
a  b 

1  2 
1  3 
2  2 
…  … 
Because we can have many bs to each a, we can not generate data to violate the MD.
Now consider the following relation R_{2}:
R_{3}  
a  b  c 

1  2  6 
1  3  7 
2  2  6 
…  …  … 
In this case, a →→ b is nontrivial, because the relation is threeway.
Nontrivial MDs come in pairs[edit]
In fact, a →→ b has a partner MD which is also nontrivial: a →→ c
Here are the listings of all nontrivial pairs of MDs in R_{2}





Compare this to examples of trivial MDs for R_{2}:
a →→ bc 
ab →→ c 
b →→ ac 
bc →→ a 
c →→ ab 
ca →→ b 
… 
Implications of FDs do not hold for MDs[edit]
In FDs, we know
However, this same implication does not exist for MDs. That is,
MDs are actually about independence[edit]
Given the MDs
 a →→ b
 and
 a →→ c
We know that, for each a
 there can be many a
 there can be many a
 and, bs are independent of cs with respect to a, which we represent symbolically as [NOTE: the actual first binary relation symbol should be \Perp instead of \perp, but this symbol is not supported by Mediawiki.]
Every FD is an MD[edit]
By definition, X → Y ⇒ X →→ Y. If X has at most one Y, then it satisfies the condition of having many Y.
For example, given the following relation
a  b 

…  … 
and an FD of a → b, then we also have a trivial MD of a →→ b.
Now, consider the FD a → b with following relation:
a  b  c  d 

…  …  …  … 
In this case, we actually have two nontrivial MDs:
a →→ b 
a →→ cd 
(Remember, nontrivial MDs come in pairs.)