Data Management in Bioinformatics/Normalization

From Wikibooks, open books for an open world
< Data Management in Bioinformatics
Jump to: navigation, search
Chapter Navigation
Top E/R Theory - Normalization - Data Querying - Integrating SQL and Programming Languages

Functional Dependency (FD)[edit]

Introduction of FD[edit]

gene_id name annotation
... ... ...
g23 hsp39 XX protein
g24 hsp39 XX protein

Functional Dependency: X->Y if two tuples agree on X, then they agree on Y. e.g. "gene_id -> name" and "gid-> annotation" are FDs, but "name->gene_id" and "name->annotation" are NOT FDs.

Relationship between key and FD:

A set of attribute{A1, A2, ...At} is a key for relation R if

(1) {A1, A2, ...At} functionally determines all other attributes;

(2) when no proper subset of {A1, A2, ...At} functionally determin all other attributes, it is called superkey (or Candidate key?)

rules of FDs:

Observation:

        Correct: 
               X->Y, A->B:   XA->BY;   
               X->AY:   X->A, X->Y;
               X->Y, A->Y: XA->Y;
        Incorrect:
               XY->A: X->A, Y->A;

Formal inference rules for functional dependencies:

        Armstrong’s axioms:
               1. Reflexivity: If Y ⊆ X then X -> Y (called trivial FD);
               2. Transitivity: If X -> Y and Y -> Z then X -> Z;
               3. Augmentation: If X -> Y then XZ -> YZ where Z is a set of attributes.

Where do FDs come from?

looking either at

              (1) Domain Knowledge (the meaning of the attributes); 
              (2) the data; 
a b c
1 2 3
4 2 6

b->a is not FD, but a->b is a valid FD.

BCNF[edit]

Formal way to break tables:

gene name annotation experiment id experiment description experiment level
... ... ... ... ... ...

         R1 (gid, name, annotation);    gid->name annotation;
         R2 (exp id, exp desc);         exp id -> exp desc;
         R3 (gid, expid, exp level);   gid, expid -> exp level;



Original Relation -> find\ \ a \  FD \  X->Y \left\{
  \begin{array}{ll}
    {X, Y}, & \hbox{gid, name, annotation;} \\
    {everything\ \  except Y}, & \hbox{gid, exp id, exp desc, exp level;} 
  \end{array}
\right.



gid, exp id, exp desc, exp level \left\{
  \begin{array}{ll}
    {exp id, exp desc}, & \hbox{;} \\
    {gid, \  exp id, \ exp level }, & \hbox{;} 
  \end{array}
\right.

A relation R is in BCNF if for all dependencies X -> Y , at least one of the following holds:

• X -> Y is a trivial FD (Y ⊆ X)

• X is a superkey for R

3NF[edit]

Formal Definition: a relation R is in 3NF if for all dependencies X® Y in F+, at least one of the following holds:

• X -> Y is a trivial FD (Y ⊆ X)

• X is a superkey for R

• Y ⊂ a candidate key for R


Comparison of BCNF and 3NF[edit]

Removes
Redundancy
Preserves
FDs
Lossless
Decomposition
BCNF Yes No Yes
3NF No Yes Yes


Lossy Decomposition[edit]

Suppose we have a table such as the one below.

a b c
1 2 3
4 2 6

into the following "normalized" tables

a b
1 2
4 2
b c
2 3
2 6

This means we followed

b → a or
b → c

as our FD for the decomposition of the original table. (Remember, BCNF decomposes X → Y to the sets {X, Y} and {everything except Y}). However, notice that column b has non-unique values (two of digit 2). When we recombine these two tables in a join, we wind up with a table in which the original relationships are lost:

a b c
1 2 3
1 2 6
4 2 3
4 2 6

We declare this as a lossy decomposition, because the FDs assumed by the decomposed tables do not hold true (both b → a and b → c are false). Thus, BCNF can not remove all forms of redundancy. We will need to use another model, Multivalued Dependency (MD), to aid us in decomposing tables properly for normalization.

Multivalued Dependency (MD)[edit]

Let's explore a concrete scenario. We'd like to represent belongings of affluent students. We create a multivalued dependency (MD) of

SID →→ car

read as, "A given student ID can have many cars."

Compare this to an FD like

SID → name

which we read as, "A given SID can have at most one name."

Suppose we also had the following MD:

SID →→ clothes

We can get the following table:

SID car clothes
1 Honda jeans
1 Toyota khakis
1 Toyota jeans
1 Honda khakis

Perhaps we want to decompose this table further. We decompose along the MD of

SID →→ car

This leaves us with {SID, clothes}, leaving us with the following tables

SID →→ car {SID, clothes}
SID car
1 Honda
1 Toyota
sid clothes
1 jeans
1 khakis

We can't actually do this, because these are not FDs! Neither SID →→ car or SID →→ clothes have a superkey. The key must be all three attributes. We need new rules for decomposing MDs.

Rules for MDs[edit]

Spotting trivial MDs[edit]

We say that an MD X →→ Y is trivial in R if XY = {all attributes} .

For example, consider the following relation R1 where we have the MD a →→ b:

R1
a b
1 2
1 3
2 2

Because we can have many bs to each a, we can not generate data to violate the MD.

Now consider the following relation R2:

R3
a b c
1 2 6
1 3 7
2 2 6

In this case, a →→ b is non-trivial, because the relation is three-way.

Non-trivial MDs come in pairs[edit]

In fact, a →→ b has a partner MD which is also non-trivial: a →→ c

Here are the listings of all non-trivial pairs of MDs in R2

a →→ b
a →→ c
b →→ a
b →→ c
c →→ a
c →→ b

Compare this to examples of trivial MDs for R2:

a →→ bc
ab →→ c
b →→ ac
bc →→ a
c →→ ab
ca →→ b

Implications of FDs do not hold for MDs[edit]

In FDs, we know X \rightarrow YZ \iff \begin{array}{c}
  X \rightarrow Y \\
  \mbox{and} \\
  X \rightarrow Z
\end{array}

However, this same implication does not exist for MDs. That is, X \rightarrow\rightarrow YZ \nLeftrightarrow \begin{array}{c}
  X \rightarrow\rightarrow Y \\
  \mbox{and} \\
  X \rightarrow\rightarrow Z
\end{array}

MDs are actually about independence[edit]

Given the MDs

a →→ b
and
a →→ c

We know that, for each a

  • there can be many a
  • there can be many a
  • and, bs are independent of cs with respect to a, which we represent symbolically as b \perp c | a [NOTE: the actual first binary relation symbol should be \Perp instead of \perp, but this symbol is not supported by Mediawiki.]

Every FD is an MD[edit]

By definition, XYX →→ Y. If X has at most one Y, then it satisfies the condition of having many Y.

For example, given the following relation

a b

and an FD of ab, then we also have a trivial MD of a →→ b.

Now, consider the FD ab with following relation:

a b c d

In this case, we actually have two non-trivial MDs:

a →→ b
a →→ cd

(Remember, non-trivial MDs come in pairs.)