Data Management in Bioinformatics/Normalization

From Wikibooks, open books for an open world
< Data Management in Bioinformatics
Jump to: navigation, search
Chapter Navigation
Top E/R Theory - Normalization - Data Querying - Integrating SQL and Programming Languages

Contents

[edit] Functional Dependency (FD)

[edit] Introduction of FD

gene_id name annotation
... ... ...
g23 hsp39 XX protein
g24 hsp39 XX protein

Functional Dependency: X->Y if two tuples agree on X, then they agree on Y. e.g. "gene_id -> name" and "gid-> annotation" are FDs, but "name->gene_id" and "name->annotation" are NOT FDs.

Relationship between key and FD:

A set of attribute{A1, A2, ...At} is a key for relation R if

(1) {A1, A2, ...At} functionally determines all other attributes;

(2) when no proper subset of {A1, A2, ...At} functionally determin all other attributes, it is called superkey (or Candidate key?)

rules of FDs:

Observation:

        Correct: 
               X->Y, A->B:   XA->BY;   
               X->AY:   X->A, X->Y;
               X->Y, A->Y: XA->Y;
        Incorrect:
               XY->A: X->A, Y->A;

Formal inference rules for functional dependencies:

        Armstrong’s axioms:
               1. Reflexivity: If Y ⊆ X then X -> Y (called trivial FD);
               2. Transitivity: If X -> Y and Y -> Z then X -> Z;
               3. Augmentation: If X -> Y then XZ -> YZ where Z is a set of attributes.

Where do FDs come from?

looking either at

              (1) Domain Knowledge (the meaning of the attributes); 
              (2) the data; 
a b c
1 2 3
4 2 6

b->a is not FD, but a->b is a valid FD.

[edit] BCNF

Formal way to break tables:

gene name annotation experiment id experiment description experiment level
... ... ... ... ... ...

         R1 (gid, name, annotation);    gid->name annotation;
         R2 (exp id, exp desc);         exp id -> exp desc;
         R3 (gid, expid, exp level);   gid, expid -> exp level;



Original Relation -> find\ \ a \  FD \  X->Y \left\{
  \begin{array}{ll}
    {X, Y}, & \hbox{gid, name, annotation;} \\
    {everything\ \  except Y}, & \hbox{gid, exp id, exp desc, exp level;} 
  \end{array}
\right.



gid, exp id, exp desc, exp level \left\{
  \begin{array}{ll}
    {exp id, exp desc}, & \hbox{;} \\
    {gid, \  exp id, \ exp level }, & \hbox{;} 
  \end{array}
\right.

A relation R is in BCNF if for all dependencies X -> Y , at least one of the following holds:

• X -> Y is a trivial FD (Y ⊆ X)

• X is a superkey for R

[edit] 3NF

Formal Definition: a relation R is in 3NF if for all dependencies X® Y in F+, at least one of the following holds:

• X -> Y is a trivial FD (Y ⊆ X)

• X is a superkey for R

• Y ⊂ a candidate key for R


[edit] Comparison of BCNF and 3NF

Removes
Redundancy
Preserves
FDs
Lossless
Decomposition
BCNF Yes No Yes
3NF No Yes Yes


[edit] Lossy Decomposition

Suppose we have a table such as the one below.

a b c
1 2 3
4 2 6

into the following "normalized" tables

a b
1 2
4 2
b c
2 3
2 6

This means we followed

b → a or
b → c

as our FD for the decomposition of the original table. (Remember, BCNF decomposes X → Y to the sets {X, Y} and {everything except Y}). However, notice that column b has non-unique values (two of digit 2). When we recombine these two tables in a join, we wind up with a table in which the original relationships are lost:

a b c
1 2 3
1 2 6
4 2 3
4 2 6

We declare this as a lossy decomposition, because the FDs assumed by the decomposed tables do not hold true (both b → a and b → c are false). Thus, BCNF can not remove all forms of redundancy. We will need to use another model, Multivalued Dependency (MD), to aid us in decomposing tables properly for normalization.

[edit] Multivalued Dependency (MD)

Let's explore a concrete scenario. We'd like to represent belongings of affluent students. We create a multivalued dependency (MD) of

SID →→ car

read as, "A given student ID can have many cars."

Compare this to an FD like

SID → name

which we read as, "A given SID can have at most one name."

Suppose we also had the following MD:

SID →→ clothes

We can get the following table:

SID car clothes
1 Honda jeans
1 Toyota khakis
1 Toyota jeans
1 Honda khakis

Perhaps we want to decompose this table further. We decompose along the MD of

SID →→ car

This leaves us with {SID, clothes}, leaving us with the following tables

SID →→ car {SID, clothes}
SID car
1 Honda
1 Toyota
sid clothes
1 jeans
1 khakis

We can't actually do this, because these are not FDs! Neither SID →→ car or SID →→ clothes have a superkey. The key must be all three attributes. We need new rules for decomposing MDs.

[edit] Rules for MDs

[edit] Spotting trivial MDs

We say that an MD X →→ Y is trivial in R if XY = {all attributes} .

For example, consider the following relation R1 where we have the MD a →→ b:

R1
a b
1 2
1 3
2 2

Because we can have many bs to each a, we can not generate data to violate the MD.

Now consider the following relation R2:

R3
a b c
1 2 6
1 3 7
2 2 6

In this case, a →→ b is non-trivial, because the relation is three-way.

[edit] Non-trivial MDs come in pairs

In fact, a →→ b has a partner MD which is also non-trivial: a →→ c

Here are the listings of all non-trivial pairs of MDs in R2

a →→ b
a →→ c
b →→ a
b →→ c
c →→ a
c →→ b

Compare this to examples of trivial MDs for R2:

a →→ bc
ab →→ c
b →→ ac
bc →→ a
c →→ ab
ca →→ b

[edit] Implications of FDs do not hold for MDs

In FDs, we know X \rightarrow YZ \iff \begin{array}{c}
  X \rightarrow Y \\
  \mbox{and} \\
  X \rightarrow Z
\end{array}

However, this same implication does not exist for MDs. That is, X \rightarrow\rightarrow YZ \nLeftrightarrow \begin{array}{c}
  X \rightarrow\rightarrow Y \\
  \mbox{and} \\
  X \rightarrow\rightarrow Z
\end{array}

[edit] MDs are actually about independence

Given the MDs

a →→ b
and
a →→ c

We know that, for each a

  • there can be many a
  • there can be many a
  • and, bs are independent of cs with respect to a, which we represent symbolically as b \perp c | a [NOTE: the actual first binary relation symbol should be \Perp instead of \perp, but this symbol is not supported by Mediawiki.]

[edit] Every FD is an MD

By definition, XYX →→ Y. If X has at most one Y, then it satisfies the condition of having many Y.

For example, given the following relation

a b

and an FD of ab, then we also have a trivial MD of a →→ b.

Now, consider the FD ab with following relation:

a b c d

In this case, we actually have two non-trivial MDs:

a →→ b
a →→ cd

(Remember, non-trivial MDs come in pairs.)

Personal tools
Namespaces
Variants
Actions
Navigation
Community
Toolbox
Sister projects
Print/export