Data Management in Bioinformatics/Data Querying

From Wikibooks, open books for an open world
Jump to: navigation, search
Chapter Navigation
Top E/R Theory - Normalization - Data Querying - Integrating SQL and Programming Languages

Data Querying[edit]

Query Languages[edit]

  • SQL (Structured Query Language ). SQL is a Data Definition and Manipulation Language (both DDL and DML)
  • RA (Relational Algebra)
  • Datalog

Creating Tables[edit]

  • Command : CREATE TABLE

For the relations:

Gene(gid, name, annotation)

Experiment(eid, name,date)

Microarray(gid, eid, exprlevel)

Tables can be created using the following set of commands in SQL:

CREATE TABLE Gene( gid INTEGER, name VARCHAR(20), annotation  
VARCHAR(50), PRIMARY KEY (gid));
 
CREATE TABLE Experiment(eid INTEGER, name VARCHAR(10), DATE  
DATE, PRIMARY KEY (eid));
 
CREATE TABLE Microarray (a INTEGER, b INTEGER, exprlevel REAL,  
PRIMARY KEY (gid, eid));

A primary key is used to uniquely identify each row in a table. It can either be part of the actual record itself, or it can be an artificial field. SQL will automatically index the table based on the primary key. Indexing with other keys (other than the primary key) is also possible; it can be done by using the CREATE INDEX command.

Note that the primary key for the "Microarray" table is composed of keys from the other two tables i.e. Gene and Experiment. These fields are called Foreign Keys. It is important to reference where the table gets its primary key from. The syntax is as follows.

CREATE TABLE Microarray (a INTEGER, b INTEGER, exprlevel REAL,  
PRIMARY KEY (gid, eid), FOREIGN KEY (a) REFERENCES Gene 
(gid), FOREIGN KEY (b) Experiment (eid) );

These types of constraints are often called Referential Integrity Constraints.

Storing Data[edit]

  • Command : INSERT INTO

Example:

INSERT INTO Gene VALUES(1, "1433E", "enzyme binding");

Most database management systems have the capability to load bulk data at once.

Querying[edit]

Projection[edit]

  1. To display all the data stored in the Gene table
    • SQL
      SELECT * 
      FROM Gene;
      
    • RA Gene
    • Datalog Gene(x,y,z)
  2. To display the names of all genes in the Gene table
    • SQL
      SELECT name
      FROM Gene;
      
    • RA \pi_{name}(Gene)
    • Datalog Answer(x) \leftarrow Gene(y,x,z)
    • Note: to list out only distinct values use SELECT DISTINCT in SQL. RA and Datalog return distinct values by default.

  3. To display gene names that meet certain criterion
    • SQL
      SELECT * 
      FROM Gene
      WHERE annotation LIKE '%phosphates%';
      
    • RA  \sigma_{annotation='%phosphate%'}(Gene)
    • Datalog Gene(x,y,z), z='%phosphates%'

Cartesian Product[edit]

Definition : The Cartesian product of two tables of size nxa and mxb is a table having n*m rows and a+b columns

  • SQL
    SELECT * 
    FROM Gene, Experiment;
    
  • RA Gene \times Experiment
  • Datalog Answer(x,y,z, a, b,c) \leftarrow Gene (x,y,z), Experiment (a,b,c)

Join[edit]

Example 1[edit]

  • Table
Genes
Expression
Experiments
gid name anno
... ... ...
... ... ...
... ... ...
gid eid explevel
... ... ...
... ... ...
... ... ...
eid name date
... ... ...
... ... ...
... ... ...
  • E-R Diagram

File:Cs5634-ER-pic1.gif

  • SQL
    SELECT Genes.*, Expts.*, exp_level
      FROM Genes, Expression, Expts
     WHERE Genes.gid = Expression.gid
       AND Expression.eid = Expts.gid;
    
  • RA
     Genes ?? Expression ?? Expts
    
     (A ?? B) ?? C = A ?? (B ?? C)
    
  • Datalog
     Answer(x, y, t, a, b, w) ← Genes(x, y, z), Expression(x, t, w), Expts(t, a, b)
    
     Genes' x match Expression's x
     Expts' t match Expression's t
    

Example 2[edit]

Question: ...

Expression
gid eid explevel
... ... ...
... ... ...
... ... ...
  • RA CannotbeMax = ... Allpairs = ... Allpairs - CannotbeMax
  • SQL
    CREATE VIEW CannotbeMax(gid, eid) AS
        SELECT E1.gid, E1.eid
          FROM Expression AS E1, Expression AS E2
         WHERE E1.explevel < E2.explevel;
     
    CREATE VIEW Allpairs(gid, eid) AS
        SELECT gid, eid
          FROM expression;
     
    SELECT * FROM Allpairs
    EXCEPT
    SELECT * FROM CannotbeMax;
    
    - EXCEPT
    ∪ UNION
    ∩ INTERSECT
    
  • Datalog
     CannotbeMax(x, y) <- Expression(x, y, t), Expression(a, b, w), t < w.
    
     Answer(x, y) <- Expression(x, y, a), NOT CannotbeMax(x, y).
    
  • Interesting Query[edit]

    Without MAX() operator

    • max
    - ref. Join section Example 2
    
    • min
    • second largest
    • median

    Relational Division[edit]

    A X B = C
    C / B = ?
    
    Student & Courses
    StudentId CourseID
    one CS2204
    one CS4104
    one CS6354
    two CS2204
    two CS4104
    Required For Graduation(R4G)
    CourseID
    CS2204
    CS4104
    CS6354

    Through Relational Algebra
    1. sid =\pisidStudent&Courses
    2. Ideal = sid X R4G
    3. Reality = Student&Courses
    4. Defaulters = Ideal - Reality
    5. Answer = sid - \pisidDefaulters
    Through SQL

    1.
    CREATE VIEW SID(sid)<BR>
              SELECT sid<BR>
              FROM   Student&Courses<BR>
    
    2.
    CREATE VIEW IDEAL(sid,courseId)<BR>
               SELECT sid,courseID<BR>
               FROM  SID,R4G<BR>
    
    3.
    CREATE VIEW REALITY(sid,courseId)<BR>
               SELECT *<BR>
               FROM Student&Courses<BR>
    
    4.
    CREATE VIEW Defaulters(sid,courseId)<BR>
               SELECT * <BR>
               FROM IDEAL<BR>
               EXCEPT<BR>
               SELECT *<BR>
               FROM REALITY<BR>
    
    5.
    SELECT * <BR>
      FROM SID<BR>
      EXCEPT <BR>
      SELECT sid<BR>
      FROM Defaulters <BR>
    


    Exercise:

    • Find students who satisfied graduate requirements except for 1.
    • Find students who satisfied most of the graduate requirements.


    A View is not really computed, an row most of times cannot be inserted because of ambiguities.
    SQL Injection, is a hacking technique where the input is formatted such that it executes a SQL query which is not expected.


    Bells and Whistles
    1. ORDER BY
    2. MAX,AVG,SUM,MIN
    3. GROUP BY, HAVING
    4. SELECT has a functionality of Printing