Data Management in Bioinformatics/SQL Exercises

From Wikibooks, open books for an open world
Jump to: navigation, search

For these questions, we will consider the following tables about microarray gene expression data. Your task is to express each of the given queries in SQL. Example data are provided in these tables for your convenience but note that they are only example data. Your queries must work for all potential data residing in the given tables, not just those illustrated here.

Genes
 gid |  name   | organism |         annotation         
-----+---------+----------+----------------------------
 g1  | YLR180C | yeast    | hypothetical protein
 g2  | YLR181D | yeast    | response to desiccation
 g3  | sp15    | yeast    | drought stress responsive
 g4  | pdp77   | pine     | putative stress responsive
 g5  | hsp70   | pine     | heat shock protein
 g6  | hsp90   | pine     | heat shock protein
Expression
 gid | experimentid | level | significance 
-----+--------------+-------+--------------
 g1  | exp12        |   3.5 |            1
 g2  | exp23        |    -3 |            1
 g3  | exp12        |     1 |            2
 g3  | exp13        |  -1.5 |            2
 g3  | exp23        |   1.7 |            4
 g4  | exp12        |   1.5 |            2
 g4  | exp13        |   1.5 |            2
 g4  | exp23        |   1.5 |            2
 g4  | exp6         |   1.5 |            2
 g5  | exp6         |     2 |            1
 g5  | exp13        |   2.5 |            2
 g6  | exp6         | -3.86 |            3
Experiments
 experimentid |        name        | whoperformed  |    date    
--------------+--------------------+---------------+------------
 exp12        | Systematic Torture | Prof. Pain    | 2004-06-02
 exp23        | Heaped Abuse       | Tommy Student | 2004-06-03
 exp13        | Salt Stress        | Gasch         | 1998-07-04
 exp6         | Sorbitol Exposure  | Gasch         | 1999-07-05
Membership
 gid |      category      
-----+--------------------
 g1  | glutathione
 g2  | antioxidant
 g3  | glycine binding
 g1  | amino acid binding
 g4  | amino acid binding
 g5  | amino acid binding
 g6  | binding
GOTree
      category      |  parent_category   
--------------------+--------------------
 antioxidant        | molecular function
 binding            | molecular function
 glutathione        | antioxidant
 glycine binding    | amino acid binding
 amino acid binding | binding

Question 1[edit]

Find the names of experiments performed by Prof. Pain after Jan 1, 2004.

Answer[edit]

SELECT name
FROM Experiments
WHERE whoperformed = 'Prof. Pain'
AND DATE > '2004-01-01';


Question 2[edit]

Find the names of genes that were either positively expressed twofold or more with a significance of at least 1.0, in some experiment, or negatively expressed twofold or less with a significance of at least 1.0, in some experiment. List them alongside their organisms in a two-column format.

Answer[edit]

SELECT Genes.gid, name, level, significance
FROM Expression, Genes
WHERE Expression.gid = Genes.gid
AND significance >= 1.0
AND (level >= 2.0 OR level <= 2.0);

Question 3[edit]

What is the grandparent category of 'glycine binding'?

Answer[edit]

SELECT Parents.parent_category
FROM GOTree AS Children, GOTree AS Parents
WHERE Children.category = 'glycine binding'
AND Children.parent_category = Parents.category;

Question 4[edit]

Find the names of experiments that were performed before some Gasch experiment.

Straightforward Answer[edit]

SELECT E1.name
FROM Experiments AS E1, Experiments AS E2
WHERE E1.date < E2.date
AND E2.whoperformed = 'Gasch';

Subquery Answer[edit]

SELECT name
FROM Experiments
WHERE Experiments.date < (
    SELECT MAX(DATE)
    FROM Experiments
    WHERE whoperformed = 'Gasch'
);

Question 5[edit]

Find the names of pine genes that were positively expressed more than 0.5-fold (with a significance of 1.0 or more) in at least two experiments.

Straightforward Answer[edit]

First, we must find the experiments where genes are upreglated and significant.

CREATE VIEW Upregulated AS
SELECT gid, experimentid
FROM Expression
WHERE significance >= 1.0
AND level >= 0.5;

Next, we must determine the genes which were upregulated in at least two experiments. We do this by taking the product of the upregulated genes and selecting rows where the gene ID is the same but the experiment ID is different.

CREATE VIEW UpInTwoOrMore AS
SELECT DISTINCT U1.gid AS gid
FROM Upregulated AS U1, Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid;

Finally, we determine which of these genes come from pine, and project their names.

SELECT name
FROM Genes, UpInTwoOrMore
WHERE Genes.gid = UpInTwoOrMore.gid
AND organism = 'pine';

We can alternatively do these steps all in one single query:

SELECT DISTINCT name
FROM Genes, Expression AS E1, Expression AS E2
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND organism = 'pine';

Subquery Answer[edit]

As another approach, we can make use of subqueries to find the answer. The key to this is to make a correlated subquery where the subquery depends on some property (in this case the gene ID) of the parent query. Note that we'll still need to make use of the Upregulated view created above to reduce code redundancy.

SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
    SELECT DISTINCT U1.gid
    FROM Upregulated AS U2
    WHERE U1.gid = U2.gid
    AND U1.experimentid <> U2.experimentid
);

GROUP BY Answer[edit]

We can make use of advanced features for certain database systems such as PostgreSQL and MySQL to make these queries in slightly more convenient ways via the GROUP BY and COUNT.

CREATE VIEW UpInTwoOrMore AS
SELECT gid
FROM Expression
WHERE level >= 0.5
AND significance >= 1.0
GROUP BY gid
HAVING COUNT(*) > 1;
 
SELECT name
FROM Genes, UpInTwoOrMore
WHERE Genes.gid = UpInTwoOrMore.gid
AND organism = 'pine';

Question 6[edit]

Find the names of pine genes that were up-regulated 0.5-fold or more (with a significance of 1.0 or more) in at least three experiments.

Straightforward Answer[edit]

Similar to the answer for question 5. The caveat here is that while the equality evaluations are transitive, while inequality evaluations are not, and so every case must be covered.

CREATE VIEW UpInThreeOrMore AS
SELECT DISTINCT U1.gid AS gid
FROM Upregulated AS U1, Upregulated AS U2, Upregulated AS U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid;
 
SELECT name
FROM Genes, UpInThreeOrMore
WHERE Genes.gid = UpInThreeOrMore.gid
AND organism = 'pine';

Alternatively:

SELECT DISTINCT name
FROM Genes, Expression AS E1, Expression AS E2, Expression AS E3
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.gid = E3.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E3.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E3.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND E1.experimentid <> E3.experimentid
AND E2.experimentid <> E3.experimentid
AND organism = 'pine';

Subquery Answer[edit]

We need to build in another correlated subquery for our original correlated subquery to make this work.

SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
    SELECT DISTINCT U1.gid
    FROM Upregulated AS U2
    WHERE U1.gid = U2.gid
    AND U1.gid = (
        SELECT DISTINCT U1.gid
        FROM Upregulated AS U3
        WHERE U1.gid = U2.gid
        AND U1.gid = U3.gid
        AND U1.experimentid <> U2.experimentid
        AND U1.experimentid <> U3.experimentid
        AND U2.experimentid <> U3.experimentid
    )
);

GROUP BY Answer[edit]

Simply adjust the count evaluation.

CREATE VIEW UpInThreeOrMore AS
SELECT gid
FROM Expression
WHERE level >= 0.5
AND significance >= 1.0
GROUP BY gid
HAVING COUNT(*) > 2;
 
SELECT name
FROM Genes, UpInThreeOrMore
WHERE Genes.gid = UpInThreeOrMore.gid
AND organism = 'pine';

Question 7[edit]

Find the names of pine genes that were up-regulated 0.5-fold or more (with a significance of 1.0 or more) in at exactly two experiments.

Straightforward Answer[edit]

The key here is identifying that taking the set of genes upregulated in two or more experiments and subtracting the set of genes upregulated in three or mor experiments gives the set of genes upregulated in precisely two experiments. Thus, our answer is the answer to question 5 subtracted by the answer to question 6.

CREATE VIEW UpInTwo AS
SELECT *
FROM UpInTwoOrMore
EXCEPT
SELECT *
FROM UpInThreeOrMore;
 
SELECT name
FROM Genes, UpInTwo
WHERE Genes.gid = UpInTwo.gid
AND organism = 'pine';

Or alternatively:

SELECT DISTINCT name
FROM Genes, Expression AS E1, Expression AS E2
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND organism = 'pine'
EXCEPT
SELECT DISTINCT name
FROM Genes, Expression AS E1, Expression AS E2, Expression AS E3
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.gid = E3.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E3.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E3.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND E1.experimentid <> E3.experimentid
AND E2.experimentid <> E3.experimentid
AND organism = 'pine';

Subquery Answer[edit]

SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
    SELECT DISTINCT U1.gid
    FROM Upregulated AS U2
    WHERE U1.gid = U2.gid
    AND U1.experimentid <> U2.experimentid
)
EXCEPT
SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
    SELECT DISTINCT U1.gid
    FROM Upregulated AS U2
    WHERE U1.gid = U2.gid
    AND U1.gid = (
        SELECT DISTINCT U1.gid
        FROM Upregulated AS U3
        WHERE U1.gid = U2.gid
        AND U1.gid = U3.gid
        AND U1.experimentid <> U2.experimentid
        AND U1.experimentid <> U3.experimentid
        AND U2.experimentid <> U3.experimentid
    )
);

GROUP BY Answer[edit]

CREATE VIEW UpInTwo AS
SELECT *
FROM UpInTwoOrMore
EXCEPT
SELECT *
FROM UpInThreeOrMore;
 
SELECT name
FROM Genes, UpInTwo
WHERE Genes.gid = UpInTwo.gid
AND organism = 'pine';

Question 8[edit]

Find the genes that showed positive expression in every experiment recorded for it.

 SELECT Expression.gid, level from Expression WHERE level>1.0;

Question 9[edit]

What is the name of the gene that was most positively expressed in experiment exp23? Assume a minimum significance of 1.0.

Question 10[edit]

This question builds on question 8; what is the name of the gene that was "second most positively expressed"? Assume again a minimum significance of 1.0.

Question 11[edit]

Which gene(s) were positively expressed in ALL the experiments listed in the Experiments table? No constraints on significance level.

Question 12[edit]

Prepare a table of genes, their annotation, and any experiment in which they were either the highest or lowest expressed (of any significance level). Include a fourth column to say if they were the highest or lowest.