Data Management in Bioinformatics/SQL Exercises
For these questions, we will consider the following tables about microarray gene expression data. Your task is to express each of the given queries in SQL. Example data are provided in these tables for your convenience but note that they are only example data. Your queries must work for all potential data residing in the given tables, not just those illustrated here.
Genes gid | name | organism | annotation -----+---------+----------+---------------------------- g1 | YLR180C | yeast | hypothetical protein g2 | YLR181D | yeast | response to desiccation g3 | sp15 | yeast | drought stress responsive g4 | pdp77 | pine | putative stress responsive g5 | hsp70 | pine | heat shock protein g6 | hsp90 | pine | heat shock protein
Expression gid | experimentid | level | significance -----+--------------+-------+-------------- g1 | exp12 | 3.5 | 1 g2 | exp23 | -3 | 1 g3 | exp12 | 1 | 2 g3 | exp13 | -1.5 | 2 g3 | exp23 | 1.7 | 4 g4 | exp12 | 1.5 | 2 g4 | exp13 | 1.5 | 2 g4 | exp23 | 1.5 | 2 g4 | exp6 | 1.5 | 2 g5 | exp6 | 2 | 1 g5 | exp13 | 2.5 | 2 g6 | exp6 | -3.86 | 3
Experiments experimentid | name | whoperformed | date --------------+--------------------+---------------+------------ exp12 | Systematic Torture | Prof. Pain | 2004-06-02 exp23 | Heaped Abuse | Tommy Student | 2004-06-03 exp13 | Salt Stress | Gasch | 1998-07-04 exp6 | Sorbitol Exposure | Gasch | 1999-07-05
Membership gid | category -----+-------------------- g1 | glutathione g2 | antioxidant g3 | glycine binding g1 | amino acid binding g4 | amino acid binding g5 | amino acid binding g6 | binding
GOTree
category | parent_category
--------------------+--------------------
antioxidant | molecular function
binding | molecular function
glutathione | antioxidant
glycine binding | amino acid binding
amino acid binding | binding
Contents |
[edit] Question 1
Find the names of experiments performed by Prof. Pain after Jan 1, 2004.
[edit] Answer
SELECT name FROM Experiments WHERE whoperformed = 'Prof. Pain' AND DATE > '2004-01-01';
[edit] Question 2
Find the names of genes that were either positively expressed twofold or more with a significance of at least 1.0, in some experiment, or negatively expressed twofold or less with a significance of at least 1.0, in some experiment. List them alongside their organisms in a two-column format.
[edit] Answer
SELECT Genes.gid, name, level, significance FROM Expression, Genes WHERE Expression.gid = Genes.gid AND significance >= 1.0 AND (level >= 2.0 OR level <= 2.0);
[edit] Question 3
What is the grandparent category of 'glycine binding'?
[edit] Answer
SELECT Parents.parent_category FROM GOTree AS Children, GOTree AS Parents WHERE Children.category = 'glycine binding' AND Children.parent_category = Parents.category;
[edit] Question 4
Find the names of experiments that were performed before some Gasch experiment.
[edit] Straightforward Answer
SELECT E1.name FROM Experiments AS E1, Experiments AS E2 WHERE E1.DATE < E2.DATE AND E2.whoperformed = 'Gasch';
[edit] Subquery Answer
SELECT name FROM Experiments WHERE Experiments.DATE < ( SELECT MAX(DATE) FROM Experiments WHERE whoperformed = 'Gasch' );
[edit] Question 5
Find the names of pine genes that were positively expressed more than 0.5-fold (with a significance of 1.0 or more) in at least two experiments.
[edit] Straightforward Answer
First, we must find the experiments where genes are upreglated and significant.
CREATE VIEW Upregulated AS SELECT gid, experimentid FROM Expression WHERE significance >= 1.0 AND level >= 0.5;
Next, we must determine the genes which were upregulated in at least two experiments. We do this by taking the product of the upregulated genes and selecting rows where the gene ID is the same but the experiment ID is different.
CREATE VIEW UpInTwoOrMore AS SELECT DISTINCT U1.gid AS gid FROM Upregulated AS U1, Upregulated AS U2 WHERE U1.gid = U2.gid AND U1.experimentid <> U2.experimentid;
Finally, we determine which of these genes come from pine, and project their names.
SELECT name FROM Genes, UpInTwoOrMore WHERE Genes.gid = UpInTwoOrMore.gid AND organism = 'pine';
We can alternatively do these steps all in one single query:
SELECT DISTINCT name FROM Genes, Expression AS E1, Expression AS E2 WHERE Genes.gid = E1.gid AND E1.gid = E2.gid AND E1.level >= 0.5 AND E2.level >= 0.5 AND E1.significance >= 1.0 AND E2.significance >= 1.0 AND E1.experimentid <> E2.experimentid AND organism = 'pine';
[edit] Subquery Answer
As another approach, we can make use of subqueries to find the answer. The key to this is to make a correlated subquery where the subquery depends on some property (in this case the gene ID) of the parent query. Note that we'll still need to make use of the Upregulated view created above to reduce code redundancy.
SELECT DISTINCT name FROM Genes, Upregulated AS U1 WHERE Genes.gid = U1.gid AND organism = 'pine' AND U1.gid = ( SELECT DISTINCT U1.gid FROM Upregulated AS U2 WHERE U1.gid = U2.gid AND U1.experimentid <> U2.experimentid );
[edit] GROUP BY Answer
We can make use of advanced features for certain database systems such as PostgreSQL and MySQL to make these queries in slightly more convenient ways via the GROUP BY and COUNT.
CREATE VIEW UpInTwoOrMore AS SELECT gid FROM Expression WHERE level >= 0.5 AND significance >= 1.0 GROUP BY gid HAVING COUNT(*) > 1; SELECT name FROM Genes, UpInTwoOrMore WHERE Genes.gid = UpInTwoOrMore.gid AND organism = 'pine';
[edit] Question 6
Find the names of pine genes that were up-regulated 0.5-fold or more (with a significance of 1.0 or more) in at least three experiments.
[edit] Straightforward Answer
Similar to the answer for question 5. The caveat here is that while the equality evaluations are transitive, while inequality evaluations are not, and so every case must be covered.
CREATE VIEW UpInThreeOrMore AS SELECT DISTINCT U1.gid AS gid FROM Upregulated AS U1, Upregulated AS U2, Upregulated AS U3 WHERE U1.gid = U2.gid AND U1.gid = U3.gid AND U1.experimentid <> U2.experimentid AND U1.experimentid <> U3.experimentid AND U2.experimentid <> U3.experimentid; SELECT name FROM Genes, UpInThreeOrMore WHERE Genes.gid = UpInThreeOrMore.gid AND organism = 'pine';
Alternatively:
SELECT DISTINCT name FROM Genes, Expression AS E1, Expression AS E2, Expression AS E3 WHERE Genes.gid = E1.gid AND E1.gid = E2.gid AND E1.gid = E3.gid AND E1.level >= 0.5 AND E2.level >= 0.5 AND E3.level >= 0.5 AND E1.significance >= 1.0 AND E2.significance >= 1.0 AND E3.significance >= 1.0 AND E1.experimentid <> E2.experimentid AND E1.experimentid <> E3.experimentid AND E2.experimentid <> E3.experimentid AND organism = 'pine';
[edit] Subquery Answer
We need to build in another correlated subquery for our original correlated subquery to make this work.
SELECT DISTINCT name FROM Genes, Upregulated AS U1 WHERE Genes.gid = U1.gid AND organism = 'pine' AND U1.gid = ( SELECT DISTINCT U1.gid FROM Upregulated AS U2 WHERE U1.gid = U2.gid AND U1.gid = ( SELECT DISTINCT U1.gid FROM Upregulated AS U3 WHERE U1.gid = U2.gid AND U1.gid = U3.gid AND U1.experimentid <> U2.experimentid AND U1.experimentid <> U3.experimentid AND U2.experimentid <> U3.experimentid ) );
[edit] GROUP BY Answer
Simply adjust the count evaluation.
CREATE VIEW UpInThreeOrMore AS SELECT gid FROM Expression WHERE level >= 0.5 AND significance >= 1.0 GROUP BY gid HAVING COUNT(*) > 2; SELECT name FROM Genes, UpInThreeOrMore WHERE Genes.gid = UpInThreeOrMore.gid AND organism = 'pine';
[edit] Question 7
Find the names of pine genes that were up-regulated 0.5-fold or more (with a significance of 1.0 or more) in at exactly two experiments.
[edit] Straightforward Answer
The key here is identifying that taking the set of genes upregulated in two or more experiments and subtracting the set of genes upregulated in three or mor experiments gives the set of genes upregulated in precisely two experiments. Thus, our answer is the answer to question 5 subtracted by the answer to question 6.
CREATE VIEW UpInTwo AS SELECT * FROM UpInTwoOrMore EXCEPT SELECT * FROM UpInThreeOrMore; SELECT name FROM Genes, UpInTwo WHERE Genes.gid = UpInTwo.gid AND organism = 'pine';
Or alternatively:
SELECT DISTINCT name FROM Genes, Expression AS E1, Expression AS E2 WHERE Genes.gid = E1.gid AND E1.gid = E2.gid AND E1.level >= 0.5 AND E2.level >= 0.5 AND E1.significance >= 1.0 AND E2.significance >= 1.0 AND E1.experimentid <> E2.experimentid AND organism = 'pine' EXCEPT SELECT DISTINCT name FROM Genes, Expression AS E1, Expression AS E2, Expression AS E3 WHERE Genes.gid = E1.gid AND E1.gid = E2.gid AND E1.gid = E3.gid AND E1.level >= 0.5 AND E2.level >= 0.5 AND E3.level >= 0.5 AND E1.significance >= 1.0 AND E2.significance >= 1.0 AND E3.significance >= 1.0 AND E1.experimentid <> E2.experimentid AND E1.experimentid <> E3.experimentid AND E2.experimentid <> E3.experimentid AND organism = 'pine';
[edit] Subquery Answer
SELECT DISTINCT name FROM Genes, Upregulated AS U1 WHERE Genes.gid = U1.gid AND organism = 'pine' AND U1.gid = ( SELECT DISTINCT U1.gid FROM Upregulated AS U2 WHERE U1.gid = U2.gid AND U1.experimentid <> U2.experimentid ) EXCEPT SELECT DISTINCT name FROM Genes, Upregulated AS U1 WHERE Genes.gid = U1.gid AND organism = 'pine' AND U1.gid = ( SELECT DISTINCT U1.gid FROM Upregulated AS U2 WHERE U1.gid = U2.gid AND U1.gid = ( SELECT DISTINCT U1.gid FROM Upregulated AS U3 WHERE U1.gid = U2.gid AND U1.gid = U3.gid AND U1.experimentid <> U2.experimentid AND U1.experimentid <> U3.experimentid AND U2.experimentid <> U3.experimentid ) );
[edit] GROUP BY Answer
CREATE VIEW UpInTwo AS SELECT * FROM UpInTwoOrMore EXCEPT SELECT * FROM UpInThreeOrMore; SELECT name FROM Genes, UpInTwo WHERE Genes.gid = UpInTwo.gid AND organism = 'pine';
[edit] Question 8
Find the genes that showed positive expression in every experiment recorded for it.
[edit] Question 9
What is the name of the gene that was most positively expressed in experiment exp23? Assume a minimum significance of 1.0.
[edit] Question 10
This question builds on question 8; what is the name of the gene that was "second most positively expressed"? Assume again a minimum significance of 1.0.
[edit] Question 11
Which gene(s) were positively expressed in ALL the experiments listed in the Experiments table? No constraints on significance level.
[edit] Question 12
Prepare a table of genes, their annotation, and any experiment in which they were either the highest or lowest expressed (of any significance level). Include a fourth column to say if they were the highest or lowest.
This page may need to be