Structured Query Language/SELECT: Set Operations
Tables, views and results of SELECT commands are in somewhat similar to sets of set theory. In this comparision the elements of sets correspond to rows of tables, views and SELECT results. The differences between set theory and the itemized SQL constructs are:
- Sets of set theory do not allow duplicates whereas SQL allows duplicates. (Even different rows of one table may be identical as there is no duty to use the concept of primary keys.) In the following we use the term multiset when we speak about sets in SQL where duplicates are possible.
- Sets of set theory and multisets are not ordered. But for the result of a SELECT command we can enforce an ordering by means of the optional ORDER BY clause.
The comparision between set theory and SQL goes even further. In SQL we have operations which acts on multisets in the sense of set theory: The SQL operations UNION, INTERSECT and EXCEPT (some name it MINUS) process intermediate multisets generated by differents SELECT commands. The operations expect the multisets are of the same type. This means mainly that they must have the same number of columns. Also their data type should correlate, but this is not mandatory. If they do not, the DBMS will cast them to a common data type - if possible.
The UNION operation pushes the result of several SELECT commands together. The result of the UNION contains those values, which are in the first or in the second intermediate result.
-- Please consider that this is only one command (only ONE semicolon at the very end) SELECT firstname -- first SELECT command FROM person UNION -- push both intermediate results together to one result SELECT lastname -- second SELECT command FROM person;
This is a single SQL command. It consists of two SELECTs and one UNION operation. The SELECTs are evaluated first. Afterwards their results are pushed together to one single result. In our example the result contains all lastnames and firstnames in a single column (our example may be of limited help in praxis, it's only a demonstration for the UNION).
DISTINCT / ALL
If we examine the result closely, we will notice that it consists only of 17 values. The table person contains ten rows so that we probably expect twenty values in the result. If we perform the 'SELECT firstname ...' and 'SELECT lastname ...' as separate commands without the UNION, we receive for both commands 10 values. The explanation for the 3 missing values is the UNION command. By default UNION removes duplicates. Therefore some of the intermediate values are skipped. If we want to obtain the duplicate values we have to modify the UNION operation. Its behavior can be changed with one of the two key words DISTINCT or ALL. DISTINCT is the default and it removes duplicate values as we have seen before. ALL will retain all values including duplicates.
-- remove (that's the default) or keep duplicates SELECT ... UNION [DISTINCT | ALL] SELECT ... [ -- it is possible to 'UNION' more than 2 intermediate results UNION [DISTINCT | ALL] SELECT ... ];
A hint for Oracle users: The use of the key word DISTINCT, which is the default, is not accepted by Oracle. Omit it.
In most cases the UNION combines SELECT commands on different tables or on different columns of the same table. SELECT commands on the same column of a single table usually use the WHERE clause in combination with boolean logic.
-- A very unusual example. People apply such querys on the same table only in combination with very complex WHERE conditions. -- This example would normally expressed with a phrasing similar to: WHERE lastname IN ('de Winter', 'Goldstein'); SELECT * FROM person WHERE lastname = 'de Winter' UNION ALL SELECT * FROM person WHERE lastname = 'Goldstein';
The INTERSECT operation evaluates to those values, which are in both intermediate results, in the first as well as in the second.
-- As in our example database there is no example for the INTERSECT we insert a new person. -- This person has the same lastname 'Victor' as the first name of another person. INSERT INTO person VALUES (21, 'Paul', 'Victor', DATE'1966-04-02', 'Washington', '078-05-1121', 66); COMMIT; -- All firstnames which are used as lastname. SELECT firstname -- first SELECT command FROM person INTERSECT -- looking for common values SELECT lastname -- second SELECT command FROM person;
A hint to MySQL users: MySQL (5.5) does not support INTERSECT operation. But as it is not an elementary operation, there are workarrounds.
The EXCEPT operation evaluates to those values, which are in the first intermediate result but not in the second.
-- All firstname with the exception of 'Victor' because there is a lastname with this value. SELECT firstname -- first SELECT command FROM person EXCEPT -- are there values in the result of first SELECT but not of second? SELECT lastname -- second SELECT command FROM person;
A hint to MySQL users: MySQL (5.5) does not support the EXCEPT operation. But as it is not an elementary operation, there are workarrounds.
A hint to Oracle users: Oracle use the key word MINUS instead of EXCEPT.
-- Clean up the example database DELETE FROM person WHERE id > 10; COMMIT;
We can combine set operations with all other elements of SELECT command, in particular with ORDER BY and GROUP BY. But this may lead to some uncertainties. Therefore, we would like to explain some of the details below.
SELECT firstname -- first SELECT command FROM person UNION -- push both intermediate results together to one result SELECT lastname -- second SELECT command FROM person ORDER BY firstname;
To which part of the command belongs the ORDER BY? To the first SELECT, to the second SELECT or to the result of the UNION? The SQL rules determine, that set operations are evaluated before ORDER BY clauses (as always parenthesis can change the order of evaluation). Therefore the ORDER BY sorts the final result and not any of the intermediate results.
We rearrange the example in the hope that things get clear.
-- Equivalent semantic SELECT * FROM (SELECT firstname -- first SELECT command FROM person UNION -- push both intermediate (unnamed) results together to the next intermediate result 't' SELECT lastname -- second SELECT command FROM person ) t -- 't' is the name for the intermediate result generated by UNION ORDER BY t.firstname;
First the two SELECTS are evaluated, afterwareds the UNION. This intermediate result gets the name 't'. 't' is ordered.
Often one would like the rows from the first SELECT to be ordered independent from the rows of the second SELECT. We can do this by adding a virtual column to the result of each SELECT statement and using the virtual columns in the ORDER BY.
SELECT '1' as dummy, firstname FROM person UNION SELECT '2', lastname FROM person ORDER BY dummy, firstname;
With the GROUP BY clause things are little more complicated than with ORDER BY. The GROUP BY refers to the last SELECT or - to say it the other way round - to the SELECT of its direct level.
-- Will not work because the GROUP BY belongs to the second SELECT and not to the UNION! SELECT firstname FROM person UNION SELECT lastname FROM person GROUP BY firstname; -- -- Works, but possibly not what you want to do. -- The alias name for the (only) column of the UNION is 'firstname'. SELECT firstname FROM person UNION -- We group over the (only) column of the second SELECT, which is 'lastname' and results in 7 values SELECT lastname FROM person GROUP BY lastname; -- -- Make things clear: rearrange the query to group over the final result SELECT * FROM (SELECT firstname -- columnnames of the first SELECT determins the columnnames of the UNION FROM person UNION SELECT lastname FROM person ) t GROUP BY t.firstname; -- now we can group over the complete result
Show the lowest, highest and mean weight as a) 3 values of 1 row and b) 1 value in 3 rows.
-- 1 row SELECT min(weight), max(weight), avg(weight) FROM person; -- 3 rows SELECT min(weight) FROM person UNION SELECT max(weight) FROM person UNION SELECT avg(weight) FROM person;
Extend the previous 3-rows-solution to meet two additional criteria: a) consider only persons born in San Francisco and
b) add a virtual column to show 'Min', 'Max' and 'Avg' according to the correlating numeric values.
SELECT 'Min', min(weight) FROM person WHERE place_of_birth = 'San Francisco' UNION SELECT 'Max', max(weight) FROM person WHERE place_of_birth = 'San Francisco' UNION SELECT 'Avg', avg(weight) FROM person WHERE place_of_birth = 'San Francisco';
Extend the previous solution to order the result: the minimum value first, followed by the average and then the highest value.
-- 'ugly' solution SELECT '1 Min' AS note, min(weight) FROM person WHERE place_of_birth = 'San Francisco' UNION SELECT '3 Max' AS note, max(weight) FROM person WHERE place_of_birth = 'San Francisco' UNION SELECT '2 Avg' AS note, avg(weight) FROM person WHERE place_of_birth = 'San Francisco' ORDER BY note; -- 'clean' solution SELECT 1 AS note, 'Min', min(weight) FROM person WHERE place_of_birth = 'San Francisco' UNION SELECT 3 AS note, 'Max', max(weight) FROM person WHERE place_of_birth = 'San Francisco' UNION SELECT 2 AS note, 'Avg', avg(weight) FROM person WHERE place_of_birth = 'San Francisco' ORDER BY note;
Create a list of lastnames for persons with a weight greater than 70 kg together with
all e-mail values (one value per row). There is no concordance between lastnames and e-mails.
(This example is not very helpfull for praxis, but instructive.)
SELECT lastname FROM person WHERE weight > 70 UNION SELECT contact_value FROM contact WHERE contact_type = 'email';
In the previous example the lastname 'de Winter' is shown only once. But there are more than one persons of the family with a weight greater than 70 kg.
Extend the previous solution to show as much resulting rows as hits to the criteria.
-- Extend 'UNION' to 'UNION ALL'. The default is 'UNION DISTINCT' SELECT lastname FROM person WHERE weight > 70 UNION ALL SELECT contact_value FROM contact WHERE contact_type = 'email';