Structured Query Language/SELECT: Join Operation
Data should be stored in a way, that no redundant information exists in the database. If more than one person indulge a hobby, for example, we avoid storing information about hobbies (designation, remarks, ...) within each affected persons row as this would double the hobbies information. And we avoid storing information about persons (name, ssn, ...) within each affected hobby row for the same reason. Instead we create independent person and hobby tables and point from one to the other. This technic of grouping data items redundancy-free into separate tables is called database normalization. As a result, information about complex objects is scattered across multiple tables. And we need a opportunity to reassemble the whole, original situation. This reverse technic is called a 'JOIN operation'.
In our example database there are the two tables person and contact. The contact table contains the column person_id, which correlates with the Primary Key column id of the person table. By evaluating the column values we can join contacts and persons together.
person table P
contact table C
Joined (virtual) table, created out of person and contact
As we see, Larry Goldstein, who exists only once in the real table person now occurs four times in the joined, virtual table, every time in combination with one of his four contact items. The same applies for Kim Goldstein with his two contact items.
But what is going on with Tom Burton and Lisa Hamilton, whose contact information is not available? We may have some trouble with the attempt to join their person with - not existing - contact information. For the moment we have flagged the situation with question marks. Later on this page we will explain in detail how to transform the problem into a solution.
The Basic Syntax
Obviously it's necessary to specify two things with the JOIN operation
- the names of involved tables
- the names of involved columns
The basic syntax extends the SELECT command with this two elements
SELECT <things_to_be_displayed> -- as usual FROM <tablename_1> <table_1 alias> -- a table alias JOIN <tablename_2> <table_2 alias> ON <JOIN condition> -- the join criterion ... -- optionally all the other elements of SELECT command ;
Let's take a first attempt.
SELECT * FROM person p JOIN contact c ON p.id = c.person_id;
The table names are mentioned behind the FROM (as usual) and behind the new keyword JOIN, which instructs the DBMS to perform a join operation. Next there is the ON keyword, which introduces the columnnames together with a comparision operator (or a general condition as you will see later). The columnnames are prefixed with the aliases of the tablenames p and c respectively. This is necessary, because in different tables there may exist columns with identical names like id.
When the DBMS executes the command it delivers 'something' that contains all columns of table person and all columns of table contact - amongst others the two columns id of person and id of contact. The result contains 9 rows, one per existing combination of person and contact ('existing' in the sense of the formulation behind the 'ON' keyword). Please notice that persons without any stored contact are not part of the result.
The delivered 'something' looks like a new table. And in fact it has the same structure, behaviour and data like a table. If it is created during a subselect or from a view, we can perform new SELECTs on it. But there is an important difference to tables: The data is not stored in the DBMS at its own place, the data is computed at runtime out of the values of real tables.
The key feature of assembling complex information out of simple tables is possible by using the two simple keywords JOIN and ON. At the same time - as you will see - the syntax is extendable to manage very complex querys, so that there may be a great number of additional details in specifying your join criterias. People sometimes gets confused about the gap between their intention and some unexpected results. If you run into a problem of that kind, try to reduce your query to the shown simple form. Confusion often results from the fact, that the JOIN syntax itself may be complex due to a lot of details. Additionally joining can be combined with all other syntactical elements of the SELECT command, which also may lead to lack of clarity.
The combination of the join syntax with other language elements is shown in the following examples.
-- -- show only important columns SELECT p.firstname, p.lastname, c.contact_type AS "Kind of Contact", c.contact_value AS "Call Number" FROM person p JOIN contact c ON p.id = c.person_id; -- show only desired rows SELECT p.firstname, p.lastname, c.contact_type AS "Kind of Contact", c.contact_value AS "Call Number" FROM person p JOIN contact c ON p.id = c.person_id WHERE c.contact_type IN ('fixed line', 'mobile'); -- apply any sort order SELECT p.firstname, p.lastname, c.contact_type AS "Kind of Contact", c.contact_value AS "Call Number" FROM person p JOIN contact c ON p.id = c.person_id WHERE c.contact_type IN ('fixed line', 'mobile') ORDER BY p.lastname, p.firstname, c.contact_type DESC; -- use functions: min() / max() / count() SELECT COUNT(*) FROM person p JOIN contact c ON p.id = c.person_id WHERE c.contact_type IN ('fixed line', 'mobile'); -- JOIN a table with itself. Example: Search different persons with same lastname SELECT p1.id, p1.firstname, p1.lastname, p2.id, p2.firstname, p2.lastname FROM person p1 JOIN person p2 ON p1.lastname = p2.lastname -- for second incarnation of person we must use a different alias WHERE p1.id != p2.id -- sorting of p2.lastname is not neccessary as it is identical to the already sorted p1.lastname ORDER BY p1.lastname, p1.firstname, p2.firstname; -- JOIN more than two tables. Example: contact information of different persons with same lastname SELECT p1.id, p1.firstname, p1.lastname, p2.id, p2.firstname, p2.lastname, c.contact_type, c.contact_value FROM person p1 JOIN person p2 ON p1.lastname = p2.lastname JOIN contact c ON p2.id = c.person_id -- contact info from person2. p1.id would lead to person1 WHERE p1.id != p2.id ORDER BY p1.lastname, p1.firstname, p2.lastname;
Four Join Types
At the beginning of this page we have seen a strange situation where no corresponding rows exist. We have inserted a question mark into this cells. With the basic syntax of the JOIN operation all rows, where those question marks would appear, are not part of the result. This kind of JOIN is called an INNER join. Addionally there are three kinds of so called OUTER joins. An OUTER join likewise contains all the corresponding rows as an INNER join, but is supplemented with rows with no counterpart in one or both of the two tables: LEFT OUTER, RIGHT OUTER and FULL OUTER.
So we can widen the basic JOIN syntax to the four possibilities:
- [INNER] JOIN
- LEFT [OUTER] JOIN
- RIGHT [OUTER] JOIN
- FULL [OUTER] JOIN
Keywords surrounded by [ ] are optional. Omitting the INNER keyword leads to the basic syntax we have seen in the previous paragraph.
In most cases, people use the inner join. As we have seen previously the result in this version contains exactly those rows, which achievs exactly the criterion specified behind the ON. As a first example we create a list of persons and their contacts.
-- A list of persons and their contacts SELECT p.firstname, p.lastname, c.contact_type, c.contact_value FROM person p JOIN contact c ON p.id = c.person_id -- identical meaning: INNER JOIN ... ORDER BY p.lastname, p.firstname, c.contact_type DESC, c.contact_value;
The important thing is that persons without any contact information are not part of the result.
Left (outer) Join
Sometimes we need a little more, for example a list of all persons, which contains - as an additional information about the person - their contact information. Please consider the difference to the above example. The list should contain all persons, also such without contact information.
-- A list of ALL persons plus their contacts SELECT p.firstname, p.lastname, c.contact_type, c.contact_value FROM person p LEFT JOIN contact c ON p.id = c.person_id -- identical meaning: LEFT OUTER JOIN ... ORDER BY p.lastname, p.firstname, c.contact_type DESC, c.contact_value;
Obviously the result contains all rows just like in the above inner join. Additionally there are rows for persons which have no contact information. As they do not have those information the values for the contact columns are not known. The DBMS fills them with the so called 'null value' (which is different from the string 'null value' and different from the binary 0 - and for SQL users it's waste to know how it is implemented). You will find more informations about null values and how to retrieve them in a later chaper.
In summary the left (outer) join is an inner join plus one row for each row without a counterpart.
Consider the word 'left'. It indicates that the evaluation of the formulation "FROM <table_1> LEFT JOIN <table_2>" is done from left to right. All rows of the left listed table (here: table_1) are included in the result, they may have a counterpart in table_2 or not.
SELECT p.firstname, p.lastname, c.contact_type, c.contact_value FROM contact c LEFT JOIN person p ON p.id = c.person_id -- identical meaning: LEFT OUTER JOIN ... ORDER BY p.lastname, p.firstname, c.contact_type DESC, c.contact_value;
What makes the difference? We have changed the order of tablenames. As contact is now mentioned on the left side, it is evaluated first (all the contact rows go to the result - plus their according person information). As in our example database there is no contact without a person, the result is the same as with an inner join. But it's different from the first example of the left join paragraph.
Right (outer) Join
The right join obeys the same rules as the left join. First all joining rows goes to the result, supplemented by all rows with no corresponding rows in the other table. The DBMS fills all missing column values with the null value. The only difference is that the evaluation sequence of tables takes place from right to left. In a formulation like "FROM <table_1> RIGHT JOIN <table_2>" all rows of table_2 are included in the result, they may have a counterpart in table_1 or not.
-- A list of ALL contacts plus according persons SELECT p.firstname, p.lastname, c.contact_type, c.contact_value FROM person p RIGHT JOIN contact c ON p.id = c.person_id -- identical meaning: RIGHT OUTER JOIN ... ORDER BY p.lastname, p.firstname, c.contact_type DESC, c.contact_value;
In our example database their is no contact without a person, so the result of this query is identical to an inner join. In general a right join is identical to a left join with reverse order of table names, see our 'another example' above.
Full (outer) Join
A full join combines the proceedings of left and right join such that the result contains all corresponding rows plus rows of left table without match in right table plus rows of right table without match in left table.
SELECT p.firstname, p.lastname, c.contact_type, c.contact_value FROM person p FULL JOIN contact c ON p.id = c.person_id -- identical meaning: FULL OUTER JOIN ... ORDER BY p.lastname, p.firstname, c.contact_type DESC, c.contact_value;
Suppose there are two tables table_1 and table_2.
The full join
SELECT * FROM table_1 t1 FULL JOIN table_2 t2 ON t1.id = t2.table_1_id;
will result in
The result contains the (single) matching row plus all other rows, which are extended by null values in the missing cells.
Hint: The full join is not supported by all DBMS. But as it isn't an atomic operation it is always possible to create the desired result by a combination of multiple SELECTs with SET operations.
In the case of the inner join version of the command we can omit the ON part. SQL interprets this situation as a request to combine every row of the left table with every row of the right table. The result will contain a large number of rows: number of rows within left table multiplied by number of rows within right table.
This special kind of an inner join is called a carthesian product, see also: carthesian product in german Wiki. The carthesian product is an elementary operation of the relationa algebra, which is the foundation for all rDBMS implementations.
-- all persons combined with all contacts (some implementations insists in replacing the -- keyword 'JOIN' by an comma) SELECT p.firstname, p.lastname, c.contact_type, c.contact_value FROM person p JOIN contact c -- missing ON keyword: p X c will be created ORDER BY p.lastname, p.firstname, c.contact_type DESC, c.contact_value; -- count the arising rows SELECT COUNT(*) FROM person p JOIN contact c;
Be carefull: If we (falsely) forget the ON term we will receive a large number of resulting rows. If, for example, the first table contains 10.000 rows and the second one 20.000 rows, we will receive 200 million rows.
The n:m Situation
How can we create a list of persons and their hobbies? Remember: one person may run many hobbies and several persons may run the same hobby. So there is no direct connection from persons to hobbies. Between the two tables we have created a third one person_hobby. It holds the id of persons as well as the id of hobbies.
We have to 'walk' from person to person_hobby and from there to hobby.
-- persons combined with their hobbies SELECT p.id p_id, p.firstname, p.lastname, h.hobbyname, h.id h_id FROM person p JOIN person_hobby ph ON p.id = ph.person_id JOIN hobby h ON ph.hobby_id = h.id ORDER BY p.lastname, p.firstname, h.hobbyname;
Please notice that no column of the table person_hobby goes to the result. This table acts only during intermediate execution steps. Even its column id is not of interest.
Some people do not perform a hobby. As we performed an INNER JOIN they are not part of the above list. If we want to see in the list also persons without hobbies, we must do what we have done before: use LEFT OUTER JOINs instead of INNER JOINs.
-- ALL persons plus their hobbies (if present) SELECT p.id p_id, p.firstname, p.lastname, h.hobbyname, h.id h_id FROM person p LEFT JOIN person_hobby ph ON p.id = ph.person_id LEFT JOIN hobby h ON ph.hobby_id = h.id ORDER BY p.lastname, p.firstname, h.hobbyname;
Hint: If neccessary we can combine every kind of join with every other kind of join in every desired sequence, eg: LEFT OUTER with FULL OUTER with INNER ... .
Criterias for join operations are not restricted to the usual formulation:
SELECT ... FROM table_1 t1 JOIN table_2 t2 ON t1.id = t2.fk ...
First, we can use any column, not only primary key and foreign key columns. In one of the above examples we used the lastname for a join. Lastname is of type character and has no meaning of any key. To avoid poor peformance some DBMS restrict the use of columns to those having an index.
Second, the comparator is not restricted to the equal sign. We can use any sensfull operator, for example the 'greater than' for numeric values.
-- Which person has the greater body weight - restricted to 'de Winter' for clarity SELECT p1.id, p1.firstname AS "is heavier", p1.weight, p2.id, p2.firstname AS "than", p2.weight FROM person p1 JOIN person p2 ON p1.weight > p2.weight WHERE p1.lastname = 'de Winter' AND p2.lastname = 'de Winter' ORDER BY p1.weight DESC, p2.weight DESC;
Third, we can use an arbitrary function.
-- short lastnames vs. long lastnames SELECT p1.firstname, p1.lastname AS "shorter lastname", p2.firstname, p2.lastname FROM person p1 JOIN person p2 ON LENGTH(p1.lastname) < LENGTH(p2.lastname) -- likewise ORDER BY can use functions ORDER BY LENGTH(p1.lastname), LENGTH(p2.lastname);
Show first- and lastname plus icq number for persons having an icq number
SELECT p.id, p.firstname, p.lastname, c.contact_value FROM person p JOIN contact c ON p.id = c.person_id WHERE c.contact_type = 'icq';
Show first- and lastname plus icq number plus fixed line number for persons having an icq number AND a fixed line. You need to join the contact table twice.
SELECT p.id, p.firstname, p.lastname, c1.contact_value AS icq, c2.contact_value AS "fixed line" -- looks like previous, but is different FROM person p JOIN contact c1 ON p.id = c1.person_id JOIN contact c2 ON p.id = c2.person_id -- it's a second (virtual) incarnation of contact table WHERE c1.contact_type = 'icq' -- from first incarnation AND c2.contact_type = 'fixed line'; -- from second incarnation -- In this example of an INNER JOIN we can convert the WHERE part to an additional JOIN criterion. -- This may clarify the intention of the command. But be carefull: This shifting in combination with -- one of the OUTER JOINs may lead to different results. SELECT p.id, p.firstname, p.lastname, c1.contact_value AS icq, c2.contact_value AS "fixed line" FROM person p JOIN contact c1 ON p.id = c1.person_id AND c1.contact_type = 'icq' JOIN contact c2 ON p.id = c2.person_id AND c2.contact_type = 'fixed line';
Show first- and lastname plus (if present) the icq number for ALL persons
-- To retrieve ALL persons it's neccessary to use a LEFT join. -- But the first approach is not what we expect! In this example the LEFT JOIN is evaluated first -- and creates an intermediate table with null-values in contact_type (eliminate the -- WHERE clause to see this intermediate result). These rows and all other except the -- one with 'icq' are then thrown away by evaluating the WHERE clause. SELECT p.id, p.firstname, p.lastname, c.contact_value FROM person p LEFT JOIN contact c ON p.id = c.person_id WHERE c.contact_type = 'icq'; -- It's neccessary to formulate the search criterion as part of the JOIN. Unlike with -- the INNER JOIN in the previous example with (LEFT/FULL/RIGHT) OUTER JOINs it is not possible -- to shift it to the WHERE clause. SELECT p.id, p.firstname, p.lastname, c.contact_value FROM person p LEFT JOIN contact c ON p.id = c.person_id AND c.contact_type = 'icq';
Create a list which contains ALL hobbies plus according persons (if present)
SELECT p.id p_id, p.firstname, p.lastname, h.hobbyname, h.id h_id FROM person p RIGHT JOIN person_hobby ph ON p.id = ph.person_id RIGHT JOIN hobby h ON ph.hobby_id = h.id ORDER BY h.hobbyname, p.lastname, p.firstname;
Is it possible that one of the three outer joins contains fewer rows than the corresponding inner join?
All four join types contain the same rows with column-matching-values. In addition
outer joins contain rows where column values do not match - if such a situation exists.