Structured Query Language/SELECT: Grouping
In this chapter we will leave the level of individual rows. We strive to find informations and statements that refer to groups of rows - at the expense of information about individual rows. In the context of SQL such 'row-groups' (or sets of rows) are built by the GROUP BY clause and further processed by the HAVING clause.
First we must establish criteria according to which the rows are assigned to groups. To do so we use the content of one or more columns of the involved table(s). If the values are identical, the rows belong to the same group. Consider the lastname in table person. In our small example we can insinuate that persons with same lastname form a family. So if we strive for informations about families we should use this column as the grouping criterion. This grouping allows us to ask questions concerning whole families, such as 'Which families are there?', 'How many families exists?', 'How many persons are in each family?'. Please note that all of them are questions about the whole group (which means the family), not about single rows (which means the person).
In the SQL syntax the criterion is specified after the key word GROUP BY and consists of one or more columnnames.
SELECT ... -- as usual FROM ... -- as usual (optionally plus JOINs) GROUP BY <columnname> -- optionally more columnnames ... -- optionally other elements of SELECT command ;
Our concrete example about families looks like this:
SELECT lastname FROM person GROUP BY lastname;
The query retrieves seven 'family names' out of the 10 rows. There are several persons with lastname 'Goldstein' or 'de Winter'.
We can retrieve the same seven 'family names' by applying the key word DISTINCT in a SELECT without GROUP BY.
SELECT DISTINCT lastname FROM person; -- no GROUP BY clause
What makes the difference? The DISTINCT key word is limited to remove duplicate values. It can not initiate computations on other rows and columns of the result set. In contrast, the GROUP BY additionally arranges the intermediate received rows as a number of groups and offers the possibility to get informations about each of these groups. It is even the case that within these groups all columns are available, not only the 'criterion'-column. To confirm this statement about 'all' columns we use weight which is not the 'criterion'-column.
SELECT lastname, avg(weight) -- avg() is a function to compute the arithmetic mean of numerical values FROM person GROUP BY lastname;
The result shows the seven family names - as seen before - plus the average weight of every family. The weight of individual persons is not shown. (In groups with exactly one person the average weight of the group is of course identical to the single persons weight.)
Grouping over multiple columns
If necessary we can define the grouping over more than one column. In this case we can imagine the concatenation of the colums as the grouping rule.
-- Group over one column: place_of_birth leads to 6 resulting rows SELECT place_of_birth, count(*) FROM person GROUP BY place_of_birth; -- Group over two columns: place_of_birth plus lastname leads to 8 resulting rows with Richland and SF shown twice SELECT place_of_birth, lastname, count(*) FROM person GROUP BY place_of_birth, lastname;
After we have defined groups with the GROUP BY key word, we can select more informations about each of them, e.g.: how much persons (rows) exist within each family (group of rows)?
SELECT lastname, count(*) -- count() is a function which counts values or rows FROM person GROUP BY lastname;
We see that in our small example database there is one family with 3 members, another with 2 members and all others consist of exactly 1 member.
What is going on behind the scene during the execution of the command?
- All ten rows of table person are retrieved (in the above command there is no WHERE clause).
- The rows are arranged into seven groups according to the value of column lastname.
- Every group with all of its rows is passed to the SELECT clause.
- The SELECT builds one resulting row for every received group (in 'real world' databases each of the groups may contain thousands of rows).
In step 4 exactly one resulting row is generated per group. Because the SELECT creates only one resulting row per group, it is not possible to show values of such columns which may differ from row to row, e.g. the firstname. The SELECT can only show such values of which it is ensured that they are identical within all rows of the group: the 'criterion'-column.
-- It is not possible to show the 'firstname' of a group! 'firstname' is an attribute of single person. -- Within a group 'firstname' varies from row to row. -- The DBMS should recognise this problem and should issue an error message. SELECT lastname, firstname FROM person GROUP BY lastname; -- A hint to users of MySQL: -- To receive correct results (the error message) you must deactivate a special performance feature by issuing the command -- set sql_mode = 'ONLY_FULL_GROUP_BY'; or set it in the workbench or in the ini-file.
Nevertheless we can get information about the non-criterion-columns. But this information is more generalized. The DBMS offers a special group of functions which builds one value out of a set of rows. Consider the avg() function, which computes the arithmetic mean of numerical values. This function receives a column name and operates on a set of rows. If our command in question contains a GROUP BY clause, the avg() function does compute one value per group - not one value per all rows as usual. So it is possible to show the result of such functions together with the values of the 'criterion'-column.
Here is an - incomplete - list of such functions: count(), max(), min(), sum(), avg(). Not all functions are of that kind, e.g. the function concat(), which concatenates two strings, operates on single rows and creates one value per row.
-- compute avg() by your own formula SELECT lastname, sum(weight) / count(weight) as "Mean weight 1", avg(weight) as "Mean weight 2" FROM person GROUP BY lastname;
Focus on Desired Groups
You know the WHERE clause. It defines which rows of a table will be part of the result set. The HAVING clause has the same meaning at the group-level. It defines which groups will be part of the result set.
-- The HAVING complements the GROUP BY SELECT ... FROM ... GROUP BY <columnname> HAVING <having clause>; -- specify a criterion which can be applied to groups
We retrieve exclusively families with more than 1 members:
SELECT lastname FROM person GROUP BY lastname -- grouping over lastname HAVING count(*) > 1; -- more than one person within the group
All families with one member are no longer part of the result.
In a second example we focus on such groups which satifies a criterion on column firstname. Consider that firstname is not the grouping-column.
-- Groups containing a person whose firstname has more than 4 characters: 5 resulting rows SELECT lastname FROM person GROUP BY lastname HAVING max(length(firstname)) > 4; -- max() returns ONE value (the highest one) for all rows of each 'lastname'-group
The result shows the 5 families Baker, de Winter, Goldstein, Rich and Stefanos (, but not the row(s) with the long firstname).
Please note that this result is very different from the similar question to persons whose firstname has more than 4 characters:
-- Persons whose firstname has more than 4 characters: 6 resulting rows!! SELECT lastname, firstname FROM person WHERE length(firstname) > 4; -- no GROUP BY and no HAVING. The WHERE isn't an equivalent replacement for the HAVING!!
Where is the additional row coming from? In the family de Winter there are two persons with a firstname longer than 4 characters: James and Victor. Because in the command without GROUP BY we select for persons and not for families, both rows are displayed individually.
In summary we can say that the HAVING clause decides, which groups are part of the result set and which are not.
The Overall Picture
The GROUP BY and HAVING clauses are part of the SELECT command and we can combine them with any other clauses of the SELECT as desired. Only the order of the clauses is obligatory.
-- This is the obligatory order of clauses SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY ... ;
As mentioned the WHERE clause works on the row-level whereas the HAVING clause works on the group-level. First the WHERE is evaluated, next the GROUP BY, next the HAVING, next the ORDER BY and at the end the SELECT. Every step is based on the results of the previous one.
Finally we offer two additional examples:
-- Are there persons born on the same day? SELECT date_of_birth -- In a later chapter you will learn how to select the name of this persons. FROM person GROUP BY date_of_birth HAVING count(date_of_birth) > 1 -- more than one on the same day? ORDER BY date_of_birth; -- Families with long first- and lastname. Comment out some lines to see differences to the original query. SELECT lastname, count(*) as cnt FROM person WHERE length(firstname) > 4 GROUP BY lastname HAVING length(lastname) > 4 ORDER BY cnt desc, lastname ;
Are there persons born on the same day in the same city? Hint: group over both criteria
SELECT date_of_birth, place_of_birth FROM person GROUP BY date_of_birth, place_of_birth HAVING count(*) > 1;
Categorise persons according to the formula: 'round (weight / 10)': 10 to 19 kg --> 1, 20 to 29 kg --> 2, ...
How much persons exist in each category?
SELECT round (weight / 10), count(*) FROM person GROUP BY round (weight / 10) -- ORDER BY round (weight / 10) -- order by category ORDER BY count(*) -- order by frequency ;
Which contact type is used in which frequency in table contact?
SELECT contact_type, count(*) FROM contact GROUP BY contact_type -- ORDER BY contact_type -- order by contact_type ORDER BY count(*) -- order by frequency ;
Restrict previous result to contact types which occurs more than once.
SELECT contact_type, count(*) FROM contact GROUP BY contact_type HAVING count(*) > 1 -- order by contact_type -- order by contact_type ORDER BY count(*) -- order by frequency ;
Are there persons performing more than 2 hobies? Hint: check table person_hobby.
SELECT person_id, count(*) FROM person_hobby GROUP BY person_id HAVING count(*) > 2 ;
Are there persons performing only one hobby?
SELECT person_id, count(*) FROM person_hobby GROUP BY person_id HAVING count(*) = 1 ;
Are there persons performing no hobby?
There are persons, which do not perform a hobby. But the nearby formulation 'count(*) = 0'
will not lead to the expected result because for such persons there are no rows
in table person_contact, so the DBMS cannot create any group and hence cannot display anything.
Looking for something that does NOT exist is often more difficult than looking for the
existence of something. In such cases you usually have to use one of: NOT EXISTS, NOT IN,
a combination of OUTER JOIN and IS NULL, a combination of OUTER JOIN and MINUS together
with INNER JOIN.