Structured Query Language/SELECT: Grouping

From Wikibooks, open books for an open world
Jump to: navigation, search
SELECT: Join Operation Structured Query Language
SELECT: Grouping
SELECT: IS NULL Predicate


In this chapter we will leave the level of individual rows. We strive to find informations and statements that refer to groups of rows - at the expense of information about individual rows. In the context of SQL such 'row-groups' (or sets of rows) are build by the GROUP BY clause and further processed by the HAVING clause.


Constitute Groups[edit]

First we must establish criteria according to which the rows are assigned to groups. To do so we use the content of one or more columns of the involved table(s). If the values are identical, the rows belong to the same group. Consider the lastname in table person. In our small example we can insinuate that persons with same lastname form a family. So if we strive for informations about families we should use this column as the grouping criterion. This grouping allows us to ask questions concerning whole families, such as 'Which families are there?', 'How many families exists?', 'How many persons are in each family?'. Please note that all of them are questions about the whole group (which means the family), not about single rows (which means the person).

In the SQL syntax the criterion is specified after the key word GROUP BY and consists of one or more columnnames.

SELECT ...             -- as usual
FROM   ...             -- as usual (optionally plus JOINs)
GROUP BY <columnname>  -- optionally more columnnames
...                    -- optionally other elements of SELECT command
;


Our concrete example about families looks like this:

SELECT lastname
FROM   person
GROUP BY lastname;

The query retrieves seven 'family names' out of the 10 rows. There are several persons with lastname 'Goldstein' or 'de Winter'.

We can retrieve the same seven 'family names' by applying the key word DISTINCT in a SELECT without GROUP BY.

SELECT DISTINCT lastname
FROM   person;
-- no GROUP BY clause

What makes the difference? The DISTINCT key word is limited to remove duplicate values. It can not initiate computations on other rows and columns of the result set. In contrast, the GROUP BY additionally arranges the intermediate received rows as a number of groups and offers the possibility to get informations about each of this groups. It is even the case that within this groups all columns are available, not only the 'criterion'-column. To confirm this statement about 'all' columns we use weight which is not the 'criterion'-column.

SELECT lastname, avg(weight)  -- avg() is a function to compute the arithmetic mean of numerical values
FROM   person
GROUP BY lastname;

The result shows the seven family names - as seen before - plus the average weight of every family. The weight of individual persons is not shown. (In groups with exactly one person the average weight of the group is of course identical to the single persons weight.)

Grouping over multiple columns[edit]

If necessary we can define the grouping over more than one column. In this case we can imagine the concatenation of the colums as the grouping rule.

-- Group over one column: place_of_birth leads to 6 resulting rows
SELECT place_of_birth, COUNT(*)
FROM   person
GROUP BY place_of_birth;
-- Group over two columns: place_of_birth plus lastname leads to 8 resulting rows with Richland and SF shown twice
SELECT place_of_birth, lastname, COUNT(*)
FROM   person
GROUP BY place_of_birth, lastname;

Inspect Groups[edit]

After we have defined groups with the GROUP BY key word, we can select more informations about each of them, e.g.: how much persons (rows) exist within each family (group of rows)?

SELECT lastname, COUNT(*)  -- count() is a function which counts values or rows
FROM   person
GROUP BY lastname;

We see that in our small example database there is one family with 3 members, another with 2 members and all others consist of exactly 1 member.

What is going on behind the scene during the execution of the command?

  1. All ten rows of table person are retrieved (in the above command there is no WHERE clause).
  2. The rows are arranged into seven groups according to the value of column lastname.
  3. Every group with all of its rows is passed to the SELECT clause.
  4. The SELECT builds one resulting row for every received group (in 'real world' databases each of the groups may contain thousands of rows).

In step 4 exactly one resulting row is generated per group. Because the SELECT creates only one resulting row per group, it is not possible to show values of such columns which may differ from row to row, e.g. the firstname. The SELECT can only show such values of which it is ensured that they are identical within all rows of the group: the 'criterion'-column.

-- It is not possible to show the 'firstname' of a group! 'Firstname' is an attribute of single persons.
-- Within a group 'firstname' varies from row to row.
-- The DBMS should recognise this problem and should issue an error message.
SELECT lastname, firstname
FROM   person
GROUP BY lastname;
-- A hint to users of MySQL:
-- To receice correct results (the error message) you must deactivate a special performance feature by issue the command
-- set sql_mode = 'ONLY_FULL_GROUP_BY'; or set it in the workbench or in the ini-file.

Nevertheless we can get information about the non-criterion-columns. But this information is more generalised. The DBMS offers a special group of functions which builds one value out of a set of rows. Consider the avg() function, which computes the arithmetic mean of numerical values. This function receives a columnname and operates on a set of rows. If our command in question contains a GROUP BY clause, the avg() function does compute one value per group - not one value per all rows as usual. So it is possible to show the result of such functions together with the values of the 'criterion'-column.

Here is an - incomplete - list of such functions: count(), max(), min(), sum(), avg(). Not all functions are of that kind, e.g. the function concat(), which concatenates two strings, operates on single rows and creates one value per row.

-- compute avg() by your own formula
SELECT lastname, SUM(weight) / COUNT(weight) AS "Mean weight 1", avg(weight) AS "Mean weight 2"
FROM   person
GROUP BY lastname;

Focus on Desired Groups[edit]

You know the WHERE clause. It defines which rows of a table will be part of the result set. The HAVING clause has the same meaning at the group-level. It defines which groups will be part of the result set.

-- The HAVING complements the GROUP BY
SELECT ...
FROM   ...
GROUP BY <columnname>
HAVING <HAVING clause>; -- specify a criterion which can be applied to groups

We retrieve exclusively families with more than 1 members:

SELECT lastname
FROM   person
GROUP BY lastname    -- grouping over lastname
HAVING COUNT(*) > 1; -- more than one person within the group

All families with one member are no longer part of the result.

In a second example we focus on such groups which satifies a criterion on column firstname. Consider that firstname is not the grouping-column.

-- Groups containing a person whose firstname has more than 4 characters: 5 resulting rows
SELECT lastname
FROM   person
GROUP BY lastname
HAVING MAX(LENGTH(firstname)) > 4; -- max() returns ONE value (the highest one) for all rows of each 'lastname'-group

The result shows the 5 families Baker, de Winter, Goldstein, Rich and Stefanos (, but not the row(s) with the long firstname).

Please note that this result is very different from the similar question to persons whose firstname has more than 4 characters:

-- Persons whose firstname has more than 4 characters: 6 resulting rows!!
SELECT lastname, firstname
FROM   person
WHERE  LENGTH(firstname) > 4;
-- no GROUP BY and no HAVING. The WHERE isn't an equivalent replacement for the HAVING!!

Where is the additional row coming from? In the family de Winter there are two persons with a firstname longer than 4 characters: James and Victor. Because in the command without GROUP BY we select for persons and not for families, both rows are displayed individually.

In summary we can say that the HAVING clause decides, which groups are part of the result set and which are not.

The Overall Picture[edit]

The GROUP BY and HAVING clauses are part of the SELECT comand and we can combine them with any other clauses of the SELECT as desired. Only the order of the clauses is obligatory.

-- This is the obligatory order of clauses
SELECT ...
FROM   ...
WHERE  ...
GROUP BY ...
HAVING   ...
ORDER BY ...
;

As mentioned the WHERE clause works on the row-level whereas the HAVING clause works on the group-level. First the WHERE is evaluated, next the GROUP BY, next the HAVING, next the ORDER BY and at the end the SELECT. Every step is based on the results of the previous one.

Finally we offer two additional examples:

-- Are there persons born on the same day?
SELECT date_of_birth  -- In a later chapter you will learn how to select the name of this persons.
FROM   person
GROUP BY date_of_birth
HAVING COUNT(date_of_birth) > 1 -- more than one on the same day?
ORDER BY date_of_birth;
 
-- Families with long first- and lastname. Comment out some lines to see differences to the original query.
SELECT lastname, COUNT(*) AS cnt 
FROM   person
WHERE  LENGTH(firstname) > 4  
GROUP BY lastname
HAVING LENGTH(lastname) > 4
ORDER BY cnt DESC, lastname
;

Exercises[edit]

Are there persons born on the same day in the same city? Hint: group over both criteria

Click to see solution
SELECT date_of_birth, place_of_birth
FROM   person
GROUP BY date_of_birth, place_of_birth
HAVING   COUNT(*) > 1;

Categorise persons according to the formula: 'round (weight / 10)': 10 to 19 kg --> 1, 20 to 29 kg --> 2, ...
How much persons exist in each category?

Click to see solution
SELECT round (weight / 10), COUNT(*)
FROM   person
GROUP BY round (weight / 10)
-- ORDER BY round (weight / 10)  -- order by category
ORDER BY COUNT(*)                -- order by frequency
;

Which contact type is used in which frequency in table contact?

Click to see solution
SELECT contact_type, COUNT(*)
FROM   contact
GROUP BY contact_type
-- ORDER BY contact_type  -- order by contact_type
ORDER BY COUNT(*)         -- order by frequency
;

Restrict previous result to contact types which occurs more than once.

Click to see solution
SELECT contact_type, COUNT(*)
FROM   contact
GROUP BY contact_type
HAVING   COUNT(*) > 1
-- order by contact_type  -- order by contact_type
ORDER BY COUNT(*)         -- order by frequency
;

Are there persons performing more than 2 hobies? Hint: check table person_hobby.

Click to see solution
SELECT person_id, COUNT(*)
FROM   person_hobby
GROUP BY person_id
HAVING   COUNT(*) > 2
;

Are there persons performing only one hobby?

Click to see solution
SELECT person_id, COUNT(*)
FROM   person_hobby
GROUP BY person_id
HAVING   COUNT(*) = 1
;

Are there persons performing no hobby?

Click to see solution
There are persons, which do not perform a hobby. But the nearby formulation 'count(*) = 0' 
will not lead to the expected result because for such persons there are no rows
in table person_contact, so the DBMS cannot create any group and hence cannot display anything.

Looking for something that does NOT exist is often more difficult than looking for the
existence of something. In such cases you usually have to use one of: NOT EXISTS, NOT IN,
a combination of OUTER JOIN and IS NULL, a combination of OUTER JOIN and MINUS together
with INNER JOIN.


SELECT: Join Operation Structured Query Language
SELECT: Grouping
SELECT: IS NULL Predicate