Structured Query Language/NULLs and the Three Valued Logic
As mentioned in a previous chapter of this wikibook and in wikipedia sometimes there is no value in a column of a row, or - to say it the other way round - the column stores the NULL marker (a flag to indicate the absence of any data), or - to use the notion of the SQL standard - the column stores the NULL value. This NULL marker is very different from the numeric value zero or a string with a length of zero characters! Typically it occurs when an application yet hasn't stored anything in the column of this row.
(A hint to Oracle users: For Oracle the NULL marker is identical to a string of zero characters.)
The existence of the NULL marker introduces a new fundamental problem. In the usual boolean logic there are the two logical values TRUE and FALSE. Every comparison evaluates to one of the two - and the comparisons negation evaluates to the opposite one. If a comparison evaluates to TRUE, its negation evaluates to FALSE and vice versa. As an example, in the usual boolean logic one of the following two comparisons is TRUE and the other one is FALSE: 'x < 5', 'x >= 5'.
Imagine now the new situation that x holds the NULL marker. It is not feasible that 'NULL < 5' is true (1). But if we say, 'NULL < 5' is false (2), its negation 'NULL >= 5' is true (3)! Is (3) more feasible than (1)? Of course not. (1) and (3) have the same 'degree of truth', so they shall evaluate to the same value. And this value must be different from TRUE and FALSE.
Therefore the usual boolean logic is extended by a third logic value. It is named UNKNOWN. All comparisons to the NULL marker results per definition in this new value. And the well known statement 'if a statement is true, its negation is false' gets lost because there is a third option.
SQL's logic is an implementation of this so called trivalent, ternary or three-valued logic (3VL). The existence of the NULL marker in SQL is not without controversy. But if NULLs are accepted, the 3VL is a necessity.
This page proceeds in two stages: First it explains the handling of NULLs concerning comparisons, grouping, etc. . Second it explains the boolean logic for the cases where the new value UNKNOWN interacts with any other boolean value - including itself.
Step 1: Evaluation of NULLs
Comparison Predicates, IS NULL Predicate
SQL knows the six comparison predicates <, <=, =, >=, > and <> (unequal). Their main purpose is the arithmetic comparison of numeric values. Each of them needs two variables or constants (infix notation). This implies that it is possible that one or even both operants hold the NULL marker. As stated before the common and very simple rule is: "All comparisons to the NULL marker results per definition in this new value (unknown).". Here are some examples:
- NULL = 5 evaluates to UNKNOWN.
- 5 = NULL evaluates to UNKNOWN.
- NULL <= 5 evaluates to UNKNOWN.
- col_1 = 5 evaluates to UNKNOWN for rows where col_1 holds the NULL marker.
- col_1 = col_2 evaluates to UNKNOWN for rows where col_1 or col_2 holds the NULL marker.
- NULL = NULL evaluates to UNKNOWN.
- col_1 = col_2 evaluates to UNKNOWN for rows where col_1 and col_2 holds the NULL marker.
The WHERE clause returns such rows where it evaluates to TRUE. It does not return rows where it evaluates to FALSE or to UNKNOWN. In consequence it is not guaranteed that the following SELECT will return the complete table t1:
-- This SELECT will not return such rows where col_1 holds the NULL marker. SELECT * FROM t1 WHERE col_1 > 5 OR col_1 = 5 OR col_1 < 5;
Of course there are use cases where rows with the NULL marker must be retrieved. Because the arithmetic comparisons are not able to do so, another language construct must do the job. It is the IS NULL predicate.
-- This SELECT will return exactly these rows where col_1 holds the NULL marker. SELECT * FROM t1 WHERE col_1 IS NULL;
For the other predicates there is no simple rule of thumb. They must be explained one after the other.
The IN predicate is a shortcut for a sequence of OR operations:
-- Shortcut for: col_1 = 3 OR col_1 = 18 OR col_1 = NULL SELECT * FROM t1 WHERE col_1 IN (3, 18, NULL); -- the NULL case will never hit with the IN predicate! -- a second example which is a little more complex -- WHERE col_1 IN (SELECT col_x FROM t2 WHERE id < 10);
Only the two comparisons 'col_1 = 3' and 'col_1 = 18' are able to retrieve rows (possibly many rows). The comparison 'col_1 = NULL' will never evaluate to TRUE. It's always UNKNOWN, even if col_1 holds the NULL marker. To retrieve those rows it's necessary - as shown above - to use the 'IS NULL' predicate.
The subselect of an EXISTS predicate evaluates to TRUE if the cardinality of the retrieved rows is greater than 0, and to FALSE if the cardinality is 0. It is not possible that the UNKNOWN value occurs.
-- The subselect to t2 can hit some rows - or not. If there are hits in the subselect, ALL rows of t1 -- are returned, else no rows of t1 are returned. SELECT * -- The select to table t1 FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE id < 10); -- The subselect to table t2
The LIKE predicate compares a column with a regular expression. If the column contains the NULL marker, the LIKE predicate returns the UNKNOWN value, what means that the row is not retrieved.
-- The LIKE retrieves NO rows if col_2 contains the NULL marker. SELECT * FROM t1 WHERE col_2 LIKE 'Hello %';
The aggregate functions
COUNT(<column_name>), MIN(<column_name>), MAX(<column_name>), SUM(<column_name>) and
AVG(<column_name>) ignores such rows where <column_name> contains the NULL marker. On the other hand
COUNT(*) includes all rows.
If a parameter of one of the scalar functions like
UPPER(), TRIM(), CONCAT(), ABS(), SQRT(), ... contains the NULL marker the resulting value is - in the most cases - the NULL marker.
There are some situations where column values are compared to each other to answer the question, whether they are distinct. For usual numbers and strings the result of such decisions is obvious. But how shall the DBMS handle NULL markers? Are they distinct from each other, are they equal to each other or is there no answer to this question at all? To get results, which are expected by (nearly) every end user, the standard defines "Two null values are not distinct.", they build a single group.
SELECT DISTINCT col_1 FROM t1; retrieves one and only row for all rows where col_1 holds the NULL marker.
... GROUP BY col_1 ...; builds one and only one group for all rows where col_1 holds the NULL marker.
Step 2: Boolean Operations within 3VL
After we have seen how various comparisons and predicates on the NULL marker produces TRUE, FALSE and UNKNOWN it's necessary to explain the rules for the new logic value UNKNOWN.
A first elementary operation is the inspection of a truth value: is it TRUE, FALSE or UNKNOWN? Analogous to the IS NULL predicate there are three additional predicates:
- IS [NOT] TRUE
- IS [NOT] FALSE
- IS [NOT] UNKNOWN
-- Check for 'UNKNOWN' SELECT * FROM t1 WHERE (col_1 = col_2) IS UNKNOWN; -- parenthesis are not necessary -- ... is semantically equivalent to SELECT * FROM t1 WHERE col_1 IS NULL OR col_2 IS NULL;
In the abstract syntax of logical systems p shall represent any of its truth values. Herein the new predicates evaluate according to the following table:
|p||IS TRUE||IS FALSE||IS UNKNOWN||IS NOT TRUE||IS NOT FALSE||IS NOT UNKNOWN|
Please notice that all predicates leads to TRUE or FALSE and never to UNKNOWN.
The next operation is the negation of the new value. To which value evaluates 'NOT UNKNOWN'? The UNKNOWN value represents the impossibility to decide between TRUE and FALSE. It is not feasible that the negation of this impossibility leads to TRUE or FALSE. Likewise it is UNKNOWN.
-- Which rows will match? (1) SELECT * FROM t1 WHERE NOT col_2 = NULL; -- 'col_2 = NULL' evaluates to UNKNOWN in all cases, see above. -- Is this SELECT equivalent to the first one? (2) SELECT * FROM t1 EXCEPT SELECT * FROM t1 WHERE col_2 = NULL; -- No, it's different!! Independent from NULL markers in col_2, (1) retrieves -- absolutely NO row and (2) retrieves ALL rows.
The above SELECT (1) will retrieve no rows as 'NOT col_2 = NULL' evaluates to the same as 'col_2 = NULL', namely UNKNOWN. And the SELECT (2) will retrieve all rows, as the part after EXCEPT will retrieve no rows, hence only the part before EXCEPT is relevant.
In the abstract syntax of logical systems p shall represent any of its truth values and NOT p its negation. Herein the following table applies:
There are the two binary operations AND and OR. They evaluate as follows:
|p||q||p AND q||p OR q|
The precedence of the operations is defined as usual: IS predicate, NOT, AND, OR.
-- -- Add a new row to the test data base INSERT INTO person (id, firstname, lastname) -- Omit some columns to generate NULL markers VALUES (99, 'Tommy', 'Test'); COMMIT; SELECT * FROM person -- focus all tests to the new row WHERE id = 99 -- (1): TRUE AND -- (3): (1) AND (2) ==> TRUE AND UNKNOWN ==> UNKNOWN ( date_of_birth = NULL -- (2): UNKNOWN ); -- no hit SELECT * FROM person WHERE id = 99 -- (1): TRUE AND -- (3): (1) AND (2) ==> TRUE AND TRUE ==> TRUE ( date_of_birth IS NULL -- (2): TRUE ); -- hit SELECT * FROM person WHERE id = 99 -- (1): TRUE OR -- (3): (1) OR (2) ==> TRUE OR UNKNOWN ==> TRUE ( date_of_birth = NULL -- (2): UNKNOWN ); -- hit SELECT * FROM person WHERE id = 99 -- (1): TRUE AND -- (4): (1) AND (3) ==> TRUE AND FALSE ==> FALSE ( NOT -- (3): NOT (2) ==> NOT TRUE ==> FALSE date_of_birth IS NULL -- (2): TRUE ); -- no hit (same as AND date_of_birth IS NOT NULL) -- Clean up the test database DELETE FROM person WHERE id = 99; COMMIT;