|
|
|
Title
Treatment of NULLs adds a fair amount of complexity to SQL. Here I’d
like to point out a specific issue that has to do with NULL treatment and
point out a standard SQL language element that addresses this issue and
was not yet implemented in SQL Server. At the end of this article I’ll
provide a URL where you can vote for adding it to SQL Server.
When you compare columns/expressions, say in the ON or WHERE
clauses of a query, predicates follow what’s called the 3-valued-logic. For
example:
SELECT ...
FROM dbo.T1 JOIN dbo.T2
ON T1.col1 = T2.col1;
The predicate can evaluate to TRUE, FALSE or UNKNOWN. The
predicate evaluates to
UNKNOWN whenever NULLs are involved. For
example (T1.col1 = 5, T2.col1 = NULL),
(T1.col1 = NULL, T1.col2 =
10) and even (T1.col1 = NULL, T2.col1 = NULL). Some
T-SQL language
elements will handle UNKNOWN differently than others. For
example,
ON, WHERE, HAVING ,IF and CASE “accept TRUE”, and will treat
UNKNOWN
somewhat similar to FALSE, but not exactly like FALSE.
For example, in the above
example, when T1.col1 = T2.col1 evaluates to
UNKNOWN, like with FALSE, this will
be considered a nonmatch.
However, the reason that I said that the treatment
here is not exactly like
with FALSE is that if T1.col1 = T2.col1 evaluates to
UNKNOWN, NOT
T1.col1 = T2.col1 also evaluates to UNKNOWN. Other language
elements
like a CHECK constraint “reject FALSE”, and treat UNKNOWN more
similar
to TRUE, but again, not exactly like TRUE. For example, an
expression such as
CHECK(col1 = col2) will reject a row where col1 = 5
and col2 = 10, but will
accept col1 = 5 and col2 = NULL. But as I
mentioned earlier, if col1 = col2
evaluates to UNKNOWN, NOT col1 =
col2 still evaluates to UNKNOWN. So in this
case, when
col1 = 5 and col2 = 10 the row will be accepted this time (because
NOT
FALSE is TRUE), but col1 = 5 and col2 = NULL will also be accepted.
T-SQL already provides an answer for cases where you want to explicitly
check
when the expression is or is not a NULL via the IS NULL and IS
NOT NULL
operators. However, there are cases where you want to
compare
columns/expressions from both sides, and use 2-valued-logic; that
is, you want
the predicate to evaluate to either TRUE or FALSE.
For example, suppose that in certain circumstances you want the predicate
T1.col1 = T2.col1 to evaluate to TRUE when both sides have known values
and are
equal, or when both sides are NULL—in other words, you want to
treat NULLs just
like you do known values. For now, you have to add
some logic such as:
(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))
This makes the expression awkward, and also you would probably cross
your
fingers in hope that this will not hurt optimization. Note that this
expression
will “accept” only the cases you want to treat as a match and
“reject” all
others, but it still has 3-valued-logic, e.g.,
T1.col1 = 10, T2.col1 = NULL
evaluates to UNKNOWN. This means
that:
NOT(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))
Will evaluate to UNKNOWN in those cases as well and not to TRUE. So if
you want
all non matches of the positive predicate to be accepted, you
won’t use the
positive predicate and apply NOT on top, rather rewrite your
logic, e.g.,
T1.col1 <> T2.col1
OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL)
OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL)
In short, there are cases where you want predicates to use 2-valued-logic,
and
for now, the solutions are not pretty, and we can only hope that they are
optimized well.
SQL Server 2008 introduces new language elements that add scenarios
where NULLs
are involved and you would want predicates to use 2-
valued-logic, e.g., grouping
sets related features and the MERGE statement.
I will cover those in detail in
my October and November columns. For now,
even though you may not be familiar
with grouping sets and the MERGE
statement, focus on the ON clause of the
following MERGE statement
handling incremental updates of aggregates:
WITH LastDay AS
(
SELECT
GROUPING_ID(
custid, empid,
YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id,
custid, empid,
YEAR(orderdate) AS orderyear,
MONTH(orderdate) AS ordermonth,
DAY(orderdate) AS orderday,
SUM(qty) AS qty
FROM dbo.Orders
WHERE orderdate = '20080419'
GROUP BY
CUBE(custid, empid),
ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate))
)
MERGE INTO dbo.MyGroupingSets AS TGT
USING LastDay AS SRC
ON (TGT.grp_id = SRC.grp_id)
AND (TGT.orderyear = SRC.orderyear
OR (TGT.orderyear IS NULL AND SRC.orderyear IS NULL))
AND (TGT.ordermonth = SRC.ordermonth
OR (TGT.ordermonth IS NULL AND SRC.ordermonth IS NULL))
AND (TGT.orderday = SRC.orderday
OR (TGT.orderday IS NULL AND SRC.orderday IS NULL))
AND (TGT.custid = SRC.custid
OR (TGT.custid IS NULL AND SRC.custid IS NULL))
AND (TGT.empid = SRC.empid
OR (TGT.empid IS NULL AND SRC.empid IS NULL))
WHEN MATCHED THEN
UPDATE SET
TGT.qty += SRC.qty
WHEN NOT MATCHED THEN
INSERT (grp_id, orderyear, ordermonth, orderday, custid, empid)
VALUES (SRC.grp_id, SRC.orderyear, SRC.ordermonth, SRC.orderday, SRC.custid, SRC.empid);
Notice how awkward it is.
I had a discussion on the subject with fellow MVPs in a private forum. Steve
Kass did some research and discovered that ANSI SQL has an answer to
this
problem in the form of operators:
IS [NOT] DISTINCT FROM
It does require getting used to, but once you do, it makes perfect sense.
DISTINCT treats NULLs like known values. That is, one NULL is not
distinct from
another NULL, but a NULL is distinct from known values. 5 is
distinct from 10,
but 5 is not distinct from 5.
Now let’s go back to the example I mentioned earlier:
T1.col1 = T2.col1
To have this predicate use 2-valued-logic, and have NULLs treated like
known
values, instead of using:
(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))
You would use:
T1.col1 IS NOT DISTINCT FROM T2.col1
This is ultimately simpler than the awkward forms used nowadays and has
much
better potential for good optimization in terms of utilizing indexes
efficiently. As for the inverse; take the following predicate:
T1.col1 <> T2.col1
In order to treat NULLs like known values, instead of using:
T1.col1 <> T2.col1
OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL)
OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL)
You would use:
T1.col1 IS DISTINCT FROM T2.col1
Besides being simpler and lending itself to better optimization, the IS [NOT]
DISTINCT FROM operator give you pure 2-valued-logic, namely, apply
NOT on top
and you will get the exact inverse.
I find this standard language element very important, and this will become
more
evident in SQL Server 2008. Unfortunately, it was not implemented in
T-SQL.
Steve Kass who researched and discovered it submitted a
suggestion for this
enhancement in connect.microsoft.com. I encourage you
to vote for this
suggestion
here;
the more votes it gets, the better are the
chances you will see it in a future
version of SQL Server.
Cheers,
--
BG
|
|
|
|