back to blog index

Title

Treatment of NULLs adds a fair amount of complexity to SQL. Here I’d
like to point out a specific issue that has to do with NULL treatment and
point out a standard SQL language element that addresses this issue and
was not yet implemented in SQL Server. At the end of this article I’ll
provide a URL where you can vote for adding it to SQL Server.
When you compare columns/expressions, say in the ON or WHERE
clauses of a query, predicates follow what’s called the 3-valued-logic. For
example:

SELECT ...
FROM dbo.T1 JOIN dbo.T2
  ON T1.col1 = T2.col1;
The predicate can evaluate to TRUE, FALSE or UNKNOWN. The
predicate evaluates to UNKNOWN whenever NULLs are involved. For
example (T1.col1 = 5, T2.col1 = NULL), (T1.col1 = NULL, T1.col2 =
10) and even (T1.col1 = NULL, T2.col1 = NULL). Some T-SQL language
elements will handle UNKNOWN differently than others. For example,
ON, WHERE, HAVING ,IF and CASE “accept TRUE”, and will treat
UNKNOWN somewhat similar to FALSE, but not exactly like FALSE.
For example, in the above example, when T1.col1 = T2.col1 evaluates to
UNKNOWN, like with FALSE, this will be considered a nonmatch.
However, the reason that I said that the treatment here is not exactly like
with FALSE is that if T1.col1 = T2.col1 evaluates to UNKNOWN, NOT
T1.col1 = T2.col1 also evaluates to UNKNOWN. Other language elements
like a CHECK constraint “reject FALSE”, and treat UNKNOWN more
similar to TRUE, but again, not exactly like TRUE. For example, an
expression such as CHECK(col1 = col2) will reject a row where col1 = 5
and col2 = 10, but will accept col1 = 5 and col2 = NULL. But as I
mentioned earlier, if col1 = col2 evaluates to UNKNOWN, NOT col1 =
col2 still evaluates to UNKNOWN. So in this case, when
col1 = 5 and col2 = 10 the row will be accepted this time (because NOT
FALSE is TRUE), but col1 = 5 and col2 = NULL will also be accepted.
T-SQL already provides an answer for cases where you want to explicitly
check when the expression is or is not a NULL via the IS NULL and IS
NOT NULL operators. However, there are cases where you want to
compare columns/expressions from both sides, and use 2-valued-logic; that
is, you want the predicate to evaluate to either TRUE or FALSE.
For example, suppose that in certain circumstances you want the predicate
T1.col1 = T2.col1 to evaluate to TRUE when both sides have known values
and are equal, or when both sides are NULL—in other words, you want to
treat NULLs just like you do known values. For now, you have to add
some logic such as:
(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))
This makes the expression awkward, and also you would probably cross
your fingers in hope that this will not hurt optimization. Note that this
expression will “accept” only the cases you want to treat as a match and
“reject” all others, but it still has 3-valued-logic, e.g.,
T1.col1 = 10, T2.col1 = NULL evaluates to UNKNOWN. This means
that:
NOT(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))
Will evaluate to UNKNOWN in those cases as well and not to TRUE. So if
you want all non matches of the positive predicate to be accepted, you
won’t use the positive predicate and apply NOT on top, rather rewrite your
logic, e.g.,
T1.col1 <> T2.col1
  OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL)
  OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL)
In short, there are cases where you want predicates to use 2-valued-logic,
and for now, the solutions are not pretty, and we can only hope that they are
optimized well.
SQL Server 2008 introduces new language elements that add scenarios
where NULLs are involved and you would want predicates to use 2-
valued-logic, e.g., grouping sets related features and the MERGE statement.
I will cover those in detail in my October and November columns. For now,
even though you may not be familiar with grouping sets and the MERGE
statement, focus on the ON clause of the following MERGE statement
handling incremental updates of aggregates:
WITH LastDay AS
(
  SELECT 
    GROUPING_ID(
      custid, empid,
      YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id,
    custid, empid,
    YEAR(orderdate) AS orderyear,
    MONTH(orderdate) AS ordermonth,
    DAY(orderdate) AS orderday,
    SUM(qty) AS qty
  FROM dbo.Orders
  WHERE orderdate = '20080419'
  GROUP BY
    CUBE(custid, empid),
    ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate))
)
MERGE INTO dbo.MyGroupingSets AS TGT
USING LastDay AS SRC
  ON     (TGT.grp_id = SRC.grp_id)
     AND (TGT.orderyear = SRC.orderyear
          OR (TGT.orderyear IS NULL AND SRC.orderyear IS NULL))
     AND (TGT.ordermonth = SRC.ordermonth
          OR (TGT.ordermonth IS NULL AND SRC.ordermonth IS NULL))
     AND (TGT.orderday = SRC.orderday
          OR (TGT.orderday IS NULL AND SRC.orderday IS NULL))
     AND (TGT.custid = SRC.custid
          OR (TGT.custid IS NULL AND SRC.custid IS NULL))
     AND (TGT.empid = SRC.empid
          OR (TGT.empid IS NULL AND SRC.empid IS NULL))
WHEN MATCHED THEN
  UPDATE SET
    TGT.qty += SRC.qty
WHEN NOT MATCHED THEN
  INSERT (grp_id, orderyear, ordermonth, orderday, custid, empid)
  VALUES (SRC.grp_id, SRC.orderyear, SRC.ordermonth, SRC.orderday, SRC.custid, SRC.empid);
Notice how awkward it is.
I had a discussion on the subject with fellow MVPs in a private forum. Steve
Kass did some research and discovered that ANSI SQL has an answer to
this problem in the form of operators:
IS [NOT] DISTINCT FROM
It does require getting used to, but once you do, it makes perfect sense.
DISTINCT treats NULLs like known values. That is, one NULL is not
distinct from another NULL, but a NULL is distinct from known values. 5 is
distinct from 10, but 5 is not distinct from 5.
Now let’s go back to the example I mentioned earlier:
T1.col1 = T2.col1
To have this predicate use 2-valued-logic, and have NULLs treated like
known values, instead of using:
(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))
You would use:
T1.col1 IS NOT DISTINCT FROM T2.col1
This is ultimately simpler than the awkward forms used nowadays and has
much better potential for good optimization in terms of utilizing indexes
efficiently. As for the inverse; take the following predicate:
T1.col1 <> T2.col1
In order to treat NULLs like known values, instead of using:
T1.col1 <> T2.col1
  OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL)
  OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL)
You would use:
T1.col1 IS DISTINCT FROM T2.col1
Besides being simpler and lending itself to better optimization, the IS [NOT]
DISTINCT FROM operator give you pure 2-valued-logic, namely, apply
NOT on top and you will get the exact inverse.
I find this standard language element very important, and this will become
more evident in SQL Server 2008. Unfortunately, it was not implemented in
T-SQL. Steve Kass who researched and discovered it submitted a
suggestion for this enhancement in connect.microsoft.com. I encourage you
to vote for this suggestion here; the more votes it gets, the better are the
chances you will see it in a future version of SQL Server.

Cheers,
--
BG


You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

Considering that many joins of this type will be two fields optionally referencing a common synthetic key, the following works:

where isnull(T1.col1, -1) = isnull(T2.col1, -1)

Now this won't take advantage of an index, but this very well may:

where (T1.col1 = T2.col1 or isnull(T1.col1, -1) = isnull(T2.col1, -1))

I've used the latter; it tends to generate nested loops over hash joins and performs fairly well.

You can also create a view and index an output column. Wouldn't it be nice to create an index based on a function?

While I would use "IS DISTINCT FROM", I am a little worried how the average developer will understand it!

tgalasso

Article Rating 5 out of 5

 

     1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
       
or

More blogs about technology,
databases, and SQL Server.