• subscribe
December 23, 2008 12:00 AM

Handling Arrays as Inputs

Use a T-SQL split function
SQL Server Pro
InstantDoc ID #100657
Downloads
100657.zip

Executive Summary: Learn how to use a T-SQL split function to handle arrays as inputs for Microsoft SQL Server stored procedures.

Although SQL Server stored procedures don’t support using arrays as inputs, DBAs often need to run procedures that can accept an unknown number of input parameters. For example, suppose you need to develop a stored procedure in the Northwind database that accepts a list of order IDs as input and returns all orders from the Orders table whose keys appear in the input list. A common solution is to use dynamic SQL; in “Handling Arrays” (InstantDoc ID 100453), I explain why this solution isn’t preferable.

Another option is to create a table function that accepts a string with a separated list of elements as input (@arr), and possibly a second argument with the separator character (@sep), and returns a table result with a row for each element. You can then use such a split function in your stored procedure and join its output table with the Orders table to identify the qualifying orders.

In this article I discuss how to use a T-SQL implementation of the split function solution. Next month I’ll discuss a Common Language Runtime (CLR) implementation of the split function approach.

Getting Started
To use the T-SQL split function approach, first run the code in Listing 1 to create and populate an auxiliary table of numbers. Next, run the code in Listing 2 to create the T-SQL implementation of the fn_split function. The split logic implemented by the function’s query can be divided into three parts: generating copies, extracting elements, and calculating position.

Generating copies. The first part of the query produces a copy of the @arr value for each element in the array. This action requires the Nums table that the script in Listing 1 created; as you can see, the query in Listing 2 is against the Nums table. The predicate in the WHERE clause is in charge of producing a match for each element in the array. In simple terms, a match is found for each number n from the Nums table that represents the character position of @sep (the list’s separator) in @arr. For x number of elements, there are x - 1 separators. In order to obtain x matches for x elements, you must add a separator in front of the first element. That is, you need to look for @sep in @sep + @arr instead of in @arr. In addition, because the Nums table can have many more numbers than the number of characters in @arr, you filter only the numbers that are smaller than or equal to the length of @arr plus one for the separator you added in front of the first element. Here’s the complete predicate in the query’s WHERE clause:

WHERE n <= LEN(@arr) + 1
  AND SUBSTRING(@sep + @arr, n, 1) = @sep
This filter gives you a match for each element in the array; within that match, n represents the position of @sep in @sep + @arr. Interestingly, in respect to @arr, n simply represents the position of the character where the element starts. Next, you need to extract a different element from each copy of @arr.

Extracting elements. The second part of the script in Listing 2 extracts elements from each copy of @arr. The column n in the Nums table represents the character position where the element of interest starts. Thus, a call to the SUBSTRING function extracts the element:

SUBSTRING(@arr, n, CHARINDEX(@sep, @arr + @sep,
  n) - n)

The SUBSTRING function operates on @arr as the input string and extracts a substring starting at character position n. The length to extract is the position of the next occurrence of @sep in @arr (calculated by the CHARINDEX function) minus n. Two things to note about the use of CHARINDEX to calculate the position of the next occurrence of @sep in @arr are that (1) in order to ensure that the separator is also found after the last element, the function looks for @sep in @arr + @sep rather than just in @arr and (2) the last argument to the function tells the function where to start looking for @sep. In my example, you would tell CHARINDEX to look for the next occurrence of @sep starting at the beginning of the current element, which is represented by n; hence n is provided as the last argument to the function.

Calculating position. The final portion of the script in Listing 2 calculates the position of the element within the array. In SQL Server 2000 the solution is quite tricky. The position of the element starting at character n can be calculated by counting the number of occurrences of @sep prior to the nth character (in the left n - 1 characters) plus 1. To count the number of occurrences of a character c in a string s, use the REPLACE function to remove all occurrences of c from s (call the result s2) by substituting them with an empty string, then simply subtract the length of s2 from s. The length of the original string minus the length of the string after removal of all occurrences of the separator character gives you the number of occurrences of the separator character. The original length of our string of interest is by definition (n - 1), therefore the complete expression that calculates the number of occurrences of @sep in the left n - 1 characters in @arr is:

(n - 1) - LEN(REPLACE(LEFT(@arr, n-1), @sep, N'))
By adding 1 to the expression above, you get the position of the element within the array. Note that if you’re working with SQL Server 2005 or later, you can calculate the position of the element with a simple expression based on the ROW_NUMBER function:
ROW_NUMBER() OVER(ORDER BY n)

Also, remember that as of SQL Server 2005 you can work with VARCHAR(MAX) or NVARCHAR(MAX) as the input parameters to the function and stored procedure, and thus support very long input strings (up to 2GB).

Now that you understand the logic behind this solution, try running the following code to test the function:

SELECT * FROM dbo.fn_split(N'10248,10250,10249',
    N',');

You’ll get the output in Table 1, which includes a row for each element and indicates its position in the array.

Continue on Page 2



ARTICLE TOOLS

Comments
  • Ognjen
    3 years ago
    Dec 04, 2009

    When I insert result of fn_split function in to the temp variable and then join that temp variable with my other two tables I get nested loop join which is optimal for my query.

    Ognjen

    [Reply: Makes sense. :)

    Cheers,
    Itzik]

  • Ognjen
    3 years ago
    Dec 04, 2009

    Itzik,
    Artical is great but I have one question.
    I get wrong estimation of rows when I use fn_split function and because of that I have suboptimal query plan. Fn_split function in my case returns only two rows but estimated number of rows is 1100000 and becaouse of that I get hash join instead of nested loop with two other tables. Do you have some tip how to get a good estimation of rows.

    Thanks,
    Ognjen

    [Reply: An inaccurate estimate could be a result of reuse of a previously cached plan. The estimate is based on compile time. Could also be a matter of referring to a variable who's value cannot be sniffed. In both cases, adding OPTION(RECOMPILE) to the query could help. If that’s not the case and it’s just a matter of inaccurate estimates there’s the option to first insert the result of the split into a temp table and then joining the temp table with the other. This will add some cost to the process due to the use of the temp table, but the estimates should be better, so overall the process could improve.

    Cheers,
    Itzik]

  • Steve
    3 years ago
    Jan 09, 2009

    -- try that again; hopefully my returns will persist this time;

    WITH splitterCte (pos, idx, element) AS (

    SELECT CAST(1 AS INT) pos
    ,CHARINDEX(@sep, @arr+@sep, 1) idx
    ,SUBSTRING(@arr, 0, CHARINDEX(@sep, @arr+@sep, 1)) element

    UNION ALL

    SELECT pos+1 pos
    ,CHARINDEX(@sep, @arr+@sep, idx+1) idx
    ,SUBSTRING(@arr, idx+1, CHARINDEX(@sep, @arr+@sep, (idx+1))-(idx+1)) element
    FROM splitterCte
    WHERE CHARINDEX(@sep, @arr+@sep, idx+1)>0
    )
    SELECT pos, element
    FROM splitterCte OPTION (MAXRECURSION 32767);

  • Steve
    3 years ago
    Jan 09, 2009

    Thank you for the article, enjoyable as always.

    This is a road I, like so many others, been down many times in the past. Using the Nums table is a clever technique for achieving a set based solution that is compatible with SQL8.

    I can understand why reader danoland may be hesitant to create a permanent Nums table, but I also believe that he doesn’t fully comprehend how it is used. The Nums table is used as something of an anchor, something to select against when discovering the starting positions of the array elements; it’s the key to the set based solution. So, it doesn’t matter what the data type of the array is as long as the Nums table is sequential and the string length of the ‘array’ passed is not greater than than MAX(Nums.n).

    I’ve found that if backwards compatibility is not an issue a recursive CTE solution runs significantly faster. One notable caveat is that due to MAXRECURSION the array length is limited to 32,767 items.

    -----------------------------------------------
    WITH splitterCte (pos, idx, element) AS (

    SELECT CAST(1 AS INT) pos
    ,CHARINDEX(@sep, @arr+@sep, 1) idx
    ,SUBSTRING(@arr, 0, CHARINDEX(@sep, @arr+@sep, 1)) element

    UNION ALL

    SELECT pos+1 pos
    ,CHARINDEX(@sep, @arr+@sep, idx+1) idx
    ,SUBSTRING(@arr, idx+1, CHARINDEX(@sep, @arr+@sep, (idx+1))-(idx+1)) element
    FROM splitterCte
    WHERE CHARINDEX(@sep, @arr+@sep, idx+1)>0
    )
    SELECT pos, element
    FROM splitterCte OPTION (MAXRECURSION 32767);
    -----------------------------------------------

    --Steve--

    [Reply: Thanks for the input, Steve. BTW, my friend and fellow MVP Erland Sommarskog did very thorough benchmarks of the various methods and you can find his results here: http://www.sommarskog.se/arrays-in-sql.html. You can see the numbers for the CTE option vs. the numbers table among others.

    Cheers,
    Itzik]

  • Dan
    4 years ago
    Dec 31, 2008

    Thank you for this article - it is very useful for something I am working on right now.

    I don't like having dbo.nums as a separate table. It seems that I would also need a separate tables for strings and dates. Furthermore, I think I would need to make sure that the table is empty before processing.

    I think I would like to create a temporary table or a table variable within the stored procedure that is calling the split function. It seems that the functionality of dbo.nums is temporary and ephemeral, exactly the sort of thing that temporary tables and table variables were designed for. However, I don't understand the performance considerations for such an implementation. I would think that if I put it in a stored procedure, then the query plan would be cached, and the performance wouldn't be too bad.

    [Reply: There are many practical uses for a numbers table, including generating a sequence of date and time values. I would strongly recommend having a permanent one in the database.

    Cheers,
    Itzik]

You must log on before posting a comment.

Are you a new visitor? Register Here