The second query executes significantly faster than the first query. I was recently involved in a customer deployment in which the NONEMPTYCROSSJOIN() function changed a query's performance from 15 minutes to about 3 seconds. That's compelling! Notice that I said the second query is nearly equivalent to the first. The difference between the queries introduces another challenge for your data analysis. The NONEMPTYCROSSJOIN() function doesn't include calculated members in the result set, as a typical CROSSJOIN() function would. Because of this difference, the second query might return a different set of top 10 items than the first query does. My company has had several clients whose cubes were large enough to require the NONEMPTYCROSSJOIN() function for acceptable performance, but the fact that NONEMPTYCROSSJOIN() removes calculated members was a serious problem for those clients because they risked losing calculated dimension members that were added after the cube was created.
As BOL describes it, the NONEMPTYCROSSJOIN() function in the second query is conceptually equivalent to the following MDX expression:
FILTER( CROSSJOIN( Products.Name.Members,
SalesReps.Name.Members ), NOT ISEMPTY(
California ) )
However, the two methods aren't exactly equivalent. The syntax of this example implies that the cross join occurs first, then the query eliminates empty items by specifying that the items contain California sales data. But what actually happens is that Analysis Services retrieves the non-empty portion of the cube for California, then cross-joins the products and sales representatives that exist in that portion of the cube. Performing the NONEMPTYCROSSJOIN() function in this order gives the function a significant performance advantage over filtering the items after the cross join. Because cubes are generally sparse, Analysis Services stores only the non-null values; and when the client-side driver, PivotTable Service, requests data from the server, it receives only non-null blocks of cube data. NONEMPTYCROSSJOIN() takes advantage of this method of storing and communicating data in Analysis Services. If you think about it, creating a long list of empty entries with the CROSSJOIN() function only to turn around and eliminate them with the FILTER() function is a giant waste of CPU time.
So what if you're developing an OLAP application and you want the improved performance of NONEMPTYCROSSJOIN(), but you don't want to risk losing any calculated members that might be in the cube? I spoke with the architect of Analysis Services, Amir Netz, about the problem, and he identified a couple of solutions.
The first approach is to determine whether the sets identify any calculated members before the NONEMPTYCROSSJOIN() function. If the sets include calculated members, fall back to using CROSSJOIN() with the FILTER() function. For example, consider the following generic function:
NONEMPTYCROSSJOIN( <set1>, <set2>,
<filter criteria>, 2 )
First, you can execute the following two queries and compare the counts they return:
COUNT( <set1> )
COUNT( STRIPCALCULATEDMEMBERS( <set1> ) )
Then, repeat the two queries for <set2>. If the counts are different for either set, the result of your cross-join query might include calculated members, and you must use the following filter syntax with the CROSSJOIN() function to guarantee the correct result:
FILTER( CROSSJOIN( <set1>, <set2> ),
NOT ISEMPTY(
<filter criteria> ) )
This approach does the best job of determining when you can use NONEMPTYCROSSJOIN(), but it also incurs the overhead of running four extra queries. These queries on a single dimension should execute quickly, but always performance-test your application before putting it into production. If running four extra queries for every query that includes a cross join makes you nervous, a more conservative approach exists.
The more conservative approach is to see whether any calculated members exist in the dimensions that the cross join references. You can check for calculated members programmatically by using the OpenSchema() function call in ADO MD. Request the list of dimension members from the MDSCHEMA_MEMBERS rowset by using the member type MDMEMBER_TYPE_FORMULA. If the request returns any members, you know the dimension contains calculated members and the set definitions included in the cross join might include a calculated member. This approach isn't exact and might cause you to avoid using the NONEMPTYCROSSJOIN() function more often than you need to. But the determination that uses OpenSchema is faster than using four extra queries and thus adds less overhead.
I recommend choosing your approach based on which situation you believe will occur more frequently in your environment. For example, if you know that most cross-join situations in your application will benefit from a change to a NONEMPTYCROSSJOIN(), you might want to use the first approach. However, if you believe that the large dimensions that can benefit from the NONEMPTYCROSSJOIN() optimization don't (and won't) have any calculated members, the second approach is more efficient; and the second approach still works correctly if a calculated member is added.
I hope that the next release of Analysis Services includes a NONEMPTYCROSSJOIN() function that doesn't strip calculated members. Meanwhile, if you're developing analysis applications based on large, sparse data sets, consider using one of these approaches to optimize your cross joins.
Finally, if you're looking for a challenge, check out the October MDX Puzzle on the Web. Go to http://www.sqlmag.com and enter 26327 in the InstantDoc ID text box.
End of Article
Prev. page
1
[2]
next page -->