A CASE expression in the SELECT list
determines whether interpolation is required.
If the input mark (@mark) is equal to one of
the existing marks in the range, the CASE
expression returns the corresponding percentile
rank; otherwise, the CASE expression
uses the aforementioned formula to apply
interpolation. The code in Listing 4 returns the
value 0.766667 (76.6667 percent).
With minor revisions to the code in Listing
4 you can calculate the percentile ranks
of multiple values stored in a table. Listing 5 shows how to do so. The code populates a table
variable called @MyMarks with a few random
marks. The outer query in Listing 5 joins the
table variable with PctRankRanges, to apply
the calculation to each mark from the table
variable as opposed to a single mark in a scalar
variable. When I ran the code in Listing 5 on
my system I got the output shown in Table
3. Note that the calculation returns
NULLs for marks that are smaller than the
minimum or higher than the maximum in the
Marks table.
Percentiles
As I explained, a percentile is a value below
which a certain percent of values fall. I’ll
describe the way Excel calculates the PERCENTILE
function, then I’ll explain how to implement
similar logic with T-SQL.
Given an ordered set of values v1, v2, …,
vcnt, and a request for percentile pct, Excel’s
PERCENTILE function calculates the pctth
percentile as follows:
Calculate n as pct * (cnt - 1) + 1
Using the marks from our Marks table as an
example, the ordered set of marks is: v1 = 40, v2
= 55, v3 = 70, v4 = 70, v5 = 70, v6 = 75, v7 = 75,
v8 = 80, v9 = 80, v10 = 95, v11 = 100. In our case
cnt = 11. For pct = 0.95 (95th percentile) n = 0.95
* (11 - 1) + 1 = 10.5.
Now, let k be the integer part of n and d be
the decimal part of n. In our case, k = 10 and d
= 0.5. If d = 0, return vk. That is, when n is whole
(n = k), return the kth value in the ordered set. If
d <> 0, interpolate to produce the percentile as
follows:
When k = cnt, return vk-1 + d * (vk - vk-1)
When k < cnt, return vk + d * (vk+1 - vk)
In our case, k < cnt (10 < 11), therefore the calculation
is vk + d * (vk+1 - vk): 95 + 0.5 * (100 - 95)
= 97.5.
The code in Listing 6, implements
this logic and calculates the percentiles for a whole set of percents
represented by the CTE
called Pcts. Play with
your own percents to
see how the output
changes. The CTE
called CntMarks has
a single row with the
value cnt representing
the count of rows in the
Marks table. The query
defining the CTE called
PctCnt_n calculates for
each percent from Pcts
the value n described
earlier. The query in the
CTE called PctCnt_ndk
breaks n to its integer
and decimal components k and d, respectively.
The query defining the CTE called MarksRn
simply calculates row numbers (column rn) for
the Marks from the Marks table.
Think of the row numbers as the
positions in the ordered set (1, 2, …,
cnt), and the corresponding marks
as v1, v2, …, vcnt. The outer query
joins PctCnt_ndk and
MarksRn, matching
to each percent from
PctCnt_ndk the relevant
marks from
MarksRn. To remind
you, when d = 0, the
relevant mark is vk;
else, when k = cnt, the
two relevant marks are
vk-1 and vk; else, when
k < cnt, the relevant
marks are vk and vk+1.
In all three cases, for
each percent, the formulas
for calculating
the percentiles can be
generalized to:
MIN(mark) + d
* (MAX(mark)
- MIN(mark))
The output of the
code in Listing 6 is
shown in Table 4. As
you can see in the
output, the median
is the mark 75, the
lower quartile is the mark 70, and so on.
Note that Excel interpolates to calculate the
percentile when n is not whole. Other ways to
calculate percentiles are to round n and return
the value in the rounded position in the ordered
set, but I’ll leave this method for you as an
exercise. [Editor’s Note: If you try this method,
please email us with your experiences. Send your
feedback to lpeters@penton.com.]
Statistically Speaking
You can easily use T-SQL to implement statistical
calculations such as percentile and percentile
rank. In addition, you’ll likely encounter
many other Excel-supported statistical calculations
and other types of calculations that are
handy for use in a database. Besides the usefulness
of such functions in statistical analysis,
trying to implement them with T-SQL is a great
exercise.