| Executive Summary: The GROUP BY clause is a powerful feature of T-SQL. Depending on the level of detail you need in your reports, you can use the GROUP BY clause in SELECT statements to group data by values in one or more columns. You can even use the HAVING clause to further refine your reports. |
You can begin to tap into the true power of T-SQL by using the GROUP BY clause in SELECT statements. Grouping data lets you produce reports that answer complex questions instead of reports that answer only basic questions. For example, with the GROUP BY clause, you can produce a report that answers the question “What is the average size of the bonus paid to each employee over the past 10 years?” instead of “What is the average size of the bonus paid out to employees?”
Depending on the level of detail you need in your reports, you can use the GROUP BY clause to group data by values in one or more columns. You can even use the HAVING clause to further refine your reports. Before I show you how to do so, though, you might want to create and populate a couple of tables so you can follow along.
The Prerequisites
To help demonstrate grouping, I created two tables: Bonus and MovieReview. The Bonus table contains the bonus payments given to eight employees in the past 10 years. This table contains three columns: EmployeeID, Amount, and PaymentDate. The MovieReview table contains the ratings that the five employees have given to movies they’ve watched in their spare time. This table contains four columns: EmployeeID, Genre, MovieName, and Stars. The Stars field specifies the movie’s rating, where 1 star is the worst rating and 5 is the best rating.
You can create and populate these tables by following these steps:
- Download the CodeToCreateBonusTable.sql, CodeToPopulateBonusTable.sql, CodeTo
CreateMovieReviewTable.sql, and CodeTo PopulateMovieReviewTable.sql files. Go to
www.sqlmag.com, enter 98711 in the InstantDoc ID text box, click Go, then click the 98711.zip hotlink.
- Create the Bonus table. Open SQL Server 2005’s SQL Server Management Studio (SSMS) or SQL Server 2000’s Query Analyzer and copy the code in CodeToCreateBonusTable.sql into the query window. In the first line of code, change MyDB to the name of your database. Execute the code.
- Populate the Bonus table by running CodeToPopulateBonusTable.sql in SSMS or Query Analyzer.
- Create the MovieReview table. Copy the code in CodeToCreateMovieTable.sql into SSMS’s or Query Analyzer’s query window. In the first line of code, change MyDB to the name of your database. Execute the code.
- Populate the MovieReview table by running CodeToPopulateMovieReviewTable.sql in SSMS or Query Analyzer.
Grouping Data Using One Column
I remember spending hours writing COBOL programs to produce reports. They took even longer to debug. Today, it takes me only a few minutes to crank out the equivalent T-SQL code to produce similar reports, thanks in part to the GROUP BY clause.
When you use the GROUP BY clause in a SELECT statement, two things happen:
- GROUP BY uses the grouping criterion you specify to group the data being returned by the SELECT query. Typically, the grouping criterion is a column, in which case GROUP BY groups the data into the possible values in that column. For example, the EmployeeID column in the Bonus table has eight possible values (1 through 8), so GROUP BY would group the data being returned by the SELECT query into those eight groups. Similarly, the Stars column in the MovieReview table has five possible values (1 through 5), so GROUP BY would group the data into those five groups. GROUP BY returns only one row for each possible group. For example, GROUP BY would return eight rows when you group by the EmployeeID column and five rows when you group by the Stars column, assuming there is data in each group.
- After GROUP BY is done grouping the data, the aggregate function specified in the SELECT query is performed on each group rather than on the entire result set. So, for example, if the SELECT query specifies to use the AVG function to get the average of the values in the Amount column in the Bonus table and you grouped the data by the EmployeeID column, individual averages will be calculated for each of the eight groups instead of one overall average.
These concepts can be confusing for people new to T-SQL, so let’s take a look at a few examples. Note that in trying to keep these examples as straightforward as possible, I’m minimizing the use of the T-SQL concepts that I haven’t yet covered. People who are more familiar with T-SQL might question the column definitions I have chosen or the lack of differential integrity in the tables. Those concepts, while important, aren’t relevant to what I’m covering here. I’ll explore those topics in future lessons. Until then, please bear with me.
Suppose you need to query the Bonus table to determine the total bonus amount paid to each employee over the past 10 years. In Lesson 3, I showed you how to use the SUM function in a SELECT statement to obtain the total of all the values in a specified column. If you were to use this function with the Amount column in the Bonus table, you’d get a single dollar figure representing the total amount that the company paid out in bonuses for the past 10 years. To get the total bonus amount paid to each employee over the past 10 years, you can use a GROUP BY clause in which the specified column is EmployeeID:
SELECT EmployeeID,
SUM(Amount) AS 'Total Bonus'
FROM Bonus
GROUP BY EmployeeID
As you can see in the results in Figure 1, employee 6 has the lowest total bonus payout. If you cross reference this result with the data in the Employee table created in Lesson 3, you’ll see that employee 6 is Napoleon Lawrence, one of several employees most recently hired in 2006. You’ll also see that Napoleon has the lowest salary of all the employees. Thus, in all likelihood, his total bonus payout is less than the other employees because he hasn’t been with the company that long and his salary is low. (Typically, employees with lower salaries receive lower bonuses.)
Now let’s go a bit further and determine the average bonus per employee along with the number of bonuses paid to each employee. You can do this by using the AVG and COUNT functions and grouping their results by EmployeeID:
SELECT EmployeeID,
AVG(Amount) AS 'Average Bonus',
COUNT(Amount) AS 'Bonuses Paid'
FROM Bonus
GROUP BY EmployeeID
As the results in Figure 2 show, Napoleon (employee 6) has indeed earned only two bonuses and his average payout is $1,000, which is significantly lower than all the other employees.
Now let’s compare each employee’s average bonus with the overall average paid out by the company. You first need to determine the overall average of the bonuses paid out by the company:
SELECT AVG(Amount)
AS 'Corporate Average'
FROM Bonus
The result is $7787.50. Next, you need to plug this corporate average into a calculation that determines how each employee’s average bonus compares with the corporate average in terms of percentage:
SELECT EmployeeID,
AVG(Amount) AS 'Average Bonus',
7787.50 AS 'Corporate Average',
(AVG(Amount) - (7787.50)) / (7787.50) * 100
AS '% Higher/Lower than Corp Avg'
FROM Bonus
GROUP BY EmployeeID
Figure 3 shows the results. Note that if you wanted to calculate the corporate average on the fly, you could use several subqueries in the SELECT statement. Or even better, you could calculate the corporate average and store the result in a local variable. You could then reference this variable in the SELECT statement instead of hard-coding the corporate average. However, this is an advanced topic that I’ll cover in a later lesson.
Prev. page  
[1]
2
next page