5  Aggregating with GROUP BY

5.1 Learning Objectives

By the end of this chapter, students will be able to:

  • Write GROUP BY queries to compute aggregate summaries for each unique value in one or more columns.
  • Apply the five core aggregate functions (COUNT, SUM, AVG, MIN, MAX) to summarize data within groups.
  • Distinguish between WHERE (filters rows before grouping) and HAVING (filters groups after aggregation) and apply each correctly.
  • Use the FILTER clause to compute conditional aggregates without restructuring the query.
  • Apply ROLLUP to generate subtotals and a grand total in a single query pass.
  • Apply CUBE to generate all possible subtotal combinations for a set of grouping columns.
  • Use COALESCE and GROUPING() to label NULL subtotal rows in ROLLUP and CUBE output.
  • Write GROUPING SETS and explain its relationship to ROLLUP and CUBE as a generalization of both.
  • Diagnose errors caused by referencing non-aggregated, non-grouped columns in a SELECT with GROUP BY.

The previous chapters covered queries that return one row per row in the table. Every SELECT statement you have written so far has been a one-in, one-out operation: each row that passes the WHERE filter produces exactly one row in the result.

GROUP BY changes that. It collapses multiple rows into a single summary row, one per group. Instead of asking “what are the oscar nominations for each actor,” you can ask “what is the average oscar nomination count per genre.” Instead of listing every actor, you get one row per genre with a summary value computed across all actors in that group.

This is one of the most frequently used features in SQL for data analysis, and it pairs directly with the aggregate functions introduced briefly in the SELECT chapter: COUNT, SUM, AVG, MIN, and MAX.

5.2 Aggregate Functions

An aggregate function takes a set of rows as input and returns a single value. You have already seen COUNT(*) used on its own. When combined with GROUP BY, aggregate functions compute a separate result for each group rather than for the entire table.

The five core aggregate functions are:

Function Returns
COUNT(*) Number of rows
COUNT(column) Number of non-NULL values in the column
SUM(column) Total of all values
AVG(column) Arithmetic mean of all values
MIN(column) Smallest value
MAX(column) Largest value

Before using GROUP BY, it is worth seeing these functions applied to the entire table, without any grouping:

SELECT
    COUNT(*)                              AS total_actors,
    SUM(oscar_wins)                       AS total_wins,
    ROUND(AVG(oscar_nominations), 1)      AS avg_nominations,
    MAX(total_box_office_usd) / 1000000   AS highest_box_office_millions
FROM actors;
1 records
total_actors total_wins avg_nominations highest_box_office_millions
200 142 2.9 9200

This returns a single row summarizing all 200 actors. GROUP BY lets you compute that same summary separately for each value of a categorical column.

5.3 The GROUP BY Clause

SELECT   grouping_column, aggregate_function(column)
FROM     table_name
GROUP BY grouping_column;

The GROUP BY clause comes after WHERE and before ORDER BY. The column you group by must appear in the SELECT list. Every column in SELECT that is not inside an aggregate function must appear in GROUP BY — this is a hard rule PostgreSQL enforces.

The following query counts actors and totals Oscar wins for each primary genre:

SELECT
    primary_genre,
    COUNT(*)        AS actor_count,
    SUM(oscar_wins) AS total_wins
FROM actors
GROUP BY primary_genre
ORDER BY actor_count DESC;
8 records
primary_genre actor_count total_wins
Drama 135 115
Action 22 5
Comedy 20 11
Musical 7 5
Thriller 7 4
Romance 5 1
Horror 3 1
Sci-Fi 1 0

The result has one row per genre. Drama dominates with 135 actors and 115 wins. Action comes second by count but its wins are comparatively low — a reflection of the Academy’s historical preference for dramatic performances over action roles.

5.3.1 Grouping by Multiple Columns

You can group by more than one column by listing them in GROUP BY, separated by commas. The database creates one group per unique combination of those columns.

SELECT
    primary_genre,
    sex,
    COUNT(*)                            AS actor_count,
    ROUND(AVG(oscar_nominations), 1)    AS avg_nominations
FROM actors
GROUP BY primary_genre, sex
ORDER BY primary_genre, sex;
14 records
primary_genre sex actor_count avg_nominations
Action F 3 0.3
Action M 19 0.8
Comedy F 11 2.7
Comedy M 9 2.2
Drama F 70 3.5
Drama M 65 3.6
Horror F 3 2.3
Musical F 4 2.3
Musical M 3 2.0
Romance F 4 1.3
Romance M 1 0.0
Sci-Fi F 1 3.0
Thriller F 3 2.0
Thriller M 4 1.8

Each row now represents a unique genre-sex pairing. This lets you compare, for example, whether male and female Drama actors differ in their average nomination counts.

5.3.2 Conditional Aggregation with FILTER

The FILTER clause attaches a condition to a single aggregate function, telling it to count (or sum, or average) only the rows that satisfy that condition. This lets you compute multiple conditional summaries in the same query without writing separate subqueries.

aggregate_function(expression) FILTER (WHERE condition)

The following query counts all actors per genre, then adds separate columns for those who have won at least one Oscar, those with an honorary Oscar, and those who are still living — all in a single pass over the table:

SELECT
    primary_genre,
    COUNT(*)                                        AS total_actors,
    COUNT(*) FILTER (WHERE oscar_wins > 0)          AS competitive_winners,
    COUNT(*) FILTER (WHERE has_honorary_oscar)      AS honorary_recipients,
    COUNT(*) FILTER (WHERE death_date IS NULL)      AS still_living
FROM actors
GROUP BY primary_genre
ORDER BY total_actors DESC;
8 records
primary_genre total_actors competitive_winners honorary_recipients still_living
Drama 135 84 10 82
Action 22 5 2 12
Comedy 20 10 2 8
Musical 7 4 2 3
Thriller 7 2 0 7
Romance 5 1 1 4
Horror 3 1 0 1
Sci-Fi 1 0 0 1

Each FILTER condition is evaluated independently. An actor can satisfy multiple conditions and be counted in multiple filtered columns simultaneously.

FILTER is specific to PostgreSQL and a handful of other databases. In databases that do not support it, the equivalent technique is to use SUM with a CASE expression: SUM(CASE WHEN oscar_wins > 0 THEN 1 ELSE 0 END). The FILTER form is cleaner and more readable when it is available.

5.4 Filtering Groups with HAVING

WHERE filters rows before grouping. Once groups are formed, you cannot use WHERE to filter them — the groups do not exist yet when WHERE runs. HAVING is the clause that filters groups after they have been computed.

SELECT   grouping_column, aggregate_function(column)
FROM     table_name
GROUP BY grouping_column
HAVING   aggregate_condition;

The following query returns only birth countries that have produced at least five actors in the dataset:

SELECT
    birth_country,
    COUNT(*)        AS actor_count,
    SUM(oscar_wins) AS total_wins
FROM actors
GROUP BY birth_country
HAVING COUNT(*) >= 5
ORDER BY actor_count DESC;
7 records
birth_country actor_count total_wins
USA 100 96
UK 35 24
France 14 3
Italy 9 3
Australia 8 5
Sweden 6 4
Spain 5 2

Without HAVING COUNT(*) >= 5, this query would return all 19 birth countries, including several with only one or two actors. HAVING lets you focus on groups that meet a meaningful threshold.

A useful way to remember the difference: WHERE filters rows before grouping; HAVING filters groups after grouping. If the condition references an aggregate function like COUNT() or AVG(), it belongs in HAVING. If it references a plain column value, it belongs in WHERE.

5.4.1 Combining WHERE and HAVING

WHERE and HAVING can appear in the same query. WHERE narrows the rows that are fed into the grouping step, and HAVING then narrows the resulting groups.

The following query asks: among actors who are still living, which birth countries have produced at least three of them, and what is their combined box office?

SELECT
    birth_country,
    COUNT(*)                                        AS living_actors,
    SUM(total_box_office_usd) / 1000000             AS combined_box_office_millions
FROM actors
WHERE death_date IS NULL
GROUP BY birth_country
HAVING COUNT(*) >= 3
ORDER BY combined_box_office_millions DESC;
9 records
birth_country living_actors combined_box_office_millions
USA 54 95160
UK 20 39630
Australia 7 26470
India 4 7540
Spain 5 5463
France 9 5000
Mexico 3 3060
Japan 3 1560
Italy 4 1125

The WHERE death_date IS NULL clause excludes deceased actors before any grouping occurs. HAVING COUNT(*) >= 3 then keeps only countries where at least three living actors remain.

5.5 Subtotals with ROLLUP and CUBE

Standard GROUP BY returns one row per unique combination of the grouping columns, with no subtotals. ROLLUP and CUBE are extensions that automatically add summary rows to the result.

5.5.1 ROLLUP

ROLLUP generates subtotals along a hierarchy. GROUP BY ROLLUP(a, b) produces:

  • One row per unique (a, b) combination
  • One subtotal row per unique value of a (aggregating across all values of b)
  • One grand total row (aggregating across everything)

The subtotal and grand total rows have NULL in the columns that were rolled up. By default, PostgreSQL sorts NULL values after all non-NULL values in ascending order and before them in descending order. Because ROLLUP results contain meaningful NULL values in the grouping columns, the ORDER BY clause uses NULLS LAST to push the subtotal and grand total rows to the bottom regardless of sort direction, keeping the detail rows together at the top.

The following query counts actors by birth country and genre for three countries, with subtotals per country and a grand total:

SELECT
    birth_country,
    primary_genre,
    COUNT(*) AS actors
FROM actors
WHERE birth_country IN ('USA', 'UK', 'France')
GROUP BY ROLLUP (birth_country, primary_genre)
ORDER BY birth_country NULLS LAST, primary_genre NULLS LAST;
22 records
birth_country primary_genre actors
France Action 3
France Comedy 1
France Drama 7
France Romance 2
France Thriller 1
France NULL 14
UK Action 3
UK Comedy 3
UK Drama 23
UK Musical 2
UK Romance 1
UK Thriller 3
UK NULL 35
USA Action 8
USA Comedy 14
USA Drama 68
USA Horror 2
USA Musical 5
USA Sci-Fi 1
USA Thriller 2
USA NULL 100
NULL NULL 149

The rows where primary_genre is NULL are the per-country subtotals — for example, the row with birth_country = 'UK' and primary_genre = NULL shows the total actor count across all genres for UK actors. The final row where both columns are NULL is the grand total.

The NULL values in ROLLUP and CUBE results represent subtotal aggregations, not missing data. This can cause confusion when your grouping columns also contain genuine NULL values. The GROUPING() function can distinguish the two cases, but for most analytical work the context makes the distinction clear.

5.5.2 CUBE

CUBE generates subtotals for every possible combination of the grouping columns, not just the hierarchical rollup. GROUP BY CUBE(a, b) produces rows for (a, b), (a), (b), and () — all four combinations of including or excluding each column.

The following query uses two countries and three genres to keep the output readable:

SELECT
    birth_country,
    primary_genre,
    COUNT(*) AS actors
FROM actors
WHERE birth_country IN ('USA', 'UK')
  AND primary_genre IN ('Drama', 'Action', 'Comedy')
GROUP BY CUBE (birth_country, primary_genre)
ORDER BY birth_country NULLS LAST, primary_genre NULLS LAST;
12 records
birth_country primary_genre actors
UK Action 3
UK Comedy 3
UK Drama 23
UK NULL 29
USA Action 8
USA Comedy 14
USA Drama 68
USA NULL 90
NULL Action 11
NULL Comedy 17
NULL Drama 91
NULL NULL 119

Compared to ROLLUP, the result includes two additional rows: one per genre summing across both countries (where birth_country is NULL but primary_genre is not), and one grand total. CUBE is most useful when there is no natural hierarchy and you want to slice the data every possible way in a single query.

5.5.3 Labeling Subtotals with COALESCE and GROUPING

The NULL values in ROLLUP and CUBE output are technically correct but can look awkward in a report. COALESCE addresses this: it takes two or more arguments and returns the first one that is not NULL.

COALESCE(expression, fallback_value)

There is a catch, however. Once COALESCE replaces the NULLs in the SELECT list, ORDER BY ... NULLS LAST stops working — the alias now resolves to the COALESCE result, which contains no NULLs, so labels like 'Grand Total' sort alphabetically among the real values rather than at the end.

The solution is GROUPING(column), a companion function designed specifically for ROLLUP and CUBE results. It returns 1 when a column’s NULL is a rollup marker and 0 when it holds a real grouping value. Sorting by GROUPING(column) first places subtotal rows after the detail rows they summarize, with no dependence on NULLs:

SELECT
    COALESCE(birth_country, 'Grand Total')  AS birth_country,
    COALESCE(primary_genre, 'All Genres')   AS primary_genre,
    COUNT(*) AS actors
FROM actors
WHERE birth_country IN ('USA', 'UK', 'France')
GROUP BY ROLLUP (birth_country, primary_genre)
ORDER BY GROUPING(birth_country), birth_country,
         GROUPING(primary_genre), primary_genre;
22 records
birth_country primary_genre actors
France Action 3
France Comedy 1
France Drama 7
France Romance 2
France Thriller 1
France All Genres 14
UK Action 3
UK Comedy 3
UK Drama 23
UK Musical 2
UK Romance 1
UK Thriller 3
UK All Genres 35
USA Action 8
USA Comedy 14
USA Drama 68
USA Horror 2
USA Musical 5
USA Sci-Fi 1
USA Thriller 2
USA All Genres 100
Grand Total All Genres 149

For each country, GROUPING(primary_genre) is 0 for the detail rows and 1 for the per-country subtotal row, so the subtotal always appears after the genres it summarizes. GROUPING(birth_country) is 1 only for the grand total row, pushing it to the very end. COALESCE then handles the display without interfering with the ordering.

To reverse that arrangement and place each subtotal row before its detail rows, sort GROUPING(primary_genre) descending. Everything else stays the same:

SELECT
    COALESCE(birth_country, 'Grand Total')  AS birth_country,
    COALESCE(primary_genre, 'All Genres')   AS primary_genre,
    COUNT(*) AS actors
FROM actors
WHERE birth_country IN ('USA', 'UK', 'France')
GROUP BY ROLLUP (birth_country, primary_genre)
ORDER BY GROUPING(birth_country), birth_country,
         GROUPING(primary_genre) DESC, primary_genre;
22 records
birth_country primary_genre actors
France All Genres 14
France Action 3
France Comedy 1
France Drama 7
France Romance 2
France Thriller 1
UK All Genres 35
UK Action 3
UK Comedy 3
UK Drama 23
UK Musical 2
UK Romance 1
UK Thriller 3
USA All Genres 100
USA Action 8
USA Comedy 14
USA Drama 68
USA Horror 2
USA Musical 5
USA Sci-Fi 1
USA Thriller 2
Grand Total All Genres 149

The All Genres subtotal for each country now leads that country’s block, followed by its individual genre rows. Which layout to use depends on how the report will be read; leading with the subtotal works well when readers care primarily about the summary and scan the detail rows only for context.

COALESCE is not limited to ROLLUP and CUBE output. It is useful anywhere a column may contain NULLs and a default value is preferable in the result: substituting 0 for a NULL box office figure, 'Living' for a NULL death date, or a placeholder string for any optional text field.

5.5.4 GROUPING SETS

ROLLUP and CUBE are both shortcuts for GROUPING SETS, which is the underlying primitive. GROUPING SETS lets you specify exactly which combinations of grouping columns should produce rows in the output. Each set in the list is one combination; every other combination is omitted.

The two shortcuts expand to their GROUPING SETS equivalents like this:

-- These are identical:
GROUP BY ROLLUP (a, b)
GROUP BY GROUPING SETS ((a, b), (a), ())

-- These are identical:
GROUP BY CUBE (a, b)
GROUP BY GROUPING SETS ((a, b), (a), (b), ())

ROLLUP always gives you the full hierarchy from most-specific to grand total. CUBE always gives you every possible combination. GROUPING SETS lets you pick exactly the combinations you want, which matters when neither shortcut fits.

The following query returns actor counts at two levels: the (primary_genre, sex) detail level and the grand total, skipping the intermediate per-genre and per-sex subtotals that CUBE would add:

SELECT
    primary_genre,
    sex,
    COUNT(*) AS actors
FROM actors
WHERE primary_genre IN ('Drama', 'Action', 'Comedy')
GROUP BY GROUPING SETS (
    (primary_genre, sex),
    ()
)
ORDER BY primary_genre NULLS LAST, sex NULLS LAST;
7 records
primary_genre sex actors
Action F 3
Action M 19
Comedy F 11
Comedy M 9
Drama F 70
Drama M 65
NULL NULL 177

The row where both columns are NULL is the grand total. Every other row is a specific genre-sex pairing. There are no per-genre or per-sex subtotals, because those sets were not listed. CUBE on these two columns would have produced eight additional rows; GROUPING SETS produces exactly what was requested.

5.6 The Logical Order of Execution

The full logical order of a query with aggregation is:

  1. FROM — identify the table
  2. WHERE — filter individual rows
  3. GROUP BY — form groups
  4. Aggregate functions — compute summary values per group
  5. HAVING — filter groups
  6. SELECT — compute and name output columns
  7. ORDER BY — sort the result
  8. LIMIT — trim to the requested row count

This order has a practical consequence: aggregate functions and column aliases defined in SELECT cannot be referenced in HAVING, because HAVING runs before SELECT. You must repeat the aggregate expression:

-- This fails: 'avg_noms' is not defined when HAVING runs
SELECT primary_genre, ROUND(AVG(oscar_nominations), 1) AS avg_noms
FROM actors
GROUP BY primary_genre
HAVING avg_noms > 2;

-- This works: repeat the expression
SELECT primary_genre, ROUND(AVG(oscar_nominations), 1) AS avg_noms
FROM actors
GROUP BY primary_genre
HAVING AVG(oscar_nominations) > 2;

5.7 NULL Values in Aggregation

Aggregate functions handle NULL values consistently: they ignore them. AVG(column) computes the mean of only the non-NULL values. SUM(column) adds only the non-NULL values. COUNT(column) counts only the non-NULL values. COUNT(*) is the exception — it counts every row regardless of NULLs.

The actors table has no NULL values in the numeric columns used in this chapter, but the behavior is worth knowing. If a new actor were added with total_box_office_usd left NULL, AVG(total_box_office_usd) would exclude that actor from the average entirely rather than treating the NULL as zero.

5.8 Putting It All Together

The following query combines all the clauses covered in this chapter to answer a specific question: for each primary genre with more than two actors born outside the USA, what are the actor count, total Oscar wins, and average box office in millions?

SELECT
    primary_genre,
    COUNT(*)                                         AS actor_count,
    SUM(oscar_wins)                                  AS total_wins,
    ROUND(AVG(total_box_office_usd) / 1000000, 0)    AS avg_box_office_millions
FROM actors
WHERE birth_country <> 'USA'
GROUP BY primary_genre
HAVING COUNT(*) > 2
ORDER BY actor_count DESC;
5 records
primary_genre actor_count total_wins avg_box_office_millions
Drama 67 38 1080
Action 14 2 2099
Comedy 6 1 388
Romance 5 1 1114
Thriller 5 2 2110

5.9 Exercises

5.9.1 Reflection

1. Explain the difference between WHERE and HAVING. Write an example of a condition that must go in WHERE and one that must go in HAVING, and explain why each belongs where it does.


2. A student writes the following query intending to find genres where the average number of Oscar nominations exceeds 2, but PostgreSQL returns an error. Identify the problem and fix it.

SELECT primary_genre, ROUND(AVG(oscar_nominations), 1) AS avg_noms
FROM actors
GROUP BY primary_genre
HAVING avg_noms > 2;

3. Explain why COUNT(*) and COUNT(birth_name) would return different values when used inside a GROUP BY query. In what situation would they return the same value for every group?


5.9.2 Coding

4. Write a query that counts the number of actors from each birth country. Return only countries with at least two actors, sorted from most to fewest.


5. Write a query that shows, for each primary genre, the total number of Oscar nominations and the total number of Oscar wins. Include a third column that expresses wins as a percentage of nominations, rounded to one decimal place. Sort by total nominations descending.


6. Write a query that returns the birth country, actor count, and average height in centimeters (rounded to one decimal place) for all birth countries that have produced at least three actors. Sort by average height descending.


7. Write a query that shows the number of living actors and the number of deceased actors for each primary genre using the FILTER clause. Include a third column with the total actor count per genre. Filter to genres that have at least five actors total. Sort by total actors descending.


8. Write a query that finds the single birth country with the highest total box office across all its actors. Return the country name and the total in billions, rounded to two decimal places.


9. Write a query using ROLLUP that shows actor counts by primary_genre and sex, with subtotals per genre and a grand total. Limit the result to actors born in the USA. Order by genre (NULLs last) then sex (NULLs last).


10. Write a query using CUBE on primary_genre and sex that produces counts for every combination of those two dimensions, including the per-genre totals, per-sex totals, and the grand total. Limit to Drama, Action, and Comedy genres. How many rows does the result contain?