By the end of this chapter, students will be able to:
WriteGROUP BY queries to compute aggregate summaries for each unique value in one or more columns.
Apply the five core aggregate functions (COUNT, SUM, AVG, MIN, MAX) to summarize data within groups.
Distinguish between WHERE (filters rows before grouping) and HAVING (filters groups after aggregation) and apply each correctly.
Use the FILTER clause to compute conditional aggregates without restructuring the query.
ApplyROLLUP to generate subtotals and a grand total in a single query pass.
ApplyCUBE to generate all possible subtotal combinations for a set of grouping columns.
UseCOALESCE and GROUPING() to label NULL subtotal rows in ROLLUP and CUBE output.
WriteGROUPING SETS and explain its relationship to ROLLUP and CUBE as a generalization of both.
Diagnose errors caused by referencing non-aggregated, non-grouped columns in a SELECT with GROUP BY.
The previous chapters covered queries that return one row per row in the table. Every SELECT statement you have written so far has been a one-in, one-out operation: each row that passes the WHERE filter produces exactly one row in the result.
GROUP BY changes that. It collapses multiple rows into a single summary row, one per group. Instead of asking “what are the oscar nominations for each actor,” you can ask “what is the average oscar nomination count per genre.” Instead of listing every actor, you get one row per genre with a summary value computed across all actors in that group.
This is one of the most frequently used features in SQL for data analysis, and it pairs directly with the aggregate functions introduced briefly in the SELECT chapter: COUNT, SUM, AVG, MIN, and MAX.
5.2 Aggregate Functions
An aggregate function takes a set of rows as input and returns a single value. You have already seen COUNT(*) used on its own. When combined with GROUP BY, aggregate functions compute a separate result for each group rather than for the entire table.
The five core aggregate functions are:
Function
Returns
COUNT(*)
Number of rows
COUNT(column)
Number of non-NULL values in the column
SUM(column)
Total of all values
AVG(column)
Arithmetic mean of all values
MIN(column)
Smallest value
MAX(column)
Largest value
Before using GROUP BY, it is worth seeing these functions applied to the entire table, without any grouping:
SELECTCOUNT(*) AS total_actors,SUM(oscar_wins) AS total_wins,ROUND(AVG(oscar_nominations), 1) AS avg_nominations,MAX(total_box_office_usd) /1000000AS highest_box_office_millionsFROM actors;
1 records
total_actors
total_wins
avg_nominations
highest_box_office_millions
200
142
2.9
9200
This returns a single row summarizing all 200 actors. GROUP BY lets you compute that same summary separately for each value of a categorical column.
The GROUP BY clause comes after WHERE and before ORDER BY. The column you group by must appear in the SELECT list. Every column in SELECT that is not inside an aggregate function must appear in GROUP BY — this is a hard rule PostgreSQL enforces.
The following query counts actors and totals Oscar wins for each primary genre:
SELECT primary_genre,COUNT(*) AS actor_count,SUM(oscar_wins) AS total_winsFROM actorsGROUPBY primary_genreORDERBY actor_count DESC;
8 records
primary_genre
actor_count
total_wins
Drama
135
115
Action
22
5
Comedy
20
11
Musical
7
5
Thriller
7
4
Romance
5
1
Horror
3
1
Sci-Fi
1
0
The result has one row per genre. Drama dominates with 135 actors and 115 wins. Action comes second by count but its wins are comparatively low — a reflection of the Academy’s historical preference for dramatic performances over action roles.
5.3.1 Grouping by Multiple Columns
You can group by more than one column by listing them in GROUP BY, separated by commas. The database creates one group per unique combination of those columns.
SELECT primary_genre, sex,COUNT(*) AS actor_count,ROUND(AVG(oscar_nominations), 1) AS avg_nominationsFROM actorsGROUPBY primary_genre, sexORDERBY primary_genre, sex;
14 records
primary_genre
sex
actor_count
avg_nominations
Action
F
3
0.3
Action
M
19
0.8
Comedy
F
11
2.7
Comedy
M
9
2.2
Drama
F
70
3.5
Drama
M
65
3.6
Horror
F
3
2.3
Musical
F
4
2.3
Musical
M
3
2.0
Romance
F
4
1.3
Romance
M
1
0.0
Sci-Fi
F
1
3.0
Thriller
F
3
2.0
Thriller
M
4
1.8
Each row now represents a unique genre-sex pairing. This lets you compare, for example, whether male and female Drama actors differ in their average nomination counts.
5.3.2 Conditional Aggregation with FILTER
The FILTER clause attaches a condition to a single aggregate function, telling it to count (or sum, or average) only the rows that satisfy that condition. This lets you compute multiple conditional summaries in the same query without writing separate subqueries.
The following query counts all actors per genre, then adds separate columns for those who have won at least one Oscar, those with an honorary Oscar, and those who are still living — all in a single pass over the table:
SELECT primary_genre,COUNT(*) AS total_actors,COUNT(*) FILTER (WHERE oscar_wins >0) AS competitive_winners,COUNT(*) FILTER (WHERE has_honorary_oscar) AS honorary_recipients,COUNT(*) FILTER (WHERE death_date ISNULL) AS still_livingFROM actorsGROUPBY primary_genreORDERBY total_actors DESC;
8 records
primary_genre
total_actors
competitive_winners
honorary_recipients
still_living
Drama
135
84
10
82
Action
22
5
2
12
Comedy
20
10
2
8
Musical
7
4
2
3
Thriller
7
2
0
7
Romance
5
1
1
4
Horror
3
1
0
1
Sci-Fi
1
0
0
1
Each FILTER condition is evaluated independently. An actor can satisfy multiple conditions and be counted in multiple filtered columns simultaneously.
FILTER is specific to PostgreSQL and a handful of other databases. In databases that do not support it, the equivalent technique is to use SUM with a CASE expression: SUM(CASE WHEN oscar_wins > 0 THEN 1 ELSE 0 END). The FILTER form is cleaner and more readable when it is available.
5.4 Filtering Groups with HAVING
WHERE filters rows before grouping. Once groups are formed, you cannot use WHERE to filter them — the groups do not exist yet when WHERE runs. HAVING is the clause that filters groups after they have been computed.
The following query returns only birth countries that have produced at least five actors in the dataset:
SELECT birth_country,COUNT(*) AS actor_count,SUM(oscar_wins) AS total_winsFROM actorsGROUPBY birth_countryHAVINGCOUNT(*) >=5ORDERBY actor_count DESC;
7 records
birth_country
actor_count
total_wins
USA
100
96
UK
35
24
France
14
3
Italy
9
3
Australia
8
5
Sweden
6
4
Spain
5
2
Without HAVING COUNT(*) >= 5, this query would return all 19 birth countries, including several with only one or two actors. HAVING lets you focus on groups that meet a meaningful threshold.
A useful way to remember the difference: WHERE filters rows before grouping; HAVING filters groups after grouping. If the condition references an aggregate function like COUNT() or AVG(), it belongs in HAVING. If it references a plain column value, it belongs in WHERE.
5.4.1 Combining WHERE and HAVING
WHERE and HAVING can appear in the same query. WHERE narrows the rows that are fed into the grouping step, and HAVING then narrows the resulting groups.
The following query asks: among actors who are still living, which birth countries have produced at least three of them, and what is their combined box office?
The WHERE death_date IS NULL clause excludes deceased actors before any grouping occurs. HAVING COUNT(*) >= 3 then keeps only countries where at least three living actors remain.
5.5 Subtotals with ROLLUP and CUBE
Standard GROUP BY returns one row per unique combination of the grouping columns, with no subtotals. ROLLUP and CUBE are extensions that automatically add summary rows to the result.
5.5.1 ROLLUP
ROLLUP generates subtotals along a hierarchy. GROUP BY ROLLUP(a, b) produces:
One row per unique (a, b) combination
One subtotal row per unique value of a (aggregating across all values of b)
One grand total row (aggregating across everything)
The subtotal and grand total rows have NULL in the columns that were rolled up. By default, PostgreSQL sorts NULL values after all non-NULL values in ascending order and before them in descending order. Because ROLLUP results contain meaningful NULL values in the grouping columns, the ORDER BY clause uses NULLS LAST to push the subtotal and grand total rows to the bottom regardless of sort direction, keeping the detail rows together at the top.
The following query counts actors by birth country and genre for three countries, with subtotals per country and a grand total:
SELECT birth_country, primary_genre,COUNT(*) AS actorsFROM actorsWHERE birth_country IN ('USA', 'UK', 'France')GROUPBYROLLUP (birth_country, primary_genre)ORDERBY birth_country NULLSLAST, primary_genre NULLSLAST;
22 records
birth_country
primary_genre
actors
France
Action
3
France
Comedy
1
France
Drama
7
France
Romance
2
France
Thriller
1
France
NULL
14
UK
Action
3
UK
Comedy
3
UK
Drama
23
UK
Musical
2
UK
Romance
1
UK
Thriller
3
UK
NULL
35
USA
Action
8
USA
Comedy
14
USA
Drama
68
USA
Horror
2
USA
Musical
5
USA
Sci-Fi
1
USA
Thriller
2
USA
NULL
100
NULL
NULL
149
The rows where primary_genre is NULL are the per-country subtotals — for example, the row with birth_country = 'UK' and primary_genre = NULL shows the total actor count across all genres for UK actors. The final row where both columns are NULL is the grand total.
The NULL values in ROLLUP and CUBE results represent subtotal aggregations, not missing data. This can cause confusion when your grouping columns also contain genuine NULL values. The GROUPING() function can distinguish the two cases, but for most analytical work the context makes the distinction clear.
5.5.2 CUBE
CUBE generates subtotals for every possible combination of the grouping columns, not just the hierarchical rollup. GROUP BY CUBE(a, b) produces rows for (a, b), (a), (b), and () — all four combinations of including or excluding each column.
The following query uses two countries and three genres to keep the output readable:
SELECT birth_country, primary_genre,COUNT(*) AS actorsFROM actorsWHERE birth_country IN ('USA', 'UK')AND primary_genre IN ('Drama', 'Action', 'Comedy')GROUPBYCUBE (birth_country, primary_genre)ORDERBY birth_country NULLSLAST, primary_genre NULLSLAST;
12 records
birth_country
primary_genre
actors
UK
Action
3
UK
Comedy
3
UK
Drama
23
UK
NULL
29
USA
Action
8
USA
Comedy
14
USA
Drama
68
USA
NULL
90
NULL
Action
11
NULL
Comedy
17
NULL
Drama
91
NULL
NULL
119
Compared to ROLLUP, the result includes two additional rows: one per genre summing across both countries (where birth_country is NULL but primary_genre is not), and one grand total. CUBE is most useful when there is no natural hierarchy and you want to slice the data every possible way in a single query.
5.5.3 Labeling Subtotals with COALESCE and GROUPING
The NULL values in ROLLUP and CUBE output are technically correct but can look awkward in a report. COALESCE addresses this: it takes two or more arguments and returns the first one that is not NULL.
COALESCE(expression, fallback_value)
There is a catch, however. Once COALESCE replaces the NULLs in the SELECT list, ORDER BY ... NULLS LAST stops working — the alias now resolves to the COALESCE result, which contains no NULLs, so labels like 'Grand Total' sort alphabetically among the real values rather than at the end.
The solution is GROUPING(column), a companion function designed specifically for ROLLUP and CUBE results. It returns 1 when a column’s NULL is a rollup marker and 0 when it holds a real grouping value. Sorting by GROUPING(column) first places subtotal rows after the detail rows they summarize, with no dependence on NULLs:
SELECTCOALESCE(birth_country, 'Grand Total') AS birth_country,COALESCE(primary_genre, 'All Genres') AS primary_genre,COUNT(*) AS actorsFROM actorsWHERE birth_country IN ('USA', 'UK', 'France')GROUPBYROLLUP (birth_country, primary_genre)ORDERBYGROUPING(birth_country), birth_country,GROUPING(primary_genre), primary_genre;
22 records
birth_country
primary_genre
actors
France
Action
3
France
Comedy
1
France
Drama
7
France
Romance
2
France
Thriller
1
France
All Genres
14
UK
Action
3
UK
Comedy
3
UK
Drama
23
UK
Musical
2
UK
Romance
1
UK
Thriller
3
UK
All Genres
35
USA
Action
8
USA
Comedy
14
USA
Drama
68
USA
Horror
2
USA
Musical
5
USA
Sci-Fi
1
USA
Thriller
2
USA
All Genres
100
Grand Total
All Genres
149
For each country, GROUPING(primary_genre) is 0 for the detail rows and 1 for the per-country subtotal row, so the subtotal always appears after the genres it summarizes. GROUPING(birth_country) is 1 only for the grand total row, pushing it to the very end. COALESCE then handles the display without interfering with the ordering.
To reverse that arrangement and place each subtotal row before its detail rows, sort GROUPING(primary_genre) descending. Everything else stays the same:
SELECTCOALESCE(birth_country, 'Grand Total') AS birth_country,COALESCE(primary_genre, 'All Genres') AS primary_genre,COUNT(*) AS actorsFROM actorsWHERE birth_country IN ('USA', 'UK', 'France')GROUPBYROLLUP (birth_country, primary_genre)ORDERBYGROUPING(birth_country), birth_country,GROUPING(primary_genre) DESC, primary_genre;
22 records
birth_country
primary_genre
actors
France
All Genres
14
France
Action
3
France
Comedy
1
France
Drama
7
France
Romance
2
France
Thriller
1
UK
All Genres
35
UK
Action
3
UK
Comedy
3
UK
Drama
23
UK
Musical
2
UK
Romance
1
UK
Thriller
3
USA
All Genres
100
USA
Action
8
USA
Comedy
14
USA
Drama
68
USA
Horror
2
USA
Musical
5
USA
Sci-Fi
1
USA
Thriller
2
Grand Total
All Genres
149
The All Genres subtotal for each country now leads that country’s block, followed by its individual genre rows. Which layout to use depends on how the report will be read; leading with the subtotal works well when readers care primarily about the summary and scan the detail rows only for context.
COALESCE is not limited to ROLLUP and CUBE output. It is useful anywhere a column may contain NULLs and a default value is preferable in the result: substituting 0 for a NULL box office figure, 'Living' for a NULL death date, or a placeholder string for any optional text field.
5.5.4 GROUPING SETS
ROLLUP and CUBE are both shortcuts for GROUPING SETS, which is the underlying primitive. GROUPING SETS lets you specify exactly which combinations of grouping columns should produce rows in the output. Each set in the list is one combination; every other combination is omitted.
The two shortcuts expand to their GROUPING SETS equivalents like this:
-- These are identical:GROUPBYROLLUP (a, b)GROUPBYGROUPING SETS ((a, b), (a), ())-- These are identical:GROUPBYCUBE (a, b)GROUPBYGROUPING SETS ((a, b), (a), (b), ())
ROLLUP always gives you the full hierarchy from most-specific to grand total. CUBE always gives you every possible combination. GROUPING SETS lets you pick exactly the combinations you want, which matters when neither shortcut fits.
The following query returns actor counts at two levels: the (primary_genre, sex) detail level and the grand total, skipping the intermediate per-genre and per-sex subtotals that CUBE would add:
SELECT primary_genre, sex,COUNT(*) AS actorsFROM actorsWHERE primary_genre IN ('Drama', 'Action', 'Comedy')GROUPBYGROUPING SETS ( (primary_genre, sex), ())ORDERBY primary_genre NULLSLAST, sex NULLSLAST;
7 records
primary_genre
sex
actors
Action
F
3
Action
M
19
Comedy
F
11
Comedy
M
9
Drama
F
70
Drama
M
65
NULL
NULL
177
The row where both columns are NULL is the grand total. Every other row is a specific genre-sex pairing. There are no per-genre or per-sex subtotals, because those sets were not listed. CUBE on these two columns would have produced eight additional rows; GROUPING SETS produces exactly what was requested.
5.6 The Logical Order of Execution
The full logical order of a query with aggregation is:
FROM — identify the table
WHERE — filter individual rows
GROUP BY — form groups
Aggregate functions — compute summary values per group
HAVING — filter groups
SELECT — compute and name output columns
ORDER BY — sort the result
LIMIT — trim to the requested row count
This order has a practical consequence: aggregate functions and column aliases defined in SELECT cannot be referenced in HAVING, because HAVING runs before SELECT. You must repeat the aggregate expression:
-- This fails: 'avg_noms' is not defined when HAVING runsSELECT primary_genre, ROUND(AVG(oscar_nominations), 1) AS avg_nomsFROM actorsGROUPBY primary_genreHAVING avg_noms >2;-- This works: repeat the expressionSELECT primary_genre, ROUND(AVG(oscar_nominations), 1) AS avg_nomsFROM actorsGROUPBY primary_genreHAVINGAVG(oscar_nominations) >2;
5.7 NULL Values in Aggregation
Aggregate functions handle NULL values consistently: they ignore them. AVG(column) computes the mean of only the non-NULL values. SUM(column) adds only the non-NULL values. COUNT(column) counts only the non-NULL values. COUNT(*) is the exception — it counts every row regardless of NULLs.
The actors table has no NULL values in the numeric columns used in this chapter, but the behavior is worth knowing. If a new actor were added with total_box_office_usd left NULL, AVG(total_box_office_usd) would exclude that actor from the average entirely rather than treating the NULL as zero.
5.8 Putting It All Together
The following query combines all the clauses covered in this chapter to answer a specific question: for each primary genre with more than two actors born outside the USA, what are the actor count, total Oscar wins, and average box office in millions?
SELECT primary_genre,COUNT(*) AS actor_count,SUM(oscar_wins) AS total_wins,ROUND(AVG(total_box_office_usd) /1000000, 0) AS avg_box_office_millionsFROM actorsWHERE birth_country <>'USA'GROUPBY primary_genreHAVINGCOUNT(*) >2ORDERBY actor_count DESC;
5 records
primary_genre
actor_count
total_wins
avg_box_office_millions
Drama
67
38
1080
Action
14
2
2099
Comedy
6
1
388
Romance
5
1
1114
Thriller
5
2
2110
5.9 Exercises
5.9.1 Reflection
1. Explain the difference between WHERE and HAVING. Write an example of a condition that must go in WHERE and one that must go in HAVING, and explain why each belongs where it does.
2. A student writes the following query intending to find genres where the average number of Oscar nominations exceeds 2, but PostgreSQL returns an error. Identify the problem and fix it.
SELECT primary_genre, ROUND(AVG(oscar_nominations), 1) AS avg_nomsFROM actorsGROUPBY primary_genreHAVING avg_noms >2;
3. Explain why COUNT(*) and COUNT(birth_name) would return different values when used inside a GROUP BY query. In what situation would they return the same value for every group?
5.9.2 Coding
4. Write a query that counts the number of actors from each birth country. Return only countries with at least two actors, sorted from most to fewest.
5. Write a query that shows, for each primary genre, the total number of Oscar nominations and the total number of Oscar wins. Include a third column that expresses wins as a percentage of nominations, rounded to one decimal place. Sort by total nominations descending.
6. Write a query that returns the birth country, actor count, and average height in centimeters (rounded to one decimal place) for all birth countries that have produced at least three actors. Sort by average height descending.
7. Write a query that shows the number of living actors and the number of deceased actors for each primary genre using the FILTER clause. Include a third column with the total actor count per genre. Filter to genres that have at least five actors total. Sort by total actors descending.
8. Write a query that finds the single birth country with the highest total box office across all its actors. Return the country name and the total in billions, rounded to two decimal places.
9. Write a query using ROLLUP that shows actor counts by primary_genre and sex, with subtotals per genre and a grand total. Limit the result to actors born in the USA. Order by genre (NULLs last) then sex (NULLs last).
10. Write a query using CUBE on primary_genre and sex that produces counts for every combination of those two dimensions, including the per-genre totals, per-sex totals, and the grand total. Limit to Drama, Action, and Comedy genres. How many rows does the result contain?