By the end of this chapter, students will be able to:
Write a SELECT statement that retrieves specific columns from a table using SELECT ... FROM.
Apply column aliases using AS to rename columns in query output.
Construct text concatenation expressions using the || operator to combine column values and literal strings.
Write arithmetic expressions in SELECT to compute new values from existing numeric columns.
ApplyDISTINCT to return only unique values from a column or combination of columns.
UseCOUNT(*), COUNT(column), and COUNT(DISTINCT column) to count rows and measure data completeness.
ApplyWHERE with comparison operators (=, <>, <, >, BETWEEN, IN, LIKE, IS NULL) to filter rows.
Combine filter conditions using AND, OR, and NOT, applying parentheses to control evaluation order.
Explain why WHERE column = NULL always returns zero rows and use IS NULL instead.
Describe the NOT IN / NULL trap and explain why a single NULL in the list causes no rows to be returned.
Sort query results using ORDER BY with ASC and DESC across one or more columns.
UseLIMIT to restrict the number of rows returned and explain its interaction with ORDER BY.
Describe the logical order in which SQL clauses are evaluated and explain why column aliases cannot be referenced in WHERE.
Now that your database environment is up and running, it is time to write some SQL. The most fundamental operation in SQL is the query: a question you ask the database, written in a structured way that the database engine can understand and answer. In this chapter, you will write your first queries against the actors database that ships with the course Docker image.
The actors table contains one row per actor, covering 200 entries drawn from film history. Each row records the actor’s name, birth name (for those who used a stage name), sex, birth and death dates, country of birth, height, Oscar nominations and wins, and primary film genre. It is a rich dataset for learning SQL because it mixes text columns, fixed-length character fields, whole numbers, and dates in a single table.
By the end of this chapter, you will know how to retrieve data from a table, choose which columns to display, compute new values from existing ones, filter rows by any condition you can describe, sort the results, and cap how many rows are returned.
3.2 The SELECT and FROM Clauses
Every query that retrieves data starts with a SELECT statement. The two required pieces are:
SELECT specifies which columns you want in the result.
FROM tells the database which table to look in.
3.2.1 Selecting All Columns
The quickest way to see everything a table contains is to use an asterisk (*) in place of column names. The asterisk is shorthand for “every column, in the order they are defined.”
SELECT*FROM actorsLIMIT5;
5 records
actor_id
first_name
last_name
birth_name
sex
birth_date
death_date
birth_country
height_cm
oscar_nominations
oscar_wins
primary_genre
has_honorary_oscar
notable_role
total_box_office_usd
1
Tom
Hanks
NULL
M
1956-07-09
NULL
USA
183
5
2
Drama
FALSE
Forrest Gump
4.8e+09
2
Meryl
Streep
NULL
F
1949-06-22
NULL
USA
168
21
3
Drama
FALSE
The Devil Wears Prada
2.1e+09
3
Cate
Blanchett
NULL
F
1969-05-14
NULL
Australia
174
8
2
Drama
FALSE
Elizabeth
5.1e+09
4
Marlon
Brando
NULL
M
1924-04-03
2004-07-01
USA
175
8
2
Drama
FALSE
The Godfather
1.6e+09
5
Audrey
Hepburn
Audrey Kathleen Ruston
F
1929-05-04
1993-01-20
UK
170
5
1
Romance
TRUE
Breakfast at Tiffany’s
9.2e+08
The LIMIT 5 at the end restricts the output to five rows. Without it, the query would return all 200 rows in the table. This is similar to the head() function in R or Python. You will learn more about LIMIT later in this chapter, but getting into the habit of adding it while exploring an unfamiliar table is good practice.
3.2.2 Selecting Specific Columns
In most real situations, you will not want every column. Listing only the columns you need makes results easier to read and faster to retrieve from large tables.
Column names are listed after SELECT, separated by commas. The database returns them in exactly the order you specify, regardless of how they are arranged inside the table.
Relying on SELECT * in production code is considered poor practice because it breaks silently when someone adds, removes, or reorders columns in the table. Naming your columns explicitly makes queries more predictable and easier to maintain.
3.3 Shaping Your Output
Beyond choosing which columns to include, SQL lets you rename columns, combine text from multiple columns, and compute new values on the fly. None of these operations modify the data in the table; they only affect what is shown in the result.
3.3.1 Column Aliases
By default, a column in your result inherits the name it has in the table. You can override this with the AS keyword, which assigns an alias to a column. Aliases are useful for giving computed values a meaningful name, shortening long column names, or improving readability.
SELECT first_name AS given_name, last_name AS surname, birth_country AS country, primary_genre AS genre, oscar_nominations AS nominationsFROM actors;
Displaying records 1 - 25
given_name
surname
country
genre
nominations
Tom
Hanks
USA
Drama
5
Meryl
Streep
USA
Drama
21
Cate
Blanchett
Australia
Drama
8
Marlon
Brando
USA
Drama
8
Audrey
Hepburn
UK
Romance
5
Humphrey
Bogart
USA
Drama
3
Katharine
Hepburn
USA
Drama
12
Jack
Nicholson
USA
Drama
12
Denzel
Washington
USA
Drama
9
Ingrid
Bergman
Sweden
Drama
7
James
Stewart
USA
Drama
5
Bette
Davis
USA
Drama
10
Sidney
Poitier
USA
Drama
2
Jane
Fonda
USA
Drama
7
Al
Pacino
USA
Drama
9
Robert
De Niro
USA
Drama
8
Sophia
Loren
Italy
Drama
5
Marcello
Mastroianni
Italy
Drama
3
Jeanne
Moreau
France
Drama
0
Jean-Paul
Belmondo
France
Action
0
Catherine
Deneuve
France
Drama
2
Isabelle
Huppert
France
Drama
2
Gérard
Depardieu
France
Drama
1
Alec
Guinness
UK
Drama
5
Peter
Sellers
UK
Comedy
3
The AS keyword is optional in PostgreSQL; you can write the alias directly after the column name without it. Using AS explicitly is recommended, however, because it makes the intent of the query immediately clear to anyone reading it.
3.3.2 Combining Text Columns
PostgreSQL uses the || operator to concatenate (join) strings together. This lets you build a single display value out of multiple columns, or add literal text between column values.
SELECT first_name ||' '|| last_name AS full_name, birth_country, primary_genreFROM actors;
Displaying records 1 - 25
full_name
birth_country
primary_genre
Tom Hanks
USA
Drama
Meryl Streep
USA
Drama
Cate Blanchett
Australia
Drama
Marlon Brando
USA
Drama
Audrey Hepburn
UK
Romance
Humphrey Bogart
USA
Drama
Katharine Hepburn
USA
Drama
Jack Nicholson
USA
Drama
Denzel Washington
USA
Drama
Ingrid Bergman
Sweden
Drama
James Stewart
USA
Drama
Bette Davis
USA
Drama
Sidney Poitier
USA
Drama
Jane Fonda
USA
Drama
Al Pacino
USA
Drama
Robert De Niro
USA
Drama
Sophia Loren
Italy
Drama
Marcello Mastroianni
Italy
Drama
Jeanne Moreau
France
Drama
Jean-Paul Belmondo
France
Action
Catherine Deneuve
France
Drama
Isabelle Huppert
France
Drama
Gérard Depardieu
France
Drama
Alec Guinness
UK
Drama
Peter Sellers
UK
Comedy
You can add as many literal strings as you like. Here, a comma and space are inserted between the last name and the birth country to produce a compact bio label:
If any value being concatenated is NULL, the entire result of || becomes NULL. The actors table has a birth_name column that is NULL for most actors. If you concatenated birth_name into a label without accounting for that, those rows would produce a NULL result. The COALESCE function, covered in a later chapter, handles this situation cleanly.
3.3.3 Arithmetic Operations
SQL treats numeric columns like variables in an algebraic expression. You can add, subtract, multiply, and divide columns and constants using standard arithmetic operators (+, -, *, /).
As an example, height_cm stores each actor’s height in centimeters. You can convert that to inches by dividing by 2.54:
SELECT first_name ||' '|| last_name AS full_name, height_cm,ROUND(height_cm /2.54::numeric, 1) AS height_inFROM actorsORDERBY height_cm DESCLIMIT10;
10 records
full_name
height_cm
height_in
Max von Sydow
196
77.2
Clint Eastwood
193
76.0
Gregory Peck
191
75.2
Charlton Heston
191
75.2
Chris Hemsworth
190
74.8
James Stewart
188
74.0
Idris Elba
188
74.0
Sidney Poitier
188
74.0
Hugh Jackman
188
74.0
Henry Fonda
188
74.0
ROUND(value, 1) rounds the result to one decimal place. The ::numeric cast converts the 2.54 literal to a decimal type before dividing, which ensures a decimal result rather than integer division.
3.4 Exploring Data with DISTINCT and COUNT
Two features of the SELECT clause are especially useful when you first sit down with an unfamiliar database: DISTINCT, which removes duplicate values from your results, and COUNT, which tallies rows. Together they let you quickly understand what a table contains and how its data is distributed before you start filtering or transforming it.
3.4.1 DISTINCT
Adding the DISTINCT keyword immediately after SELECT tells the database to return only unique combinations of the specified columns, removing any duplicate rows from the result.
A common use is to see every unique value that appears in a categorical column. The actors table has 200 rows but only eight distinct primary genres. Without DISTINCT, selecting the primary_genre column would return 200 rows with many repeated values. With it, you get the eight unique values once each:
DISTINCT applies to the full combination of columns listed, not just the first one. If you write SELECT DISTINCT primary_genre, birth_country, the result contains each unique pairing of those two values rather than unique genres alone.
SELECT DISTINCT is useful for a quick look at what values exist in a column, but it is not a substitute for proper data profiling. For large tables, counting how many rows correspond to each distinct value tells you much more than the list of values alone. GROUP BY, covered in a later chapter, is the right tool for that.
3.4.2 COUNT
COUNT is an aggregate function: rather than returning one row per input row, it collapses results into a single summary value. The most common form, COUNT(*), returns the total number of rows in the result.
SELECTCOUNT(*) AS total_actorsFROM actors;
1 records
total_actors
200
Paired with a WHERE clause, COUNT(*) counts only the rows that satisfy the filter:
SELECTCOUNT(*) AS drama_actorsFROM actorsWHERE primary_genre ='Drama';
1 records
drama_actors
135
COUNT(column_name), with a specific column instead of *, counts only rows where that column is not NULL. This makes it a convenient way to measure how complete your data is:
SELECTCOUNT(*) AS total_rows,COUNT(birth_name) AS rows_with_birth_name,COUNT(death_date) AS rows_with_death_dateFROM actors;
1 records
total_rows
rows_with_birth_name
rows_with_death_date
200
70
82
The birth_name column records an actor’s legal name when they perform under a stage name. The result shows that only 70 of the 200 actors have a recorded birth name, meaning 130 either perform under their real name or have no birth name on record. Similarly, death_date is NULL for living actors, so COUNT(death_date) reveals how many actors in the table are deceased.
COUNT(DISTINCT column) combines both ideas, counting how many unique values appear in a column:
SELECTCOUNT(DISTINCT birth_country) AS num_countriesFROM actors;
1 records
num_countries
19
COUNT belongs to a family of aggregate functions that summarize groups of rows into a single value. Others include SUM(), AVG(), MIN(), and MAX(). These become significantly more powerful when paired with GROUP BY, which computes a separate summary for each category in a column, such as the number of actors per genre or the average height per birth country. GROUP BY is covered in its own chapter.
3.5 Filtering Rows with WHERE
SELECT and FROM control which columns and tables you work with. The WHERE clause controls which rows appear in the result. Any row that does not satisfy the condition after WHERE is excluded from the output.
Use a single equals sign (=) to match a text value exactly. Text comparisons in PostgreSQL are case-sensitive by default, so 'Drama' and 'drama' are not the same.
Fifteen actors in the table have received five or more Oscar nominations. Meryl Streep leads with 21 nominations, more than any other actor in the dataset.
3.5.3 Comparing Date Values
Date columns can be compared with the same operators as numbers. PostgreSQL accepts date literals in 'YYYY-MM-DD' format.
This query returns actors born in 1980 or later, representing the youngest generation in the dataset.
3.5.4 BETWEEN
BETWEEN is shorthand for a range filter. The following two forms are exactly equivalent:
-- Using comparison operatorsWHERE birth_date >='1950-01-01'AND birth_date <='1969-12-31'-- Using BETWEENWHERE birth_date BETWEEN'1950-01-01'AND'1969-12-31'
BETWEEN is inclusive on both ends: a row where the column value equals either boundary is included in the result. It works with dates, integers, and strings.
With strings, BETWEEN uses alphabetical (lexicographic) order. The boundary behavior is the same — both ends are inclusive — but the upper boundary catches only that exact string, not everything that begins with it. last_name BETWEEN 'D' AND 'H' would miss Hanks, Hepburn, and every other name that starts with H but sorts after the bare letter H. The correct way to express “last names beginning with D through H” is a half-open range using explicit operators:
The lower bound >= 'D' is inclusive and works as expected. The upper bound uses < 'I' rather than <= 'H' precisely because 'H' alone would exclude names like Hanks and Hoffman that sort after the single character H.
Prefer explicit >= and <= comparisons over BETWEEN for most situations. The inclusive behavior is easy to forget, which makes off-by-one errors common, and BETWEEN cannot express half-open ranges. The alphabetical edge cases with strings make it especially error-prone there. Explicit operators are unambiguous. BETWEEN is fine when you encounter it in existing code or reach for it with simple integer or date ranges, but it is not the preferred form for new queries.
3.5.5 NULL Values and IS NULL
NULL is a special marker in SQL that represents the absence of a value. It is not the same as zero, an empty string, or a blank space. NULL means that no value was recorded for that column in that row at all.
The death_date column is NULL for actors who are still living. You can find living actors using IS NULL:
To find rows where a value is present, use IS NOT NULL. The following query returns only deceased actors by requiring that death_date has been recorded:
You cannot test for NULL using =. Writing WHERE birth_name = NULL always returns zero rows, because NULL is not considered equal to anything, not even another NULL. Always use IS NULL or IS NOT NULL when checking for missing values.
3.5.6 Combining Conditions with AND and OR
A WHERE clause can include multiple conditions joined by AND or OR.
AND requires both conditions to be true for a row to be included.
This query returns Drama actors who have won at least one Oscar, sorted by wins and then by total nominations to break ties.
When combining AND and OR in the same WHERE clause, use parentheses to make your intentions explicit. SQL evaluates AND before OR, which can produce surprising results if conditions are not grouped clearly.
-- Returns actors in Drama or Comedy who have at least one nominationWHERE (primary_genre ='Drama'OR primary_genre ='Comedy')AND oscar_nominations >0
3.5.7 Matching Patterns with LIKE
The LIKE operator filters text columns using simple patterns. Two wildcard characters are available:
% matches any sequence of zero or more characters.
_ matches exactly one character.
The following example finds all actors whose last names begin with H:
Lowercase 'h%' would return no results here because text matching with LIKE is case-sensitive, and no last names in this table begin with a lower-case h. ILIKE works identically to LIKE but ignores case. Replacing LIKE with ILIKE makes the same pattern work regardless of how the value is capitalized:
Both queries return the same fourteen rows because the data happens to be consistently capitalized. The difference becomes meaningful when data entry is inconsistent — for example, if primary_genre values were stored as a mix of 'drama', 'Drama', and 'DRAMA', a LIKE filter for 'drama' would miss the capitalized variants while ILIKE would catch all of them.
With % wildcards on both sides of a search term, you can find text that occurs anywhere in a column. The following query finds actors whose birth names contain the word “Schwartz” or any other substring:
The NOT keyword reverses the result of a condition. It can be placed in front of LIKE, IN, BETWEEN, and similar operators to express “everything except what this pattern matches.”
NOT LIKE excludes rows that match a pattern. The query below returns actors whose last names do not begin with a vowel:
IS NOT NULL, introduced in the previous section, follows the same pattern and is technically a NOT form as well.
NOT can also be written as a prefix before a parenthesized condition:
WHERENOT (primary_genre ='Drama')
This is equivalent to WHERE primary_genre <> 'Drama'. The prefix form is more useful when negating a compound condition that would be awkward to rewrite with <> or other operators.
3.5.9.1 NOT IN and NULL Values
NOT IN behaves unexpectedly when NULL values are involved, and understanding this is important for using it safely.
When a NULL appears anywhere in the list passed to NOT IN, the condition returns no rows at all. To see this, try adding a NULL to an otherwise straightforward filter:
This returns zero rows despite the fact that most actors in the table are neither Drama nor Action performers. The reason comes down to how SQL evaluates NULL comparisons. The expression primary_genre NOT IN ('Drama', 'Action', NULL) expands internally to:
The third condition, primary_genre <> NULL, never evaluates to TRUE or FALSE. Any comparison with NULL produces NULL, meaning the result is unknown. When one part of an AND chain is unknown, the entire expression is unknown, and the row is excluded from the output. Because this applies to every row in the table, the query returns nothing.
A single NULL anywhere in the list passed to NOT IN silently causes the entire condition to return no rows. Because no error is raised, this problem can go unnoticed for a long time. When building a list for NOT IN, verify that it contains no NULL values. A later chapter covers how this same issue arises when the list comes from a subquery rather than a literal set of values, and introduces NOT EXISTS as the preferred alternative in those situations. Unlike NOT IN, NOT EXISTS handles NULL values correctly and does not produce this silent failure.
3.6 Sorting Results with ORDER BY
By default, a database returns rows in no particular order. The physical order of data in storage has nothing to do with alphabetical order, nomination count, or any other meaningful sequence. The ORDER BY clause lets you specify the order you want.
You can sort by more than one column by listing them after ORDER BY, separated by commas. The database sorts by the first column, then uses the second column to break ties, then the third, and so on.
This query first groups rows by genre alphabetically, then within each genre sorts actors from most nominated to least. The result lets you compare the most decorated actors within each genre at a glance.
3.7 Limiting Results with LIMIT
The LIMIT clause caps the number of rows returned by a query. It is most useful when combined with ORDER BY, because the two together let you answer “top N” questions: the ten most nominated actors, the five tallest actors, and so on.
Using LIMIT without ORDER BY returns an arbitrary subset of rows. Which rows you get depends on how the database happens to store and retrieve data at that moment, which can vary between runs. If you need a specific subset (such as the most nominated, tallest, or earliest-born), always pair LIMIT with an ORDER BY that defines what “first” means.
3.8 Putting It All Together
The clauses covered in this chapter can be combined in a single query. PostgreSQL evaluates them in a fixed logical order, regardless of how you arrange them in the text:
FROM (identify the table)
WHERE (filter rows)
SELECT (compute and choose columns)
ORDER BY (sort the result)
LIMIT (cut the result to the requested number of rows)
The query below uses all of these clauses together to find the ten most Oscar-nominated actors born outside the United States, showing their full name, birth country, nomination total, and wins.
Reading a query like this from top to bottom is natural once you understand what each clause contributes. As you move through subsequent chapters and your queries grow more complex, this structured way of thinking about each clause separately will continue to serve as a reliable foundation.
3.9 SQL as a Declarative Language
Now that you have a sense for how a basic SELECT statement is written, it is important to understand something fundamental about how SQL thinks, because it is different from how R and Python think.
If you have experience with either of those languages, you are accustomed to writing programs that specify a sequence of steps. You tell the computer exactly what to do and in what order: load this data, filter these rows, select these columns, sort the result. This style is called imperative programming, because you are issuing commands.
SQL is declarative. Rather than specifying steps, you describe the result you want and let the database engine figure out how to produce it.
To make the contrast concrete, consider retrieving Drama actors sorted by nomination count. In Python with pandas, you might write:
The SQL query does not describe steps. It describes a result: “I want the first_name, last_name, and oscar_nominations columns, from the actors table, restricted to rows where primary_genre is 'Drama', sorted by oscar_nominations descending.” The database engine reads that description and decides for itself how to retrieve the data efficiently. It may use an index, reorder internal operations, or parallelize work across processors. You do not control those decisions, and most of the time you do not need to.
3.9.1 Set-Based Thinking
One practical consequence of SQL’s declarative nature is that it operates on sets of rows rather than individual records. When you write a WHERE clause, you are not writing a loop that examines each row in turn. You are describing a condition that applies to the entire table at once, and the database evaluates it however it finds most efficient.
This can feel unfamiliar at first. In Python or R, reaching for a loop or an apply-style function is instinctive. In SQL, you resist that instinct and instead describe what the complete result set should look like. Queries that try to mimic row-by-row processing are almost always slower and harder to read than queries that embrace the set-based model.
3.9.2 The Logical Order of Execution
There is one aspect of SQL’s declarative model that surprises many people coming from R or Python: the order in which you write a query’s clauses is not the order in which the database evaluates them.
A query is written as:
SELECT...FROM...WHERE...ORDERBY...LIMIT...
But the database processes the clauses in this logical order:
FROM (identify the source table)
WHERE (filter rows)
SELECT (compute and choose columns)
ORDER BY (sort the result)
LIMIT (trim to the requested number of rows)
The most common practical consequence of this is that column aliases defined in SELECT cannot be referenced in a WHERE clause, because WHERE runs before SELECT. The following query fails for exactly that reason:
-- This fails: 'height_in' is not defined when WHERE runsSELECTROUND(height_cm /2.54::numeric, 1) AS height_inFROM actorsWHERE height_in >72;
You have to repeat the expression in WHERE instead:
SELECTROUND(height_cm /2.54::numeric, 1) AS height_inFROM actorsWHEREROUND(height_cm /2.54::numeric, 1) >72;
ORDER BY is an exception: most databases, including PostgreSQL, allow ORDER BY to reference aliases defined in SELECT, because ORDER BY is evaluated after SELECT. That is why ORDER BY height_in would work in the query above even though WHERE height_in does not.
Keeping this logical order in mind will make otherwise cryptic error messages much easier to diagnose as your queries grow more complex.
3.10 Exercises
3.10.1 Reflection
SQL is described in this chapter as a “declarative” programming language, while R and Python are “imperative.” In your own words, explain what that distinction means. How does it change the way you approach writing a query compared to writing a data pipeline in R or Python?
A colleague writes the following query hoping to find all actors who are over six feet tall (approximately 183 cm), but PostgreSQL returns an error. Identify the problem and explain how to fix it.
Explain the difference between COUNT(*) and COUNT(birth_name) as they apply to the actors table. Under what circumstances would the two expressions return different values?
A classmate wants to retrieve all actors for whom no birth name has been recorded. They write the query below, but it always returns zero rows. Why? What should they write instead?
For each exercise below, write a SQL query against the actors database.
Write a query that retrieves the first name, last name, birth country, and primary genre for every actor in the table. Sort the results alphabetically by last name.
Write a query that returns the full name (first and last name combined into a single column called full_name), birth country, and Oscar nomination count for all actors born in the United Kingdom, sorted from most to least nominated.
Write a query that shows only the distinct birth countries represented in the actors table, sorted alphabetically. Then write a second query using COUNT(DISTINCT birth_country) to confirm the total number of unique countries.
Write a query that returns the first name, last name, height in centimeters, and height converted to inches (rounded to one decimal place, named height_in) for all actors over 185 cm tall. Sort from tallest to shortest.
Write a query that returns the first name, last name, primary genre, Oscar nominations, and Oscar wins for all actors who meet every one of the following conditions: they are still living (no death date recorded), they have received at least three Oscar nominations, and they were born in the USA. Sort the results by Oscar nominations in descending order.