In Development: This textbook is currently
undergoing development and should not be used as an authoritative
source of... well, ANYTHING.
Learning Objectives
By the end of this chapter, students will be able to:
- Define primary key, foreign key, and unique constraint and explain the role each plays in a relational schema.
- Identify candidate keys in a table and select an appropriate primary key.
- Write
CREATE TABLE statements that declare PRIMARY KEY and FOREIGN KEY constraints.
- Explain referential integrity and describe what happens when a foreign key constraint is violated.
- Apply
ON DELETE and ON UPDATE referential actions (CASCADE, SET NULL, RESTRICT, NO ACTION) and predict their behavior.
- Distinguish between natural keys and surrogate keys and evaluate the trade-offs of each for a given table.
- Diagnose constraint violations from PostgreSQL error messages and identify the underlying data problem.
From Joins to Design
The joins chapter introduced primary keys, foreign keys, and cardinality as the tools you need to write queries that span multiple tables. This chapter approaches the same concepts from a different angle: not as the machinery behind a query, but as design decisions you make when defining a table.
Choosing the right key, understanding what referential integrity actually guarantees, and deciding what should happen when a parent row is deleted are decisions that determine whether a database stays reliable over months and years of use. The chapter that follows this one translates all of these concepts into SQL using CREATE TABLE statements.
What Makes Something a Key?
A key is a column or combination of columns that uniquely identifies every row in a table. Two properties are required:
- Uniqueness: no two rows share the same value (or combination of values) for the key columns.
- Non-nullability: key columns cannot be NULL, because NULL represents an unknown value and no two unknowns can be compared to confirm uniqueness.
Superkeys and Candidate Keys
Any set of columns that satisfies the uniqueness requirement is called a superkey. Most superkeys contain redundant columns. Consider a students table:
Several superkeys exist for this table:
(student_id): unique on its own
(email): unique on its own
(student_id, email): also unique, but email is redundant
(student_id, email, first_name, last_name): unique, but all extra columns are redundant
A candidate key is a minimal superkey: a superkey from which no column can be removed without losing uniqueness. In this table, student_id and email are both candidate keys. (student_id, email) is not, because removing either column still yields a unique identifier.
Every table has at least one candidate key. A table always has the set of all its columns as a superkey, and among all superkeys there is always at least one minimal one.
Primary Keys and Alternate Keys
When a table has multiple candidate keys, the designer chooses one as the primary key: the official row identifier, and the column that foreign keys in other tables will reference. The remaining candidate keys become alternate keys.
In the students example, student_id is the natural choice for primary key: it is stable, compact, and has no meaning outside the database (meaning no external force can make it change). The email column is a legitimate alternate key and can be declared UNIQUE to enforce that no two students share an address.
The distinction is practical: other tables that reference a student will use student_id as the foreign key column, not email. If a student later changes their email, only the students table needs updating. No other table is affected.
Surrogate vs. Natural Keys
The joins chapter introduced this distinction briefly. It is worth examining at greater depth because it is one of the more consequential decisions you make when designing a schema.
A natural key is a value that already exists in the data and is inherently unique: an ISBN, a country’s ISO code, a social security number, an email address. Natural keys have the advantage of being human-readable and sometimes self-explanatory.
A surrogate key is an artificial value with no meaning outside the database, typically an auto-incrementing integer assigned by the database when a row is inserted. It is stable precisely because nothing outside the database cares about its value.
The case for natural keys: No extra column to manage. Queries that need to filter by a known natural identifier (a country code, a product SKU) do not require a join to look it up.
The case against natural keys: Real-world “unique” values are often not as stable or unique as they first appear.
- ISBNs have been reused.
- Social security numbers have been reassigned.
- Email addresses change when people switch employers, get married, or simply choose a new provider.
- What seems unique at a national scale (a driver’s license number) may collide with records from another country in a merged dataset.
When a natural key changes, every table that holds it as a foreign key must be updated simultaneously. With a surrogate key, a name or email change is a single-row update in one table.
In practice, most production schemas use surrogate keys as primary keys and enforce natural uniqueness separately with a UNIQUE constraint on the natural identifier column. You get the stability of surrogates and retain the ability to query by natural identifier. The CREATE TABLE chapter shows how to express this pattern in SQL.
Composite Keys
When no single column uniquely identifies every row, two or more columns together serve as the primary key. The joins chapter introduced this concept when describing the observations table (keyed on country_code, indicator_code, and year). The normalization chapter showed another example: the order_items table, where (order_id, product) together identify each line item.
Composite keys arise most often in two situations:
- Junction tables that resolve many-to-many relationships, where each row pairs one entity from each side
- Measurement tables where a row records a value for a specific combination of dimensions (country, indicator, and time period)
A useful check: if you find yourself reaching for a surrogate key on a junction table, it may be a sign that the table is trying to represent something more than just the relationship. If the table records only the pairing, a composite key is the right choice.
Foreign Keys in Depth
A foreign key column holds values that reference the primary key of another table. This section covers aspects that the joins chapter set aside.
The Referential Integrity Guarantee
Referential integrity is the guarantee that every value in a foreign key column either exists as a primary key value in the referenced table, or is NULL. When the database enforces this constraint, it is impossible to create an orphaned row: a row whose foreign key points to something that does not exist.
Without this enforcement, the gap between what the data says and what it means grows over time. Suppose an application deletes a customer without first handling their orders. The orders table now contains rows with a customer_id that does not exist in customers. Queries that join the two tables will silently lose those orders. Any report that depends on the customer-order relationship produces wrong answers that are difficult to detect.
With referential integrity enforced, the database simply rejects the deletion until the relationship is resolved.
NULL in Foreign Key Columns
A foreign key column can allow NULL. A NULL in a foreign key column does not violate referential integrity; it means the row has no parent, or the relationship is unknown.
Whether this is appropriate depends on business rules. In an orders table, a NULL in customer_id might represent a guest order placed by someone without an account. That may be intentional, or it may be a data quality problem. Declaring the foreign key column NOT NULL says “every order must belong to a known customer.” Allowing NULL says “orders without a customer are a permitted state.”
This choice should be deliberate, not accidental. The default in most databases is to allow NULL in foreign key columns unless NOT NULL is explicitly declared.
Cascade Behavior
When a row in a parent table is deleted or its primary key is updated, the database must decide what to do with child rows that reference it. SQL provides four options, specified in the foreign key constraint declaration.
RESTRICT (the default in PostgreSQL): The operation is rejected if any child rows reference the parent row. It prevents orphaned rows by preventing the deletion in the first place. The calling application must resolve the dependent rows before the parent can be removed.
CASCADE: The operation propagates automatically. Deleting a parent row also deletes all child rows that reference it. Updating a primary key value updates the matching foreign key values in child rows. This is convenient but can silently delete large amounts of data if used carelessly.
SET NULL: The foreign key column in each child row is set to NULL. The child rows are not deleted; they simply lose their reference to the parent. This is only valid if the foreign key column allows NULL.
SET DEFAULT: The foreign key column in each child row is set to its default value. This is uncommon in practice.
Using the orders schema as a concrete example, consider deleting customer 1 (Alice), who has orders 1 and 3:
| RESTRICT |
The DELETE is rejected. Alice cannot be deleted while orders reference her. |
| CASCADE |
Orders 1 and 3 are deleted. Their order_items rows are also deleted if order_items cascades on order deletion. |
| SET NULL |
Orders 1 and 3 remain, with customer_id set to NULL. Alice’s purchase history is preserved but disowned. |
| SET DEFAULT |
Orders 1 and 3 remain, with customer_id set to whatever default was declared for the column. |
The right choice is a business logic decision. CASCADE suits situations where child data has no meaning independent of the parent (a line item on a deleted order is meaningless). RESTRICT suits situations where the application should be forced to handle the relationship explicitly. SET NULL suits situations where the child records should survive the parent’s deletion but no longer need to point to it.
ON UPDATE CASCADE is less frequently needed than ON DELETE CASCADE, but it matters when a natural key is used as a primary key. If the natural key value changes (an ISO code is reassigned, a product SKU is reformatted), CASCADE propagates the new value to all foreign key columns that reference it automatically.
Self-Referencing Relationships
A foreign key does not have to reference a different table. A table can contain a foreign key that references its own primary key. This is called a self-referencing relationship, and it represents a hierarchy within a single entity type.
The classic example is an employee reporting structure:
| 1 |
Sandra |
NULL |
| 2 |
Marcus |
1 |
| 3 |
Priya |
1 |
| 4 |
Tom |
2 |
manager_id is a foreign key referencing employee_id in the same table. Sandra has no manager, so her manager_id is NULL. Marcus and Priya report to Sandra. Tom reports to Marcus. This single table represents an arbitrarily deep hierarchy. Querying it across levels requires a self-join: joining the employees table to itself under two different aliases.
Self-referencing foreign keys use the same cascade options as any other foreign key. ON DELETE SET NULL is a common choice: if a manager leaves, their reports become unmanaged (NULL) rather than being deleted.
Exercises
Reflection
1. What is the difference between a superkey and a candidate key? Give an example of a superkey that is not a candidate key, using a table of your own design.
2. A database designer is modeling a table for university courses with the columns: course_id (auto-generated integer), course_code (like “CS101”), title, department_code, and credit_hours. Identify all candidate keys. Which would you choose as the primary key, and why? What would you do with the remaining candidate keys?
3. Why might a designer choose a surrogate key over a natural key even when a stable natural key exists? Give an example of a natural key that seemed reliable but would cause problems in practice.
4. What is the difference between NOT NULL and UNIQUE as column constraints? Can a column be both NOT NULL and UNIQUE? What does that combination effectively make the column?
5. Explain the difference between ON DELETE CASCADE and ON DELETE RESTRICT. Describe a business scenario where each would be the appropriate choice.
Structural Analysis
6. Consider a database for a hospital with the following tables:
patients: patient_id, first_name, last_name, date_of_birth, national_health_id
doctors: doctor_id, name, license_number, specialty
appointments: records which patient was seen by which doctor on which date and time
For each table: (a) identify all candidate keys, (b) choose a primary key and justify the choice, and (c) identify any columns that should be declared as alternate keys.
Then describe the foreign key relationships between the tables. For each foreign key, state whether the column should allow NULL and what the appropriate ON DELETE behavior is.
7. A development team is debating the primary key for a customers table. One developer argues for using the customer’s email address as the primary key, since emails are already required to be unique. Another argues for a surrogate integer key plus a separate UNIQUE constraint on email. Write a brief argument for each position, then give your own recommendation.