Adding a NOT NULL Column Without Locking in Postgres
Adding a NOT NULL column to a large Postgres table without downtime requires a multi-step migration — add the column as nullable first, backfill in batches, then enforce the constraint — rather than a single ALTER TABLE that holds an ACCESS EXCLUSIVE lock for the duration; this pattern is covered in depth as part of Zero-Downtime Schema Migration Strategies.
Quick Answer
A naive single-step migration rewrites the entire table under an ACCESS EXCLUSIVE lock, blocking all reads and writes for minutes on a large table. The correct approach splits the operation into discrete steps, each taking only a brief or weaker lock.
Before — the wrong approach (blocks reads and writes)
# migrations/versions/xxxx_add_status_wrong.py
from alembic import op
import sqlalchemy as sa
def upgrade() -> None:
# DANGER: Takes ACCESS EXCLUSIVE lock for the full table rewrite.
# On a 50M-row table this can run for 30+ minutes.
op.add_column(
"orders",
sa.Column("status", sa.String(50), nullable=False, server_default="pending"),
)
After — the correct multi-step recipe
# migrations/versions/xxxx_add_status_safe.py
from alembic import op
import sqlalchemy as sa
from sqlalchemy import text
def upgrade() -> None:
conn = op.get_bind()
# Step 0: Set a short lock_timeout so this migration fails fast
# rather than queuing behind a long-running transaction.
conn.execute(text("SET lock_timeout = '2s'"))
conn.execute(text("SET statement_timeout = '0'")) # No cap on batch UPDATEs
# Step 1: Add the column as nullable — instant, no table rewrite.
op.add_column(
"orders",
sa.Column("status", sa.String(50), nullable=True),
)
# Step 2: Backfill in batches of 10,000 rows outside a long transaction.
# Each batch is its own short transaction — avoids long-held row locks.
batch_size = 10_000
while True:
result = conn.execute(
text(
"UPDATE orders SET status = 'pending' "
"WHERE id IN ("
" SELECT id FROM orders WHERE status IS NULL LIMIT :batch"
")"
),
{"batch": batch_size},
)
conn.commit() # Commit each batch immediately
if result.rowcount == 0:
break # No more NULL rows
# Step 3a (PG 11+, constant default): If a constant server_default is
# acceptable, this single statement is near-instant on PG 11+ because
# Postgres stores the default in pg_attribute — no table rewrite occurs.
# Use this path instead of Steps 3b/3c when the default is a fixed literal.
#
# op.alter_column("orders", "status", server_default="pending", nullable=False)
#
# Step 3b: For PG 12+ or when you need the strongest guarantee, use the
# CHECK NOT VALID + VALIDATE CONSTRAINT trick (see Advanced section below).
conn.execute(
text(
"ALTER TABLE orders "
"ADD CONSTRAINT orders_status_not_null "
"CHECK (status IS NOT NULL) NOT VALID"
)
)
conn.commit()
# Step 3c: VALIDATE takes only ShareUpdateExclusiveLock — reads and writes
# continue unblocked during the full-table scan.
conn.execute(
text("ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null")
)
conn.commit()
# Step 3d: On PG 12+ this is near-instant because Postgres finds the
# existing CHECK and skips re-scanning the table.
conn.execute(
text("ALTER TABLE orders ALTER COLUMN status SET NOT NULL")
)
conn.commit()
# Step 4: Drop the now-redundant CHECK constraint and clean up the default.
conn.execute(
text("ALTER TABLE orders DROP CONSTRAINT orders_status_not_null")
)
op.alter_column("orders", "status", server_default=None)
def downgrade() -> None:
op.drop_column("orders", "status")
Execution Context & Async Workflow Integration
Alembic migrations typically run in a synchronous context invoked from the command line or a deployment script, but in an async SQLAlchemy project you may be calling run_sync from an async engine. The multi-step recipe above is event-loop safe because each conn.commit() releases locks and returns the connection to a clean state between batch iterations. There is no long-held transaction that could starve the async connection pool.
When Alembic is wired to an AsyncEngine via run_sync, the pattern requires wrapping the synchronous migration logic correctly:
# env.py — async Alembic environment configuration
import asyncio
from logging.config import fileConfig
from alembic import context
from sqlalchemy.ext.asyncio import create_async_engine
from myapp.db import Base
config = context.config
fileConfig(config.config_file_name)
target_metadata = Base.metadata
def run_migrations_online() -> None:
async def _run() -> None:
connectable = create_async_engine(
config.get_main_option("sqlalchemy.url"),
# Keep pool_size small for migration runners — they are sequential.
pool_size=2,
max_overflow=0,
)
async with connectable.connect() as connection:
await connection.run_sync(do_run_migrations)
await connectable.dispose()
asyncio.run(_run())
def do_run_migrations(connection: sa.engine.Connection) -> None:
context.configure(
connection=connection,
target_metadata=target_metadata,
# Required so individual op.* calls use the same connection.
compare_type=True,
)
with context.begin_transaction():
context.run_migrations()
run_migrations_online()
The batched UPDATE loop inside upgrade() commits each batch independently. This is intentional: in an async web server running concurrently, long-held row locks from a single large UPDATE would queue async tasks waiting for those rows, causing cascading timeouts in the connection pool. Batching keeps individual lock windows under a few milliseconds.
The lock_timeout = '2s' set at migration start is a circuit breaker. If a long-running OLAP query or a vacuum is holding a conflicting lock, the migration fails immediately with a LockNotAvailable error rather than queuing silently and blocking the entire connection pool for minutes. Retry with exponential back-off from your deployment tooling.
Resolving Warnings, Errors & Common Mistakes
| Exact error / warning string | Root cause | Production fix |
|---|---|---|
ERROR: column "status" of relation "orders" contains null values | ALTER COLUMN ... SET NOT NULL executed before all rows were backfilled. | Verify backfill completed: SELECT COUNT(*) FROM orders WHERE status IS NULL. Re-run the batch loop until it returns 0. |
ERROR: canceling statement due to lock timeout | Another session holds a conflicting lock (e.g., a long-running SELECT, autovacuum, or another DDL). | Retry the migration during low-traffic. Increase lock_timeout only as a last resort; investigate what holds the lock with SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock'. |
ERROR: cannot use a subquery in a check constraint | Attempted CHECK (status IN (SELECT ...)) or similar correlated subquery inside the ADD CONSTRAINT statement. | Check constraints must use only simple expressions referencing the current row. Replace subqueries with explicit value lists: CHECK (status IN ('pending', 'active', 'closed')). |
ERROR: deadlock detected | Batch UPDATE transactions are large enough to intersect with each other (parallel migration runs) or with OLTP writes touching the same rows. | Reduce batch size (try 1,000 rows). Add an index on the column being backfilled (CREATE INDEX CONCURRENTLY orders_status_null_idx ON orders (id) WHERE status IS NULL) so each batch scan is an index seek, not a sequential scan. Ensure only one migration worker runs at a time. |
WARNING: there is already a transaction in progress | conn.commit() called inside an Alembic begin_transaction() context that auto-manages the transaction. | Use op.get_bind() and call conn.execute(text("COMMIT")) directly, or restructure the migration to use connection.execution_options(no_begin=True) for the batch loop. |
Advanced NOT NULL Constraint Optimization
PostgreSQL 11+ Fast Default Path
On PostgreSQL 11 and later, when you add a column with a constant server_default and mark it NOT NULL in a single ALTER TABLE, Postgres stores the default value in pg_attribute (the system catalog) rather than physically writing it into every row. The table itself is not rewritten. The operation is essentially instantaneous regardless of table size.
# migrations/versions/xxxx_add_priority_pg11.py
from alembic import op
import sqlalchemy as sa
from sqlalchemy import text
def upgrade() -> None:
conn = op.get_bind()
conn.execute(text("SET lock_timeout = '2s'"))
# On PG 11+: near-instant even on a 100M-row table.
# Postgres records the default in pg_attribute; rows read the default
# lazily until they are next modified (VACUUM or UPDATE rewrites them).
op.add_column(
"orders",
sa.Column(
"priority",
sa.Integer(),
nullable=False,
server_default="0",
),
)
# Remove the server_default so new inserts must supply an explicit value.
op.alter_column("orders", "priority", server_default=None)
def downgrade() -> None:
op.drop_column("orders", "priority")
When this applies: The default must be a constant literal (an integer, a string, a function call like now() counts as non-constant and triggers a rewrite on older PG versions). Verify your Postgres version with SELECT version() before relying on this. On PG 10 and earlier, the same statement rewrites every row.
The CHECK NOT VALID + VALIDATE CONSTRAINT Trick in Detail
This is the most broadly applicable technique because it works across PG 10–17 and handles non-constant defaults:
ADD CONSTRAINT ... CHECK (...) NOT VALID— acquiresACCESS EXCLUSIVElock only briefly to record the constraint in the catalog. It does not scan existing rows. New rows and updated rows are validated immediately. Existing rows remain unchecked.VALIDATE CONSTRAINT— scans all existing rows but takes onlySHARE UPDATE EXCLUSIVElock. This lock is compatible with concurrentSELECT,INSERT,UPDATE, andDELETE. Your application continues serving traffic during the scan.ALTER COLUMN ... SET NOT NULL— on PG 12+, Postgres inspects the catalog and finds the validatedCHECK (status IS NOT NULL)constraint. It concludes a full table scan is unnecessary and the operation takes only a briefACCESS EXCLUSIVElock to update column metadata.
# Demonstration of the three lock levels
# Step 3b lock: ACCESS EXCLUSIVE (brief, catalog write only)
conn.execute(text(
"ALTER TABLE orders "
"ADD CONSTRAINT orders_status_not_null "
"CHECK (status IS NOT NULL) NOT VALID"
))
# Step 3c lock: SHARE UPDATE EXCLUSIVE (long scan, but non-blocking for reads/writes)
conn.execute(text(
"ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null"
))
# Step 3d lock: ACCESS EXCLUSIVE (near-instant on PG 12+ — constraint already verified)
conn.execute(text(
"ALTER TABLE orders ALTER COLUMN status SET NOT NULL"
))
On PG 11 and older, step 3d still performs a full table scan even with the validated CHECK constraint, so the safe option on older versions is to leave the CHECK constraint in place instead of converting it to a NOT NULL column attribute.
Frequently Asked Questions
Does the batch UPDATE need to run inside the Alembic migration, or can it be a separate script?
It can be a separate script run before the migration — this is often preferable for very large tables (hundreds of millions of rows) where the backfill takes hours. In that case, step 1 (add nullable column) is one migration deployed first, the backfill script runs separately and can be monitored, and steps 3–4 (enforce constraint) are a second migration deployed after the backfill completes. Splitting it this way avoids migration timeout issues in deployment pipelines.
Why use WHERE id IN (SELECT id FROM orders WHERE status IS NULL LIMIT :batch) rather than a plain UPDATE ... WHERE status IS NULL LIMIT :batch?
Postgres does not support LIMIT in UPDATE statements directly. The subquery pattern is the idiomatic workaround. The inner SELECT uses an index on id and a filter on status IS NULL, making each batch a bounded operation. Ensure there is an index that supports the WHERE status IS NULL filter — either the primary key index (if status is the only NULL column and a partial index exists) or a dedicated CREATE INDEX CONCURRENTLY on (id) WHERE status IS NULL.
What happens to rows inserted during the backfill window?
Application code inserting rows during the backfill window should always supply a value for status explicitly (since it has no DEFAULT at this point — step 1 added it as nullable with no default). If application code cannot be deployed simultaneously, add a server_default in step 1 so the database fills in the default for any insert that omits the column, then remove the server_default at step 4.
Can I combine steps 3b–3d into a single transaction?
No. VALIDATE CONSTRAINT must be committed before ALTER COLUMN ... SET NOT NULL for PG 12+ to recognize the existing validated CHECK and skip the rescan. If both statements run in the same transaction, the planner has not yet written the validated constraint to the catalog, and SET NOT NULL will perform a full table scan under ACCESS EXCLUSIVE lock — exactly what you wanted to avoid.
Related
- Zero-Downtime Schema Migration Strategies — parent section covering the full range of lock-safe migration patterns
- Configuring Alembic with Async SQLAlchemy Engines — how to wire
AsyncEngineintoenv.pyso migrations run in async projects - Alembic Async Migrations and Schema Evolution — pillar covering all migration topics for async SQLAlchemy projects