Zero-Downtime Schema Migration Strategies with Alembic and PostgreSQL
Running schema changes against a live PostgreSQL database without downtime requires deliberate sequencing, careful lock management, and a disciplined split between schema evolution and application deployment. The core principle is that your database schema and your application code must remain mutually compatible across at least two consecutive releases — the current version and the one being rolled out. This guide covers the expand-contract pattern, safe column additions, concurrent index builds, lock timeout guards, and how to orchestrate destructive changes across separate deployments using Alembic with async SQLAlchemy engines.
Concept & Execution Model
Zero-downtime migrations rest on a single invariant: at no point during a rolling deployment should the running application version be unable to read or write the database. This rules out any migration that removes a column before all app instances stop using it, adds a NOT NULL constraint before backfilling every row, or holds an AccessExclusiveLock long enough to queue writes behind it.
PostgreSQL's locking model is the primary obstacle. DDL statements — ALTER TABLE, DROP COLUMN, ADD CONSTRAINT — acquire an AccessExclusiveLock on the target table that blocks every concurrent read and write for the duration of the statement. On large tables, a single unconstrained ALTER TABLE ADD COLUMN DEFAULT 'x' could rewrite millions of rows under lock, stalling production traffic for seconds or minutes. PostgreSQL 11 eliminated the rewrite for simple defaults, but constraints, indexes, and nullable changes still demand care.
The expand-contract pattern structures schema evolution into three distinct phases that map cleanly to release boundaries:
- Expand — add the new column, table, or index alongside the existing structure. Old code continues to work unchanged; new code can optionally use the new structure.
- Migrate — backfill data, create supporting indexes, and validate constraints. The database and all running app versions remain compatible.
- Contract — remove the now-obsolete column, table, or constraint after all application instances have been updated and verified.
Each phase ships as one or more Alembic migration scripts, each targeting a specific release. The SVG below illustrates the release timeline:
Importantly, the expand and contract phases must ship in separate releases separated by enough time to confirm a full fleet rollout. Collapsing them into a single deployment is the most common source of production incidents.
Migration Construction & Alembic Execution Patterns
Adding Columns with Defaults Safely
PostgreSQL 11 introduced a major optimisation: adding a column with a DEFAULT value no longer rewrites the table. Instead, PostgreSQL stores the default in pg_attribute and serves it lazily for existing rows. This makes the following pattern safe even on tables with hundreds of millions of rows:
"""Add status column to orders with a safe server default.
Revision ID: a1b2c3d4e5f6
Revises: 9z8y7x6w5v4u
Create Date: 2026-06-18 09:00:00.000000
"""
from __future__ import annotations
from alembic import op
import sqlalchemy as sa
revision = "a1b2c3d4e5f6"
down_revision = "9z8y7x6w5v4u"
branch_labels = None
depends_on = None
def upgrade() -> None:
# PostgreSQL 11+: server_default avoids a full table rewrite.
# Use server_default (DDL-level) NOT default (Python-level) for this guarantee.
op.add_column(
"orders",
sa.Column(
"fulfillment_status",
sa.String(length=32),
nullable=True, # Nullable in the expand phase; NOT NULL comes later
server_default="pending",
),
)
def downgrade() -> None:
op.drop_column("orders", "fulfillment_status")
The critical distinction is server_default versus default. server_default is a DDL-level expression evaluated by PostgreSQL itself, enabling the fast-path optimisation. default is a Python-side value that SQLAlchemy inserts into every INSERT statement — it provides no DDL instruction and therefore triggers a full table rewrite for the column addition.
Avoid adding a NOT NULL column in the same migration as the column creation. Adding NOT NULL requires PostgreSQL to verify every row, which acquires an AccessExclusiveLock for the duration of the table scan. The correct sequence is: expand (nullable column with server_default), migrate (backfill, then add a CHECK NOT VALID constraint), contract (validate the constraint and alter to NOT NULL). The detailed walkthrough lives in adding a NOT NULL column without locking in Postgres.
CREATE INDEX CONCURRENTLY
Building an index on a large table without CONCURRENTLY acquires a ShareLock that blocks all writes for the index build duration — potentially minutes on large tables. CREATE INDEX CONCURRENTLY performs the build in two passes without blocking writes, at the cost of taking longer overall.
The critical constraint: CREATE INDEX CONCURRENTLY cannot run inside a transaction. Alembic wraps each migration in a transaction by default, so you must explicitly leave the transaction context. The correct approach uses op.get_bind() to retrieve the connection and issues the statement in autocommit mode via an explicit execution option:
"""Create concurrent index on invoices.tenant_id.
Revision ID: b2c3d4e5f6a7
Revises: a1b2c3d4e5f6
Create Date: 2026-06-18 10:00:00.000000
"""
from __future__ import annotations
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
revision = "b2c3d4e5f6a7"
down_revision = "a1b2c3d4e5f6"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Disable the Alembic transaction wrapper for this migration.
# CREATE INDEX CONCURRENTLY cannot run inside a transaction block.
op.execute(sa.text("COMMIT")) # End the implicit transaction Alembic opened
# Execute in autocommit so PostgreSQL can run the concurrent build.
conn = op.get_bind()
conn.execution_options(isolation_level="AUTOCOMMIT").execute(
sa.text(
"CREATE INDEX CONCURRENTLY IF NOT EXISTS "
"ix_invoices_tenant_id ON invoices (tenant_id)"
)
)
def downgrade() -> None:
conn = op.get_bind()
conn.execution_options(isolation_level="AUTOCOMMIT").execute(
sa.text("DROP INDEX CONCURRENTLY IF EXISTS ix_invoices_tenant_id")
)
A common anti-pattern is wrapping the index build inside alembic.context.begin_transaction() or calling migration_context.connection.execute() while still inside the Alembic transaction context. Both approaches raise psycopg2.errors.ActiveSqlTransaction (or the asyncpg equivalent) because PostgreSQL rejects CREATE INDEX CONCURRENTLY inside any open transaction block.
When using an async engine, the pattern adapts slightly. Because Alembic's run_async_migrations function already manages connection lifecycle, you should configure the migration script to run in non-transactional mode from the start for any migration containing concurrent index operations:
"""Async-compatible concurrent index migration.
Revision ID: c3d4e5f6a7b8
Revises: b2c3d4e5f6a7
Create Date: 2026-06-18 11:00:00.000000
"""
from __future__ import annotations
from alembic import op
import sqlalchemy as sa
revision = "c3d4e5f6a7b8"
down_revision = "b2c3d4e5f6a7"
branch_labels = None
depends_on = None
# Tell Alembic not to wrap this migration in a transaction.
# Required for CREATE INDEX CONCURRENTLY with asyncpg.
transactional_ddl = False # Alembic respects this attribute
def upgrade() -> None:
op.execute(
sa.text(
"CREATE INDEX CONCURRENTLY IF NOT EXISTS "
"ix_products_category_id ON products (category_id)"
)
)
def downgrade() -> None:
op.execute(
sa.text("DROP INDEX CONCURRENTLY IF EXISTS ix_products_category_id")
)
Setting transactional_ddl = False at module level in the migration file instructs Alembic to skip the implicit BEGIN / COMMIT wrapping, which is the cleanest path when the entire migration consists of concurrent DDL operations. See configuring Alembic with async SQLAlchemy engines for the env.py setup that wires async connections into the Alembic migration context.
Transaction Boundaries & Lock Management
Setting Lock Timeouts
Long-running schema changes that queue behind an existing long-running query cause cascading problems: they hold an AccessExclusiveLock while waiting, and every subsequent query to that table queues behind them. A 2-second lock_timeout combined with a statement_timeout cap is the minimal guard for any migration that touches a live table.
"""Migration with explicit lock and statement timeouts.
Revision ID: d4e5f6a7b8c9
Revises: c3d4e5f6a7b8
Create Date: 2026-06-18 12:00:00.000000
"""
from __future__ import annotations
from alembic import op
import sqlalchemy as sa
revision = "d4e5f6a7b8c9"
down_revision = "c3d4e5f6a7b8"
branch_labels = None
depends_on = None
def upgrade() -> None:
# Fail fast rather than queue behind a long-running transaction.
# lock_timeout: abort if we cannot acquire the lock within 2 seconds.
# statement_timeout: abort the entire statement if it runs longer than 30s.
op.execute(sa.text("SET lock_timeout = '2s'"))
op.execute(sa.text("SET statement_timeout = '30s'"))
op.add_column(
"tenants",
sa.Column("plan_tier", sa.String(length=16), nullable=True),
)
# Reset to session defaults so subsequent statements aren't constrained.
op.execute(sa.text("SET lock_timeout = DEFAULT"))
op.execute(sa.text("SET statement_timeout = DEFAULT"))
def downgrade() -> None:
op.execute(sa.text("SET lock_timeout = '2s'"))
op.execute(sa.text("SET statement_timeout = '30s'"))
op.drop_column("tenants", "plan_tier")
op.execute(sa.text("SET lock_timeout = DEFAULT"))
op.execute(sa.text("SET statement_timeout = DEFAULT"))
SET lock_timeout and SET statement_timeout are session-level settings that take effect immediately and persist only for the duration of the current connection. They do not require a DDL lock themselves. Setting them at the very start of upgrade() ensures that any subsequent DDL that cannot immediately acquire its lock will raise psycopg2.errors.LockNotAvailable within 2 seconds rather than building a queue.
Understanding the Lock Queue
PostgreSQL's lock queue is FIFO. When a CREATE INDEX or ALTER TABLE waits for an AccessExclusiveLock, every new query to that table queues behind it — even SELECT queries that would normally use AccessShareLock. The combination of lock_timeout and scheduling migrations during low-traffic windows (combined with retry logic in your deployment pipeline) is the standard approach: if the migration times out, your deployment fails fast and cleanly rather than degrading production for minutes.
For the most demanding tables — millions of rows, high write concurrency, zero tolerance for lock waits — consider pg_try_advisory_lock to coordinate with application-level connection draining before running the DDL. This is an advanced pattern outside the scope of Alembic's built-in facilities.
Advanced Zero-Downtime Patterns
Backwards-Compatible Deploys
Your application fleet runs multiple versions simultaneously during a rolling deployment. For a two-stage expand-contract cycle, this means:
- Release N ships the Alembic migration that adds the new column (nullable, with
server_default). - The new app code writes to both the old and new columns during the transition window.
- Release N+1 removes the write to the old column and reads only from the new one.
- Release N+2 ships the Alembic migration that drops the old column.
The "write to both columns" pattern in the transition release is sometimes called a dual-write. It is essential when the old column is still being read by concurrently deployed instances of the previous release.
from __future__ import annotations
from sqlalchemy import String, select, update
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.orm import Mapped, mapped_column, DeclarativeBase
class Base(DeclarativeBase):
pass
class User(Base):
__tablename__ = "users"
id: Mapped[int] = mapped_column(primary_key=True)
# Legacy column — still present during transition
username: Mapped[str | None] = mapped_column(String(128), nullable=True)
# New canonical column — added in expand phase
display_name: Mapped[str | None] = mapped_column(String(128), nullable=True)
async def update_user_display(
session: AsyncSession, user_id: int, name: str
) -> None:
"""Dual-write during transition: populate both columns simultaneously."""
await session.execute(
update(User)
.where(User.id == user_id)
.values(
username=name, # Keep populating for old app instances
display_name=name, # Populate new column for new app instances
)
)
await session.commit()
Once Release N+1 is fully deployed and verified, the dual-write simplifies to display_name only, and the username column is safe to drop in Release N+2.
Renaming a Column
PostgreSQL supports ALTER TABLE RENAME COLUMN which acquires AccessExclusiveLock only briefly (it updates catalog metadata, not rows). However, the application must not reference the old column name after the rename. The safe rename sequence is:
- Expand: add the new column, dual-write to both.
- Migrate: backfill the new column from the old.
- Contract release 1: remove reads of the old column from application code.
- Contract release 2:
ALTER TABLE RENAME COLUMN old TO neworDROP COLUMN old.
Avoid using Alembic's op.alter_column(..., new_column_name=...) on tables that are actively read in production without the dual-write transition in place.
Splitting Destructive Changes Across Releases
Dropping a column is the canonical example of a destructive change. The rule is absolute: the Alembic migration that drops a column must ship in a separate release after all application code that reads or writes that column has been removed and deployed across the entire fleet.
"""PHASE 3 ONLY: Drop legacy username column after full fleet migration.
This migration MUST NOT be merged until Release N+1 (which removes all
application references to 'username') is fully deployed and verified.
Revision ID: e5f6a7b8c9d0
Revises: d4e5f6a7b8c9
Create Date: 2026-06-18 14:00:00.000000
"""
from __future__ import annotations
from alembic import op
import sqlalchemy as sa
revision = "e5f6a7b8c9d0"
down_revision = "d4e5f6a7b8c9"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.execute(sa.text("SET lock_timeout = '2s'"))
op.drop_column("users", "username")
def downgrade() -> None:
# Restore the column (data loss — downgrade restores structure only).
op.add_column(
"users",
sa.Column("username", sa.String(length=128), nullable=True),
)
Include the prerequisite deployment condition in the migration docstring so it is visible in version control history. Some teams enforce this with a migration comment convention checked by their CI pipeline. When reviewing autogenerated migration scripts, search for drop_column and drop_table operations and verify they are guarded by the appropriate release sequencing.
Backfilling Large Tables
Backfilling millions of rows in a single transaction holds locks for the entire duration of the UPDATE. The zero-downtime approach is to batch the backfill using ctid ranges or primary key pagination, committing after each batch:
from __future__ import annotations
import asyncio
from sqlalchemy import text, update
from sqlalchemy.ext.asyncio import AsyncEngine, create_async_engine
async def backfill_display_name_in_batches(engine: AsyncEngine) -> None:
"""Backfill display_name from username in batches of 5000 rows.
Run this as a one-off script during the migrate phase, not inside
an Alembic migration function, to allow safe interruption and resumption.
"""
batch_size = 5_000
last_id = 0
async with engine.connect() as conn:
while True:
result = await conn.execute(
text(
"UPDATE users SET display_name = username "
"WHERE id > :last_id AND display_name IS NULL "
"ORDER BY id LIMIT :batch_size "
"RETURNING id"
),
{"last_id": last_id, "batch_size": batch_size},
)
rows = result.fetchall()
await conn.commit()
if not rows:
break
last_id = max(r[0] for r in rows)
print(f"Backfilled up to id={last_id}")
# Brief pause to reduce I/O pressure on the primary
await asyncio.sleep(0.05)
Run this backfill script outside of Alembic's migration runner — as a standalone administrative script — so it can be interrupted, monitored, and resumed safely. Place it in a scripts/ or migrations/backfills/ directory with a comment linking to the corresponding Alembic revision.
Hybrid Strategies & 1.4 → 2.0 Migration Path
op.batch_alter_table for SQLite
SQLite does not support ALTER TABLE ADD COLUMN with constraints, foreign keys, or NOT NULL in the same way PostgreSQL does. Alembic's op.batch_alter_table handles SQLite by performing a table recreation: it creates a new table with the target schema, copies data, drops the old table, and renames the new one. This is safe for development and test environments but is not suitable for production zero-downtime migrations on PostgreSQL.
"""SQLite-compatible migration using batch_alter_table.
On PostgreSQL, prefer native op.add_column / op.drop_column.
On SQLite (dev/test), batch_alter_table handles schema reconstruction.
Revision ID: f6a7b8c9d0e1
Revises: e5f6a7b8c9d0
Create Date: 2026-06-18 15:00:00.000000
"""
from __future__ import annotations
from alembic import op
import sqlalchemy as sa
from sqlalchemy.engine import Inspector
revision = "f6a7b8c9d0e1"
down_revision = "e5f6a7b8c9d0"
branch_labels = None
depends_on = None
def upgrade() -> None:
bind = op.get_bind()
if bind.dialect.name == "sqlite":
# SQLite requires table recreation for constraint changes.
with op.batch_alter_table("products") as batch_op:
batch_op.add_column(
sa.Column("sku", sa.String(length=64), nullable=True)
)
else:
# PostgreSQL: native ALTER TABLE, no rewrite needed.
op.execute(sa.text("SET lock_timeout = '2s'"))
op.add_column(
"products",
sa.Column("sku", sa.String(length=64), nullable=True),
)
def downgrade() -> None:
bind = op.get_bind()
if bind.dialect.name == "sqlite":
with op.batch_alter_table("products") as batch_op:
batch_op.drop_column("sku")
else:
op.execute(sa.text("SET lock_timeout = '2s'"))
op.drop_column("products", "sku")
Migrating 1.4 Migration Scripts to 2.0 Patterns
Alembic migration files authored against SQLAlchemy 1.4 are almost always forward-compatible: op.add_column, op.drop_column, and op.create_index function identically in 2.0. The main change is in env.py — specifically, replacing synchronous engine_from_config with create_async_engine and wrapping context.run_migrations() in asyncio.run(run_async_migrations()). The migration script files themselves rarely need edits.
However, if your 1.4 migrations use op.get_bind().execute(string_sql) with raw string queries (rather than sa.text(...)), these will raise ObjectNotExecutableError in SQLAlchemy 2.0 because plain strings are no longer accepted as executable. Replace every bare string with sa.text(...):
# 1.4 pattern — raises ObjectNotExecutableError in 2.0
op.get_bind().execute("UPDATE orders SET status = 'legacy' WHERE status IS NULL")
# 2.0 compatible
op.execute(sa.text("UPDATE orders SET status = 'legacy' WHERE status IS NULL"))
The full env.py configuration for async engines is covered in configuring Alembic with async SQLAlchemy engines.
Production Pitfalls & Anti-Patterns
- Dropping a column in the same release that removes it from application code. If any instance of the previous release is still running when the migration executes, reads against the dropped column will raise
UndefinedColumnError. Always ship the drop migration one release after the code removal. - Using
op.execute("CREATE INDEX ...")inside a transactional migration. Alembic wraps migrations inBEGIN/COMMITby default.CREATE INDEX CONCURRENTLYinside a transaction raisespsycopg2.errors.ActiveSqlTransaction. Usetransactional_ddl = Falseorop.get_bind().execution_options(isolation_level="AUTOCOMMIT"). - Applying a
NOT NULLconstraint without a prior backfill.ALTER TABLE ALTER COLUMN SET NOT NULLacquiresAccessExclusiveLockand scans every row to verify nullability. On a table with unfilled rows, this fails immediately with a constraint violation. UseADD CONSTRAINT ... NOT VALIDfollowed byVALIDATE CONSTRAINTin a separate step —VALIDATE CONSTRAINTuses a weakerShareUpdateExclusiveLockthat does not block reads. - Mixing concurrent index creation with other DDL in the same migration.
CREATE INDEX CONCURRENTLYcannot share a transaction withALTER TABLEor other DDL. Isolate concurrent index migrations in their own revision file withtransactional_ddl = False. - Relying on
op.execute(sa.text("COMMIT"))as the sole mechanism to escape the transaction. This commits the Alembic transaction prematurely, leaving the migration state partially updated inalembic_version. If the migration fails after the manualCOMMIT, the revision record may not be written. Usetransactional_ddl = Falseinstead to keep Alembic's bookkeeping intact. - Setting
statement_timeoutglobally inpostgresql.confinstead of per-migration. A globalstatement_timeoutthat is too aggressive will abort legitimate long-running analytical queries. Setstatement_timeoutat the session level insideupgrade()and reset it withSET statement_timeout = DEFAULTbefore the function returns.
Frequently Asked Questions
Can I add multiple columns in a single Alembic migration without locking?
Yes, for PostgreSQL 11+. Each ADD COLUMN with a server_default is a metadata-only operation that avoids table rewrites and requires only a brief AccessExclusiveLock to update the catalog. Combining multiple ADD COLUMN statements in a single ALTER TABLE reduces the number of lock acquisitions to one. Use op.add_column calls within the same migration, and Alembic will batch them if the dialect supports it. Nullable columns without defaults are always instant.
What happens if CREATE INDEX CONCURRENTLY fails partway through?
PostgreSQL marks the index as INVALID in pg_index. The index consumes space but is not used by the query planner. You must DROP INDEX the invalid index and retry. Use IF NOT EXISTS in your migration and check pg_indexes for invalid entries before retrying. Alembic does not automatically clean up invalid indexes.
How do I handle a migration that times out because lock_timeout fires?
The migration fails with psycopg2.errors.LockNotAvailable. Alembic rolls back the transaction (or the operation fails without rollback if using transactional_ddl = False). Your deployment pipeline should treat this as a retriable failure. Schedule a retry during a lower-traffic window, or investigate the long-running query holding the conflicting lock using pg_stat_activity and pg_locks.
Is it safe to run Alembic upgrade head in a Kubernetes init container?
Yes, with caveats. Multiple replicas starting simultaneously may attempt to run migrations concurrently. Alembic does not provide distributed locking out of the box. Use an advisory lock at the start of your migration runner:
import asyncio
from sqlalchemy import text
from sqlalchemy.ext.asyncio import create_async_engine
from alembic import command
from alembic.config import Config
async def run_migrations_with_advisory_lock() -> None:
engine = create_async_engine("postgresql+asyncpg://user:pass@host/db")
async with engine.connect() as conn:
# pg_try_advisory_lock returns true only for the first caller.
# Other callers get false and should skip migration.
result = await conn.execute(text("SELECT pg_try_advisory_lock(12345)"))
acquired = result.scalar()
if not acquired:
print("Another instance is running migrations — skipping.")
return
cfg = Config("alembic.ini")
command.upgrade(cfg, "head")
Does op.batch_alter_table work safely on PostgreSQL?
op.batch_alter_table performs a table recreation on SQLite, but on PostgreSQL it delegates to native ALTER TABLE statements. It is safe on PostgreSQL and produces identical results to the native operations. The main use case for batch_alter_table on PostgreSQL is writing dialect-agnostic migrations that work in both environments without branching on bind.dialect.name.
How do I verify that a migration is backwards-compatible before deploying?
Review the generated migration with the autogenerate workflow and check for: any drop_column or drop_table operations (require prior code removal), any NOT NULL constraint additions without a preceding backfill, any CREATE INDEX without CONCURRENTLY, and any missing lock_timeout guards on tables above a few million rows.
Related
- Configuring Alembic with Async SQLAlchemy Engines — setting up
env.pyto drive Alembic migrations through anasyncpg-backed async engine, including therun_async_migrationspattern and connection lifecycle management. - Autogenerating and Reviewing Migration Scripts — how to use
alembic revision --autogenerate, what it cannot detect automatically, and the review checklist for catching destructive or lock-heavy operations before they reach production. - Adding a NOT NULL Column Without Locking in Postgres — the three-step sequence for safely constraining a column to
NOT NULLon a large live table usingNOT VALIDconstraints and deferred validation. - Setting Up asyncpg Connection Pool Size for High Concurrency — tuning the connection pool available to your migration runner and application layer for concurrent workloads.
- Alembic Async Migrations and Schema Evolution — the parent reference covering the full scope of Alembic + async SQLAlchemy: engine setup, script generation, migration execution, and schema evolution patterns.