Configuring Pool Pre-Ping to Handle Stale Connections

Set pool_pre_ping=True in create_async_engine() to automatically discard connections whose underlying TCP socket was closed by the database or a network device while the connection sat idle in the pool — this is the direct fix for OperationalError: server closed the connection unexpectedly and asyncpg.exceptions.ConnectionDoesNotExistError in the handling connection leaks and pool exhaustion family of problems.

Quick Answer

# Before — stale connections handed out after idle timeout, causing OperationalError
from sqlalchemy.ext.asyncio import create_async_engine

engine = create_async_engine(
    "postgresql+asyncpg://user:pass@db.internal/app",
    pool_size=10,
    max_overflow=5,
)

# After — pool validates each connection before checkout; stale sockets discarded
from sqlalchemy.ext.asyncio import create_async_engine

engine = create_async_engine(
    "postgresql+asyncpg://user:pass@db.internal/app",
    pool_size=10,
    max_overflow=5,
    pool_pre_ping=True,       # send a lightweight probe before each checkout
    pool_recycle=1800,        # also recycle connections after 30 minutes (complement)
)

Both parameters address the same root failure — a connection whose socket is no longer alive — from different angles. pool_pre_ping detects the problem at checkout time. pool_recycle prevents long-lived connections from ever reaching the age where cloud infrastructure is likely to terminate them.

The error they prevent is easy to reproduce: start your application, let the connection pool fill with established connections, wait longer than the database's idle connection timeout (or trigger a failover), then make a new request. Without pre-ping or recycle, the first few requests receive dead sockets and raise OperationalError. With pre-ping, those dead connections are transparently replaced with fresh ones before your application code ever sees them.

Execution Context & Async Workflow Integration

Why Connections Go Stale

When a connection is checked back into AsyncAdaptedQueuePool, it is kept open and reused for future checkouts. The connection's underlying TCP socket remains open to the database server. Several infrastructure events can close that socket while the connection appears healthy inside the pool:

  • Cloud database idle connection timeouts — AWS RDS Postgres defaults to terminating connections idle for 8 hours via tcp_keepalives_idle. Aurora and Cloud SQL have similar timeouts, some as short as 10–15 minutes for serverless tiers.
  • Failover and replica promotion — during an RDS Multi-AZ failover, the old primary's connections are terminated. The pool holds references to sockets that now point at a dead endpoint.
  • NAT gateway and load balancer timeouts — AWS NLBs time out idle TCP flows after 350 seconds by default. Connections sitting in the pool longer than this threshold become half-open sockets that appear active to the application but are dropped by the intermediary.
  • Database pg_terminate_backend() calls — manual DBA intervention or automated maintenance scripts may terminate long-lived idle connections.

In all these cases, SQLAlchemy's pool has no way to know the socket is dead until it tries to use the connection. Without pre-ping, the first query on the stale connection raises an error.

What pool_pre_ping Does

When pool_pre_ping=True is set, SQLAlchemy executes a dialect-appropriate probe query immediately before handing a connection to the caller. For asyncpg, this is a lightweight SELECT 1 executed over the existing socket. The sequence:

  1. Coroutine requests a connection via async with engine.connect() or async with AsyncSessionLocal() as session.
  2. Pool selects the least-recently-used connection from its queue.
  3. Pool sends SELECT 1 to the database over that connection's socket.
  4. If the probe succeeds: connection is handed to the caller normally.
  5. If the probe fails with a disconnect error: the connection is discarded, a fresh connection is opened, and the new connection is handed to the caller.

The caller never sees the stale connection. From the coroutine's perspective, checkout simply takes slightly longer than usual when a stale connection is encountered and replaced.

Latency Cost and Event Loop Impact

The probe adds one network round-trip per checkout for every connection that has been idle. On a LAN or VPC with 0.5–2 ms RTT to the database, this cost is negligible for most workloads. On high-throughput services making thousands of requests per second, the aggregate overhead becomes measurable.

Profile your actual p99 checkout latency with and without pool_pre_ping enabled using the pool event listeners described in the connection leaks and pool exhaustion guide. If pre-ping adds more than 10% to your checkout overhead, use pool_recycle as the primary defence and reserve pool_pre_ping for environments with unpredictable disconnects (multi-AZ failovers, spot instance databases).

You can observe when pre-ping triggers by watching the sqlalchemy.pool log output with echo_pool=True. A successful pre-ping produces no visible output — only a failed pre-ping (one that discards a stale connection) produces a log line similar to:

Pool pre-ping on connection <asyncpg...> failed, will attempt to reconnect (...)

Seeing this message frequently is a signal that your pool_recycle setting is too long relative to the infrastructure's idle timeout, or that your database is restarting/failing over frequently.

pool_recycle as a Complement

pool_recycle=N tells the pool to discard any connection that has been open for more than N seconds, regardless of activity. It prevents connections from living long enough for infrastructure timeouts to close them in the first place:

from sqlalchemy.ext.asyncio import create_async_engine

engine = create_async_engine(
    "postgresql+asyncpg://user:pass@db.internal/app",
    pool_size=10,
    max_overflow=5,
    pool_recycle=1800,   # recycle before RDS's 8-hour idle timeout
    pool_pre_ping=True,  # catch any that slip through during failover
)

A practical rule: set pool_recycle to less than half the database's idle connection timeout. For RDS Postgres (default 8 hours), pool_recycle=1800 (30 minutes) is conservative. For Aurora Serverless with a 10-minute idle pause timeout, use pool_recycle=300.

The two parameters are not redundant — they protect against different failure modes:

Scenariopool_pre_ping protects youpool_recycle protects you
Database idle connection timeout (predictable)Partially — catches at next checkoutYes — connection never reaches the age threshold
Multi-AZ failover (sudden)Yes — detects dead socket at checkoutNo — recycle doesn't help if failover happens between recycle and next checkout
NAT gateway TCP idle timeout (unpredictable)YesOnly if recycle interval is shorter than NAT timeout
High-latency environment (p99 checkout > 5ms SLA)Adds latency — consider disablingNo latency at checkout

For cloud PostgreSQL on RDS, Aurora, or Cloud SQL, both parameters together form the minimum viable stale-connection defence.

Resolving Warnings, Errors & Common Mistakes

Error / WarningRoot CauseProduction Fix
OperationalError: server closed the connection unexpectedlyTCP socket closed by DB/network while connection was idle in poolAdd pool_pre_ping=True; set pool_recycle below DB idle timeout
asyncpg.exceptions.ConnectionDoesNotExistErrorasyncpg attempted to use a connection whose socket is goneSame — pool_pre_ping=True catches the stale connection before asyncpg touches it
asyncpg.exceptions.InterfaceError: connection is closedConnection already invalidated (often by a prior failed transaction)Ensure async with session.begin(): is used; add pool_pre_ping=True for robustness
sqlalchemy.exc.TimeoutError: QueuePool limit of size X overflow Y reachedConnections not returned to pool (leak), unrelated to pre-pingFix context-manager discipline; pre-ping does NOT help with leaks
OperationalError: SSL SYSCALL error: EOF detectedSSL connection terminated mid-flight (often during failover)Add pool_pre_ping=True; also configure asyncpg ssl='require' with reconnect logic
High checkout latency (p99 spikes)Pre-ping probe on every checkout under high throughputUse pool_recycle instead of pre-ping for latency-sensitive paths; or tune pool_size up to reduce stale-connection frequency
WARNING: pool pre-ping failed, reconnecting in logsPre-ping detected a dead connection; a new connection is being madeThis is the intended behaviour — only a concern if it fires very frequently (investigate DB restarts or NAT timeout)

Advanced Pool Pre-Ping Optimization

Selective Pre-Ping with Connection Age Tracking

If pre-ping overhead is measurable but you still need stale-connection protection, you can apply the probe selectively: only check connections that have been idle longer than a threshold. This requires wrapping the checkout event to track last-use time:

import time
from sqlalchemy import event
from sqlalchemy.ext.asyncio import create_async_engine, AsyncEngine

# Build engine WITHOUT global pre-ping — we will implement selective checking
engine = create_async_engine(
    "postgresql+asyncpg://user:pass@db.internal/app",
    pool_size=10,
    max_overflow=5,
    pool_recycle=1800,
    pool_pre_ping=False,  # disabled globally
)

STALE_THRESHOLD_SECONDS = 60  # probe connections idle for more than 60s


def attach_selective_pre_ping(eng: AsyncEngine) -> None:
    sync_pool = eng.sync_engine.pool
    last_checkin: dict[int, float] = {}

    @event.listens_for(sync_pool, "checkin")
    def record_checkin(dbapi_conn, conn_record) -> None:
        last_checkin[id(dbapi_conn)] = time.monotonic()

    @event.listens_for(sync_pool, "checkout")
    def selective_ping(dbapi_conn, conn_record, conn_proxy) -> None:
        conn_id = id(dbapi_conn)
        idle_for = time.monotonic() - last_checkin.get(conn_id, 0)
        if idle_for > STALE_THRESHOLD_SECONDS:
            # Manually test the connection using the dialect's ping mechanism
            try:
                dbapi_conn.ping(reconnect=False)
            except Exception:
                # invalidate forces the pool to open a fresh connection
                conn_record.invalidate()
                raise


attach_selective_pre_ping(engine)

Note: dbapi_conn.ping() is available on aiomysql connections. For asyncpg, the pool-level invalidation on checkout failure is handled by SQLAlchemy's pre-ping machinery itself — the most reliable approach for asyncpg remains the built-in pool_pre_ping=True.

Combining pool_pre_ping with Retry Logic

Pre-ping handles the idle-pool case, but a stale connection can also appear mid-transaction during a failover. For full resilience, add application-level retry around the operation:

import asyncio
from sqlalchemy.exc import OperationalError
from sqlalchemy.ext.asyncio import AsyncSession


async def execute_with_retry(session: AsyncSession, stmt, *, retries: int = 2):
    """Execute a statement, retrying once on disconnect errors."""
    for attempt in range(retries + 1):
        try:
            result = await session.execute(stmt)
            return result
        except OperationalError as exc:
            if attempt < retries and "server closed the connection" in str(exc):
                await asyncio.sleep(0.1 * (2 ** attempt))  # brief backoff
                await session.rollback()
                continue
            raise

This retry wrapper should be narrow — only catching OperationalError with known disconnect messages — to avoid masking genuine query errors.

Frequently Asked Questions

Does pool_pre_ping work with asyncpg specifically? Yes. For the postgresql+asyncpg dialect, SQLAlchemy's pre-ping sends SELECT 1 through asyncpg's connection and catches asyncpg.exceptions.ConnectionDoesNotExistError and asyncpg.exceptions.InterfaceError as indicators of a dead connection. The stale connection is invalidated and a fresh one is transparently substituted before the connection reaches your code.

Should I use pool_pre_ping=True for every async engine? Use it whenever: your application is deployed against a cloud-managed database (RDS, Cloud SQL, Azure Database), your traffic has long idle periods between requests (the pool holds connections but no queries run for minutes), or you operate in a multi-AZ or replica-set topology where failovers terminate connections. Skip it (or combine with selective logic) if you have a latency SLA under 5 ms and your pool is never idle long enough for sockets to go stale.

What is the difference between pool_pre_ping and pool_recycle?pool_pre_ping probes the connection at checkout time — reactive, adds latency, catches any disconnect regardless of cause. pool_recycle proactively discards connections older than N seconds before they can go stale — preventive, zero latency at checkout, but does not protect against unexpected disconnects like failovers. Use both together for the strongest protection.

Will pool_pre_ping prevent the QueuePool limit error? No. Pool exhaustion (QueuePool limit of size X overflow Y reached) is caused by connections being checked out and never returned — a leak. Pre-ping only acts on connections that are already back in the pool (idle, checked in). If your pool is exhausted, the problem is context-manager discipline or long-running transactions, not stale sockets. See the connection leaks and pool exhaustion guide for the correct diagnosis.