Configuring Pool Pre-Ping to Handle Stale Connections
Set pool_pre_ping=True in create_async_engine() to automatically discard connections whose underlying TCP socket was closed by the database or a network device while the connection sat idle in the pool — this is the direct fix for OperationalError: server closed the connection unexpectedly and asyncpg.exceptions.ConnectionDoesNotExistError in the handling connection leaks and pool exhaustion family of problems.
Quick Answer
# Before — stale connections handed out after idle timeout, causing OperationalError
from sqlalchemy.ext.asyncio import create_async_engine
engine = create_async_engine(
"postgresql+asyncpg://user:pass@db.internal/app",
pool_size=10,
max_overflow=5,
)
# After — pool validates each connection before checkout; stale sockets discarded
from sqlalchemy.ext.asyncio import create_async_engine
engine = create_async_engine(
"postgresql+asyncpg://user:pass@db.internal/app",
pool_size=10,
max_overflow=5,
pool_pre_ping=True, # send a lightweight probe before each checkout
pool_recycle=1800, # also recycle connections after 30 minutes (complement)
)
Both parameters address the same root failure — a connection whose socket is no longer alive — from different angles. pool_pre_ping detects the problem at checkout time. pool_recycle prevents long-lived connections from ever reaching the age where cloud infrastructure is likely to terminate them.
The error they prevent is easy to reproduce: start your application, let the connection pool fill with established connections, wait longer than the database's idle connection timeout (or trigger a failover), then make a new request. Without pre-ping or recycle, the first few requests receive dead sockets and raise OperationalError. With pre-ping, those dead connections are transparently replaced with fresh ones before your application code ever sees them.
Execution Context & Async Workflow Integration
Why Connections Go Stale
When a connection is checked back into AsyncAdaptedQueuePool, it is kept open and reused for future checkouts. The connection's underlying TCP socket remains open to the database server. Several infrastructure events can close that socket while the connection appears healthy inside the pool:
- Cloud database idle connection timeouts — AWS RDS Postgres defaults to terminating connections idle for 8 hours via
tcp_keepalives_idle. Aurora and Cloud SQL have similar timeouts, some as short as 10–15 minutes for serverless tiers. - Failover and replica promotion — during an RDS Multi-AZ failover, the old primary's connections are terminated. The pool holds references to sockets that now point at a dead endpoint.
- NAT gateway and load balancer timeouts — AWS NLBs time out idle TCP flows after 350 seconds by default. Connections sitting in the pool longer than this threshold become half-open sockets that appear active to the application but are dropped by the intermediary.
- Database
pg_terminate_backend()calls — manual DBA intervention or automated maintenance scripts may terminate long-lived idle connections.
In all these cases, SQLAlchemy's pool has no way to know the socket is dead until it tries to use the connection. Without pre-ping, the first query on the stale connection raises an error.
What pool_pre_ping Does
When pool_pre_ping=True is set, SQLAlchemy executes a dialect-appropriate probe query immediately before handing a connection to the caller. For asyncpg, this is a lightweight SELECT 1 executed over the existing socket. The sequence:
- Coroutine requests a connection via
async with engine.connect()orasync with AsyncSessionLocal() as session. - Pool selects the least-recently-used connection from its queue.
- Pool sends
SELECT 1to the database over that connection's socket. - If the probe succeeds: connection is handed to the caller normally.
- If the probe fails with a disconnect error: the connection is discarded, a fresh connection is opened, and the new connection is handed to the caller.
The caller never sees the stale connection. From the coroutine's perspective, checkout simply takes slightly longer than usual when a stale connection is encountered and replaced.
Latency Cost and Event Loop Impact
The probe adds one network round-trip per checkout for every connection that has been idle. On a LAN or VPC with 0.5–2 ms RTT to the database, this cost is negligible for most workloads. On high-throughput services making thousands of requests per second, the aggregate overhead becomes measurable.
Profile your actual p99 checkout latency with and without pool_pre_ping enabled using the pool event listeners described in the connection leaks and pool exhaustion guide. If pre-ping adds more than 10% to your checkout overhead, use pool_recycle as the primary defence and reserve pool_pre_ping for environments with unpredictable disconnects (multi-AZ failovers, spot instance databases).
You can observe when pre-ping triggers by watching the sqlalchemy.pool log output with echo_pool=True. A successful pre-ping produces no visible output — only a failed pre-ping (one that discards a stale connection) produces a log line similar to:
Pool pre-ping on connection <asyncpg...> failed, will attempt to reconnect (...)
Seeing this message frequently is a signal that your pool_recycle setting is too long relative to the infrastructure's idle timeout, or that your database is restarting/failing over frequently.
pool_recycle as a Complement
pool_recycle=N tells the pool to discard any connection that has been open for more than N seconds, regardless of activity. It prevents connections from living long enough for infrastructure timeouts to close them in the first place:
from sqlalchemy.ext.asyncio import create_async_engine
engine = create_async_engine(
"postgresql+asyncpg://user:pass@db.internal/app",
pool_size=10,
max_overflow=5,
pool_recycle=1800, # recycle before RDS's 8-hour idle timeout
pool_pre_ping=True, # catch any that slip through during failover
)
A practical rule: set pool_recycle to less than half the database's idle connection timeout. For RDS Postgres (default 8 hours), pool_recycle=1800 (30 minutes) is conservative. For Aurora Serverless with a 10-minute idle pause timeout, use pool_recycle=300.
The two parameters are not redundant — they protect against different failure modes:
| Scenario | pool_pre_ping protects you | pool_recycle protects you |
|---|---|---|
| Database idle connection timeout (predictable) | Partially — catches at next checkout | Yes — connection never reaches the age threshold |
| Multi-AZ failover (sudden) | Yes — detects dead socket at checkout | No — recycle doesn't help if failover happens between recycle and next checkout |
| NAT gateway TCP idle timeout (unpredictable) | Yes | Only if recycle interval is shorter than NAT timeout |
| High-latency environment (p99 checkout > 5ms SLA) | Adds latency — consider disabling | No latency at checkout |
For cloud PostgreSQL on RDS, Aurora, or Cloud SQL, both parameters together form the minimum viable stale-connection defence.
Resolving Warnings, Errors & Common Mistakes
| Error / Warning | Root Cause | Production Fix |
|---|---|---|
OperationalError: server closed the connection unexpectedly | TCP socket closed by DB/network while connection was idle in pool | Add pool_pre_ping=True; set pool_recycle below DB idle timeout |
asyncpg.exceptions.ConnectionDoesNotExistError | asyncpg attempted to use a connection whose socket is gone | Same — pool_pre_ping=True catches the stale connection before asyncpg touches it |
asyncpg.exceptions.InterfaceError: connection is closed | Connection already invalidated (often by a prior failed transaction) | Ensure async with session.begin(): is used; add pool_pre_ping=True for robustness |
sqlalchemy.exc.TimeoutError: QueuePool limit of size X overflow Y reached | Connections not returned to pool (leak), unrelated to pre-ping | Fix context-manager discipline; pre-ping does NOT help with leaks |
OperationalError: SSL SYSCALL error: EOF detected | SSL connection terminated mid-flight (often during failover) | Add pool_pre_ping=True; also configure asyncpg ssl='require' with reconnect logic |
| High checkout latency (p99 spikes) | Pre-ping probe on every checkout under high throughput | Use pool_recycle instead of pre-ping for latency-sensitive paths; or tune pool_size up to reduce stale-connection frequency |
WARNING: pool pre-ping failed, reconnecting in logs | Pre-ping detected a dead connection; a new connection is being made | This is the intended behaviour — only a concern if it fires very frequently (investigate DB restarts or NAT timeout) |
Advanced Pool Pre-Ping Optimization
Selective Pre-Ping with Connection Age Tracking
If pre-ping overhead is measurable but you still need stale-connection protection, you can apply the probe selectively: only check connections that have been idle longer than a threshold. This requires wrapping the checkout event to track last-use time:
import time
from sqlalchemy import event
from sqlalchemy.ext.asyncio import create_async_engine, AsyncEngine
# Build engine WITHOUT global pre-ping — we will implement selective checking
engine = create_async_engine(
"postgresql+asyncpg://user:pass@db.internal/app",
pool_size=10,
max_overflow=5,
pool_recycle=1800,
pool_pre_ping=False, # disabled globally
)
STALE_THRESHOLD_SECONDS = 60 # probe connections idle for more than 60s
def attach_selective_pre_ping(eng: AsyncEngine) -> None:
sync_pool = eng.sync_engine.pool
last_checkin: dict[int, float] = {}
@event.listens_for(sync_pool, "checkin")
def record_checkin(dbapi_conn, conn_record) -> None:
last_checkin[id(dbapi_conn)] = time.monotonic()
@event.listens_for(sync_pool, "checkout")
def selective_ping(dbapi_conn, conn_record, conn_proxy) -> None:
conn_id = id(dbapi_conn)
idle_for = time.monotonic() - last_checkin.get(conn_id, 0)
if idle_for > STALE_THRESHOLD_SECONDS:
# Manually test the connection using the dialect's ping mechanism
try:
dbapi_conn.ping(reconnect=False)
except Exception:
# invalidate forces the pool to open a fresh connection
conn_record.invalidate()
raise
attach_selective_pre_ping(engine)
Note: dbapi_conn.ping() is available on aiomysql connections. For asyncpg, the pool-level invalidation on checkout failure is handled by SQLAlchemy's pre-ping machinery itself — the most reliable approach for asyncpg remains the built-in pool_pre_ping=True.
Combining pool_pre_ping with Retry Logic
Pre-ping handles the idle-pool case, but a stale connection can also appear mid-transaction during a failover. For full resilience, add application-level retry around the operation:
import asyncio
from sqlalchemy.exc import OperationalError
from sqlalchemy.ext.asyncio import AsyncSession
async def execute_with_retry(session: AsyncSession, stmt, *, retries: int = 2):
"""Execute a statement, retrying once on disconnect errors."""
for attempt in range(retries + 1):
try:
result = await session.execute(stmt)
return result
except OperationalError as exc:
if attempt < retries and "server closed the connection" in str(exc):
await asyncio.sleep(0.1 * (2 ** attempt)) # brief backoff
await session.rollback()
continue
raise
This retry wrapper should be narrow — only catching OperationalError with known disconnect messages — to avoid masking genuine query errors.
Frequently Asked Questions
Does pool_pre_ping work with asyncpg specifically?
Yes. For the postgresql+asyncpg dialect, SQLAlchemy's pre-ping sends SELECT 1 through asyncpg's connection and catches asyncpg.exceptions.ConnectionDoesNotExistError and asyncpg.exceptions.InterfaceError as indicators of a dead connection. The stale connection is invalidated and a fresh one is transparently substituted before the connection reaches your code.
Should I use pool_pre_ping=True for every async engine? Use it whenever: your application is deployed against a cloud-managed database (RDS, Cloud SQL, Azure Database), your traffic has long idle periods between requests (the pool holds connections but no queries run for minutes), or you operate in a multi-AZ or replica-set topology where failovers terminate connections. Skip it (or combine with selective logic) if you have a latency SLA under 5 ms and your pool is never idle long enough for sockets to go stale.
What is the difference between pool_pre_ping and pool_recycle?pool_pre_ping probes the connection at checkout time — reactive, adds latency, catches any disconnect regardless of cause. pool_recycle proactively discards connections older than N seconds before they can go stale — preventive, zero latency at checkout, but does not protect against unexpected disconnects like failovers. Use both together for the strongest protection.
Will pool_pre_ping prevent the QueuePool limit error?
No. Pool exhaustion (QueuePool limit of size X overflow Y reached) is caused by connections being checked out and never returned — a leak. Pre-ping only acts on connections that are already back in the pool (idle, checked in). If your pool is exhausted, the problem is context-manager discipline or long-running transactions, not stale sockets. See the connection leaks and pool exhaustion guide for the correct diagnosis.
Related
- Handling Connection Leaks and Pool Exhaustion — parent guide covering the full spectrum of pool problems including the
QueuePool limiterror and context-manager discipline. - Configuring Async Engines and Connection Pools — complete reference for
pool_size,max_overflow,pool_timeout, andNullPoolconfiguration. - Tuning Connection Pools for Cloud Databases — RDS, Aurora, and Cloud SQL specific idle-timeout values and recommended
pool_recyclesettings.