Setting pool_size and max_overflow for AWS RDS
Set pool_size = floor((rds_max_connections - reserved) / instance_count * 0.7) and max_overflow = floor(pool_size / 2), then verify that (pool_size + max_overflow) * instance_count never exceeds the usable connection budget — this is the core formula for safe RDS pool sizing, and the full derivation with cloud-specific context lives in Tuning Connection Pools for Cloud Databases.
Quick Answer
RDS max_connections is derived from instance RAM, not a configurable static value. The authoritative number is available from SHOW max_connections; connected to the instance. The formula AWS uses internally is approximately LEAST(DBInstanceClassMemory / 9531392, 5000). Subtract a fixed reserve for the superuser, administrative sessions, and monitoring agents, then divide the remainder evenly across all application instances. Apply a 70/30 split between pool_size and max_overflow.
Before:
# Legacy / unsafe — pool_size is a guess, no instance-count accounting
from sqlalchemy import create_engine
engine = create_engine(
"postgresql+psycopg2://app:secret@mydb.us-east-1.rds.amazonaws.com/orders",
pool_size=100, # ← arbitrary, does not account for multiple instances
max_overflow=20,
pool_pre_ping=False,
)
After — SQLAlchemy 2.0 async with derived values:
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
# Inputs (replace with your actuals):
RDS_MAX_CONNECTIONS = 683 # db.r6g.large: LEAST(17179869184 / 9531392, 5000)
SUPERUSER_RESERVED = 3 # superuser_reserved_connections default
ADMIN_RESERVE = 10 # monitoring agent, Alembic, interactive psql
APP_INSTANCE_COUNT = 4 # e.g. 4 ECS tasks / Kubernetes pods
usable = RDS_MAX_CONNECTIONS - SUPERUSER_RESERVED - ADMIN_RESERVE # = 670
per_instance = usable // APP_INSTANCE_COUNT # = 167
pool_size = int(per_instance * 0.7) # = 116
max_overflow = int(per_instance * 0.3) # = 50
engine = create_async_engine(
"postgresql+asyncpg://app:secret@mydb.cluster-xyz.us-east-1.rds.amazonaws.com/orders",
pool_size=pool_size,
max_overflow=max_overflow,
pool_timeout=30.0,
pool_recycle=1800, # 30 min — RDS default idle timeout is 8 h
pool_pre_ping=True,
echo=False,
)
AsyncSessionLocal = async_sessionmaker(
engine, class_=AsyncSession, expire_on_commit=False
)
At scale-out the worst-case total is (116 + 50) × 4 = 664, safely within the 670-connection budget.
Execution Context & Async Workflow Integration
Why the formula matters for async Python specifically
In a threaded server, each thread holds a connection for the duration of a request. Thread count is an upper bound on connection demand and is usually modest (8–32). In an async server (Uvicorn, Hypercorn), a single OS thread runs thousands of concurrent coroutines — but each coroutine still requires a real database connection while a query is in-flight. The pool controls how many simultaneous queries reach RDS.
With asyncpg and AsyncAdaptedQueuePool, connection checkout is non-blocking: a coroutine that cannot immediately acquire a connection is suspended and yields control to other coroutines. This means pool exhaustion does not hang the event loop — but it does cause all waiting coroutines to accumulate timeout pressure. When pool_timeout expires, SQLAlchemy raises sqlalchemy.exc.TimeoutError (surfaced to the caller as QueuePool limit of size X overflow Y reached, connection timed out, timeout 30).
Scaling pool_size when instance count changes
Instance count changes when you autoscale. If your application scales from 4 to 8 ECS tasks, and each task still has pool_size=116, max_overflow=50, worst-case connections jump from 664 to 1328 — nearly double max_connections=683. The application will hit FATAL: remaining connection slots are reserved immediately.
The correct pattern is to parameterize pool sizes from an environment variable set by the deployment system:
import os
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
RDS_MAX_CONNECTIONS = int(os.environ["RDS_MAX_CONNECTIONS"]) # set in task definition
APP_INSTANCE_COUNT = int(os.environ["APP_INSTANCE_COUNT"]) # set by autoscaler or fixed
ADMIN_RESERVE = int(os.environ.get("DB_ADMIN_RESERVE", "15"))
usable = RDS_MAX_CONNECTIONS - ADMIN_RESERVE
per_instance = max(1, usable // APP_INSTANCE_COUNT)
pool_size = max(1, int(per_instance * 0.7))
max_overflow = max(1, int(per_instance * 0.3))
engine = create_async_engine(
os.environ["DATABASE_URL"],
pool_size=pool_size,
max_overflow=max_overflow,
pool_timeout=30.0,
pool_recycle=1800,
pool_pre_ping=True,
echo=False,
connect_args={
"command_timeout": 15,
"server_settings": {"statement_timeout": "20000"},
},
)
AsyncSessionLocal = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
Passing APP_INSTANCE_COUNT as an environment variable requires the deployment system to inject the desired replica count at startup rather than reading it from the database. Many teams manage this with ECS task definition environment overrides or Kubernetes downward API, setting it equal to the replica count in the deployment spec.
Verifying the formula with pg_stat_activity
After deploying, confirm actual connection counts match predictions by querying pg_stat_activity from a psql session or a monitoring query:
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine, async_sessionmaker
from sqlalchemy import text
diagnostic_engine = create_async_engine(
"postgresql+asyncpg://admin:secret@mydb.rds.amazonaws.com/orders",
pool_size=1,
max_overflow=0,
)
DiagnosticSession = async_sessionmaker(diagnostic_engine, class_=AsyncSession)
async def check_connection_budget() -> None:
"""Print current active connections grouped by application name."""
async with DiagnosticSession() as session:
result = await session.execute(
text(
"""
SELECT application_name,
state,
count(*) AS conn_count
FROM pg_stat_activity
WHERE datname = current_database()
GROUP BY application_name, state
ORDER BY conn_count DESC;
"""
)
)
for row in result:
print(f"{row.application_name!r:40s} state={row.state!r:15s} count={row.conn_count}")
If conn_count for your application name exceeds pool_size per instance, either max_overflow connections are being actively used (acceptable under traffic spikes) or connections are leaking. Compare against pool.checkedout() from the engine introspection API to distinguish the two cases.
Aurora vs RDS: connection limit differences
Aurora clusters present a subtlety: each Aurora instance (writer or reader) has its own max_connections, derived from its own instance class. The writer and each reader are separate endpoints with separate connection limits. If you route writes to cluster.cluster-xyz.rds.amazonaws.com and reads to cluster.cluster-ro-xyz.rds.amazonaws.com, each engine's pool budget is calculated independently against the respective instance's max_connections.
Aurora Serverless v2 auto-scales ACUs (Aurora Capacity Units), which means max_connections is not fixed. The minimum capacity's connection limit applies when the Aurora instance is cold; at maximum capacity, it scales proportionally. For Aurora Serverless v2, use a lower pool_size (5–15) and rely on RDS Proxy for connection multiplexing to avoid exhausting the cold-start connection limit during scale-up events.
Resolving Warnings, Errors & Common Mistakes
| Error / Warning | Root Cause | Production Fix |
|---|---|---|
FATAL: remaining connection slots are reserved for non-replication superuser connections | (pool_size + max_overflow) × instance_count exceeds max_connections - superuser_reserved_connections. The superuser reserve (default 3) is the last line of defence. | Recalculate budget with the formula above. Check if autoscaling added instances without reducing per-instance pool sizes. Lower pool_size or add an RDS Proxy. |
sqlalchemy.exc.TimeoutError: QueuePool limit of size X overflow Y reached, connection timed out, timeout 30 | All pool_size + max_overflow connections are checked out and pool_timeout elapsed before one became available. | First, confirm the database actually has available connections (SHOW max_connections; SELECT count(*) FROM pg_stat_activity;). If yes, increase pool_size or optimize long-running transactions. If no, scale the RDS instance. |
asyncpg.exceptions.TooManyConnectionsError: FATAL: too many connections for role "app" | The database role has a CONNECTION LIMIT set (via ALTER ROLE app CONNECTION LIMIT 50) that is lower than the total connections your instances attempt. | Run ALTER ROLE app CONNECTION LIMIT -1; to remove the limit, or raise it to a value above your calculated worst-case total. |
asyncpg.exceptions.ConnectionDoesNotExistError | An idle connection in the pool was silently dropped by the RDS TCP idle timeout (default 8 h) or a security group change, then handed to a coroutine without validation. | Enable pool_pre_ping=True and set pool_recycle=1800. For environments with aggressive idle timeouts (RDS Proxy default 1800 s), set pool_recycle=900. |
WARNING: there are already too many connections to the database | PostgreSQL is at or near max_connections. Postgres logs this before refusing the next connection. | Monitor SELECT count(*) FROM pg_stat_activity GROUP BY state;. Reduce pool sizes or add a proxy. |
TimeoutError on first request after low-traffic period | Pool connections were recycled while idle; all slots need fresh TCP handshakes simultaneously (cold burst). | Pre-warm the pool during application startup: async with engine.connect() as conn: await conn.execute(text("SELECT 1")). Repeat for pool_size // 4 connections. |
sqlalchemy.exc.OperationalError: (asyncpg.exceptions.QueryCanceledError) ERROR: canceling statement due to statement timeout | RDS Proxy or server_settings.statement_timeout terminated a long-running query, and the connection returned to the pool in a failed state. | Set pool_pre_ping=True to validate connections on checkout. Ensure statement_timeout in connect_args matches or is lower than the proxy-level timeout. |
Advanced Pool Sizing Optimization
Adapting pool_size per worker type in a Celery deployment
Applications that run both an async API server and Celery workers against the same RDS instance must account for Celery worker connections in the total budget. Celery workers typically use synchronous drivers (psycopg2), which means each Celery worker thread holds one connection for the duration of a task.
import os
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
# Shared budget constants (same RDS instance)
RDS_MAX_CONNECTIONS = int(os.environ["RDS_MAX_CONNECTIONS"]) # e.g. 683
CELERY_WORKER_COUNT = int(os.environ.get("CELERY_WORKERS", "8"))
CELERY_CONCURRENCY = int(os.environ.get("CELERY_CONCURRENCY", "4")) # threads per worker
ADMIN_RESERVE = 15
celery_connections = CELERY_WORKER_COUNT * CELERY_CONCURRENCY # 8 × 4 = 32
api_budget = RDS_MAX_CONNECTIONS - ADMIN_RESERVE - celery_connections # 683-15-32 = 636
api_instance_count = int(os.environ.get("API_INSTANCE_COUNT", "4"))
per_api_instance = api_budget // api_instance_count # 636 // 4 = 159
pool_size = int(per_api_instance * 0.7) # 111
max_overflow = int(per_api_instance * 0.3) # 47
api_engine = create_async_engine(
os.environ["DATABASE_URL"],
pool_size=pool_size,
max_overflow=max_overflow,
pool_timeout=30.0,
pool_recycle=1800,
pool_pre_ping=True,
echo=False,
)
ApiSession = async_sessionmaker(api_engine, class_=AsyncSession, expire_on_commit=False)
Treating Celery connections as a fixed reservation keeps the API pool safe from unexpected exhaustion when Celery workers spike.
IAM token authentication and pool_recycle interaction
When using IAM database authentication with RDS, the IAM token used as the connection password is valid for 15 minutes. A connection created with an expired token continues to work — the token is only validated at connection time — but a recycled connection that initiates a new TCP handshake after the 15-minute window will fail authentication if the token is not refreshed.
The safest approach is to regenerate a fresh IAM token in a SQLAlchemy creator function, which is called each time the pool creates a new underlying connection:
import boto3
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
from sqlalchemy.pool import NullPool
rds_client = boto3.client("rds", region_name="us-east-1")
RDS_HOST = "mydb.cluster-xyz.us-east-1.rds.amazonaws.com"
RDS_PORT = 5432
DB_USER = "app"
DB_NAME = "orders"
def get_iam_token() -> str:
return rds_client.generate_db_auth_token(
DBHostname=RDS_HOST,
Port=RDS_PORT,
DBUsername=DB_USER,
)
# asyncpg creator receives the sync connection-creation call
import asyncpg # type: ignore
async def _create_asyncpg_conn() -> asyncpg.Connection:
token = get_iam_token()
return await asyncpg.connect(
host=RDS_HOST,
port=RDS_PORT,
user=DB_USER,
password=token,
database=DB_NAME,
ssl="require",
statement_cache_size=0,
)
# Use NullPool so each connection goes through the creator with a fresh token
iam_engine = create_async_engine(
f"postgresql+asyncpg://{DB_USER}:placeholder@{RDS_HOST}/{DB_NAME}",
poolclass=NullPool,
echo=False,
)
For production IAM auth with a QueuePool, set pool_recycle to 800 (well below the 900-second token validity window minus connection handshake time) and regenerate the token in the creator callback. This ensures every recycled connection picks up a fresh token automatically.
Frequently Asked Questions
How do I find the exact max_connections for my RDS instance class without connecting?
The AWS RDS documentation publishes connection limits per instance class, but they can lag behind actual values after engine version upgrades. The reliable approach is to query SHOW max_connections; directly, or look it up in the RDS Parameter Group in the AWS Console under the max_connections parameter — it shows the formula-derived value for your instance.
Should I always use the 70/30 split between pool_size and max_overflow?
The 70/30 split is a safe default. For batch-heavy workloads that have long transaction bursts followed by idle periods, a 60/40 split gives more burst headroom. For high-concurrency APIs with short-lived transactions, an 80/20 split reduces the number of overflow connections that are discarded after use, improving connection reuse. Validate with load testing: measure p99 checkout latency and pool.overflow() under peak traffic.
Can I set pool_size=0 and rely entirely on max_overflow?
No. A pool with pool_size=0 creates only overflow connections, all of which are discarded after use. Every checkout incurs a fresh TCP handshake and TLS negotiation — 10–100 ms on RDS. Under sustained load this causes severe latency degradation. Use NullPool explicitly if you want no pooling (e.g., behind RDS Proxy), which makes the intent clear.
What is the right pool_timeout for RDS?30.0 seconds is a reasonable default. Set it lower (10–15 s) if your API has strict SLA requirements and you'd rather fail fast and return an error than queue indefinitely. Set it higher (60 s) only for batch jobs where user impact is low and a brief queue during a traffic spike is acceptable.
Related
- Tuning Connection Pools for Cloud Databases — Full cloud pooling guide covering GCP Cloud SQL, Azure, PgBouncer, RDS Proxy, and Lambda patterns.
- Configuring Async Engines and Connection Pools — Baseline
create_async_engineinitialization and pool parameter reference. - Handling Connection Leaks and Pool Exhaustion — Diagnosing
PoolTimeoutand connection leak patterns in production. - Using SQLAlchemy Async with Celery Task Workers — Integrating async engines alongside synchronous Celery workers sharing the same RDS instance.