We recently chased a p95 latency regression at Qargo. Most requests were fine, but some spent seconds inside COMMIT long after their last query had returned. The cause was an interaction between row locks, foreign keys, and deferred constraints.
The symptom
The slow traces all had the same shape. Our business transactions are long and run through a lot of SELECT, UPDATE and INSERT statements, but the COMMIT at the end was the slowest part. Sometimes it took longer than all the preceding queries combined.
A slow COMMIT doesn’t show up in EXPLAIN. By the time you reach it, the queries are done and Postgres just has to flush the WAL and return. A COMMIT that takes seconds is usually a synchronous_commit wait on a replica, or a lock somewhere. Replication wasn’t the issue for us, so we went looking for the lock.
Finding the blocker
pg_blocking_pids() returns the PIDs that a given backend is waiting on. Joining it against pg_stat_activity tells you who is blocking whom, and what each side is doing:
SELECT blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
Running this during one of the slow commits showed something unexpected: the blocked statement was an INSERT into a child table, and the transaction blocking it was sitting on a SELECT ... FOR UPDATE of a row in the parent table. A parent row the slow request never read or wrote.
Row locks, briefly
Postgres has four row-level lock modes, weakest to strongest:
FOR KEY SHAREFOR SHAREFOR NO KEY UPDATEFOR UPDATE
The one to know for this story is FOR KEY SHARE. Whenever you write a row that has a foreign key reference to some parent row, Postgres acquires a FOR KEY SHARE lock on that parent. The FK constraint needs to know the parent’s key isn’t going to change or disappear under it.
FOR KEY SHARE conflicts with FOR UPDATE, and only with FOR UPDATE. So a transaction holding FOR UPDATE on a parent row blocks any other transaction trying to write a child row pointing at it, even when the two transactions are otherwise working on completely unrelated tables.
How we ended up with FOR UPDATE
Django’s ORM doesn’t make the lock mode obvious. This:
with transaction.atomic():
obj = MyModel.objects.select_for_update().get(pk=pk)
# ... business logic ...
issues SELECT ... FOR UPDATE, the strongest row-level lock. We were using it to serialize a piece of business logic on a parent row. In the meantime, unrelated requests were inserting into child tables that had foreign keys pointing at that row, and queueing up behind the lock.
Here’s how the two transactions overlap in time:
Why the wait shows up at COMMIT
The other surprise was the timing. The INSERT into the child table ran fast, and only the COMMIT was slow. The lock was acquired far away in time from the statement that caused it.
The reason is deferred foreign key constraints. By default Postgres checks FKs at the end of each statement. A constraint declared DEFERRABLE INITIALLY DEFERRED is checked at commit instead. That’s useful when you need to write rows in an order that temporarily violates the constraint, like cycles between tables.
We use deferred constraints. So when a child row is inserted, the FK check, and the FOR KEY SHARE lock it needs on the parent, doesn’t run at the INSERT. It runs at COMMIT. From the outside that looks like a fast INSERT followed by a slow COMMIT, while the contention is on a lock held by a different transaction.
Why plain UPDATE doesn’t hit this
UPDATEs on the parent could in theory change its primary key too, so why don’t they cause the same blocking?
Because Postgres can inspect what the UPDATE is doing. If none of the updated columns is one that a foreign key from another table could reference (in practice, the primary key), it takes the weaker FOR NO KEY UPDATE instead of FOR UPDATE. FOR NO KEY UPDATE doesn’t conflict with FOR KEY SHARE, so it doesn’t block child writes. If the UPDATE does touch such a column, it upgrades to FOR UPDATE and the blocking comes back.
SELECT ... FOR UPDATE gives Postgres no such information. You haven’t named any columns, and the row could be modified arbitrarily later in the transaction, primary key included. The strongest lock is the only safe choice.
The fix
Tell Postgres what you actually intend. If you’re not going to change the primary key, take FOR NO KEY UPDATE:
SELECT ... FOR NO KEY UPDATE;
You still get mutual exclusion on the row, which was the point of the lock. But the weaker mode doesn’t conflict with FOR KEY SHARE, so child writes and their deferred FK checks can proceed.
Django exposes this as no_key=True (added in Django 3.2):
with transaction.atomic():
obj = MyModel.objects.select_for_update(no_key=True).get(pk=pk)
# ... business logic ...
After switching the relevant call sites, the slow commits disappeared.