Drupal Post-Mortem Examines CPU Spike Caused by Bot Traffic and Cache Misses
Production instability on a Drupal 11 platform forms the basis of a technical post-mortem that examines how automated traffic pushed application CPU usage to 94% and increased average response times. The illustrative scenario describes a containerized stack using PHP 8.3, PHP-FPM, Redis, Varnish, Cloudflare, New Relic, and Blackfire. According to the article, traffic rose from about 40 requests per second to 140 requests per second within 15 minutes, triggering an investigation into both infrastructure behaviour and application performance.
The analysis identifies bot traffic as the immediate trigger, but attributes the incident to multiple weaknesses. Access logs showed requests with constantly changing query-string parameters, causing Varnish to treat each variation as a separate cache key and raising the cache miss ratio to 78% from a normal level of about 15%. The article outlines mitigation measures, including Cloudflare blocking and challenge rules, Varnish query-string normalisation, cache-miss monitoring, rate limiting for sensitive paths, incident-response runbooks, and regular load testing. It argues that production failures are often caused by a combination of infrastructure gaps rather than a single technical fault and highlights cache-key management, observability, and traffic controls as important safeguards for Drupal platforms.
