Site Service is experiencing heavy load

Incident Report for Arc XP

Postmortem

Customer Impact

On Wednesday, February 12th 06:17 ET, Site Service began experiencing heavy CPU usage in one region, leading to degraded performance at the API level and in dependent products: Composer, WebSked. Capacity was added and service was restored.

Root Cause

The issue was due to an incorrect cache configuration in a release that day, which resulted in increased load across the systems. After correcting the configurations performance restored.

Timeline

All times ET + 24 hour clock

Time	Event

Time	Event
06:15	Latency increases slightly
06:17	Latency increases, service becomes degraded
06:42	Capacity is increased
06:47	New instances come online and latency starts decreasing
06:52	Service is fully restored

Arc Next Steps

The capacity for this region has been scaled up, and additional monitoring will be added to the service to better identify the source of the CPU increase. Furthermore, additional checks will be implemented around the PageBuild Engine integration systems to monitor traffic.

Posted Feb 19, 2025 - 06:02 EST

Resolved

We added the additional capacity and service has been restored.

Posted Feb 12, 2025 - 07:21 EST

Monitoring

Services has been restored and continue monitoring

Posted Feb 12, 2025 - 06:56 EST

Identified

We have identified load on Site Service in one region which is causing issues in some Arc applications (Composer, WebSked) and can also affect rendering performance. We are adding additional capacity to the service to handle the load.

Posted Feb 12, 2025 - 06:50 EST

This incident affected: Content Platform (Publishing Platform).