High latency for customers due to cache configuration error

Incident Report for Arc XP

Postmortem

Customer Impact

Beginning Tuesday, February 11th, at 10:15 AM ET until Wednesday, February 12th, at 06:52 AM ET, Site Service experienced a sudden increase in requests due to a CDN cache drop, leading to degraded performance in Site Service API which cascaded to dependent products such as Composer and WebSked. Additional capacity was added, impacted bundles were redeployed, and service was restored.

As a result of this issue, customer bundles created between February 11th at 10:15 AM and February 12th at 11:00 AM had an incorrect cache configuration, resulting in service degradation.

Root Cause

The issue was caused by an incorrect cache configuration which resulted in increased load across the systems. After correcting the configuration, performance was restored. Bundles deployed between February 11th at 10:15 AM and February 12th at 11:00 AM were terminated and redeployed.

Timeline

All times ET + 24 hour clock

Time Event
Feb 11th 10:15 New changes pushed to the Deployer.
Feb 11th 09:04 Initial customer report received later connected with this issue.
Feb 12th 06:15 Arc becomes aware of latency increases, service becomes degraded, and receives multiple customer reports.
Feb 12th 06:42 Capacity is increased
Feb 12th 06:47 New instances come online and latency starts decreasing
Feb 12th 06:52 Site Service is fully restored, and customers able to perform publish changes.
Feb 12th 11:30 The issue was identified with the new Deployer, and a solution was provided to the impacted clients while we worked on a permanent fix to use code bundles created before the new changes were published.
Feb 12th 12:15 The system has been restored by reverting the new Deployer application, and all impacted customers have been notified about the next steps. The impacted customer bundles are being terminated due to incorrect cache configuration.
Feb 12th 01:36 All Services are fully restored

Arc Next Steps

Additional integration checks and monitoring points will be added to assist us in upgrading the Deployer and further monitoring will be implemented to better identify CPU increases and enable auto-scaling.

Posted Feb 25, 2025 - 11:35 EST

Resolved

This incident has been resolved.
Posted Feb 12, 2025 - 14:48 EST

Monitoring

Site Service latency dropped to pre-incident levels at 15:35Z EU-Central-1 and 17:10Z EST in US-East-1.
Posted Feb 12, 2025 - 12:53 EST

Update

We have made a change that will allow for any new deployment to resolve the issue if your site is affected. It is also still effective to re-promote a bundle from prior to 16:00Z 2/11/25.

If redeploying, please note that this must be a new deployment, it is not sufficient to re-promote a bundle from between 16:00Z yesterday 2/11/25 and today at 16:30Z 1/12/25

Any deployment made between 16:00Z yesterday 2/11/25 and today at 16:30Z 2/12/25 today should be terminated.
Posted Feb 12, 2025 - 11:41 EST

Identified

We have identified an issue causing latency for customers in the EU and US, related to a cache configuration error.

Customers who have deployed a bundle since 16:00Z yesterday, February 11th, will need to deploy a new bundle to update their configuration.
Posted Feb 12, 2025 - 11:30 EST
This incident affected: Creator Apps (Composer, WebSked) and Content Platform (Publishing Platform).