Beginning Tuesday, February 11th, at 10:15 AM ET until Wednesday, February 12th, at 06:52 AM ET, Site Service experienced a sudden increase in requests due to a CDN cache drop, leading to degraded performance in Site Service API which cascaded to dependent products such as Composer and WebSked. Additional capacity was added, impacted bundles were redeployed, and service was restored.
As a result of this issue, customer bundles created between February 11th at 10:15 AM and February 12th at 11:00 AM had an incorrect cache configuration, resulting in service degradation.
The issue was caused by an incorrect cache configuration which resulted in increased load across the systems. After correcting the configuration, performance was restored. Bundles deployed between February 11th at 10:15 AM and February 12th at 11:00 AM were terminated and redeployed.
All times ET + 24 hour clock
Time | Event |
---|---|
Feb 11th 10:15 | New changes pushed to the Deployer. |
Feb 11th 09:04 | Initial customer report received later connected with this issue. |
Feb 12th 06:15 | Arc becomes aware of latency increases, service becomes degraded, and receives multiple customer reports. |
Feb 12th 06:42 | Capacity is increased |
Feb 12th 06:47 | New instances come online and latency starts decreasing |
Feb 12th 06:52 | Site Service is fully restored, and customers able to perform publish changes. |
Feb 12th 11:30 | The issue was identified with the new Deployer, and a solution was provided to the impacted clients while we worked on a permanent fix to use code bundles created before the new changes were published. |
Feb 12th 12:15 | The system has been restored by reverting the new Deployer application, and all impacted customers have been notified about the next steps. The impacted customer bundles are being terminated due to incorrect cache configuration. |
Feb 12th 01:36 | All Services are fully restored |
Additional integration checks and monitoring points will be added to assist us in upgrading the Deployer and further monitoring will be implemented to better identify CPU increases and enable auto-scaling.