Content Operations, including Publishing, Impacted

Incident Report for Arc XP

Postmortem

Customer Impact

On Oct 22, 2025, a subset of customers in us-east-1 experienced publishing delays due to a service degradation in our event processing infrastructure.

  • Higher-priority content actions (Arc-Priority: Standard): Delays of up to 4 hours at the time of submission.
  • Lower-priority content actions (Arc-Priority: Ingestion): Delays of up to 21 hours in some cases.

The time required for content actions to complete processing varied by customer activity and overall publishing volume.

  • Customers undergoing large-scale migrations saw longer delays (closer to 21 hours).
  • Customers with typical publishing activity saw shorter delays (around 4 hours).
  • Some customers experienced segmented degradation of up to 16 hours for individual documents or batches of content.

During this period, customers may have observed delays in:

  • Updates syncing to Content API.
  • Composer Search indexing.
  • End-to-end publishing to live sites.
  • Stale reads in systems dependent on recent Content API updates.
  • Updates syncing to Kinesis mirrors.
  • Updates triggering the corresponding IFX events.

Root Cause

On Oct 22, 2025, our monitoring infrastructure detected elevated system usage, which led to delays in publishing content. Shortly after, we identified that one of our core event processing clusters had reached near-full storage capacity too quickly, which caused a slowdown in content updates.

As teams worked to expand capacity, the system experienced additional delays due to network connectivity issues within one of the cluster nodes. This created a backlog of publishing events, resulting in slower-than-normal delivery of updates to customer sites.

Timeline

All times ET + 24 hour clock

Time Event
Oct 22, 2025  
16:23 Initial alert received regarding cluster performance.
16:30 – 17:30 Increased system storage to address capacity constraints.
17:30–18:30 Investigated and resolved AWS network connectivity issues affecting message delivery.
Oct 23, 2025  
00:00 - 06:00 Continued scaling efforts and cleared high-priority publishing events.
05:00 Identified and resolved a secondary issue related to message distribution within the cluster.
06:00 - 10:00 Processed remaining low-priority updates to stabilize system throughput. All lower-priority queues were temporarily escalated to ensure full recovery.
13:17 All systems verified stable; incident resolved and “all clear” issued.

Arc XP Next Steps

  • Evaluating scaling policies and capacity thresholds to proactively prevent similar storage saturation events.
  • Implementing additional monitoring to detect partial replications earlier and prevent secondary delays.
  • Reviewing synchronization mechanisms to improve per-customer consistency and reduce latency during high-load conditions.
Posted Oct 29, 2025 - 08:54 EDT

Resolved

This incident has been resolved.
Posted Oct 23, 2025 - 19:40 EDT

Update

The backlog has been fully cleared. All past updates have been processed, and new updates are being handled live for both Composer and programmatic ingestion. Any remaining errors can be resolved by republishing affected content. Customers should now be able to work normally across all environments.
Posted Oct 23, 2025 - 13:17 EDT

Update

The majority of content published in the past 12 hours has now been processed. New content may still experience delays while full recovery continues. We will confirm once all processing times have returned to normal levels.
Posted Oct 23, 2025 - 11:40 EDT

Update

We've increased capacity, and content updates are now processing quickly. Full recovery may take a few more hours, we'll continue providing updates and confirm once systems are fully back to normal. Thank you for your patience as we work through this.
Posted Oct 23, 2025 - 10:45 EDT

Update

As mentioned in our previous updates, we have largely increased capacity and the backlog of content updates is being processed rapidly. Due to the large volume of updates however, we expect full recovery to take some time still (a few hours, not days). We will continue to provide regular updates as well as an "all clear" once we are back to normal.

We appreciate your patience as we work through this recovery.
Posted Oct 23, 2025 - 09:27 EDT

Update

As mentioned in our previous updates, we have largely increased capacity and the backlog of content updates is being processed rapidly. Due to the large volume of updates however, we expect full recovery to take some time still (a few hours, not days). We will continue to provide regular updates as well as an "all clear" once we are back to normal.

We appreciate your patience as we work through this recovery.
Posted Oct 23, 2025 - 09:13 EDT

Update

Due to the large backlog of operations, some stories that were created, saved, or updated in the past couple hours may not yet be processed. This may result in stories not showing up on the web, or in the Composer search. In most cases, simply republishing those stories via Composer is enough to rectify them.

We are adding capacity and taking steps to accelerate the processing of the backlog to remediate this situation as fast as possible.

We will continue posting updates on a regular basis.
Posted Oct 23, 2025 - 07:27 EDT

Update

We have resolved the secondary issue with the Kafka cluster, and the processing pipelines is now working as expected. Changes in Composer (and other updates of priority "standard") should reflect normally in Content API for all organizations. For programmatic updates (Arc-Priority: ingestion), there is a backlog that is currently being processed, and may take up to a couple hours to clear up. Some of this backlog is within the range of our normal processing times for customers with large ingestion volumes, but may be more elevated for some.

If some important stories are out of sync, a re-publish in Composer should bring them to an updated status.

We will continue providing regular updates until all metrics are back to normal levels.
Posted Oct 23, 2025 - 06:36 EDT

Update

We have identified a secondary issue with our Kafka cluster and are working with our infrastructure provider to resolve it. This affects a subset of customers for whom stories are either delayed or not being processed. We will continue to provide updates as the situation progresses.
Posted Oct 23, 2025 - 05:10 EDT

Update

Operations with Arc-Priority: ingestion are still being processed with a delay. As we continued our investigation however, we became aware of a disruption for messages from Composer (messages with Arc-Priority: standard) for a subset of customers. Changes in Composer may not reflect in Content API and thus the web. This is being actively investigated.
Posted Oct 23, 2025 - 04:13 EDT

Update

Operations with Arc-Priority: ingestion are still being processed with a delay. As we continued our investigation however, we became aware of a disruption for messages with priority "standard" for a subset of customers. Changes in stories may not reflect in Content API and thus the web. This is being actively investigated.
Posted Oct 23, 2025 - 04:08 EDT

Update

Many publishing features have returned to normal operation. We are continuing to monitor metrics as recovery continues.
Posted Oct 23, 2025 - 03:25 EDT

Update

At this time, Composer and Websked applications are expected to be responsive to search & publishing actions. Any applications leveraging the Arc-Priority: standard are also expected to be arriving in a timely manner.
Customer Applications that are leveraging the Arc-Priority: ingestion are expected to be slightly delayed but are actively being processed.
Posted Oct 23, 2025 - 02:20 EDT

Update

System monitoring points to solid recovery metrics and delays are trending downwards. Engineering teams continue to be engaged until the event is resolved.
Posted Oct 23, 2025 - 01:33 EDT

Update

Content operations continue to clear through the backlog queue. Monitoring by the full response team is ongoing.
Posted Oct 23, 2025 - 00:54 EDT

Update

We have observed an improvement in content operations clearing through the backlog queue. Our team is still engaged after scaling infrastructure components and we continue to monitor the situation.
Posted Oct 23, 2025 - 00:15 EDT

Update

We continue to process the backlog of content operations and are working to restore the stability of publishing times.
Posted Oct 22, 2025 - 23:42 EDT

Update

We continue to process the backlog of content operations and are seeing an improvement in article publish times. The team is scaling additional infrastructure components to further increase processing velocity and restore publish times to normal levels.
Posted Oct 22, 2025 - 23:09 EDT

Update

We are continuing to monitor the progress of mitigation of delays across the region.
Posted Oct 22, 2025 - 22:19 EDT

Update

Photo Center and Video Center are back to full capacity and normal operation.
We are continuing to monitor for any further issues.
Posted Oct 22, 2025 - 21:33 EDT

Update

We are continuing to monitor for any further issues.
Posted Oct 22, 2025 - 20:52 EDT

Update

Processing times have been observed to continue to improve as expected. We expect normal operational levels within in hour.
Posted Oct 22, 2025 - 20:20 EDT

Update

The cluster reboot has resolved the connectivity issues, and Kafka queues are now processing messages normally. The team has scaled up message topics and supporting infrastructure to accelerate backlog processing. Customers should see continued improvement as the queues clear and content operations return to normal performance.
Posted Oct 22, 2025 - 20:00 EDT

Update

We are investigating a potential root cause related to a recent AWS patch that may have affected the stability of our Kafka cluster. As part of mitigation efforts, we are rebooting the cluster to re-establish component connectivity.
Posted Oct 22, 2025 - 19:47 EDT

Update

We are now observing an increase in delays affecting content operations, including publishing in the us-east-1 region. Engineering is investigating elevated latency in the content synchronization layer and is working to mitigate the impact. Further updates will be provided as we learn more.
Posted Oct 22, 2025 - 19:09 EDT

Update

We are continuing to monitor the progress of mitigation of delays across the region. Some customers will still see delays but we expect this improve over time.
Posted Oct 22, 2025 - 18:17 EDT

Monitoring

A fix has been applied and is propagating across the system. The content operations delay is now about five minutes and is continuing to decrease.
Posted Oct 22, 2025 - 17:37 EDT

Identified

There is delay for our internal routing for content synchronization at around 4:30 PM ET which led to a 10 minute delay in Content Operations. This incident is limited to customers located in the us-east-1 region.
Posted Oct 22, 2025 - 17:17 EDT
This incident affected: Creator Apps (Composer, WebSked, PageBuilder Editor, Photo Center, Video Center), Content Platform (Publishing Platform), and Platform & Delivery Acceleration (Web Delivery).