INCIDENT: MetaKube Control Plane issues, region FES

Affected Components: MetaKube Control Planes, region FES

Incident Start: 2025-02-11 06:00 UTC+01:00 (CET)

State: Resolved

Description:

Accessibility of the MetaKube API is not ensured.
After a scheduled maintenance to the network in FES, the MetaKube control cluster (which is hosting the customer control planes) has problems reaching DNS. This is causing issues to the customer control planes.
This also affects Database as a Service and Observability as a Service
All times below are CET

UPDATE 2025-02-13 12:20

We consider all service disruptions of the incident to be mitigated
Although we are not expecting any more service disruptions, we are still watching all systems closly

Previous Updates in reverse chronological order

UPDATE 2025-02-11 10:00

Still investigating the DNS issue. We sent out a notifier to all potentially affected customers.

UPDATE 2025-02-11 11:15

We have used the time since the last update to narrow down the root cause of the incident. We excluded some possibilites but did not find the root-cause. We are now preparing a partial rollback to downgrade the SDN again.

UPDATE 2025-02-11 12:05

We completed a partial OVN/SDN downgrade, however this has not yet resolved the incident.
We are investigating further

UPDATE 2025-02-11 12:25

We’re exploring further downgrade approaches (previous rollbacks were, as announced, partial) and are in parallel investigating further.

UPDATE 2025-02-11 13:00

As we originally updated the SDN due to a critical security gap, it was decided that we will not perform a full OVN/SDN rollback to the initial state.
We have now activated several teams who will be developing and evaluating different solutions until 1.30 pm. An update on how we proceed will follow then.

UPDATE 2025-02-11 13:50

Our Teams will continue developing and evaluating solutions in break out session as there are further leads but no breakthrough, yet.
In parallel we are preparing a failover for IAM and Alloy to DUS/HAM

UPDATE 2025-02-11 15:30

UPDATE 2025-02-11 17:47

Part of the services are still not functional
We are still working hard to resolve the issues but we will roll back the update of the SDN if no progress is made.
The planned maintenance period begins today, 11 February 2025, at 23:00 and is expected to last until around 06:00 CET on 12 February 2025, during which time there may be repeated interruptions to services.
The maintenance is also announced via notifier. You will get an info of the end of the maintenance also via notifier.

UPDATE 2025-02-11 20:15

IAM, Alloy and Observability as a Service are restored to full functionality
The Database as a Service API has also been restored, but the API still has some issues which are related to the wider SDN problem. The Databases themselves were at no point affected by the incident.

UPDATE 2025-02-11 23:00

UPDATE 2025-02-12 10:15

The previous maintenance work did not achieve the desired success.
Workarounds have been implemented, so operations should be able to continue without disruptions.
Maintainance work will continue during the upcoming night to fully resolve the incident.

UPDATE 2025-02-12 14:55

We scheduled another maintenance window for this night, February 12th, from 11:00 PM to 6:00 AM the following day.
During this maintenance window, there may be brief interruptions or limited availability of certain services.
The goal is the complete resolution of the incident caused by Monday's update.
An RfO will be available in our Helpdesk after the incident is mitigated.

UPDATE 2025-02-13 12:20

We consider all service disruptions of the incident to be mitigated
Although we are not expecting any more service disruptions, we are still watching all systems closly

UPDATE 2025-02-13 15:30

INCIDENT: MetaKube Control Plane issues, region FES Tuesday 11th February 2025 08:04:00