Kubernetes systems outage

Incident Report for OSG Consortium

Update

Work is wrapping up for the evening.

There remains capacity issues on the cluster -- too few servers remain for the number of services running. We will continue to triage, taking down development instances to preserve the ability to run production. The main outages are around accounting (GRACC) and reporting currently - glideins are being launched, jobs are running, and OSDF transfers are flowing.

One of the storage systems in use was located completely on hardware that was lost. This filesystem primarily held non-critical data, having been replaced by a larger, more robust storage system. However, some services had not migrated, meaning their status will need to be evaluated, one-by-one to determine the best path. About half of the hosted CEs are affected by the storage system loss.
Posted May 17, 2026 - 02:30 UTC

Update

We are continuing to work on a fix for this issue.
Posted May 16, 2026 - 21:48 UTC

Update

A total of 20 hosts failed to come back after the outage.

The failover to the backup hosts was successful and Kubernetes core services, such as Harbor, are coming back online. End-user service restoration is beginning.

Due to capacity issues, some less-critical services will likely be kept offline over the weekend.
Posted May 16, 2026 - 21:47 UTC

Update

Several critical pieces of hardware have failed to come back online after the power outage. Staff are failing over core services to backup hardware.
Posted May 16, 2026 - 20:23 UTC

Identified

Systems administrators are in the data center, recovering hosts.
Posted May 16, 2026 - 17:22 UTC

Investigating

The Kubernetes cluster at UW-Madison, Tiger, suffered from an apparent power outage overnight and most hosts are offline along with a majority of services. Service outage appears to coincide with thunderstorms in the Madison area around 1:00am central time.

Staff will need to travel to the datacenter to physically diagnose the situation.
Posted May 16, 2026 - 14:09 UTC
This incident affects: Software Repositories (Yum Repos, OSG Hub), Accounting (GRACC Frontend, GRACC Backend), Hosted GlideinWMS (IGWN GWMS Frontend), Kubernetes Infrastructure (Tiger), Hosted CEs (Hosted CE Infrastructure), Websites (Topology), and Open Science Data Federation (Pelican Director, Pelican Registry).