Update - Work is wrapping up for the evening.
There remains capacity issues on the cluster -- too few servers remain for the number of services running. We will continue to triage, taking down development instances to preserve the ability to run production. The main outages are around accounting (GRACC) and reporting currently - glideins are being launched, jobs are running, and OSDF transfers are flowing.
One of the storage systems in use was located completely on hardware that was lost. This filesystem primarily held non-critical data, having been replaced by a larger, more robust storage system. However, some services had not migrated, meaning their status will need to be evaluated, one-by-one to determine the best path. About half of the hosted CEs are affected by the storage system loss.
May 17, 2026 - 02:30 UTC
Update - We are continuing to work on a fix for this issue.
May 16, 2026 - 21:48 UTC
Update - A total of 20 hosts failed to come back after the outage.
The failover to the backup hosts was successful and Kubernetes core services, such as Harbor, are coming back online. End-user service restoration is beginning.
Due to capacity issues, some less-critical services will likely be kept offline over the weekend.
May 16, 2026 - 21:47 UTC
Update - Several critical pieces of hardware have failed to come back online after the power outage. Staff are failing over core services to backup hardware.
May 16, 2026 - 20:23 UTC
Identified - Systems administrators are in the data center, recovering hosts.
May 16, 2026 - 17:22 UTC
Investigating - The Kubernetes cluster at UW-Madison, Tiger, suffered from an apparent power outage overnight and most hosts are offline along with a majority of services. Service outage appears to coincide with thunderstorms in the Madison area around 1:00am central time.
Staff will need to travel to the datacenter to physically diagnose the situation.
May 16, 2026 - 14:09 UTC