Update - Work is wrapping up for the evening.

There remains capacity issues on the cluster -- too few servers remain for the number of services running. We will continue to triage, taking down development instances to preserve the ability to run production. The main outages are around accounting (GRACC) and reporting currently - glideins are being launched, jobs are running, and OSDF transfers are flowing.

One of the storage systems in use was located completely on hardware that was lost. This filesystem primarily held non-critical data, having been replaced by a larger, more robust storage system. However, some services had not migrated, meaning their status will need to be evaluated, one-by-one to determine the best path. About half of the hosted CEs are affected by the storage system loss.

May 17, 2026 - 02:30 UTC
Update - We are continuing to work on a fix for this issue.
May 16, 2026 - 21:48 UTC
Update - A total of 20 hosts failed to come back after the outage.

The failover to the backup hosts was successful and Kubernetes core services, such as Harbor, are coming back online. End-user service restoration is beginning.

Due to capacity issues, some less-critical services will likely be kept offline over the weekend.

May 16, 2026 - 21:47 UTC
Update - Several critical pieces of hardware have failed to come back online after the power outage. Staff are failing over core services to backup hardware.
May 16, 2026 - 20:23 UTC
Identified - Systems administrators are in the data center, recovering hosts.
May 16, 2026 - 17:22 UTC
Investigating - The Kubernetes cluster at UW-Madison, Tiger, suffered from an apparent power outage overnight and most hosts are offline along with a majority of services. Service outage appears to coincide with thunderstorms in the Madison area around 1:00am central time.

Staff will need to travel to the datacenter to physically diagnose the situation.

May 16, 2026 - 14:09 UTC
OSPool Operational
90 days ago
96.79 % uptime
Today
AP 23 Operational
90 days ago
94.22 % uptime
Today
AP 40 Operational
90 days ago
94.21 % uptime
Today
AP 41 Operational
90 days ago
94.21 % uptime
Today
AP42 Operational
90 days ago
94.22 % uptime
Today
AP 43 Operational
90 days ago
94.22 % uptime
Today
Jupyter Notebooks Operational
90 days ago
100.0 % uptime
Today
OSPool GlidenWMS Frontend Operational
90 days ago
100.0 % uptime
Today
OSPool Central Managers / Collectors Operational
90 days ago
100.0 % uptime
Today
OSPool Site EPs Operational
90 days ago
100.0 % uptime
Today
Open Science Data Federation Operational
90 days ago
99.84 % uptime
Today
StashCache Redirector Operational
90 days ago
100.0 % uptime
Today
CVMFS Synchronization Operational
90 days ago
100.0 % uptime
Today
Data Federation Accounting Service Operational
90 days ago
100.0 % uptime
Today
Caches Operational
90 days ago
100.0 % uptime
Today
Pelican Director Operational
90 days ago
99.64 % uptime
Today
Pelican Registry Operational
90 days ago
99.42 % uptime
Today
Hosted CEs Partial Outage
90 days ago
99.25 % uptime
Today
Hosted CE Infrastructure Partial Outage
90 days ago
99.25 % uptime
Today
Message Bus Operational
90 days ago
100.0 % uptime
Today
GlideinWMS Factory Operational
90 days ago
100.0 % uptime
Today
OASIS Operational
90 days ago
100.0 % uptime
Today
Network Monitoring Pipeline Operational
90 days ago
100.0 % uptime
Today
Software Repositories Major Outage
90 days ago
99.47 % uptime
Today
Yum Repos Major Outage
90 days ago
98.86 % uptime
Today
GridCF Repo Operational
90 days ago
100.0 % uptime
Today
OSG Hub Operational
90 days ago
99.57 % uptime
Today
Accounting Major Outage
90 days ago
99.24 % uptime
Today
GRACC Frontend Major Outage
90 days ago
98.86 % uptime
Today
GRACC Backend Major Outage
90 days ago
98.86 % uptime
Today
GRACC APEL Reporting Operational
90 days ago
100.0 % uptime
Today
Websites Operational
90 days ago
99.92 % uptime
Today
Display Operational
90 days ago
100.0 % uptime
Today
Main Website Operational
90 days ago
100.0 % uptime
Today
DNS Operational
90 days ago
100.0 % uptime
Today
OSGConnect Website Operational
90 days ago
100.0 % uptime
Today
Topology Operational
90 days ago
99.64 % uptime
Today
Hosted Submit Operational
90 days ago
100.0 % uptime
Today
Hosted Submit Infrastructure Operational
90 days ago
100.0 % uptime
Today
Hosted GlideinWMS Major Outage
90 days ago
99.77 % uptime
Today
IGWN GWMS Frontend Major Outage
90 days ago
98.86 % uptime
Today
JLAB GWMS Frontend Operational
90 days ago
100.0 % uptime
Today
GLUEX GWMS Frontend Operational
90 days ago
100.0 % uptime
Today
UCSD CMS GWMS Frontend Operational
90 days ago
100.0 % uptime
Today
UCSD CMS VO Collector Operational
90 days ago
100.0 % uptime
Today
Kubernetes Infrastructure Partial Outage
90 days ago
99.88 % uptime
Today
Tiger Partial Outage
90 days ago
99.65 % uptime
Today
River Operational
90 days ago
100.0 % uptime
Today
Tempest Operational
90 days ago
100.0 % uptime
Today
PATh Facility Operational
90 days ago
100.0 % uptime
Today
AP 1 Operational
90 days ago
100.0 % uptime
Today
AP 1 Origin Operational
90 days ago
100.0 % uptime
Today
Collaborations Operational
90 days ago
100.0 % uptime
Today
AP 23 Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
May 17, 2026

Unresolved incident: Kubernetes systems outage.

May 16, 2026
May 15, 2026

No incidents reported.

May 14, 2026

No incidents reported.

May 13, 2026

No incidents reported.

May 12, 2026

No incidents reported.

May 11, 2026

No incidents reported.

May 10, 2026

No incidents reported.

May 9, 2026

No incidents reported.

May 8, 2026

No incidents reported.

May 7, 2026

No incidents reported.

May 6, 2026

No incidents reported.

May 5, 2026
Resolved - Login to Access Points ap40, ap41, ap42, and ap43 is now available.

Jobs will start running on available computing capacity.

For more context on the state of OSPool services and actions to take, see this page on our website: https://portal.osg-htc.org/documentation/support_and_training/copyfail26/

May 5, 19:12 UTC
Identified - OSPool/PATh services are offline while we patch a serious Linux vulnerability.

Last evening, a serious Linux exploit was broadly published online. This exploit allows any user to easily obtain root (admin) access, bypassing standard security controls.

We pre-emptively shut down our systems this morning to protect our system and users. We are working to patch the vulnerability on our systems. OSPool/PATh services will remain offline through tomorrow (and possibly longer).

We will share more updates as the situation changes. We will provide the most up-to-date status of the system on our status page: https://status.osg-htc.org/incidents/lr3ntcjgjg8q

Apr 30, 21:32 UTC
Investigating - All OSPool Access Points are being taken down.
Apr 30, 14:28 UTC
May 4, 2026

No incidents reported.

May 3, 2026

No incidents reported.