Identified - We are continuing to bring services back online following the major outage over the weekend. Jobs are running, but users should expect lower-than-normal throughput while recovery efforts continue. Some jobs may be experiencing errors with OSDF transfers. OSPool Jupyter Notebook instances are experiencing significantly longer launch times.

For the latest updates on service recovery, please continue to monitor our Status Page: https://status.osg-htc.org/

May 18, 2026 - 20:24 UTC
Update - Work is wrapping up for the evening.

There remains capacity issues on the cluster -- too few servers remain for the number of services running. We will continue to triage, taking down development instances to preserve the ability to run production. The main outages are around accounting (GRACC) and reporting currently - glideins are being launched, jobs are running, and OSDF transfers are flowing.

One of the storage systems in use was located completely on hardware that was lost. This filesystem primarily held non-critical data, having been replaced by a larger, more robust storage system. However, some services had not migrated, meaning their status will need to be evaluated, one-by-one to determine the best path. About half of the hosted CEs are affected by the storage system loss.

May 17, 2026 - 02:30 UTC
Update - We are continuing to work on a fix for this issue.
May 16, 2026 - 21:48 UTC
Update - A total of 20 hosts failed to come back after the outage.

The failover to the backup hosts was successful and Kubernetes core services, such as Harbor, are coming back online. End-user service restoration is beginning.

Due to capacity issues, some less-critical services will likely be kept offline over the weekend.

May 16, 2026 - 21:47 UTC
Update - Several critical pieces of hardware have failed to come back online after the power outage. Staff are failing over core services to backup hardware.
May 16, 2026 - 20:23 UTC
Identified - Systems administrators are in the data center, recovering hosts.
May 16, 2026 - 17:22 UTC
Investigating - The Kubernetes cluster at UW-Madison, Tiger, suffered from an apparent power outage overnight and most hosts are offline along with a majority of services. Service outage appears to coincide with thunderstorms in the Madison area around 1:00am central time.

Staff will need to travel to the datacenter to physically diagnose the situation.

May 16, 2026 - 14:09 UTC
OSPool Partial Outage
90 days ago
96.42 % uptime
Today
AP 23 Operational
90 days ago
94.22 % uptime
Today
AP 40 Operational
90 days ago
94.21 % uptime
Today
AP 41 Operational
90 days ago
94.21 % uptime
Today
AP42 Operational
90 days ago
94.22 % uptime
Today
AP 43 Operational
90 days ago
94.22 % uptime
Today
Jupyter Notebooks Degraded Performance
90 days ago
100.0 % uptime
Today
OSPool GlidenWMS Frontend Operational
90 days ago
100.0 % uptime
Today
OSPool Central Managers / Collectors Operational
90 days ago
100.0 % uptime
Today
OSPool Site EPs Partial Outage
90 days ago
96.7 % uptime
Today
Open Science Data Federation Operational
90 days ago
99.84 % uptime
Today
StashCache Redirector Operational
90 days ago
100.0 % uptime
Today
CVMFS Synchronization Operational
90 days ago
100.0 % uptime
Today
Data Federation Accounting Service Operational
90 days ago
100.0 % uptime
Today
Caches Operational
90 days ago
100.0 % uptime
Today
Pelican Director Operational
90 days ago
99.64 % uptime
Today
Pelican Registry Operational
90 days ago
99.42 % uptime
Today
Hosted CEs Partial Outage
90 days ago
95.54 % uptime
Today
Hosted CE Infrastructure Partial Outage
90 days ago
95.54 % uptime
Today
Message Bus Operational
90 days ago
100.0 % uptime
Today
GlideinWMS Factory Operational
90 days ago
100.0 % uptime
Today
OASIS Operational
90 days ago
100.0 % uptime
Today
Network Monitoring Pipeline Operational
90 days ago
100.0 % uptime
Today
Software Repositories Major Outage
90 days ago
95.35 % uptime
Today
Yum Repos Major Outage
90 days ago
86.49 % uptime
Today
GridCF Repo Operational
90 days ago
100.0 % uptime
Today
OSG Hub Operational
90 days ago
99.57 % uptime
Today
Accounting Major Outage
90 days ago
90.99 % uptime
Today
GRACC Frontend Major Outage
90 days ago
86.49 % uptime
Today
GRACC Backend Major Outage
90 days ago
86.49 % uptime
Today
GRACC APEL Reporting Operational
90 days ago
100.0 % uptime
Today
Websites Operational
90 days ago
99.92 % uptime
Today
Display Operational
90 days ago
100.0 % uptime
Today
Main Website Operational
90 days ago
100.0 % uptime
Today
DNS Operational
90 days ago
100.0 % uptime
Today
OSGConnect Website Operational
90 days ago
100.0 % uptime
Today
Topology Operational
90 days ago
99.64 % uptime
Today
Hosted Submit Operational
90 days ago
100.0 % uptime
Today
Hosted Submit Infrastructure Operational
90 days ago
100.0 % uptime
Today
Hosted GlideinWMS Major Outage
90 days ago
97.29 % uptime
Today
IGWN GWMS Frontend Major Outage
90 days ago
86.49 % uptime
Today
JLAB GWMS Frontend Operational
90 days ago
100.0 % uptime
Today
GLUEX GWMS Frontend Operational
90 days ago
100.0 % uptime
Today
UCSD CMS GWMS Frontend Operational
90 days ago
100.0 % uptime
Today
UCSD CMS VO Collector Operational
90 days ago
100.0 % uptime
Today
Kubernetes Infrastructure Partial Outage
90 days ago
98.64 % uptime
Today
Tiger Partial Outage
90 days ago
95.94 % uptime
Today
River Operational
90 days ago
100.0 % uptime
Today
Tempest Operational
90 days ago
100.0 % uptime
Today
PATh Facility Operational
90 days ago
100.0 % uptime
Today
AP 1 Operational
90 days ago
100.0 % uptime
Today
AP 1 Origin Operational
90 days ago
100.0 % uptime
Today
Collaborations Operational
90 days ago
100.0 % uptime
Today
AP 23 Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
May 28, 2026

No incidents reported today.

May 27, 2026
Completed - The scheduled maintenance has been completed.
May 27, 17:34 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
May 27, 14:34 UTC
Scheduled - We are bringing ap1, ap40, ap41, ap42, ap43 into downtime to apply important updates to the kernel.

The last incident was scheduled at the wrong time. We apologize for the inconvenience.

May 27, 14:33 UTC
Completed - The scheduled maintenance has been completed.
May 27, 17:00 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
May 27, 06:00 UTC
Scheduled - We are bringing ap1, ap40, ap41, ap42, ap43 into downtime to apply important updates to the kernel.

Users will not be able to log in or submit jobs during the downtime.

May 21, 17:24 UTC
May 26, 2026

No incidents reported.

May 25, 2026

No incidents reported.

May 24, 2026

No incidents reported.

May 23, 2026

No incidents reported.

May 22, 2026

No incidents reported.

May 21, 2026

No incidents reported.

May 20, 2026

No incidents reported.

May 19, 2026

No incidents reported.

May 18, 2026

Unresolved incident: OSPool and OSDF Service Interruption.

May 17, 2026

Unresolved incident: Kubernetes systems outage.

May 16, 2026
May 15, 2026

No incidents reported.

May 14, 2026

No incidents reported.