OSDF Caches slow and unresponsive
Incident Report for OSG Consortium
Postmortem

The OSDF caches are hosted on the National Research Platform. Storage for the on-disk data cache is local to each cache, but the cache logs are stored on a shared drive that is shared across all caches, world wide. Writing to this shared drive slowed significantly on Oct. 11th, causing a slow down for the caches in processing data requests, eventually causing timeouts and failed downloads.

Corrective Actions

In order to restore the cache functionally immediately, we moved the cache logs to the local disk. Long-term, we are investigating central logging solutions that can be plugged into the NRP.

Posted Oct 12, 2023 - 17:30 UTC

Resolved
Caches became slow to respond and eventually became completely unresponsive to requests. Users of the OSDF caches would have noticed either slow data downloads, or completely failed downloads.
Posted Oct 11, 2023 - 14:00 UTC