2018-05-11

EPEL Outage Report 2018-11-05

Problem Description:

On 2018-05-11 04:00 UTC reports started coming into centos IRC channels about EPEL being corrupted and causing breakages. These were then reported to #fedora-admin and #epel-devel. The problem would show up as something like:

 One of the configured repositories failed (Unknown),
 and yum doesn't have enough cached data to continue. At this point the only
 safe thing yum can do is fail. There are a few ways to work "fix" this:

The problem was examined and turned out to be that an NFS problem on the backend systems causing the createrepo_c to create the repositories to create a corrupted SQL file. A program which was to catch this did not work for some reason still being investigated and the corrupted sqllite file was mirrored out.

Admins began filling up the #epel and #centos channel asking why their systems were broken. I would like to thank avij, tmz and others who worked on answering as many of the people as possible. I would also like to thank Kevin Fenzi for figuring out the problem, regenerating the builds and unstopping the NFS blockage.

Solution:

Because of the way mirroring works, this problem may affect clients for hours after the fix has been made on the server. There are three things a client can do:
  1. If you have a dedicated mirror, have the mirror update itself with the upstream mirrors.
  2. On client systems you may need to do a yum clean all in order to remove the bad sql in case yum thinks it is still good to cache from.
  3. You can skip yum on updates with:
    
    yum --disablerepo=epel update

Notes:

This will be filled out later as more information and future steps are taken.
  1. Mirrormanager did not have anything to do with this. It's job is to check that mirrors match the master site and in this case the master site was borked so it happily told people to go to mirrors which matched that.
  2. The problem showed up at 04:00 UTC because most servers are set up using GMT/UTC as their clock. At 04:00 the cron.daily starts up and many sites use a daily yum update which broke and mailed them.

No comments: