Cody R. Posted April 4, 2011 Report Posted April 4, 2011 We'll be performing a kernel upgrade on on Wednesday April 6th between 12PM-3PM PST. While we use KSplice to keep our kernels up-to-date for critical bugs and security fixes it doesn't pull in new features from the upstream kernel. As a result we perform semi-annual kernel updates to keep everything up-to-date and in-line with upstream. Date: 04/06/2011 Start time (PDT): 12:00pm End time (PDT): 3:00pm Duration: 1 hour Estimated Down Time: 10 minutes
Cody R. Posted April 6, 2011 Author Report Posted April 6, 2011 There was an issue during the maintenance of this machine. We're actively investigating and working on resolving it. We'll post more updates shortly - we're currently waiting for data center technicians to resolve a few issues with the IPMI/virtual media we use.
Cody R. Posted April 6, 2011 Author Report Posted April 6, 2011 There has been some unforeseen issues with the IPMI and network connectivity causing issues with us being able to investigate this further and take corrective action. We're working with data center technicians to get this resolved so we can actively investigate the machine. We'll be posting more updates as we receive them.
Tony Posted April 7, 2011 Report Posted April 7, 2011 We believe we're close to having the machine repaired and functional again. We hope to have an update within the next few hours with regarding to it being online again.
Brian Posted April 7, 2011 Report Posted April 7, 2011 Our sysadmin team is still actively and fully engaged in this issue, and we're progressing as expected. If everything continues as planned we should have services restored soon.
Cody R. Posted April 7, 2011 Author Report Posted April 7, 2011 We'll be providing more information on this issue within the next 24 hours however the machine is currently online. We're wrapping up our initial maintenance and expect everything to be online shortly. We'll update this thread when everything is fully online.
Brian Posted April 7, 2011 Report Posted April 7, 2011 All sites and services should be fully accessible at this time. Thank you to everyone for being patient throughout the downtime. As Cody mentioned, we'll be providing a more detailed explanation of todays issues within the next 24 hours once we're able to compile all the necessary information of what led to the crash.
Tony Posted April 22, 2011 Report Posted April 22, 2011 This issue was caused by software bug which resulted in major corruption in the operating system requiring us to repair it via backups. The software bug was supposedly fixed several years ago however best we can tell it somehow got re-introduced. It required a certain set of circumstances to produce and we unfortunately had a machine produce them and we could not stop the problems it caused quick enough to not cause system availability problems. Once this problem was identified we had issues using our rescue systems on the server. This required further assistance this time from our datacenter in order to restore functionality of the rescue system as we needed this to restore services. Once we had a working rescue system we spent the rest of the time repairing just the operating system which took extensive testing before we were confident it was all corrected and would function as it did before.
Recommended Posts