Skyline Issues [3/23/2008]


Tony
 Share

Recommended Posts

The Skyline server has been taken down due to issues of websites not loading correctly or not loading at all due to problems with one of the servers file systems. Just a minute ago the file system went into read-only completely and refused to stop. The services still using the file system were critical services of the operating system which lead us no choice but to reboot the system.

Unfortunately there also appears to be a problem with our remote management device on the server resulting in us being unable to control it's console remotely. Due to this we have a technician on site attempting to resolve the problem with the hardware as well as get the server to boot up. At this point in time we have no ETA on resolution to this problem.

Link to comment
Share on other sites

This issue has been resolved.

Now to explain fully what was going on here:

For several hours now the var partition was throwing errors at random causing apache to throws errors at random as well as some other services. So for one user they may have seen a 500 error while the next user visited a page fine.

We decided we could not resolve this issue easily. At which point we gave it a few pokes and it went into complete read-only mode resulting in all services to cease to act. At which point we knew we needed to run a fsck (essentially like windows scandisk and tries to fix errors) on it.

Unfortunately the server would not let us unmount /var. At this point we attempted to remote reboot the server from our IPMI device on it. This unfortunately did not work as it had failed meaning we could not solve this problem remotely. At this point an on site technician was sent to resolve the issue.

They've resolved the issue with the IPMI as well as rebooted the server and it appears to be functioning fine once again. /var seems to be functioning properly once again and at this point we believe this will not be an ongoing issue. However we'll be monitoring the server for the next 24 hours to make sure everything is fine.

Link to comment
Share on other sites

I didn't even notice a problem.

However, I do appreciate these thorough explanations, most hosts will not explain the problems like this other than "technical difficulties".

Thanks for the hard work you put into uptime and keeping customers informed.

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
 Share