Tony Posted July 25, 2008 Report Posted July 25, 2008 Today the Skyline server reported it had found a bad block on the raid array. This could possibly mean that one drive in the array is failing or just a problem with a write of the raid array. We except absolutely zero down time due this however the array will need to be verified in order to check for possible drive failures or just to correct the bad block. During the verify operation there will be a slight degradation of performance. This post will be updated when the verify has been completed and whether or not we need to schedule a maintenance window in order to replace a hard drive (no down time required for it either) (Start times are no longer valid the issue has been escalated to non service impacting to service impacting)
Tony Posted July 25, 2008 Author Report Posted July 25, 2008 We are currently experiencing some problem with the machine related to the raid array causing the machine to become unresponsive. We're actively working to correct the problem.
Tony Posted July 25, 2008 Author Report Posted July 25, 2008 At this point in time we believe these bad blocks may have been part of the file system. We're having a technician check the raid card and see what it is doing. We'll update this thread with any new information as it comes in.
Tony Posted July 25, 2008 Author Report Posted July 25, 2008 At this point in time it appears that a separate drive from the one with the bad blocks failed out and caused the raid 1 member of the raid 10 array to go into degraded mode. We will be allowing the raid 1 member to finish rebuilding then we will replaced the bad drive from the other failed member. At which point we will check the operating system and see if everything is still working. If there appears to be data loss and it's unrecoverable we will bare metal restore the machine back to this mornings backup that was generated. At this point in time we do not believe this to be the case but be please be aware this is a possibility.
Tony Posted July 25, 2008 Author Report Posted July 25, 2008 Things appear to be much worse than first anticipated unfortunately. The raid card on the system has now failed while rebuilding the one side of the degraded array. This unfortunately puts the machine in a very delicate situation. We are going to replace the raid card and see if we can once again start rebuilding the one side of the array. We are going to put the system back up and see if anything is still working. If things are working we will then proceed to make another backup and continue with the rebuilding of the array.
Cody R. Posted July 25, 2008 Report Posted July 25, 2008 Just to answer people about backups; we have a full machine backup as of Jul 24 00:48:12 CDT (this morning). We're letting the datacenter tech looking into to see if we can simply rebuild the array(s) or have to do a full restore. We'll keep the updates coming via this thread, sorry for any inconvenience!
Tony Posted July 25, 2008 Author Report Posted July 25, 2008 The raid card on the machine has been replaced. We are now working on diagnosing the actual array to determine what needs to be done next.
Tony Posted July 25, 2008 Author Report Posted July 25, 2008 The new raid card detected our raid-10 array and shows both raid-1 arrays in degraded mode. We are going to attempt to boot into the operating system to determine if there is any corruption
Tony Posted July 25, 2008 Author Report Posted July 25, 2008 I have excellent news here The system has not lost any data just the 2 drives which are currently rebuilding. This will take several hours to do so for the time being we are running a non redundant system. We are already generating a backup in case of further failure.
Cody R. Posted July 25, 2008 Report Posted July 25, 2008 For any discussion regarding this please use this thread: http://www.hawkhost.com/forums/showthread.php?t=152 Thanks!
Cody R. Posted July 25, 2008 Report Posted July 25, 2008 Well as of Tony's reply everything is online - we'll be keeping this thread open until everything is 100% operational (raid rebuilds). Thanks for everyones patience!
Tony Posted July 25, 2008 Author Report Posted July 25, 2008 We've checked websites everything is fine on that front nothing appears to be broken. We have reason to believe all along the raid card was to blame for the problems. After replacing the card everything seems to be fine no data loss or anything. We will be monitoring the rebuild process closely until it completes. The total amount of down time due to these failures was about 1hr and 40mins. Which with all things considered is not to bad.
Cody R. Posted July 25, 2008 Report Posted July 25, 2008 A quick update; It appears 3/4 of the drives are online and fully functional and the last one is finishing up the rebuilding the array.
Cody R. Posted July 25, 2008 Report Posted July 25, 2008 The last drive has finished rebuilding the array and the server is running at 100% again with full redundancy. Once again thanks for being patient during this whole process - I hope we've given you some faith when it comes to hardware failures . Cheers!
Recommended Posts