Skyline Raid Array Problems [7/24/08]


Recommended Posts

Posted

Today the Skyline server reported it had found a bad block on the raid array. This could possibly mean that one drive in the array is failing or just a problem with a write of the raid array.

We except absolutely zero down time due this however the array will need to be verified in order to check for possible drive failures or just to correct the bad block. During the verify operation there will be a slight degradation of performance.

This post will be updated when the verify has been completed and whether or not we need to schedule a maintenance window in order to replace a hard drive (no down time required for it either)

(Start times are no longer valid the issue has been escalated to non service impacting to service impacting)

Posted

We are currently experiencing some problem with the machine related to the raid array causing the machine to become unresponsive. We're actively working to correct the problem.

Posted

At this point in time we believe these bad blocks may have been part of the file system. We're having a technician check the raid card and see what it is doing. We'll update this thread with any new information as it comes in.

Posted

At this point in time it appears that a separate drive from the one with the bad blocks failed out and caused the raid 1 member of the raid 10 array to go into degraded mode. We will be allowing the raid 1 member to finish rebuilding then we will replaced the bad drive from the other failed member. At which point we will check the operating system and see if everything is still working.

If there appears to be data loss and it's unrecoverable we will bare metal restore the machine back to this mornings backup that was generated. At this point in time we do not believe this to be the case but be please be aware this is a possibility.

Posted

Things appear to be much worse than first anticipated unfortunately. The raid card on the system has now failed while rebuilding the one side of the degraded array. This unfortunately puts the machine in a very delicate situation.

We are going to replace the raid card and see if we can once again start rebuilding the one side of the array. We are going to put the system back up and see if anything is still working. If things are working we will then proceed to make another backup and continue with the rebuilding of the array.

Posted

Just to answer people about backups; we have a full machine backup as of Jul 24 00:48:12 CDT (this morning).

We're letting the datacenter tech looking into to see if we can simply rebuild the array(s) or have to do a full restore.

We'll keep the updates coming via this thread, sorry for any inconvenience!

Posted

The raid card on the machine has been replaced. We are now working on diagnosing the actual array to determine what needs to be done next.

Posted

The new raid card detected our raid-10 array and shows both raid-1 arrays in degraded mode. We are going to attempt to boot into the operating system to determine if there is any corruption

Posted

I have excellent news here

The system has not lost any data just the 2 drives which are currently rebuilding. This will take several hours to do so for the time being we are running a non redundant system. We are already generating a backup in case of further failure.

Posted

Well as of Tony's reply everything is online - we'll be keeping this thread open until everything is 100% operational (raid rebuilds).

Thanks for everyones patience!

Posted

We've checked websites everything is fine on that front nothing appears to be broken.

We have reason to believe all along the raid card was to blame for the problems. After replacing the card everything seems to be fine no data loss or anything.

We will be monitoring the rebuild process closely until it completes.

The total amount of down time due to these failures was about 1hr and 40mins. Which with all things considered is not to bad.

Posted

The last drive has finished rebuilding the array and the server is running at 100% again with full redundancy.

Once again thanks for being patient during this whole process - I hope we've given you some faith when it comes to hardware failures :).

Cheers!

Guest
This topic is now closed to further replies.