Pluto bad drives issue


Roger

Recommended Posts

You cannot migrate to a different drive type live. You would need to wipe out all data to do it. We are also in a situation where this machine cannot be replaced without user IP's changing. So no option like other machines go as far as replacing it.

We guarantee 99.9% uptime we're well within that even if once in a while there is an outage.

This is is really the nature of hard drive. We have had quite a few drive failures on one machine none on others. Pluto was deployed on 02-20-09 we did not see a single drive failure affecting issue until 10-10-09. It's had a string of them as of late but it's just a luck of the draw type of thing. We could deploy a new machine next week with SAS drives and have 2 drives fail the next week. Even SATA systems our backup machine which is a 12 drive setup has had zero failures. Heck even your home computers I've installed hard drives then 2 weeks later it fails. It's really a luck of a draw with these things. We've just had some bad luck with some machines where drives fail but before then cause problems requiring a reboot. Others fail and we post notice and silently replace them.

Link to comment
Share on other sites

Problems like this can be annoying and it has happened quite often but i have been with many hosts before HawkHost and with those hosts we had more problems then here. Atleast when something happens here they are usually put right quickly. Problems happen but i think it is more how they are dealt with and the speed they are dealt with.

Link to comment
Share on other sites

You cannot migrate to a different drive type live. You would need to wipe out all data to do it.

yeah I know, but you have bakcup system don't you?

We are also in a situation where this machine cannot be replaced without user IP's changing. So no option like other machines go as far as replacing it.

I can see how that complicate things, hope your datacenter will come up with a solution in the future.

We guarantee 99.9% uptime we're well within that even if once in a while there is an outage.

I wasn't complaining about the service, the way you adress the problem right away and keep us informed is pretty much why I like Hawk Host so much.

This is is really the nature of hard drive. We have had quite a few drive failures on one machine none on others. Pluto was deployed on 02-20-09 we did not see a single drive failure affecting issue until 10-10-09. It's had a string of them as of late but it's just a luck of the draw type of thing. We could deploy a new machine next week with SAS drives and have 2 drives fail the next week. Even SATA systems our backup machine which is a 12 drive setup has had zero failures. Heck even your home computers I've installed hard drives then 2 weeks later it fails. It's really a luck of a draw with these things. We've just had some bad luck with some machines where drives fail but before then cause problems requiring a reboot. Others fail and we post notice and silently replace them.

Well, you cant realy blame a hard drive failure to luck, and you cant compar a desktop drive to a SAS drive which are made for enterprise level.

Also, you cant compar a busy web server who also is serving databases to a storage server who has much less if not almost none random seek operations.

Velociraptor drives are hi-performance desktop drives wich are also compatible with RAID configuration, but they are certainly not made for the enterprise level and/or to withstand such long periods of intense activity. And that I think is the reason for the constant drive failures on pluto and not realy luck or the nature of hard drives.

Finaly as I said before I wasn't complaining and as long as you keep replacing the faulty drives fast we all be fine, just lets hope it never comes that the whole array fails at the same time (higly unlikely I know but still).

Edited by asambler
Link to comment
Share on other sites

The majority of web hosts are using SATA disks not SAS as they're quite expensive. We were using SATA's almost exclusively until recently and we had some machines never have drive failures. Others had drive failures some a lot others not any.

Those raptor drives are actually considered enterprise drives believe it or not: http://www.wdc.com/en/products/products.asp?driveid=495

Designed and manufactured to mission-critical enterprise-class standards to provide enterprise reliability in high duty cycle environments. With 1.4 million hours MTBF, these drives have the highest available reliability rating on a high capacity SATA drive.

They make enterprise SATA drives in 7.2K RPM form as well as in 10K RPM form. They advertise them with high reliability numbers and all of that.

As far as the backup machines well they have several TB's of data and are pretty heavily written to and very randomized (we defrag them to help deal with this). It's surprising we haven't had a drive failure they take a beating when we're doing backups.

It would not surprise me when we finish replacing the drive(s) on pluto that are showing errors that we may not see anything for a year. There was one drive that failed and at the same time another started reporting bad blocks. So probably why things did not get dropped it was not until the second drive on the other side of the raid had issues that problems came up.

Link to comment
Share on other sites

I am one of pluto users.

However it is no problem for me that there are outages because of defective hardware.

I never had seen such a status information flow like here.

Here in Europe it is not possible to get these Informations from european hosting companies:

Everybody says "we are best" and "No outages" but in real life there are outages.

Too bad if you are a customer and dont know what is happening with your hosting.

Anyway, Tony and his team are doing really great work and the information we get here is a marker for not only "best service" i would call this "First class service"!

Best greetings from austria (here it is now 10 AM)

tom

Link to comment
Share on other sites

We're going to see what we can do about the Pluto server maybe get creative and come up with some sort of solution I'm not sure. We'd love to move entirely off sata disks on our web servers. We have others that use them to that were tricky to migrate as well. So we'll see what can be done no promises though.

Link to comment
Share on other sites

is pluto down again? how about a spare server where you can just pull pluto's hard drive then plug into the spare server? these are for cases wherein there are mechanical failure or what not

Check the thread it's been updated. Something went wrong again with the file system and we're taking corrective measures / investigating why this has happened again.

Link to comment
Share on other sites

The majority of web hosts are using SATA disks not SAS as they're quite expensive. We were using SATA's almost exclusively until recently and we had some machines never have drive failures. Others had drive failures some a lot others not any.

Well, there's a reason why SAS drives are more expensive, you cant just stick to a cheaper piece of hardware and expect it to withstand the same load.

Web hosting servers can be specialy demanding to the hardware, with your ongoing growth you sould realy consider lighten the load on Pluto and Titan somehow and do your best to try and move to SAS.

Those raptor drives are actually considered enterprise drives believe it or not: http://www.wdc.com/en/products/products.asp?driveid=495

I think is clear you cant always trust what the advertising says, a very important business rule.

They make enterprise SATA drives in 7.2K RPM form as well as in 10K RPM form. They advertise them with high reliability numbers and all of that.

Of course they do, that doesn't mean they are right for any aplication.

As far as the backup machines well they have several TB's of data and are pretty heavily written to and very randomized (we defrag them to help deal with this). It's surprising we haven't had a drive failure they take a beating when we're doing backups.

that's my point, backup servers only work when they do backup and then they rest for a while.

It would not surprise me when we finish replacing the drive(s) on pluto that are showing errors that we may not see anything for a year. There was one drive that failed and at the same time another started reporting bad blocks. So probably why things did not get dropped it was not until the second drive on the other side of the raid had issues that problems came up.

lets hope first the server can get out of this in one piece. we're counting on you guys.

Link to comment
Share on other sites

Well, there's a reason why SAS drives are more expensive, you cant just stick to a cheaper piece of hardware and expect it to withstand the same load.

Web hosting servers can be specialy demanding to the hardware, with your ongoing growth you sould realy consider lighten the load on Pluto and Titan somehow and do your best to try and move to SAS.

I think is clear you cant always trust what the advertising says, a very important business rule.

Of course they do, that doesn't mean they are right for any aplication.

that's my point, backup servers only work when they do backup and then they rest for a while.

lets hope first the server can get out of this in one piece. we're counting on you guys.

It's online with no data loss, we're currently evaluating logs / hardware to try to pinpoint the issue.

Link to comment
Share on other sites

Right now this machine is a very troubling one we have close to 100 IP's so that means 100 users with their own IP's. We've always ever when doing migrations to new hardware re-routed the user IP's so no down time at all. The users do not even experience DNS propagation since their IP's do not change.

Unfortunately it's looking like we won't be able to do this. We could move some users onto other machines on the same VLAN thus they keep their IP's. Unfortunately for users using shared IP's that will definitely not happen as we simply do not have the capacity on any of the machines to take on an entire other server basically.

So it's going to be a fun few days we'll probably know what we're going to do in a few hours then post notices and the fun will start.

Link to comment
Share on other sites

Right now this machine is a very troubling one we have close to 100 IP's so that means 100 users with their own IP's. We've always ever when doing migrations to new hardware re-routed the user IP's so no down time at all. The users do not even experience DNS propagation since their IP's do not change.

Unfortunately it's looking like we won't be able to do this. We could move some users onto other machines on the same VLAN thus they keep their IP's. Unfortunately for users using shared IP's that will definitely not happen as we simply do not have the capacity on any of the machines to take on an entire other server basically.

So it's going to be a fun few days we'll probably know what we're going to do in a few hours then post notices and the fun will start.

I will gladly pay extra $2/mo to be move to another machine while you work on pluto, I think many will.

haaaa... good old geek fun there aint anything like it.

Link to comment
Share on other sites

Well this is pretty amusing we deployed the replacement Pluto only to discover a failed drive! It died literally after deployment. So much for SAS being more reliable the name Pluto is what's cursed! In all seriousness we're just going to deploy a new set of i/o stuff (raid, cables drives) for this new machine. It just shows you sata, sas or whatever failures do happen and are totally unpredictable. In this case though the machine just became very slow before the drive was removed.

Link to comment
Share on other sites

Well this is pretty amusing we deployed the replacement Pluto only to discover a failed drive! It died literally after deployment. So much for SAS being more reliable the name Pluto is what's cursed! In all seriousness we're just going to deploy a new set of i/o stuff (raid, cables drives) for this new machine. It just shows you sata, sas or whatever failures do happen and are totally unpredictable. In this case though the machine just became very slow before the drive was removed.

I never said than SAS drives never fail, then again with an unrelated comparison since a drive failure on deployment usualy means manufacturer defect and it's much better this way instead of having drives dieing from exhaustion every few months.

SAS drives are simply better, tougher, more reliable and the most importan thing they are made exactly for this kind of job.

Raptor drives are great, but I think they have proven than they won't do for this job anymore.

that is if Pluto's curse don't decide to prove other than that. :o

Joking aside, I have worked with raptor drives an they don't have much of a thermal efficiency, not to mention that 10k rpm for a 2.5" drive isn't the best design for reliability.

I still think SAS it's the way to go on this issue, and if I where you I will watch Titan very closely.

Edited by asambler
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...