Pluto bad drives issue

Roger · February 22, 2010

today is the third time than pluto server has an outage wich is realted to a bad hard drive in less then a year.

May be it's time to think about moving on to SAS.

for the rest grate service guys.

Tony · February 23, 2010

You cannot migrate to a different drive type live. You would need to wipe out all data to do it. We are also in a situation where this machine cannot be replaced without user IP's changing. So no option like other machines go as far as replacing it.

We guarantee 99.9% uptime we're well within that even if once in a while there is an outage.

This is is really the nature of hard drive. We have had quite a few drive failures on one machine none on others. Pluto was deployed on 02-20-09 we did not see a single drive failure affecting issue until 10-10-09. It's had a string of them as of late but it's just a luck of the draw type of thing. We could deploy a new machine next week with SAS drives and have 2 drives fail the next week. Even SATA systems our backup machine which is a 12 drive setup has had zero failures. Heck even your home computers I've installed hard drives then 2 weeks later it fails. It's really a luck of a draw with these things. We've just had some bad luck with some machines where drives fail but before then cause problems requiring a reboot. Others fail and we post notice and silently replace them.

Fowler · February 23, 2010

Problems like this can be annoying and it has happened quite often but i have been with many hosts before HawkHost and with those hosts we had more problems then here. Atleast when something happens here they are usually put right quickly. Problems happen but i think it is more how they are dealt with and the speed they are dealt with.

Roger · February 23, 2010

You cannot migrate to a different drive type live. You would need to wipe out all data to do it.

yeah I know, but you have bakcup system don't you?

We are also in a situation where this machine cannot be replaced without user IP's changing. So no option like other machines go as far as replacing it.

I can see how that complicate things, hope your datacenter will come up with a solution in the future.

We guarantee 99.9% uptime we're well within that even if once in a while there is an outage.

I wasn't complaining about the service, the way you adress the problem right away and keep us informed is pretty much why I like Hawk Host so much.

This is is really the nature of hard drive. We have had quite a few drive failures on one machine none on others. Pluto was deployed on 02-20-09 we did not see a single drive failure affecting issue until 10-10-09. It's had a string of them as of late but it's just a luck of the draw type of thing. We could deploy a new machine next week with SAS drives and have 2 drives fail the next week. Even SATA systems our backup machine which is a 12 drive setup has had zero failures. Heck even your home computers I've installed hard drives then 2 weeks later it fails. It's really a luck of a draw with these things. We've just had some bad luck with some machines where drives fail but before then cause problems requiring a reboot. Others fail and we post notice and silently replace them.

Well, you cant realy blame a hard drive failure to luck, and you cant compar a desktop drive to a SAS drive which are made for enterprise level.

Also, you cant compar a busy web server who also is serving databases to a storage server who has much less if not almost none random seek operations.

Velociraptor drives are hi-performance desktop drives wich are also compatible with RAID configuration, but they are certainly not made for the enterprise level and/or to withstand such long periods of intense activity. And that I think is the reason for the constant drive failures on pluto and not realy luck or the nature of hard drives.

Finaly as I said before I wasn't complaining and as long as you keep replacing the faulty drives fast we all be fine, just lets hope it never comes that the whole array fails at the same time (higly unlikely I know but still).

Edited February 23, 2010 by asambler

Tony · February 23, 2010

The majority of web hosts are using SATA disks not SAS as they're quite expensive. We were using SATA's almost exclusively until recently and we had some machines never have drive failures. Others had drive failures some a lot others not any.

Those raptor drives are actually considered enterprise drives believe it or not: http://www.wdc.com/en/products/products.asp?driveid=495

Designed and manufactured to mission-critical enterprise-class standards to provide enterprise reliability in high duty cycle environments. With 1.4 million hours MTBF, these drives have the highest available reliability rating on a high capacity SATA drive.

They make enterprise SATA drives in 7.2K RPM form as well as in 10K RPM form. They advertise them with high reliability numbers and all of that.

As far as the backup machines well they have several TB's of data and are pretty heavily written to and very randomized (we defrag them to help deal with this). It's surprising we haven't had a drive failure they take a beating when we're doing backups.

It would not surprise me when we finish replacing the drive(s) on pluto that are showing errors that we may not see anything for a year. There was one drive that failed and at the same time another started reporting bad blocks. So probably why things did not get dropped it was not until the second drive on the other side of the raid had issues that problems came up.

tomtom76 · February 23, 2010

I am one of pluto users.

However it is no problem for me that there are outages because of defective hardware.

I never had seen such a status information flow like here.

Here in Europe it is not possible to get these Informations from european hosting companies:

Everybody says "we are best" and "No outages" but in real life there are outages.

Too bad if you are a customer and dont know what is happening with your hosting.

Anyway, Tony and his team are doing really great work and the information we get here is a marker for not only "best service" i would call this "First class service"!

Best greetings from austria (here it is now 10 AM)

tom

Tony · February 23, 2010

We're going to see what we can do about the Pluto server maybe get creative and come up with some sort of solution I'm not sure. We'd love to move entirely off sata disks on our web servers. We have others that use them to that were tricky to migrate as well. So we'll see what can be done no promises though.

jonee54 · February 23, 2010

is pluto down again? how about a spare server where you can just pull pluto's hard drive then plug into the spare server? these are for cases wherein there are mechanical failure or what not

tomtom76 · February 23, 2010

poor pluto

What about this funny routing?

Edited February 23, 2010 by tomtom76

tomtom76 · February 23, 2010

and this is the routing to hawkhost.com

Much better!

Cody R. · February 23, 2010

poor pluto

What about this funny routing?

If possible submit a ticket to Support with the trace route / originating IP (your IP most likely) and ask for it to be escalated. We'll see if it's a routing issue on our end or your ISP's.

Cody R. · February 23, 2010

is pluto down again? how about a spare server where you can just pull pluto's hard drive then plug into the spare server? these are for cases wherein there are mechanical failure or what not

Check the thread it's been updated. Something went wrong again with the file system and we're taking corrective measures / investigating why this has happened again.

Roger · February 23, 2010

The majority of web hosts are using SATA disks not SAS as they're quite expensive. We were using SATA's almost exclusively until recently and we had some machines never have drive failures. Others had drive failures some a lot others not any.

Well, there's a reason why SAS drives are more expensive, you cant just stick to a cheaper piece of hardware and expect it to withstand the same load.

Web hosting servers can be specialy demanding to the hardware, with your ongoing growth you sould realy consider lighten the load on Pluto and Titan somehow and do your best to try and move to SAS.

Those raptor drives are actually considered enterprise drives believe it or not: http://www.wdc.com/en/products/products.asp?driveid=495

I think is clear you cant always trust what the advertising says, a very important business rule.

They make enterprise SATA drives in 7.2K RPM form as well as in 10K RPM form. They advertise them with high reliability numbers and all of that.

Of course they do, that doesn't mean they are right for any aplication.

As far as the backup machines well they have several TB's of data and are pretty heavily written to and very randomized (we defrag them to help deal with this). It's surprising we haven't had a drive failure they take a beating when we're doing backups.

that's my point, backup servers only work when they do backup and then they rest for a while.

It would not surprise me when we finish replacing the drive(s) on pluto that are showing errors that we may not see anything for a year. There was one drive that failed and at the same time another started reporting bad blocks. So probably why things did not get dropped it was not until the second drive on the other side of the raid had issues that problems came up.

lets hope first the server can get out of this in one piece. we're counting on you guys.

Cody R. · February 23, 2010

Well, there's a reason why SAS drives are more expensive, you cant just stick to a cheaper piece of hardware and expect it to withstand the same load.

Web hosting servers can be specialy demanding to the hardware, with your ongoing growth you sould realy consider lighten the load on Pluto and Titan somehow and do your best to try and move to SAS.

I think is clear you cant always trust what the advertising says, a very important business rule.

Of course they do, that doesn't mean they are right for any aplication.

that's my point, backup servers only work when they do backup and then they rest for a while.

lets hope first the server can get out of this in one piece. we're counting on you guys.

It's online with no data loss, we're currently evaluating logs / hardware to try to pinpoint the issue.

Tony · February 23, 2010

Right now this machine is a very troubling one we have close to 100 IP's so that means 100 users with their own IP's. We've always ever when doing migrations to new hardware re-routed the user IP's so no down time at all. The users do not even experience DNS propagation since their IP's do not change.

Unfortunately it's looking like we won't be able to do this. We could move some users onto other machines on the same VLAN thus they keep their IP's. Unfortunately for users using shared IP's that will definitely not happen as we simply do not have the capacity on any of the machines to take on an entire other server basically.

So it's going to be a fun few days we'll probably know what we're going to do in a few hours then post notices and the fun will start.

Roger · February 23, 2010

Right now this machine is a very troubling one we have close to 100 IP's so that means 100 users with their own IP's. We've always ever when doing migrations to new hardware re-routed the user IP's so no down time at all. The users do not even experience DNS propagation since their IP's do not change.

Unfortunately it's looking like we won't be able to do this. We could move some users onto other machines on the same VLAN thus they keep their IP's. Unfortunately for users using shared IP's that will definitely not happen as we simply do not have the capacity on any of the machines to take on an entire other server basically.

So it's going to be a fun few days we'll probably know what we're going to do in a few hours then post notices and the fun will start.

I will gladly pay extra $2/mo to be move to another machine while you work on pluto, I think many will.

haaaa... good old geek fun there aint anything like it.

Tony · February 23, 2010

You'd switch IP's which means DNS propagation. If you're not worried about DNS propagation we can move a single site to another server no problem. Although we may have a solution to the pluto issue without people changing IP's.

tomtom76 · February 23, 2010

You could move my account to a server hosted in washington dc (because of better routing to europe)

It would be great to get an account on a wdc server.

Then i would move my stuff myself.

Doesn

Tony · February 23, 2010

You could move my account to a server hosted in washington dc (because of better routing to europe)

It would be great to get an account on a wdc server.

Then i would move my stuff myself.

Doesn

Tony · February 23, 2010

Well no worries guys we're going to be replacing the Pluto server starting tomorrow with new hardware with 15K SAS drives. More information will follow once the machine starts getting built.

tomtom76 · February 23, 2010

very good news

Roger · February 23, 2010

in deed, great news.

Tony · February 23, 2010

Notice posted: http://forums.hawkhost.com/showthread.php?t=907

Basically 15K SAS over 10K SATA drives is the only change.

Tony · February 24, 2010

Well this is pretty amusing we deployed the replacement Pluto only to discover a failed drive! It died literally after deployment. So much for SAS being more reliable the name Pluto is what's cursed! In all seriousness we're just going to deploy a new set of i/o stuff (raid, cables drives) for this new machine. It just shows you sata, sas or whatever failures do happen and are totally unpredictable. In this case though the machine just became very slow before the drive was removed.

Roger · February 24, 2010

Well this is pretty amusing we deployed the replacement Pluto only to discover a failed drive! It died literally after deployment. So much for SAS being more reliable the name Pluto is what's cursed! In all seriousness we're just going to deploy a new set of i/o stuff (raid, cables drives) for this new machine. It just shows you sata, sas or whatever failures do happen and are totally unpredictable. In this case though the machine just became very slow before the drive was removed.

I never said than SAS drives never fail, then again with an unrelated comparison since a drive failure on deployment usualy means manufacturer defect and it's much better this way instead of having drives dieing from exhaustion every few months.

SAS drives are simply better, tougher, more reliable and the most importan thing they are made exactly for this kind of job.

Raptor drives are great, but I think they have proven than they won't do for this job anymore.

that is if Pluto's curse don't decide to prove other than that.

Joking aside, I have worked with raptor drives an they don't have much of a thermal efficiency, not to mention that 10k rpm for a 2.5" drive isn't the best design for reliability.

I still think SAS it's the way to go on this issue, and if I where you I will watch Titan very closely.

Edited February 24, 2010 by asambler

Sign In

Pluto bad drives issue

Recommended Posts

Roger

Tony

Fowler

Roger

Tony

tomtom76

Tony

jonee54

tomtom76

tomtom76

Cody R.

Cody R.

Roger

Cody R.

Tony

Roger

Tony

tomtom76

Tony

Tony

tomtom76

Roger

Tony

Tony

Roger

Join the conversation

Browse

Activity