SNG119 Massive load - recurrent problem

Ozgeek · October 15, 2019

For the last 2 days my accounts on SNG119 (reseller) have been going down in association with a massive server load and memory usage. At no stage has the server status page acknowledged an issue.

The first time I got a message that they had recovered without intervention (FSN-257-26805 ) - which did not explain the issue or the fact that no one appeared to notice before I notified them (Automatic monitoring?)

The next time - after almost 4 hours - described by support as "brief"! (SKY-864-65269) - an "issue" was "fixed" - but the problem persists. Again no explanation. It is shared hosting so I assume someone's account was using massive resources or similar?

Couple of points/questions:

Do you have an automated monitoring system to keep an eye on the servers or do you rely on clients telling support what's going on?
Is the status page in any way relevant/accurate? It still shows that there were no issues (which there obviously were/are).
It would be reassuring to be told what the issue was and what steps had been taken to mitigate further problems.

Can someone please look into this? It is very disconcerting to be fed standard lines with no explanations and a dud server status page.

Thanks

Tim

Ozgeek · October 15, 2019

Well, whatever the issue was, it has now been resolved (thanks Tony). I see that the server status page has finally been updated to reflect the issue albeit many hours after the problems started to happen. I assume it is a manual update process which is only really of value retrospectively.

Irritatingly, there is still no explanation of what happened and no answers to the specific questions I asked in my ticket.

OOPS spoke too soon - server load back up to 300 and error 503 - inaccessible... Intermittent problem

tgonhawk1 · October 16, 2019

You can request being moved to another server
(which may or may not help, depending what the underlying problem is).

If it is another user hogging resources, you'll get away from that,
but there is no guarantee the other server won't have similar problems.

I agree that hosts should act on their own when problems like this occur.
Unlike isolated users, they are in a position see what is going on server-wide.
"Proactively" is the buzzword du jour I believe.

Ozgeek · October 16, 2019

3 hours ago, tgonhawk1 said:

You can request being moved to another server
(which may or may not help, depending what the underlying problem is).

If it is another user hogging resources, you'll get away from that,
but there is no guarantee the other server won't have similar problems

To be fair, I have never had this issue before (which is why it is so frustrating now!). In the past I have always expected HH to do the right thing about monitoring resource usage etc and ensuring customers abide by as their TOS. Accordingly, I suspect that the issue NOW was more likely a rogue process eating up resources. It would be nice to know though!!

Brian · October 16, 2019

Hello,

I apologize for the issues you've seen on this server. Two days ago we did see a big load spike on the server which was resolved with a reboot. We believe that was a kernel bug as the problem has not recurred since.

The downtime you saw yesterday was due to maintenance that was believed to have absolutely no service impact but a fault in a device caused it to have an impact. As to your questions:

1) We have both internal and external monitoring on all servers. This notifies us if a server sees a spike in CPU usage for example or other resources, and also if a server goes completely offline.
2) The status page is used for extended or large outages. We would not post a notice though if just a single server is down or showing high load. Yesterday for example during the Singapore outage we did have a notice up.
3) Mentioned above but you saw two separate issues, one from the kernel bug and then the other from the maintenance.

It's worth noting we did bring additional servers online to help manage some other load spikes we saw in the Singapore cloud, so there is quite a bit more available CPU and RAM now. At this point you should be seeing the performance you've come to expect from that plan.

Sorry again for the troubles and I appreciate your patience throughout all this.

Ozgeek · October 17, 2019

Much appreciated Brian - that's the sort of input I was hoping to get.

Sign In

SNG119 Massive load - recurrent problem

Recommended Posts

Ozgeek

Ozgeek

tgonhawk1

Ozgeek

Brian

Ozgeek

Join the conversation

Browse

Activity