Ozgeek Posted October 15, 2019 Report Posted October 15, 2019 For the last 2 days my accounts on SNG119 (reseller) have been going down in association with a massive server load and memory usage. At no stage has the server status page acknowledged an issue. The first time I got a message that they had recovered without intervention (FSN-257-26805 ) - which did not explain the issue or the fact that no one appeared to notice before I notified them (Automatic monitoring?) The next time - after almost 4 hours - described by support as "brief"! (SKY-864-65269) - an "issue" was "fixed" - but the problem persists. Again no explanation. It is shared hosting so I assume someone's account was using massive resources or similar? Couple of points/questions: Do you have an automated monitoring system to keep an eye on the servers or do you rely on clients telling support what's going on? Is the status page in any way relevant/accurate? It still shows that there were no issues (which there obviously were/are). It would be reassuring to be told what the issue was and what steps had been taken to mitigate further problems. Can someone please look into this? It is very disconcerting to be fed standard lines with no explanations and a dud server status page. Thanks Tim Quote
Ozgeek Posted October 15, 2019 Author Report Posted October 15, 2019 Well, whatever the issue was, it has now been resolved (thanks Tony). I see that the server status page has finally been updated to reflect the issue albeit many hours after the problems started to happen. I assume it is a manual update process which is only really of value retrospectively. Irritatingly, there is still no explanation of what happened and no answers to the specific questions I asked in my ticket. OOPS spoke too soon - server load back up to 300 and error 503 - inaccessible... Intermittent problem Quote
tgonhawk1 Posted October 16, 2019 Report Posted October 16, 2019 You can request being moved to another server (which may or may not help, depending what the underlying problem is). If it is another user hogging resources, you'll get away from that, but there is no guarantee the other server won't have similar problems. I agree that hosts should act on their own when problems like this occur. Unlike isolated users, they are in a position see what is going on server-wide. "Proactively" is the buzzword du jour I believe. Quote
Ozgeek Posted October 16, 2019 Author Report Posted October 16, 2019 3 hours ago, tgonhawk1 said: You can request being moved to another server (which may or may not help, depending what the underlying problem is). If it is another user hogging resources, you'll get away from that, but there is no guarantee the other server won't have similar problems To be fair, I have never had this issue before (which is why it is so frustrating now!). In the past I have always expected HH to do the right thing about monitoring resource usage etc and ensuring customers abide by as their TOS. Accordingly, I suspect that the issue NOW was more likely a rogue process eating up resources. It would be nice to know though!! Quote
Brian Posted October 16, 2019 Report Posted October 16, 2019 Hello, I apologize for the issues you've seen on this server. Two days ago we did see a big load spike on the server which was resolved with a reboot. We believe that was a kernel bug as the problem has not recurred since. The downtime you saw yesterday was due to maintenance that was believed to have absolutely no service impact but a fault in a device caused it to have an impact. As to your questions: 1) We have both internal and external monitoring on all servers. This notifies us if a server sees a spike in CPU usage for example or other resources, and also if a server goes completely offline. 2) The status page is used for extended or large outages. We would not post a notice though if just a single server is down or showing high load. Yesterday for example during the Singapore outage we did have a notice up. 3) Mentioned above but you saw two separate issues, one from the kernel bug and then the other from the maintenance. It's worth noting we did bring additional servers online to help manage some other load spikes we saw in the Singapore cloud, so there is quite a bit more available CPU and RAM now. At this point you should be seeing the performance you've come to expect from that plan. Sorry again for the troubles and I appreciate your patience throughout all this. Ozgeek 1 Quote
Ozgeek Posted October 17, 2019 Author Report Posted October 17, 2019 Much appreciated Brian - that's the sort of input I was hoping to get. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.