Disk space reminder


President-Anonymous

Recommended Posts

We have monitoring setup for each partition on every system with thresholds set when the amount of drive space remaining is at a warning level and critical level. Should a partition fill up for whatever reason (this happens very, very rarely) we also have safeguards in place to clear up space on the partition to ensure the system continues to function.

One other thing to note is that percentages could be misleading depending on the size of the partitions. You could have a warning saying only 80% left but there could be 10GB free on that partition which may be 5x more space than that particular partition should ever need.

Link to comment
Share on other sites

  • 9 months later...

We have monitoring setup for each partition on every system with thresholds set when the amount of drive space remaining is at a warning level and critical level. Should a partition fill up for whatever reason (this happens very, very rarely) we also have safeguards in place to clear up space on the partition to ensure the system continues to function.

One other thing to note is that percentages could be misleading depending on the size of the partitions. You could have a warning saying only 80% left but there could be 10GB free on that partition which may be 5x more space than that particular partition should ever need.

AMS001 just encountered 100% usage on /var. The result was 500 errors and white pages. A little while later, it drops to 97% and the sites are online again. Now the page still doesn't make for good viewing but atleast things are working again.

The reason I mention this is

1) How can you lot let it get to 100%

2) What were the "safeguards" as to be honest I don't think they worked

For weeks I have been seeing it getting closer and closer to 100% but I had this topic going around in my mind and I thought surely they will not let it reach 100%. Unfortunately I was wrong.

Seriously guys... You need to up your game a bit. This is just the latest of a long line of issues with you. I really would struggle to recommend you lot to my worst enemies.

post-237-0-67470900-1370128582_thumb.png

post-237-0-21092300-1370128589_thumb.png

Link to comment
Share on other sites

1) How can you lot let it get to 100%

 

Doesn't take much, unfortunately. Not going to make an excuse and say that it sitting at 95% or higher is okay, but the percentages are less important than the actual amount of disk space left. Even at 97% full there is still over 4GB of space free. AMS001 is not being used for new accounts, so there is theoretically no reason for the usage on any given partition to grow rapidly in a short period of time. How does it happen though? Tons of reasons, most of which involve one off issues that once we identify are fixed. In this instance our backup software went haywire and wrote a whole lot of data in a short amount of time. I'd venture a guess and say that even if there were 10GB free on /var we may have hit 100% at the rate which it wrote to the partition. Mistakes happen, but it isn't due to a lack of due diligence.

 

2) What were the "safeguards" as to be honest I don't think they worked

 

You're right, in this case they didn't. They do 99% of the time though, and that is why we have such little downtime or persistent server issues. I hate to say it but in this case even our safeguards wouldn't have prevented the issue, the backup software was simply on a mission to fill that partition.

 

For weeks I have been seeing it getting closer and closer to 100% but I had this topic going around in my mind and I thought surely they will not let it reach 100%. Unfortunately I was wrong.

 

I don't see the harm in poke-checking us if you notice something like this. You ultimately stand to benefit if you report an issue that may cause downtime at some point for your website.

 

Seriously guys... You need to up your game a bit. This is just the latest of a long line of issues with you. I really would struggle to recommend you lot to my worst enemies

 

I'd like to know more about this. I think we're having fewer issues now than we were 6 months ago. The issues we do have are handled both more efficiently and with a stronger focus on customer service than before as well. I'm proud of the changes we've made and I definitely don't think we've taken a step back in any aspect of our operation.

 

I think you've been around long enough to know we don't sweep issues under the rug. When something is our fault we hold ourselves accountable and do our damndest to fix it permanently for the future. I'm not happy about this happening, I don't like seeing customers upset like you are, but at the same time I realize we're not going to have 100% uptime 365 days a year. I think we fixed this very quickly and have already begun to address the problem.

 

Feel free to PM me if you want to further the discussion, I'm all ears for you.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...