http://www.dreamhoststatus.com/2008/03/27/filer-problems-with-blingy-cluster/
Filer problems with blingy cluster. |
We are currently having a problem with a filer which has crashed and is recovering at this time. While this is happening some customers in the blingy cluster will experience problems loading their websites/email. We apologize for the outage and service is expected to return to normal as soon as the filer recovers.
UPDATE 3:01:AM PDT
The filer has finished recovering and all services are back up and running. We are working with the filer vendor to find the source of the crash to prevent any further outages.
Update 24/03/08 10am: We're working on the file server again to alleviate the load that's causing problems with web, mail and mysql services. Sorry about that.
Update 27/03/08: We are doing emergency data moves to quell the stem of problems recently caused by your file server. During these moves, your data may be inaccessible. We are moving as we can off as fast as possible. Very sorry about the continued inconvenience!
Update 27/03/08 This series of moves has finished. We are going to keep an eye on things to see how much it helped and may have to do more moves tonight and tomorrow morning to get everything working smoothly again. This post will be updated with more information as soon as possible.
Update 29/03/08
We are continuing to move data off of the problematic file server but it's a bit of a catch-22 because customers on that machine are continuing to add data at a very high rate. It filled up this morning for a while causing device full errors as well as mail problems and issues serving websites (when these fill up it causes problems across the board). To explain in more detail, when we move data it does not immediately disappear (there is a 'snapshot' created of the old data that remains in case there was a problem with the move - that ensures that we do not lose customer data but until the admin team can check the move to make sure it went through properly we cannot delete the old data). We just did some of that and have some breathing room again and of course more moves are still in progress but we are asking customer on this cluster to help us by holding up on any non-essential uploads of data for the next couple of days. As soon as we have a significant portion of the data removed the problematic file server will begin to function properly (and additional moves will go much more quickly and smoothly) but right now we're having trouble moving data more quickly than it's being added by people. If everyone could please limit uploads to absolutely essential data until we reach the turning point where everything is working this will be resolved much more quickly (in other words if for example you are setting up a repository of large files you'll actually be better off waiting a couple of days and getting the all clear from us on this issue because you'll be able to access that data reliably instead of cramming it on there now and slowing the recovery process).
In the meantime we'll be doing everything we can to safely and quickly move data off and get things back to normal.
Added information: Some of the people recently moved to the new file server are seeing errors because the data did not get set up completely (loading the site will work but just show an empty index). The admin team has been running an rsync that will fully restore all data and should hopefully finish by 9 PM PST - once that is finished all site and email data will be available for those users.
Update 30/03/08
We're still racing to keep ahead of new data being added so any help we can get on that front is greatly appreciated (we're still asking for customers to limit uploads as much as possible to speed up the recovery process). Some customers who are being moved are seeing blank directories still but those are due to moves in progress and the data will be fully restored when those complete.
Severity: | High | Resolved: | No |
--EOF--
Leave a comment