Server outages
log in

Advanced search

Message boards : Number crunching : Server outages

Author Message
Profile BarryAZ
Send message
Joined: 21 Aug 09
Posts: 251
Credit: 13,219,203,758
RAC: 23,562,759
Message 21787 - Posted: 28 Nov 2015, 16:11:02 UTC

I'm wondering with the increasingly frequent database server crashes whether something might done to make them planned instead of unplanned and thus much shorter in duration.

I get it that with the very short sieve units, the processing load has increased a lot.

My own suspicion, perhaps ill-informed, is that the database server is encountering some memory leak (as I suspect it always had), which is made worse by the higher volume processing.

Since it appears that resolving the actual problem is not an option for whatever reason, how about pre-empting it?

My (admittedly novice) suggestion would be a pair of scripts.

One would take down the database server *gracefully* at a programmed time of day (perhaps every day).

The other would restart the database server about 10 minutes later.

Perhaps something along these lines would restore the server to a 'memory clean slate' each cycle.

Just a thought from one of the users.

Profile James Lee*
Avatar
Send message
Joined: 10 Sep 15
Posts: 27
Credit: 4,284,523,040
RAC: 1,317,793
Message 21836 - Posted: 14 Dec 2015, 3:55:28 UTC

Since there are a lot of server outages, is there any way to download more than 5 hours of work? I end up with more down time here during the outage recoveries. Seems I can't get more than 50 tasks queued for each GPU - which is only 5 hours of queue depth.
____________

Profile BarryAZ
Send message
Joined: 21 Aug 09
Posts: 251
Credit: 13,219,203,758
RAC: 23,562,759
Message 21837 - Posted: 14 Dec 2015, 6:17:33 UTC - in response to Message 21836.
Last modified: 14 Dec 2015, 6:17:46 UTC

Jamea, not for now.

Slicker (that's the one guy operator of the project) is working on significantly longer process time work units. When that happens, things may change.

For now, in as much as there is an outage every 36 to 60 hours, and the outage lasts anywhere from 6 to 20 hours, the deal is to have a secondary GPU project.

I've three different GPU projects which I can shift to (depending on the GPU) -- GPUGrid (for the nvidia GPU's) and Moowrap and Poem for the ATI GPU's.

You can configure the shares to use the other projects during outages -- though that doesn't work well for GPUGrid as they offer long workunits and short deadlines.

On my GTX 750ti's - GPUGrid runs around 24 hours.

The Collatz work units run about 10 minutes for that GPU -- so when there is a Collatz outage I have about 8 to 9 hours of work units.

I typically catch the outage within about 4 hours or so, and then simply shift over to the other projects manually (and suspend processing on Collatz).

Profile mikey
Avatar
Send message
Joined: 11 Aug 09
Posts: 3242
Credit: 1,687,981,267
RAC: 6,076,082
Message 21838 - Posted: 15 Dec 2015, 12:10:40 UTC - in response to Message 21837.


I typically catch the outage within about 4 hours or so, and then simply shift over to the other projects manually (and suspend processing on Collatz).


If you set your backup gpu, or even cpu, projects at a zero resource share then it will run them automatically and then also check here first, for example, when it needs more workunits. Then if it doesn't find any work here it will get work from your backup project.

Thru a simple <exclude_gpu> set of lines in a cc_config.xml like like this:
<exclude_gpu>
<url>http://moowrap.net/</url>
<device_num>1</device_num>
<type>NVIDIA</type>
</exclude_gpu>

then you can have a backup project for each kind of gpu, AMD or Nvidia. The above lines will exclude an Nvidia from the Moowrap project. The other options for the line specifying the type of gpu are:
<type>NVIDIA|ATI|intel_gpu</type>

Just remove the ones you don't want in there. The ONE thing to be careful about is the line above the gpu type specifying the device number, each gpu will have it's own number and they must correspond to each other.

Profile BarryAZ
Send message
Joined: 21 Aug 09
Posts: 251
Credit: 13,219,203,758
RAC: 23,562,759
Message 21839 - Posted: 15 Dec 2015, 16:46:26 UTC - in response to Message 21838.

Mikey, I've considered that -- but GPUGrid is particular is something of a sticking point -- the work units there run between 22 and 36 hours or so and have short deadlines. So I need to manually intervene there anyway.

Also, to be honest, I like keeping active in the other projects to a degree -- its a combination of supporting them over the long haul *and* rewarding projects that are not in multiple crash mode such as Collatz.

For others the approach you and others have suggested likely haa merit.

Regarding the Collatz crashes, there seems to be a bit more of a pattern of late.

That is, it seem a large proportion of the crashes happen late at night (or early in the morning) between about 3AM and 6AM (PST). When they happen at that hour, the recovery cycle appears to run for 22 hours, until between 1AM and 4AM the following day.

The variation at the moment is whether the up cycle is 1 day or 2 days -- it hasn't been longer that for at least a week. Perhaps more people are running Collatz to increase the stress on the server.

A few weeks ago, Collatz was running 3 to 5 days prior to a crash.

I suppose if people back off, (perhaps due to frustration), the up time will increase from the current effective 50% to 67%.

If Slicker is able to get the long run units up and running, then the up time is likely to increase (at least we can hope).

I do believe it is important to realize that the underlying problem (SQL unresponsive) -- which I believe is due to something of a stress related memory leak -- remains for Collatz -- as it has for a very long time. It is simply that the higher workload for the project is increasing the frequency of the server crashes.

jjwhalen
Avatar
Send message
Joined: 17 Apr 10
Posts: 20
Credit: 202,526,055
RAC: 0
Message 21840 - Posted: 15 Dec 2015, 17:26:09 UTC - in response to Message 21839.

but GPUGrid is particular is something of a sticking point -- the work units there run between 22 and 36 hours or so and have short deadlines.


It also doesn't help that GPUGRID currently has little to no work available - mostly the latter. Personally I use Primegrid/PPS Sieve as my GPU reserve project. It has infinite numbers to evaluate, and server uptime is well above 99%. But whatever works for you.
____________
Best wishes:)

Profile BarryAZ
Send message
Joined: 21 Aug 09
Posts: 251
Credit: 13,219,203,758
RAC: 23,562,759
Message 21841 - Posted: 15 Dec 2015, 18:16:45 UTC - in response to Message 21840.

With GPUGrid the work is available if you regularly ping for it -- but I do have MooWrap and Poem for alternatives as well. Also MilkyWay on a couple of systems, and for that matter some PrimeGrid as well.
____________

Profile James Lee*
Avatar
Send message
Joined: 10 Sep 15
Posts: 27
Credit: 4,284,523,040
RAC: 1,317,793
Message 21842 - Posted: 15 Dec 2015, 22:56:28 UTC
Last modified: 15 Dec 2015, 22:56:54 UTC

Thanks ALL for your help. I do use others, ie PrimeGrid, as filler. But a few more of your suggestions will be tried. Also, lol, I seem on occasion to get a full page of "Ready to report"s although the systems are running - so occasionally have to do a manual update. Is that common?
____________

Profile BarryAZ
Send message
Joined: 21 Aug 09
Posts: 251
Credit: 13,219,203,758
RAC: 23,562,759
Message 21843 - Posted: 15 Dec 2015, 23:54:23 UTC - in response to Message 21842.
Last modified: 15 Dec 2015, 23:55:25 UTC

The need for a manual update might happen if the server has relatively recently come back online -- there is something of an automatic backoff timer.

The first time you get the feeder offline, you get a 1 hour defer -- as that progresses through the day, that defer each time the server gets pinged will increase -- sometimes to something like 2 to 3 hours.

If the server goes online prior to that 'backoff timer' being lapsed, you'll manually update.

For me, that also happens when I suspend Collatz during an outage to let the back up projects process (especially GPUGrid with its long work unit process times).

The test for the next day here -- will MySQL barf tonight -- or will it barf tomorrow night?
____________

Profile mikey
Avatar
Send message
Joined: 11 Aug 09
Posts: 3242
Credit: 1,687,981,267
RAC: 6,076,082
Message 21844 - Posted: 16 Dec 2015, 11:48:57 UTC - in response to Message 21842.

Thanks ALL for your help. I do use others, ie PrimeGrid, as filler. But a few more of your suggestions will be tried. Also, lol, I seem on occasion to get a full page of "Ready to report"s although the systems are running - so occasionally have to do a manual update. Is that common?


That depends...it could be as Barry suggested just a time delay that you override by manually forcing an update, or it could be that your pc just doesn't need any new work right now so isn't returning the completed units until you do. Every time we connect to the Server it takes time, that time multiplied by 10,000+ users adds up very quickly, especially if each of those 10k users do it multiple times per day. Each of those connections puts a load on the Server, so Boinc is designed to only connect when it needs too, making it work without a problem for most of us. Each time you finish a workunit a tiny bit of data is sent back to the Server saying the unit is completed successfully, but the actual results aren't sent back until later.

Profile mikey
Avatar
Send message
Joined: 11 Aug 09
Posts: 3242
Credit: 1,687,981,267
RAC: 6,076,082
Message 21845 - Posted: 16 Dec 2015, 11:50:36 UTC - in response to Message 21843.

The need for a manual update might happen if the server has relatively recently come back online -- there is something of an automatic backoff timer.

The first time you get the feeder offline, you get a 1 hour defer -- as that progresses through the day, that defer each time the server gets pinged will increase -- sometimes to something like 2 to 3 hours.

If the server goes online prior to that 'backoff timer' being lapsed, you'll manually update.

For me, that also happens when I suspend Collatz during an outage to let the back up projects process (especially GPUGrid with its long work unit process times).

The test for the next day here -- will MySQL barf tonight -- or will it barf tomorrow night?


It's cold meaning it could be hunting season, Slicker could be in the woods.

Profile BarryAZ
Send message
Joined: 21 Aug 09
Posts: 251
Credit: 13,219,203,758
RAC: 23,562,759
Message 21846 - Posted: 17 Dec 2015, 4:01:40 UTC - in response to Message 21845.
Last modified: 17 Dec 2015, 4:27:52 UTC

Yeah -- I know that -- and he doesn't take his server with him <rueful smile>

Maybe he's hunting for primes...


Post to thread

Message boards : Number crunching : Server outages


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.