Near permanent calc error
log in

Advanced search

Message boards : Windows : Near permanent calc error

Previous · 1 · 2
Author Message
Philippe85
Send message
Joined: 19 Nov 10
Posts: 15
Credit: 114,642,116
RAC: 0
Message 23712 - Posted: 10 Jan 2017, 9:32:24 UTC

All the task at the same moment are at an identical memory address but the olders are not at the same address.

By security i have test (soft) the memorry without problem.

Thanks for your answer

Philippe85
Send message
Joined: 19 Nov 10
Posts: 15
Credit: 114,642,116
RAC: 0
Message 23714 - Posted: 10 Jan 2017, 12:02:31 UTC

Just completing the diagnostic : The error appear at the going out of suspension.

The last one after five days of calculation

Profile step2000
Avatar
Send message
Joined: 1 Aug 13
Posts: 96
Credit: 1,480,279,400
RAC: 1,948,208
Message 23715 - Posted: 10 Jan 2017, 13:29:32 UTC

So your computer is set to sleep and when you wake it up it hangs? If this is the case the issue is how your system is writing the sleep mode to the drive. Try this:

Allow the computer to not sleep and only sleep the monitor. See if that works if it does you'll need to see if a need hard drive that is used on your system has a newer one that can handle writes back at a slow rate.

Philippe85
Send message
Joined: 19 Nov 10
Posts: 15
Credit: 114,642,116
RAC: 0
Message 23740 - Posted: 13 Jan 2017, 17:32:41 UTC
Last modified: 13 Jan 2017, 17:33:46 UTC

Not sleeping for windows. Suspension for Boinc when i use it (pause),

My station never sleep (special use and synchornization)

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 23742 - Posted: 13 Jan 2017, 18:55:42 UTC

When the BOINC client "suspends" the workunits, it doesn't actually do that. Rather, it first tries to tell the application to exit properly. If the application doesn't respond quickly enough, BOINC kills the application. When a GPU application is killed, the GPU may not free the memory properly. When BOINC "resumes" the GPU app, it is actually restarting it. If the app didn't free up the memory when "suspended" then it may crash when it tries to allocate the memory a second time and there isn't enough.

The choice to kill rather than suspend GPU apps was done because some people feel the need to run a bazillion projects at a time. When BOINC suspended the GPU apps and switched to another project, there were times when there wasn't enough memory for that project and it would run out of memory. IMO, the issue is that BOINC should wait longer before killing the app. When the app exits properly, it deallocates all the memory and gives the GPU the appropriate commands to free up any RAM it allocated as well. While Windows will keep track of the system RAM and make sure it gets freed, I don't think it has the ability to do the same for the GPU RAM since the drivers are often unique per GPU.

If it happened randomly, I'd agree it is a memory stick but since it always happens when resuming, it is likely due to the app being killed and not being able to resume.

Another possibility is the way the driver optimizes the GPU code when OpenCL compiles it at runtime. If the driver gets too aggressive with the optimization, it may try to reuse a variable rather than create a new one and, as has been seen for years with the MAC + AMD combination, it won't run properly unless all optimization is turned off which slows it down by 100 times or more.

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,533,614,159
RAC: 15,422,468
Message 23743 - Posted: 13 Jan 2017, 20:34:07 UTC - in response to Message 23742.
Last modified: 13 Jan 2017, 20:36:24 UTC

I also did a lot of checking around after getting this kind of errors. The error appears (sometimes) when the task is suspended and restarted (for boinc related or personal reasons). The common (good) behavior (stderr) is the following


...
Resuming at 3093177794165886418944
actual threads 64
Suspending...
Collatz Config Settings:
...

But sometimes there is not the "Suspending..." line and the task errors out afterwards, see this one https://boinc.thesonntags.com/collatz/result.php?resultid=124541366

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 23744 - Posted: 13 Jan 2017, 22:33:19 UTC - in response to Message 23743.

I also did a lot of checking around after getting this kind of errors. The error appears (sometimes) when the task is suspended and restarted (for boinc related or personal reasons). The common (good) behavior (stderr) is the following

...
Resuming at 3093177794165886418944
actual threads 64
Suspending...
Collatz Config Settings:
...

But sometimes there is not the "Suspending..." line and the task errors out afterwards, see this one https://boinc.thesonntags.com/collatz/result.php?resultid=124541366


If there isn't a "Suspending..." line it is because the BOINC client didn't want to wait for the kernel to finish and just killed the app. BOINC's impatience is causing the memory leak. It shouldn't be killing the app since the app sets the "don't suspend me now" flag whenever a kernel is running and clears it when finished. Collatz loads up items_per_reduction number of kernels and then they all run asynchronously on the GPU. There isn't a way to stop them once submitted short of forcing the GPU to reset. The projects which do 80% on the CPU and 20% on the GPU don't run into that problem. They spend the majority of their time copying data back and forth from the host to the GPU. Collatz copies the data initially and then runs kernel after kernel on the GPU with very little data going between the host and the GPU. The better optimized the Collatz app is, the closer it gets to 100% on the GPU and 0% on the CPU and BOINC was never designed with that in mind.

Profile step2000
Avatar
Send message
Joined: 1 Aug 13
Posts: 96
Credit: 1,480,279,400
RAC: 1,948,208
Message 23745 - Posted: 13 Jan 2017, 22:39:53 UTC

Slicker is correct and the driver OpenCL side on AMD cards does a poor compile on the packets into the EXE for the project. Try this and see if it helps.

Update the AMD Driver if the latest go back one or maybe to revs. Not sure which Card you are using as some work great and some fail. Make sure not to over clock your GPU as this will cause major issues (MAJOR) on files compiled used here (AMD).

If it fails I'm not sure what if anything else will fix this as tracking down a GPU compile error or address space error is not fun.

Hope this helps!

Philippe85
Send message
Joined: 19 Nov 10
Posts: 15
Credit: 114,642,116
RAC: 0
Message 23750 - Posted: 16 Jan 2017, 7:44:39 UTC

Thanks a lot.

I am not sure to understand all the concepts under (between memory, CUDA ...), but nothing really to do.

I have tried to use the latest drivers from AMD and depending on version not resolving anything and for the latest dont use the GPU. It is an old card (6 years !).

So i stop Collatz and continue with others projects, and could be back if i change my workstation (not on my roadmap, too many time to configure a new one).

Bye.

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,533,614,159
RAC: 15,422,468
Message 23764 - Posted: 20 Jan 2017, 19:38:12 UTC - in response to Message 23750.

So, the problem here is when boinc suspends/kills the application (maybe we could ask the developers to wait a little longer before doing this). Anyway, what I noticed is that I get problems only in the boxes that have also some share with the Moo project. Even if the share is low (say 100% collatz, 10% Moo) boinc downloads a lot of Moo workunits and after some time it continues to stop collatz (high priority I guess). Maybe is the way the Moo application leaves the gpu memory after using it, I don't know. No problems at all if the share is with Milkyway.

Previous · 1 · 2
Post to thread

Message boards : Windows : Near permanent calc error


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.