Author Message
Joined: 28 Dec 11
Message 13209 - Posted: 2 Jan 2012, 15:43:43 UTC

Here is a CUDA task that errored out after many hours of work with the Collatz v2.06 application:

It seems to have happened in the middle of the night, when nobody's using the computer. Has anyone else seen this kind of error? Is there anything that can be done about such problems? Should I reinstall my CUDA drivers once more and hope it doesn't happen again?

This was on a MacBook Pro (with NVIDIA 9400M & 9600M GT GPUs) on Mac OSX 10.7.2. Let me know if you need more info.


Joined: 11 Jun 09
Message 13219 - Posted: 4 Jan 2012, 15:03:44 UTC

It restarted the app 32 times? Was that because you running another CUDA project as well?

After terminating the app without giving it a chance to release the GPU memory so many times, the driver may reset. Switches between GPU apps has always been a challenge since a well written GPU app uses almost 0 CPU. The CPU calls the GPU and until the GPU is done, there is NOTHING the CPU can do to stop it. There is no abort. If BOINC wants to shut down while the GPU is running, the app gets terminated without having a chance to release the GPU memory or resources. So, when it starts the next app, if has fewer resources with which to work. Eventually, it runs out and either BOINC won't start another app or the driver crashes. If we want to spend 50% of a CPU to monitor BOINC and have the GPU run really really small and short kernels (50% utilization instead of 98-99%) would help also. Of course, a fast CPU would do as much work as a GPUI so credits would be really bad because the app would be wasting all of its abilities to try and play nice with others instead of getting work done. There have been some changes in recent BOINC clients to try and notify the app that BOINC plans to terminate it, but those changes are not in the current Collatz CUDA apps yet. I have added them to the OpenCL apps but those are still in testing.

Another possibility is the that if the OS X drivers are anything like the Windows versions, half of them should never have been released to the public because they really don't work -- or they fix one thing and break two others. Sometimes installing the latest or even going back to a previous driver version will solve the problem.

Message boards : Macintosh : Computation error: "Unspecified driver error" after many hours of working fine

