Errors on CUDA workunit
log in

Advanced search

Message boards : Number crunching : Errors on CUDA workunit

1 · 2 · Next
Author Message
BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20538 - Posted: 11 Jun 2015, 16:00:52 UTC

Hello,

I've recently returned to this project and noticed that my system had a number of tasks with errors. As my system (GTX-750) does not produce any errors on other projects (Einstein, SETI, POEM) I took another look. It turns out that other wingmen that were also running CUDA returned exactly the same error message.
Here are a couple of examples:

http://boinc.thesonntags.com/collatz/workunit.php?wuid=15778372
http://boinc.thesonntags.com/collatz/workunit.php?wuid=15779377

Is this a known issue (I tried to search the forum but could not find anything about it) ? If so, is a fix expected in the near future ?

Thanks,

Tom

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20539 - Posted: 11 Jun 2015, 17:16:16 UTC - in response to Message 20538.

Hello,

I've recently returned to this project and noticed that my system had a number of tasks with errors. As my system (GTX-750) does not produce any errors on other projects (Einstein, SETI, POEM) I took another look. It turns out that other wingmen that were also running CUDA returned exactly the same error message.
Here are a couple of examples:

http://boinc.thesonntags.com/collatz/workunit.php?wuid=15778372
http://boinc.thesonntags.com/collatz/workunit.php?wuid=15779377

Is this a known issue (I tried to search the forum but could not find anything about it) ? If so, is a fix expected in the near future ?

Thanks,

Tom


I typed in a lengthy response talking about heat, etc. which is usually the case but after re-reading your post, I double checked the other errors. Congratulations. You appear to have found a compatibility issue.

All of you are running a 64-bit version of Windows. BOINC, even though the flag is set on the server to only send 64-bit apps to 64-bit operating systems, decided to send you 32-bit applications just in case they ran faster. So.. there are really two bugs. The first is that the CUDA55 doesn't return the proper results when run on Win32. The second is with BOINC's handling of the "use preferred platform" flag.

Why does BOINC do that at all? Some project admins are so technology challenged (that's a nice way of saying they probably shouldn't even be in IT) that they have 64-bit apps that run slower than their 32-bit versions and don't remove them. They instead expect BOINC to figure out which of their versions runs faster on which operating systems. That's why BOINC randomly sends 32-bit apps to 64-bit operating systems.

The only workaround I know of is to set the no_alt_platform in your BOINC cc_config.xml. The problem with doing that is that if you crunch some other project that only has 32-bit applications, it won't get any work from them.

The fix would be to change the BOINC server code so that it never sends 32-bit apps to 64-bit operating systems, or at least on Collatz (yet another server specific code change that I'll have to merge back in the next time I upgrade the server software.)

Hopefully the new sieve apps won't behave the same way. In my testing so far, the 32-bit nVidia apps run just fine on Win 7 x64 and on Win 8.1 x64. I don't have a CUDA specific version compiled yet (just OpenCL).

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20541 - Posted: 11 Jun 2015, 17:35:55 UTC - in response to Message 20539.

Hello,

I've recently returned to this project and noticed that my system had a number of tasks with errors. As my system (GTX-750) does not produce any errors on other projects (Einstein, SETI, POEM) I took another look. It turns out that other wingmen that were also running CUDA returned exactly the same error message.
Here are a couple of examples:

http://boinc.thesonntags.com/collatz/workunit.php?wuid=15778372
http://boinc.thesonntags.com/collatz/workunit.php?wuid=15779377

Is this a known issue (I tried to search the forum but could not find anything about it) ? If so, is a fix expected in the near future ?

Thanks,

Tom


I typed in a lengthy response talking about heat, etc. which is usually the case but after re-reading your post, I double checked the other errors. Congratulations. You appear to have found a compatibility issue.

All of you are running a 64-bit version of Windows. BOINC, even though the flag is set on the server to only send 64-bit apps to 64-bit operating systems, decided to send you 32-bit applications just in case they ran faster. So.. there are really two bugs. The first is that the CUDA55 doesn't return the proper results when run on Win32. The second is with BOINC's handling of the "use preferred platform" flag.

Why does BOINC do that at all? Some project admins are so technology challenged (that's a nice way of saying they probably shouldn't even be in IT) that they have 64-bit apps that run slower than their 32-bit versions and don't remove them. They instead expect BOINC to figure out which of their versions runs faster on which operating systems. That's why BOINC randomly sends 32-bit apps to 64-bit operating systems.

The only workaround I know of is to set the no_alt_platform in your BOINC cc_config.xml. The problem with doing that is that if you crunch some other project that only has 32-bit applications, it won't get any work from them.

The fix would be to change the BOINC server code so that it never sends 32-bit apps to 64-bit operating systems, or at least on Collatz (yet another server specific code change that I'll have to merge back in the next time I upgrade the server software.)

Hopefully the new sieve apps won't behave the same way. In my testing so far, the 32-bit nVidia apps run just fine on Win 7 x64 and on Win 8.1 x64. I don't have a CUDA specific version compiled yet (just OpenCL).


Thanks for the explanation. I will try the no_alt_platform option (would not have thought of that without you mentioning it). Got a lot of error messages after restarting BOINC, but I don't think there are any serious issues.
If it does cause problems I will let you know.

Tom

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20554 - Posted: 12 Jun 2015, 5:35:51 UTC

Even with the no_alt_platform option I'm still seeing some errors.

http://boinc.thesonntags.com/collatz/result.php?resultid=18191525

From the earlier errors I found it strange that not a single one of the workunits that failed for me were later correctly handled by another CUDA hosts. All other CUDA tasks sent for the same workunit failed with exactly the same message.
So my question: are you absolutely sure this is a 32/64 bit platform issue ? I get the impression the problem is related to the CUDA app and not to 32/64 bit issues. Is there a way I can see if a task was handled by a 32-bit or a 64-bit app ?

Thanks,

Tom

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20603 - Posted: 16 Jun 2015, 17:40:10 UTC

I just ran one of the WUs listed as failing on all the CUDA devices and it ran fine on my laptop's nVidia GTX 770M processor. I ran it using the 64-bit Windows app. So, either the problem is with the 32-bit app on a 64-bit platform, or the CUDA drivers sporadically return garbage (and why the CPU double checks the results).

Collatz Conjecture v6.04 Windows x86_64 for CUDA 5.5
Based on the AMD Brook+ kernels by Gipsel
Name GeForce GTX 770M
Compute 3.0
Parameters --device 0
Start 2397317568673582415872
Checking 98784247808 numbers
Numbers/Kernel 131072
Kernels/Reduction 64
Numbers/Reduction 8388608
Reductions/WU 11776
Threads 64
Using: verbose=1 items_per_kernel=131072 kernels_per_reduction=64 threads=64 sleep=1

Highest Steps 1630 for 2397317568674010636665
Total Steps 54673909814675
Avg Steps 553
CPU time 11.7656 seconds
Total time 380.716 seconds
11:58:53 (8676): called boinc_finish

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20605 - Posted: 16 Jun 2015, 18:37:11 UTC

Here's the output of the WU you listed run on my laptop which is Win 8.1 x64. It was run with the 32-bit CUDA 5.5 app.

Name GeForce GTX 770M
Compute 3.0

Resuming at 2397317568751269314560
Using: verbose=1 items_per_kernel=131072 kernels_per_reduction=64 threads=64 sl
eep=1

Highest Steps 1630 for 2397359114451010593145
Total Steps 55007879824742
Avg Steps 556
CPU time 11.6719 seconds
Total time 381.856 seconds
12:44:12 (6064): called boinc_finish


That means the assumption that all of the Windows CUDA apps are buggy is incorrect. In fact, both apps work. The bad news is that when it works OK for some and not for others, those are the most difficult issues to track down because since the app does produce correct results, it could also be driver, memory, heat, etc. causing the sporadic results.

Have you tried the sieve app at all? If not, I'd be curious whether you get the same results with it.

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20614 - Posted: 17 Jun 2015, 16:24:25 UTC - in response to Message 20605.

Hello Slicker, thank you for trying.

I know it is hard to find problems if you can't reproduce them (I am a software engineer with over 25 years of experience).

I will try the sieve app when I am able to get some work (so far I had 9 failed requests):

17-6-2015 18:03:58 | Collatz Conjecture | Sending scheduler request: Requested by user.
17-6-2015 18:03:58 | Collatz Conjecture | Requesting new tasks for NVIDIA GPU
17-6-2015 18:04:00 | Collatz Conjecture | Scheduler request completed: got 0 new tasks

Is there a way I can check (in the task output) if a 32-bit or a 64-bit app is used ? All I see is:

Collatz Conjecture v6.04 Windows x86_64 for CUDA 5.5

My BOINC thesonntags.com_collatz directory only contains a single executable named:

mini_collatz_6.04_windows_x86_64__cuda55.exe

Based on the name I would assume this is a 64 bit executable.

How do I find/recognize the 32 bit app ?

What also surprises me is that I still got errors after adding:

<no_alt_platform>1</no_alt_platform>

to my cc_config.xml file. The errors occurred after restarting BOINC (causing errors on some other projects) and requesting new tasks.

Is there another way to force using the 64-bit app ? I know that for
you can specify the executables used in a separate file (I used anonymous platform over on SETI@home in the past).

If there is anything I can do to try to figure out what is causing this problem, please let me know. I think it is highly unlikely that this problem is caused by hardware/heat issues. The results should be more random in that case and not all hosts should fail in exactly the same way. Problems with drivers may be an issue, but I checked my wingmen on the failed results and there were a lot of different driver versions being used (but no, that does not prove anything ...).


Tom

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20615 - Posted: 17 Jun 2015, 20:19:08 UTC - in response to Message 20614.

Managed to get the sieve app up and running (not only needed to select the sieve app, but also had to enable 'test'). After 6 minutes it is at 0.5%. System is very sluggish (not nice to work with anymore). GPU at 99% memory usage above 900 Mb. Will let it run for now ...

Tom

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20623 - Posted: 18 Jun 2015, 19:29:39 UTC - in response to Message 20614.

x86_64 = 64-bit
intelx86 = 32-bit

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20624 - Posted: 18 Jun 2015, 19:38:25 UTC - in response to Message 20615.

Managed to get the sieve app up and running (not only needed to select the sieve app, but also had to enable 'test'). After 6 minutes it is at 0.5%. System is very sluggish (not nice to work with anymore). GPU at 99% memory usage above 900 Mb. Will let it run for now ...

Tom


It will likely take twice as long as a large WU to run, but it calculate 4K times as many numbers in that period compared to a large WU.

When the 2.x apps were released, they also were sluggish but the people running them had them on 24/7 crunching machines. When the OpenCL apps were first released, everyone complained that they only used 70% of the GPU. Sosirus really worked hard to optimize the new sieve apps so that they would run as fast as possible. That also means they run at 99-100% GPU. Even on fast GPUs, I have to set it to not run while the computer is in use.

If I can't find settings that will slow down the sieve app and allow better video response, I may have to create a completely different version and make the current one an optimized app for those where video response is not an issue.

Dr Who Fan
Avatar
Send message
Joined: 27 May 14
Posts: 21
Credit: 4,562,054
RAC: 0
Message 20626 - Posted: 18 Jun 2015, 20:07:41 UTC

So far BOTH of the Sieve tasks I have downloaded crashed almost immediately and also crashed my GPU forcing a restart of the GPU.

COMPUTER 146029

Exit status: 5 (0x5) Unknown error number

Stderr output

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
Access is denied.
(0x5) - exit code 5 (0x5)
</message>

<stderr_txt>
Collatz Conjecture Sieve 1.0.4 Windows i686 for OpenCL
Written by Slicker (Jon Sonntag) of team SETI.USA
Based on the AMD Brook+ kernels by Gipsel of team Planet 3DNow!
Sieve code and OpenCL optimization provided by Sosiris of team BOINC@Taiwan
Collatz Config Settings:
verbose=1
kernels_per_reduction=32
threads=1024
lut_size=11
sleep=1
reducecpu=0 (no)
Platform: NVIDIA
Device: 01AA1D38
OpenCL context created
OpenCL program created
OpenCL program copiled
Max Work Item Dimensions: 3
Max Work Item Size: 1024 1024 64
Max Work Group Size: 1024
Max Kernel Work Group Size: 1024
WU Name: collatz_sieve_2397292516180884455424_844424930131968
Start: 2397292516180884455424
Stop: 2397293360605814587392
Using:
verbose=1
threads=1024
sleep=1
lookup table size=2048
reduction on CPU=no
kernels per reduction=32
Error: (-5)Out of resources at 1303 of collatzOpenCL::RunSteps
Error: GPU steps do not match CPU steps. Workunit processing aborted.
01:55:21 (7916): called boinc_finish


</stderr_txt>
]]>
____________

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20628 - Posted: 18 Jun 2015, 20:35:07 UTC - in response to Message 20626.

So far BOTH of the Sieve tasks I have downloaded crashed almost immediately and also crashed my GPU forcing a restart of the GPU.

COMPUTER 146029

Exit status: 5 (0x5) Unknown error number

Stderr output

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
Access is denied.
(0x5) - exit code 5 (0x5)
</message>

<stderr_txt>
Collatz Conjecture Sieve 1.0.4 Windows i686 for OpenCL
Written by Slicker (Jon Sonntag) of team SETI.USA
Based on the AMD Brook+ kernels by Gipsel of team Planet 3DNow!
Sieve code and OpenCL optimization provided by Sosiris of team BOINC@Taiwan
Collatz Config Settings:
verbose=1
kernels_per_reduction=32
threads=1024
lut_size=11
sleep=1
reducecpu=0 (no)
Platform: NVIDIA
Device: 01AA1D38
OpenCL context created
OpenCL program created
OpenCL program copiled
Max Work Item Dimensions: 3
Max Work Item Size: 1024 1024 64
Max Work Group Size: 1024
Max Kernel Work Group Size: 1024
WU Name: collatz_sieve_2397292516180884455424_844424930131968
Start: 2397292516180884455424
Stop: 2397293360605814587392
Using:
verbose=1
threads=1024
sleep=1
lookup table size=2048
reduction on CPU=no
kernels per reduction=32
Error: (-5)Out of resources at 1303 of collatzOpenCL::RunSteps
Error: GPU steps do not match CPU steps. Workunit processing aborted.
01:55:21 (7916): called boinc_finish


</stderr_txt>
]]>


The defaults for the 1.04 sieve app are too high for some GPUs causing the video driver to crash. That, or the config settings being used are too high. There is no safety net since the settings for one GPU may be too high for another.

Try editing the collatz_sieve*.config file(s) to be:

verbose=1
kernels_per_reduction=2
threads=6
sleep=1
lut_size=15
reduceCPU=0

If they work, I'll release a 1.06 version with the above as the defaults so that people don't need to edit the config unless they want better performance.

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20634 - Posted: 19 Jun 2015, 16:55:47 UTC - in response to Message 20623.

x86_64 = 64-bit
intelx86 = 32-bit


All my mini_collatz tasks (both the failed and the succesfull ones) report:

Collatz Conjecture v6.04 Windows x86_64 for CUDA 5.5

Also there is no 32-bit mini_collatz app in my collatz directory.
This combined with the fact that the problem also occurs when I specified no_alt_platform in my cc_config.xml leads me to believe that the problem has nothing to do with 32-bit apps running on a 64-bit system.
As the problem always seems to happen in the first seconds of execution I think it is more likely that there is some kind of initialization issue (uninitialized variables ?). Especially when they are on the stack uninitialized variables can be hard to find because they may consistently receive the same value left from the previous function being called.

Tom

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20644 - Posted: 21 Jun 2015, 8:53:42 UTC

Another indication this problem has nothing to do with 32-bit apps on 64-bit platforms.
Have a look at this host:

http://boinc.thesonntags.com/collatz/show_host_detail.php?hostid=127401

It is a 32-bit Windows system and it has exactly the same problems.

Tom

HAL9000
Avatar
Send message
Joined: 19 Nov 09
Posts: 15
Credit: 104,993,705
RAC: 0
Message 20669 - Posted: 26 Jun 2015, 14:31:56 UTC - in response to Message 20644.

Another indication this problem has nothing to do with 32-bit apps on 64-bit platforms.
Have a look at this host:

http://boinc.thesonntags.com/collatz/show_host_detail.php?hostid=127401

It is a 32-bit Windows system and it has exactly the same problems.

Tom

Looking back it seems like this started after June 10th.
Although nothing else seemed to happen around that same time.

Dr Who Fan
Avatar
Send message
Joined: 27 May 14
Posts: 21
Credit: 4,562,054
RAC: 0
Message 20670 - Posted: 26 Jun 2015, 16:08:30 UTC - in response to Message 20669.

Yes problem STILL around. Hal, that is what I suspect also - corruption of data and/or other hardware damages from the power surge/lighting strike.

6 more (so far) task for me have crashed since project came back on-line
____________

HAL9000
Avatar
Send message
Joined: 19 Nov 09
Posts: 15
Credit: 104,993,705
RAC: 0
Message 20672 - Posted: 26 Jun 2015, 18:57:39 UTC - in response to Message 20670.
Last modified: 26 Jun 2015, 18:58:01 UTC

Yes problem STILL around. Hal, that is what I suspect also - corruption of data and/or other hardware damages from the power surge/lighting strike.

6 more (so far) task for me have crashed since project came back on-line

Only you can access your full task list. Everyone else only gets to see an error message. However I suspect that the tasks you are referring to are mostly on this host.

GonoszTopi
Send message
Joined: 17 Nov 13
Posts: 1
Credit: 5,945,263
RAC: 0
Message 20673 - Posted: 26 Jun 2015, 19:50:14 UTC

I have also met with this "GPU steps do not match CPU steps" error. Linking some of the failed ones in case it might help...

intelx86 opencl_nvidia_gpu:
http://boinc.thesonntags.com/collatz/result.php?resultid=18639055

x86_64 opencl_intel_cpu:
http://boinc.thesonntags.com/collatz/result.php?resultid=18632574

x86_64 opencl_intel_gpu:
http://boinc.thesonntags.com/collatz/result.php?resultid=18644538
http://boinc.thesonntags.com/collatz/result.php?resultid=18637267
http://boinc.thesonntags.com/collatz/result.php?resultid=18632564

Jon Fox
Send message
Joined: 6 Sep 09
Posts: 36
Credit: 352,895,343
RAC: 270,088
Message 20674 - Posted: 26 Jun 2015, 20:43:54 UTC

Due to the time frames for when these errors started to appear I wanted to note the similar thread at http://boinc.thesonntags.com/collatz/forum_thread.php?id=1279 just in case there's some correlation.

--
Jon

Dr Who Fan
Avatar
Send message
Joined: 27 May 14
Posts: 21
Credit: 4,562,054
RAC: 0
Message 20680 - Posted: 29 Jun 2015, 3:51:39 UTC - in response to Message 20672.
Last modified: 29 Jun 2015, 3:52:12 UTC

Still seeing high amount of error rates on GPU tasks for Host ID 146029

Almost all are of the "At offset xxxxxxxxx got xxx from the GPU when expecting xxx" type of errors. Where "x" is a variable number.

NOTE TO SLICKER: Like Hal9000 pointed at in a previous post - suspecting these error are related to the Electrical Problems YOUR residence had a few weeks ago.
____________

1 · 2 · Next
Post to thread

Message boards : Number crunching : Errors on CUDA workunit


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.