Computation Errors
log in

Advanced search

Message boards : Number crunching : Computation Errors

1 · 2 · Next
Author Message
Nflight
Send message
Joined: 11 Jul 09
Posts: 3
Credit: 123,281,377
RAC: 0
Message 20613 - Posted: 17 Jun 2015, 14:45:27 UTC
Last modified: 17 Jun 2015, 14:55:08 UTC

Recently with in the last 10 days I have been seeing computations errors popping up as the last WU is finished the next one to start goes just 2 secs then goes to error. This will usually occur in pairs then stop and go to the next WU and perform flawlessly.

Is anyone else seeing this behavior and is there something else that can be done to correct this equation?

OS Win 7 Pro
GPU = AMD R 200 series
____________

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,558,023,737
RAC: 15,204,941
Message 20618 - Posted: 18 Jun 2015, 9:16:19 UTC - in response to Message 20613.
Last modified: 18 Jun 2015, 9:19:51 UTC

I noticed the same behavior on two different machines (all with R280-X gpus), (unknown error) - exit code -16777217 (0xfeffffff)

In some cases many different computers fail the same way, see this http://boinc.thesonntags.com/collatz/workunit.php?wuid=15790536

Profile mikey
Avatar
Send message
Joined: 11 Aug 09
Posts: 3246
Credit: 1,700,925,421
RAC: 4,845,256
Message 20620 - Posted: 18 Jun 2015, 9:39:46 UTC - in response to Message 20618.

I noticed the same behavior on two different machines (all with R280-X gpus), (unknown error) - exit code -16777217 (0xfeffffff)

In some cases many different computers fail the same way, see this http://boinc.thesonntags.com/collatz/workunit.php?wuid=15790536


I am getting the same thing with an Nvidia 760 gpu, I don't think it's 'us'.

cyrusNGC_224@P3D
Send message
Joined: 5 May 14
Posts: 7
Credit: 2,376,123,622
RAC: 866,595
Message 20629 - Posted: 18 Jun 2015, 22:12:11 UTC - in response to Message 20613.
Last modified: 18 Jun 2015, 22:12:26 UTC

Recently with in the last 10 days I have been seeing computations errors popping up as the last WU is finished the next one to start goes just 2 secs then goes to error. This will usually occur in pairs then stop and go to the next WU and perform flawlessly.
And i with several R3 Kalindi by Debian Jessie.
http://boinc.thesonntags.com/collatz/result.php?resultid=18364275

I have nothing changed.

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20635 - Posted: 19 Jun 2015, 17:01:55 UTC - in response to Message 20629.

Recently with in the last 10 days I have been seeing computations errors popping up as the last WU is finished the next one to start goes just 2 secs then goes to error. This will usually occur in pairs then stop and go to the next WU and perform flawlessly.
And i with several R3 Kalindi by Debian Jessie.
http://boinc.thesonntags.com/collatz/result.php?resultid=18364275

I have nothing changed.


This one is interesting:

http://boinc.thesonntags.com/collatz/workunit.php?wuid=15913445

Exactly the same error message for both an AMD and an NVIDIA GPU:

At offset 16777216 got 627 from the GPU when expecting 203

My guess: this is the same problem as discussed in this thread:

http://boinc.thesonntags.com/collatz/forum_thread.php?id=1272

Tom

cyrusNGC_224@P3D
Send message
Joined: 5 May 14
Posts: 7
Credit: 2,376,123,622
RAC: 866,595
Message 20637 - Posted: 19 Jun 2015, 21:33:01 UTC

Possible.
But only few are failing. Most wus are ok.

This errors started suddenly around a week before. Nothing on the clients was changed.

HAL9000
Avatar
Send message
Joined: 19 Nov 09
Posts: 15
Credit: 104,993,705
RAC: 0
Message 20643 - Posted: 20 Jun 2015, 21:13:37 UTC - in response to Message 20637.
Last modified: 20 Jun 2015, 21:18:16 UTC

Possible.
But only few are failing. Most wus are ok.

This errors started suddenly around a week before. Nothing on the clients was changed.

Yeah in the CUDA issue thread it looks like they are thinking it is some 64-bit vs 32-bit issue. But running opencl_amd_gpu there is only the 32-bit app. Given the new app 6.05 was put up on June 1st. I would hazard a guess that is when it started. However I started getting the same "At offset 1769472 got 446 from the GPU when expecting 534 Error: GPU steps do not match CPU steps. Workunit processing aborted." on my iGPU around the same time & those apps are from April 2014.

Also it looks like my HD5750 is getting those about 25% of the time. At least they are basically instant fails. Rather than ting up the GPU for hours.

I guess I should add that my 5 hosts with HD3450's running the ati14 app, which is is a 64-bit & 32-bit version have had 0 of these errors.

Mike
Send message
Joined: 18 Sep 13
Posts: 17
Credit: 33,587,437
RAC: 0
Message 20647 - Posted: 21 Jun 2015, 13:35:00 UTC - in response to Message 20635.

Recently with in the last 10 days I have been seeing computations errors popping up as the last WU is finished the next one to start goes just 2 secs then goes to error. This will usually occur in pairs then stop and go to the next WU and perform flawlessly.
And i with several R3 Kalindi by Debian Jessie.
http://boinc.thesonntags.com/collatz/result.php?resultid=18364275

I have nothing changed.


This one is interesting:

http://boinc.thesonntags.com/collatz/workunit.php?wuid=15913445

Exactly the same error message for both an AMD and an NVIDIA GPU:

At offset 16777216 got 627 from the GPU when expecting 203

My guess: this is the same problem as discussed in this thread:

http://boinc.thesonntags.com/collatz/forum_thread.php?id=1272

Tom


and to complete the set I get the same error behaviour on my integrated intel GPU. Usually 2 or 3 units error out after 2 seconds, then one or more run OK.

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,558,023,737
RAC: 15,204,941
Message 20660 - Posted: 24 Jun 2015, 9:37:57 UTC

The number of workunits erroring out is increasing. My guess is that we reached a point in computation (numbers) which triggered a small application's bug.

Profile Skivelitis2
Avatar
Send message
Joined: 28 Mar 15
Posts: 17
Credit: 250,229,716
RAC: 1,183,398
Message 20664 - Posted: 24 Jun 2015, 15:30:05 UTC - in response to Message 20647.
Last modified: 24 Jun 2015, 15:32:21 UTC

and to complete the set I get the same error behaviour on my integrated intel GPU. Usually 2 or 3 units error out after 2 seconds, then one or more run OK.



Same here with Intel GPU but at this point only about 1 in 4 are failing. With only a 2 second runtime I can live with it.

Jon Fox
Send message
Joined: 6 Sep 09
Posts: 36
Credit: 352,667,240
RAC: 268,665
Message 20665 - Posted: 24 Jun 2015, 16:33:53 UTC

I too have started to see elevated levels of computational errors. The most noticeable volume is on my 21-inch 2014 iMac where the error rate over the past ten days has been about 25%.

I've include the "Host Details" and the "Results" links below.

http://boinc.thesonntags.com/collatz/show_host_detail.php?hostid=135259

http://boinc.thesonntags.com/collatz/results.php?hostid=135259&offset=0&show_names=0&state=6&appid=

--
Jon

Dr Who Fan
Avatar
Send message
Joined: 27 May 14
Posts: 21
Credit: 4,562,054
RAC: 0
Message 20666 - Posted: 24 Jun 2015, 19:51:05 UTC - in response to Message 20660.

Seeing the same thing with a few of tasks sent to me & wing mates.
So far only see problems with opencl_nvidia_gpu formatted tasks:

http://boinc.thesonntags.com/collatz/workunit.php?wuid=15773486 Collatz Sieve v1.05 (opencl_nvidia_gpu)

http://boinc.thesonntags.com/collatz/workunit.php?wuid=15780517 Collatz Sieve v1.05 (opencl_nvidia_gpu)

http://boinc.thesonntags.com/collatz/workunit.php?wuid=16060118 Solo Collatz Conjecture v6.04 (opencl_nvidia_gpu)

http://boinc.thesonntags.com/collatz/workunit.php?wuid=16124064 Mini Collatz Conjecture v6.04 (opencl_nvidia_gpu)
____________

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,558,023,737
RAC: 15,204,941
Message 20688 - Posted: 30 Jun 2015, 13:02:11 UTC - in response to Message 20666.
Last modified: 30 Jun 2015, 13:03:19 UTC

I do not see a relationship between GPU type and error rate, all of my AMD cards are producing errors, sometimes. Furthermore some workunits will completely fail because of this high error rate, see this one for example:
http://boinc.thesonntags.com/collatz/workunit.php?wuid=15991927.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20689 - Posted: 30 Jun 2015, 13:54:38 UTC - in response to Message 20660.

The number of workunits erroring out is increasing. My guess is that we reached a point in computation (numbers) which triggered a small application's bug.


I think I found a bug in the resume from checkpoint code which shows up only if a new high steps for the work unit is found after resuming. Since it doesn't report a tie, the best result normally occurs in the first several million numbers checked which is why it only shows up some of the time. At least I think that's the issue.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20690 - Posted: 30 Jun 2015, 14:00:48 UTC - in response to Message 20688.

I do not see a relationship between GPU type and error rate, all of my AMD cards are producing errors, sometimes. Furthermore some workunits will completely fail because of this high error rate, see this one for example:
http://boinc.thesonntags.com/collatz/workunit.php?wuid=15991927.


The nVidia driver crashes when using 4 million items per kernel which is the size of the 2^32 sieve. AMD works so long as it doesn't take more than 33ms which crashes that driver as well. So, I now have to make the sieve size configurable so that older, slower, or GPUs with less RAM can process without errors. Since the app was written expecting a static sieve size, a fair amount of the code for the sieve app needs to be altered.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20696 - Posted: 1 Jul 2015, 14:27:17 UTC - in response to Message 20689.

The number of workunits erroring out is increasing. My guess is that we reached a point in computation (numbers) which triggered a small application's bug.


I think I found a bug in the resume from checkpoint code which shows up only if a new high steps for the work unit is found after resuming. Since it doesn't report a tie, the best result normally occurs in the first several million numbers checked which is why it only shows up some of the time. At least I think that's the issue.


Nope. That wasn't the problem. When I take the exact same kernel code and run it on the CPU, it returns the correct steps. When it runs on the GPU, it returns an incorrect number of steps, even after changing the code so it runs only one kernel which eliminates the read/write access. Not sure what to try next.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20700 - Posted: 2 Jul 2015, 4:50:33 UTC - in response to Message 20696.

The number of workunits erroring out is increasing. My guess is that we reached a point in computation (numbers) which triggered a small application's bug.


We have a winner! There was an error with the lookup tablebut it only appears when certain numbers are checked. The new sieve app uses different code to generate the lookup table. When I use the lookup table generated by the sieve app, it works ok. So, I moved that sieve code into the v6 code base and have updated the CUDA apps to 6.05.

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,558,023,737
RAC: 15,204,941
Message 20701 - Posted: 2 Jul 2015, 9:30:31 UTC - in response to Message 20700.

The number of workunits erroring out is increasing. My guess is that we reached a point in computation (numbers) which triggered a small application's bug.


We have a winner! There was an error with the lookup tablebut it only appears when certain numbers are checked. The new sieve app uses different code to generate the lookup table. When I use the lookup table generated by the sieve app, it works ok. So, I moved that sieve code into the v6 code base and have updated the CUDA apps to 6.05.

That's great! These kind of bugs are usually very hard to find. Please check the opencl applications as well.

Profile entigy
Send message
Joined: 1 Jul 10
Posts: 11
Credit: 155,922,554
RAC: 234,529
Message 20704 - Posted: 2 Jul 2015, 20:48:55 UTC

....and the first new WUs I downloaded have both fallen over:

Stderr output

<core_client_version>7.6.3</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1 (0xffffffff)
</message>
<stderr_txt>
Collatz Conjecture v6.05 Windows x86_64 for CUDA 5.5
Based on the AMD Brook+ kernels by Gipsel
Using optimizations provided by Sosirus
Config: verbose=1 items_per_kernel=131072 kernels_per_reduction=64 threads=512 sleep=1
Name GeForce GTX 970
Compute 5.2
Parameters --device 0
Start 2397771633835886247936
Checking 107374182400 numbers
Numbers/Kernel 131072
Kernels/Reduction 64
Numbers/Reduction 8388608
Reductions/WU 12800
Threads 512
Using: verbose=1 items_per_kernel=131072 kernels_per_reduction=64 threads=512 sleep=1
cudaSafeCall() failed at CollatzCudaKernel11.cu:362 : device not ready
21:46:17 (1072): called boinc_finish

</stderr_txt>
]]>

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 20746 - Posted: 7 Jul 2015, 15:19:08 UTC

Short answer: v6.06 was released which should fix the errors people have been seeing.

Long answer: The stop event was being issued on a different stream than the start event. Because the event methods are asynchronous and being launched on different streams, it was trying to get the elapsed time before both the start and stop events had occurred. That resulted in a "device not ready" error. In version 6.06, both run on the same stream and a cudaEventSynchronize is also used to make sure they are complete before getting the elapsed time.

1 · 2 · Next
Post to thread

Message boards : Number crunching : Computation Errors


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.