New Windows CUDA and OpenCL Versions Released
log in

Advanced search

Message boards : News : New Windows CUDA and OpenCL Versions Released

1 · 2 · Next
Author Message
Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 20745 - Posted: 7 Jul 2015, 15:13:40 UTC

CUDA version 6.06 and OpenCL version 6.08 were released for Windows today. The CUDA version fixes (I hope) a "device not ready" bug seen by some fast GPUs. While I could not duplicate the error, the code now waits for the events to synchronize which should eliminate the error.

The OpenCL version, 6.08, now includes the bug fix where the GPU and CPU results were not always matching. It uses several of the optimizations Sosirus has provided and includes a new "lut_size" configuration option which now defaults to 12 (4096 items). The previous version used a 2^20 sized lookup table which did not fit into the GPU's cache causing it to be memory rather than processor bound. So, you should see a higher GPU utilization with the new version and it should not be as dependent upon memory speed as the previous versions.

Linux and OS X versions with the same fixes will follow soon.

As usual, let me know if you have any issues with the new versions.

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20748 - Posted: 7 Jul 2015, 16:14:01 UTC

Thank you !

The first two (CUDA) units completed and validated.
Now running two OpenCL units. Don't like it very much that these are using two CPU cores as well. Is there a way to select CUDA only ?

Tom

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20749 - Posted: 7 Jul 2015, 16:16:00 UTC - in response to Message 20748.

Found it in my account settings. Disabled OpenCL there.

Thank you !

The first two (CUDA) units completed and validated.
Now running two OpenCL units. Don't like it very much that these are using two CPU cores as well. Is there a way to select CUDA only ?

Tom

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,479,595,721
RAC: 15,718,031
Message 20750 - Posted: 7 Jul 2015, 16:35:38 UTC - in response to Message 20745.
Last modified: 7 Jul 2015, 16:40:32 UTC

Any hints about the right value for the "lut_size" configuration option for a high end gpu (r290-x r280-x)?

Stick
Send message
Joined: 30 Apr 10
Posts: 20
Credit: 10,818,888
RAC: 51,892
Message 20752 - Posted: 7 Jul 2015, 20:57:54 UTC

WU 16437712 is my first with Solo Collatz Conjecture v6.08 (opencl_intel_gpu). I am guessing there is a problem with the WU (or maybe with the new apps). That is, one wingman has errored out and another aborted. And my unit is progressing VERY slowly - only about 4% complete after about 5 hours. I will suspend it for now but keep it in cache in case there are suggestions for fixing and/or requests more info.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 20753 - Posted: 7 Jul 2015, 21:20:04 UTC - in response to Message 20750.

Any hints about the right value for the "lut_size" configuration option for a high end gpu (r290-x r280-x)?


The goal is to have the lookup table and the other variables used in the app to fit into the GPU's L2 cache.

The default is "lut_size=12" which means 2^12 items in the lookup table and each entry the table is 8 bytes (2 integers). So, the default is 4096 items x 8 bytes/item, or 32,768 bytes (32k). That's kind of small since the R9 290x has a 1MB L2 cache. It will probably work fastest with a value of 15 or 16.

In general, start with a low value (e.g. the default is 12) and increase by 1 until the app runs slower than the previous last run. By slower, I mean 10% or more since the WUs can vary in size a little. This is best done by running only the mini WUs since they will complete fairly fast. Then switch to the solo or large WUs once you have it dialed in.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 20754 - Posted: 7 Jul 2015, 21:29:09 UTC - in response to Message 20752.

WU 16437712 is my first with Solo Collatz Conjecture v6.08 (opencl_intel_gpu). I am guessing there is a problem with the WU (or maybe with the new apps). That is, one wingman has errored out and another aborted. And my unit is progressing VERY slowly - only about 4% complete after about 5 hours. I will suspend it for now but keep it in cache in case there are suggestions for fixing and/or requests more info.


Take a look at the WU again. It failed on the Brook+ ATI app which has not been updated. The next result was aborted by the user, possibly because it was with the 6.04 app which is known to have the bug (and why the 6.08 app was released).

So, the only issue is that it is running slower. FYI, the app doesn't even read the WU. The filename is all it needs in order to process so unless you aren't even getting a WU, that isn't the cause. Since it is a new app, it needs to have the new config file set to the same settings as the old one. Have you edited the config file? What settings are you using in it?

Anthony Ayiomamitis
Send message
Joined: 21 Jan 15
Posts: 48
Credit: 1,040,463,428
RAC: 10,071,377
Message 20756 - Posted: 7 Jul 2015, 22:33:12 UTC

To get the new source, do we RESET and then readd the project via BOINC manager or is there a more direct route?

Rymorea
Send message
Joined: 14 Oct 14
Posts: 100
Credit: 200,411,819
RAC: 5
Message 20757 - Posted: 7 Jul 2015, 23:15:17 UTC
Last modified: 7 Jul 2015, 23:19:36 UTC

My ATI 270x only use %35 GPU use with new ver 6.08 solo collatz opencl amd.
How to maximize the gpu use max or %90 ?

Stick
Send message
Joined: 30 Apr 10
Posts: 20
Credit: 10,818,888
RAC: 51,892
Message 20758 - Posted: 8 Jul 2015, 1:07:57 UTC - in response to Message 20754.

Have you edited the config file? What settings are you using in it?

Haven't touched it.

However, I just figured out why it's so slow. That is, it's a full-size unit and I have been getting a steady diet of minis. In fact, it's been so long since I have had a full-size, I don't remember the size difference. But minis usually take 5 to 6 hours on this computer.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 20759 - Posted: 8 Jul 2015, 3:44:08 UTC - in response to Message 20757.

My ATI 270x only use %35 GPU use with new ver 6.08 solo collatz opencl amd.
How to maximize the gpu use max or %90 ?


The instructions are located in the Optimizing Collatz v6.xx OpenCL and CUDA Applications thread

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 20761 - Posted: 8 Jul 2015, 5:36:29 UTC

Over 60 units (mostly CUDA, only a couple of OpenCL) completed and validated, not a single error. Problems fixed I should say.

Thanks again,

Tom

ahorek's team
Send message
Joined: 17 Mar 10
Posts: 1
Credit: 9,389,048
RAC: 0
Message 20777 - Posted: 8 Jul 2015, 20:46:46 UTC

I'm getting errors on my very old GeForce 8600 GT, previous versions worked fine. I'd also tested newer cards without any problem.

http://boinc.thesonntags.com/collatz/result.php?resultid=19085953

Using: verbose=1 items_per_kernel=131072 kernels_per_reduction=64 threads=512 sleep=1
cudaSafeCall() failed at CollatzCudaKernel11.cu:352 : too many resources requested for launch

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 20778 - Posted: 8 Jul 2015, 21:11:02 UTC - in response to Message 20777.

I'm getting errors on my very old GeForce 8600 GT, previous versions worked fine. I'd also tested newer cards without any problem.

http://boinc.thesonntags.com/collatz/result.php?resultid=19085953

Using: verbose=1 items_per_kernel=131072 kernels_per_reduction=64 threads=512 sleep=1
cudaSafeCall() failed at CollatzCudaKernel11.cu:352 : too many resources requested for launch


That error means your 8400 GT doesn't have enough registers, RAM, processors, etc. to be able to process all those items and threads. The kernels_per_reduction is invalid. The total of the items per kernel plus the kernels per reduction is 2^32. The values in the config are the power of 2, so if items_per_kernel is 16, that is really 2^16 items per kernel or 65,356. The kernels per reduction is also a power of 2, so 2^4 below = 16. 16x65,536 is less than 2^32. The value you tried using is 64. Hmmmm.... 2^64 is just a wee bit more than 2^32. ;-)
For your 8600 GT, I would suggest trying much lower values such as

verbose=1
items_per_kernel=16
kernels_per_reduction=4
threads=6
sleep=1
lut_size=10

If that works, then increment the lut_size and run the next WU. If that works and is not considerably slower (more than 10%), increment the lut_size again. Repeat until slower or it crashes due to not enough resources and then back it off by at least one. The items per kernel is also the same. Increase by 1 and let it run. If it works, increase again. Eventually it video response will be horrible and/or the driver will crash. Again, when that happens go back to the smaller number that worked previously. FYI, larger numbers are not always faster for the lut_size since having it fit into the L2 cache where it can be read quickly is usually much faster than being able to look ahead a few extra steps but having to get the data from slower RAM memory. With the latest GPUs, the GDDR5 RAM speed vs. the L2 cache may not be as big of a difference. For example, if reading the from the L2 cache is only 50% faster than from GDDR5 RAM, using a lut_size of 20 would probably be faster even though it may not fit entirely into the L2 cache. With slower GDDR2 or GDDR3 RAM, fitting the lookup table into the cache speads things up considerable. It almost doubles the speed on AMD processors. nVidia seems to do a better job of memory and cache management so it isn't quite as big a difference with it.

mengpangwang
Send message
Joined: 28 Jun 15
Posts: 3
Credit: 28,450,169
RAC: 26,809
Message 20787 - Posted: 9 Jul 2015, 15:10:16 UTC

Previous Version v6.04 reports an error related to CPU and GPU steps not matching.
Will the error result be removed from our account history? I believe it's due to an Open CL error.

Information:
The task almost stopped immediately. Gave an error in just 2.84 sec.
It reported "GPU steps do not match CPU steps. Workunit processing aborted."

http://boinc.thesonntags.com/collatz/result.php?resultid=18788204

OS: Win7 x64 SP1
CPU: Intel i7-2600 @ 3.40 GHz
GPU: AMD R9-270 with 14.12 Omega Driver

Note: The new version v6.08 works pretty well. No more errors occurred. Just wondering if errors related to v6.04 would be removed from task history. Thanks.

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,479,595,721
RAC: 15,718,031
Message 20791 - Posted: 10 Jul 2015, 10:01:28 UTC
Last modified: 10 Jul 2015, 10:50:14 UTC

I started to get validation errors on two different computers, just two, nothing to be worried about, statistically speaking, but I never got them before. I checked the stderr log and it seems fine.

For your reference: http://boinc.thesonntags.com/collatz/results.php?userid=2089&offset=0&show_names=0&state=5&appid=

Regarding performance, I have some statistics on a couple of R280-X, config parameters are:

items_per_kernel=20
kernels_per_reduction=9
threads=8
sleep=1
(lut_size=16,18 for the 6.08, doesn't make any noticeable difference between the two)

6.04: Large 18,730.05 sec. ave, 45,866.05 credit/hour
6.08: Solo 1,336.77 sec. ave, 31,791.19 credit/hour

dschonew
Send message
Joined: 26 Feb 12
Posts: 1
Credit: 57,698,797
RAC: 0
Message 20803 - Posted: 11 Jul 2015, 22:59:42 UTC - in response to Message 20756.
Last modified: 11 Jul 2015, 23:08:14 UTC

To get the new source, do we RESET and then readd the project via BOINC manager or is there a more direct route?


Seconding this question, what is the easiest way to upgrade to the new application version? Do we have to reset the project? Or is there a less destructive path? I just tried resetting the project to wipe out all of the existing 6.04 versions of the applications to no avail. It simply wiped out the ~30 min of progress on my current WU and downloaded 6.04 again. I also tried completely removing the project from BOINC, same issue.Any assistance is appreciated.

Anthony Ayiomamitis
Send message
Joined: 21 Jan 15
Posts: 48
Credit: 1,040,463,428
RAC: 10,071,377
Message 20804 - Posted: 11 Jul 2015, 23:35:50 UTC
Last modified: 11 Jul 2015, 23:39:47 UTC

Edit: due to temporary connection problem, message above message was posted twice.

Anthony Ayiomamitis
Send message
Joined: 21 Jan 15
Posts: 48
Credit: 1,040,463,428
RAC: 10,071,377
Message 20805 - Posted: 11 Jul 2015, 23:36:32 UTC

I was fine with V6.04 but the world caved-in with the sudden jump to V6.08. After running for a couple of seconds, I get the following type of error dump:


<core_client_version>7.4.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -16777217 (0xfeffffff)
</message>
<stderr_txt>
Collatz Conjecture v6.08 Windows x86_64 for OpenCL
Based on the AMD Brook+ kernels by Gipsel
Optimizations provided by Sosirus
Config: verbose=1 items_per_kernel=262144 kernels_per_reduction=8 threads=64 sleep=1
Platform: INTEL
Device: 000007FEF07C1300
OpenCL context created
OpenCL command queue created
OpenCL program created
OpenCL program copiled
OpenCL kernel created
Max Work Item Dimensions: 3
Max Work Item Size: 512 512 512
Max Work Group Size: 512
Max Kernel Work Group Size: 512
Init complete.
Allocate memory complete.
Device Vendor: Intel(R) Corporation
Address Bits: 64
Name: Intel(R) HD Graphics 4000
Driver Version: 8.15.10.2712
Device Version: OpenCL 1.1
Max Clock: 350
Compute Units: 16
Alignment: 1024
Constant Buffer: 1024
WU Name: mini_collatz_2398044770915275767808_103079215104
Start: 2398044770915275767808 Stop: 2398044771018354982912
Read checkpoint complete.
Using: verbose=1 items_per_kernel=262144 kernels_per_reduction=8 threads=64 sleep=1
At offset 0 got 268435455 from the GPU when expecting 291
Error: GPU steps do not match CPU steps. Workunit processing aborted.
Memory deallocation complete
02:25:10 (6664): called boinc_finish

</stderr_txt>
]]>


With V6.04, I did not use any configuration files and simply ran with the defaults. My set-up is as follows:

Core i7-3630QM CPU @ 2.40GHz (8 processors)
NVIDIA GeForce GTX 660M (2048MB) driver: 327.23 OpenCL: 1.01,
INTEL Intel(R) HD Graphics 4000 (1624MB) OpenCL: 1.01
Win 7 Pro 64-bit SP1

I mainly use the Intel HD 4000 GPU for processing Mini and Solo WU's.

Profile Richard Jablonski
Send message
Joined: 1 Jun 14
Posts: 2
Credit: 119,245,559
RAC: 0
Message 20815 - Posted: 13 Jul 2015, 16:43:08 UTC

Thank you I had had several GPU failures.

1 · 2 · Next
Post to thread

Message boards : News : New Windows CUDA and OpenCL Versions Released


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.