Optimizing Collatz Sieve
log in

Advanced search

Message boards : Number crunching : Optimizing Collatz Sieve

1 · 2 · 3 · 4 . . . 7 · Next
Author Message
Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 20542 - Posted: 11 Jun 2015, 17:54:33 UTC
Last modified: 23 Sep 2015, 17:47:57 UTC

Edit the file C:\ProgramData\BOINC\projects\boinc.thesonntags.com_collatz\<app_name>.config

The default file is empty. You may add any of the following using the format:
<option>=<value>
For example:
verbose=1

Use one option per line. Options are case specific. Unknown options will be ignored.

verbose - valid values are 1 or 0. 1 will result in more detail in the output. The default is 0.

kernels_per_reduction - this is the number of kernels that will be run before doing a reduction. Valid values are 1 to 64. The default is 32. Too high a number may cause a video driver crash or poor video response. Too low a number will slow down processing. Suggested values are between 8 and 48 depending upon the speed of the GPU.

threads - this is the 2^N size of the local size (a.k.a. work group size or threads). The default is 6 or (2^6 or 64) which is the minimum. The max value allowed is 11. Too high a value results in more threads but that means more registers being used. If too many registers are used, it will use slower non-register memory. The goal is to use as many as possible, but not so many that processing slows down. AMD GPUs tend to work best with a value of 6 or 7 even though they can support values of up to 10 or 11. nVidia GPUs seem to work as well with higher values as lower values.

lut_size - this is the size (in power of 2) of the lookup table. valid options are 2 to 31. Chances are that any value over 20 will cause the GPU driver to crash and processing to hang. The default is 10 which results in 2^10 or 1024 items. Each item uses 8 bytes. So 10 would result in 2^10 * 8 bytes or 8192 bytes. Larger is better so long as it will fit in the GPUs L1/L2 cache. Once it exceeds the cache size, it will actually take longer to complete a WU since it has to read from slower global memory rather than high speed cached memory.

sleep - this is the number of milliseconds to sleep while waiting for a kernel to complete. The default is 1. A higher value may result in less CPU utilization and improve video response, but it also may lengthen the processing time.

reduce_cpu - valid values are 1 or 0. Setting to 1 will result in more CPU utilization but may make the video more responsive. The default is 0 which will do the total steps summation and high steps comparison on the GPU. I have yet to find a reason to do the reduction on the CPU other than for testing the output of new versions.

sieve_size - valid values are 15-32. It controls both the size of the sieve used 2^15 thru 2^32 as well as the items per kernel are they are directly associated with the sieve size. A sieve size of 26 uses approx 1 million items per kernel. Each value higher roughly doubles the amount. Each value lower decreases the amount by about half. Too high a value will crash the video driver. (new in Collatz sieve version 1.07)

A sample config file looks like:

verbose=1
kernels_per_reduction=48
threads=6
lut_size=12
sleep=1
reduce_cpu=0
sieve_size=26

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 20581 - Posted: 14 Jun 2015, 15:14:33 UTC - in response to Message 20542.

I'm testing this sieve App with my PC (win8.1x6, i5-4440, and AMD HD7850), and this config.


verbose=1
kernels_per_reduction=10
threads=7
lut_size=16
sleep=1
reduceCPU=1


95% GPU utilization, almost no video lag.

kernels_per_reduction: it affects GPU usage and video lag the most from what I tested.
threads : I didn't see lots of difference once items per work-group is more than wavefront size (64) of my HD7850 in the profiler.
lut_size : I choose 16 , 65536 items for the look up table because it would fit into the L2$ (512KB) in GCN devices. IMHO it could be 20 for NV GPUs, just like previous apps, because NV GPUs have better caching.
reduceCPU : I choose to do the reduction on the CPU because AMD OpenCL apps will take up a CPU core no matter what you do (aka 'busy waiting') and because I want better video response.

Just curious, where is items_per_kernel? Is it gone or is it a fixed value so we could not change?
____________
Sosiris, team BOINC@Taiwan

Profile Richard Jablonski
Send message
Joined: 1 Jun 14
Posts: 2
Credit: 119,245,559
RAC: 0
Message 20589 - Posted: 15 Jun 2015, 14:38:27 UTC

Stopped my sieve because it showed over 35,000 to completion and that is literally years. After 5 days it was 0.003% done.

zombie67 [MM]
Volunteer tester
Avatar
Send message
Joined: 3 Jul 09
Posts: 156
Credit: 612,749,213
RAC: 158
Message 20881 - Posted: 24 Jul 2015, 5:11:56 UTC

I am using the following with my 7970s, and they are running at 99%. FWIW, I reserve a full thread per GPU.

verbose=0
kernels_per_reduction=9
threads=9
lut_size=12
sleep=1
reduceCPU=0
____________
Dublin, California
Team: SETI.USA

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 20907 - Posted: 27 Jul 2015, 19:00:18 UTC - in response to Message 20581.

Just curious, where is items_per_kernel? Is it gone or is it a fixed value so we could not change?


It is currently fixed but I need to change that since it is too many items for some GPUs and causes them to run out of resources and error out almost immediately.

[AF>Amis des Lapins] Jean-Luc
Send message
Joined: 11 Jun 10
Posts: 6
Credit: 2,565,041,701
RAC: 56,930
Message 20915 - Posted: 28 Jul 2015, 17:51:38 UTC - in response to Message 20907.

Hello,

I come from "Alliance Francophone".
Sorry for my bad english.
I have a AMD R9 290X.
I tried differents values, but it is not so good as before with Large_6.04.

for exemple :

verbose=1
kernels_per_reduction=16
threads=11
lut_size=18
sleep=1
reduceCPU=1

Does anybody have values for R9 290X ?

Thanks.

Jean-Luc

TUKIA
Send message
Joined: 19 Jun 12
Posts: 9
Credit: 5,058,573,451
RAC: 0
Message 20917 - Posted: 29 Jul 2015, 5:55:20 UTC - in response to Message 20915.
Last modified: 29 Jul 2015, 6:33:16 UTC

Try these, I have good results with my R9 290X and Large_6.08

verbose=1
items_per_kernel=22
kernels_per_reduction=9
threads=10
lut_size=17
sleep=1

[AF>Amis des Lapins] Jean-Luc
Send message
Joined: 11 Jun 10
Posts: 6
Credit: 2,565,041,701
RAC: 56,930
Message 20924 - Posted: 29 Jul 2015, 14:52:55 UTC - in response to Message 20917.

OK TUKIA, thank you for your help.
I tried a lot of combinations.
For me, the best values are :

verbose=1
items_per_kernel=22
kernels_per_reduction=12
threads=8
lut_size=16
sleep=1

Impossible to do better.
But this is not so good than before, with WUs 6.04 !!!

Jean-Luc

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 20994 - Posted: 13 Aug 2015, 17:12:07 UTC

The each kernel will check 2^sieve_size numbers / 4

15=1295
16=2114
17=4228
18=7495
19=14990
20=27328
21=46611
22=93222
23=168807
24=286581
25=573162
26=1037374
27=1762293
28=3524586
29=6385637
30=12771274
31=23642078
32=41347483

Note: The numbers checked are padded with zeros in order to be divisible by the number of threads.
For example, using a sieve size of 26 would result in 2^26 numbers checked. The sieve allows it to skip all by 1037374 of those numbers. If using 64 threads (threads = 6 which results in 2^6 threads), it would have to check 1037376 numbers which is the next highest number divisible by 64. If using 512 threads (threads=9)
it would need to check 1037824 numbers, or 112 additional numbers per kernel call than if using 64 threads. If, and only if, using 512 threads keeps the GPU better occupied, it will actually run faster.

Profile Crunch3r
Volunteer moderator
Project developer
Project tester
Avatar
Send message
Joined: 30 Jun 09
Posts: 219
Credit: 7,515,635,101
RAC: 12,688
Message 21013 - Posted: 14 Aug 2015, 17:11:47 UTC - in response to Message 20994.
Last modified: 14 Aug 2015, 17:31:28 UTC

fwiw,

1.07 is a major blow performace wise (40% performace drop compared to 1.04)
.
Before the release i was able to circumvent that degrading performance of 1.05 by using the 1.04 ocl app (nvidia) instead.

However it seems that the old (and way faster 1.04) app can't handle new WUs, at least for me they all error out.

It seems to be impossible to get 1.07 upto the performace of 1.04... i've tried the old config parameters and it's not even close.

The new sieve_size parameter doesn't help either.

Besides all that, is there a specific reason why NVIDIA gpus seem to be performing way better using the sieve app compared to AMD ?
____________

Team BOINC United.Join Science that matters.

TUKIA
Send message
Joined: 19 Jun 12
Posts: 9
Credit: 5,058,573,451
RAC: 0
Message 21278 - Posted: 12 Sep 2015, 8:42:38 UTC
Last modified: 12 Sep 2015, 9:06:58 UTC

Sieve v1.21 is slowing down. First the WU needs more and more processing time, then the start of a new WU after the end of the old one takes as long as 26-32 seconds. This has happened already in three PC:s

Rymorea
Send message
Joined: 14 Oct 14
Posts: 100
Credit: 200,411,819
RAC: 5
Message 21279 - Posted: 12 Sep 2015, 18:15:46 UTC

I notice something interesting when clean up temp files. slieve app create OCL****.tmp files for each wu. I delete more then 10 thousant OCL*.tmp files from temp directory. I dont see this temp files before. This is normal or not ?
____________
Seti@home Classic account User ID 955 member since 8 Sep 1999 classic CPU time 539,770 hours

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 21280 - Posted: 13 Sep 2015, 4:29:29 UTC - in response to Message 21278.

Sieve v1.21 is slowing down. First the WU needs more and more processing time, then the start of a new WU after the end of the old one takes as long as 26-32 seconds. This has happened already in three PC:s


I'm trying to determine whether caching the sieve is working or not. It should be present in the project directory between WUs. It takes a few seconds to create the sieve file (the larger the sieve the long it takes) so if it isn't caching the file, it could add to the completion time since the percent done isn't reported until after the sieve is created.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 21329 - Posted: 23 Sep 2015, 18:12:47 UTC - in response to Message 21279.

I notice something interesting when clean up temp files. slieve app create OCL****.tmp files for each wu. I delete more then 10 thousant OCL*.tmp files from temp directory. I dont see this temp files before. This is normal or not ?


The only files the Collatz application creates are:

stderr.txt - what you seen when you look at a result on the web site

out.txt - the file uploaded to the server with the results

sieveXX.bin - the sieve file where XX is the sieve size; if cache_sieve=0 then it is deleted when the workunit is completed

So, it is either created by either BOINC (unlikely) or possibly the AMD video driver when it compiles the OpenCL kernel. These files do not exist on my laptop which has an nVidia GPU so my assumption is that the files are from the AMD video driver.

Profile nenym
Send message
Joined: 21 Jul 09
Posts: 11
Credit: 778,839,848
RAC: 208,919
Message 21333 - Posted: 23 Sep 2015, 22:55:01 UTC

The problem is tons of ATI/AMD OCL temporary files (size 0 bytes).
I suggest to run EraseOCL.bat via scheduler every 60 minutes (if crunching MW or Collatz sieving). For other OCL apps is four hours enough.

erase C:\Users\<user_name>\AppData\Local\Temp\OCL*.tmp

"C:" is drive where the system is installed. Drivers newer than 13.9 produces this garbage.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 21334 - Posted: 24 Sep 2015, 4:39:10 UTC - in response to Message 21333.

The problem is tons of ATI/AMD OCL temporary files (size 0 bytes).
I suggest to run EraseOCL.bat via scheduler every 60 minutes (if crunching MW or Collatz sieving). For other OCL apps is four hours enough.
erase C:\Users\<user_name>\AppData\Local\Temp\OCL*.tmp

"C:" is drive where the system is installed. Drivers newer than 13.9 produces this garbage.


Or, if you want the batch file to work for any user, use the command:

del %temp%\OCL*.tmp

McShane of TSBT
Send message
Joined: 17 Jun 09
Posts: 7
Credit: 5,293,817,140
RAC: 291,969
Message 21345 - Posted: 27 Sep 2015, 6:01:47 UTC

I wonder if it is just my setup or there is a shortage of Nvidia WU's?

Knobi
Avatar
Send message
Joined: 21 Jul 15
Posts: 4
Credit: 291,298,515
RAC: 0
Message 21372 - Posted: 30 Sep 2015, 18:14:42 UTC - in response to Message 21345.

Hey Guys,

take it to the edge...Fury NANO ! 64 sec Sieve_Runtime ! Have fun ...

verbose=1
kernels_per_reduction=48
threads=8
lut_size=17
sleep=0
cache_sieve=1
reduce_cpu=0
sieve_size=30

Sincerely Knobi

Knobi
Avatar
Send message
Joined: 21 Jul 15
Posts: 4
Credit: 291,298,515
RAC: 0
Message 21373 - Posted: 30 Sep 2015, 18:31:47 UTC

Slicker give me a small Hint !


collatz_sieve_1.21_windows_intelx86__opencl_amd_gpu
<core_client_version>7.6.9</core_client_version>
<![CDATA[
<stderr_txt>
Collatz Conjecture Sieve 1.21 Windows i686 for OpenCL
Written by Slicker (Jon Sonntag) of team SETI.USA
Based on the AMD Brook+ kernels by Gipsel of team Planet 3DNow!
Sieve code and OpenCL optimization provided by Sosiris of team BOINC@Taiwan
Collatz Config Settings:
verbose 1 (yes)
kernels/reduction 48
threads 2^8 (256)
lut_size 17 (1048576 bytes)
sieve_size 2^30 (51085096 bytes)
sleep 1
cache_sieve 1 (yes)
reducecpu 0 (no)
Platform ADVANCED MICRO DEVICES
Device 007B85A8
Max Dimensions 3
Max Work Items 256 256 256
Max Work Groups 256
Max Kernel Threads 256
Device Vendor Advanced Micro Devices, Inc.
Name Fiji
Driver Version 1800.11 (VM)
OpenCL Version OpenCL 1.2 AMD-APP (1800.11)
actual threads 256
Start 2404600162929176739840
Stop 2404600169526246506496
Best 2404600164423821990939
Highest steps 1798
Total steps 44876264478933
Average steps 571
CPU time 49.6406 seconds
Elapsed time 52.7782seconds
20:18:11 (7000): called boinc_finish

</stderr_txt>
]]>

collatz_sieve_1.21_windows_x86_64__opencl_amd_gpu

core_client_version>7.6.9</core_client_version>
<![CDATA[
<stderr_txt>
Collatz Conjecture Sieve 1.21 Windows x86_64 for OpenCL
Written by Slicker (Jon Sonntag) of team SETI.USA
Based on the AMD Brook+ kernels by Gipsel of team Planet 3DNow!
Sieve code and OpenCL optimization provided by Sosiris of team BOINC@Taiwan
Collatz Config Settings:
verbose 1 (yes)
kernels/reduction 48
threads 2^8 (256)
lut_size 17 (1048576 bytes)
sieve_size 2^30 (51085096 bytes)
sleep 1
cache_sieve 1 (yes)
reducecpu 0 (no)
Platform ADVANCED MICRO DEVICES
Device 000000A68BFB0AD0
Max Dimensions 3
Max Work Items 256 256 256
Max Work Groups 256
Max Kernel Threads 256
Device Vendor Advanced Micro Devices, Inc.
Name Fiji
Driver Version 1800.11 (VM)
OpenCL Version OpenCL 2.0 AMD-APP (1800.11)
actual threads 256
Start 2404588360771364192256
Stop 2404588367368433958912
Best 2404588367095947113199
Highest steps 1860
Total steps 45504160010639
Average steps 579
CPU time 59.1563 seconds
Elapsed time 62.0799seconds
20:09:42 (6408): called boinc_finish

</stderr_txt>
]]>

Can i choose the faster version ?

Thx 4 reply Knobi

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 21377 - Posted: 30 Sep 2015, 19:26:18 UTC - in response to Message 21345.

I wonder if it is just my setup or there is a shortage of Nvidia WU's?


I adjusted the shared mem and the WU cache so it should be getting better. Let me know if it isn't.

1 · 2 · 3 · 4 . . . 7 · Next
Post to thread

Message boards : Number crunching : Optimizing Collatz Sieve


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.