Optimizing Collatz Sieve
log in

Advanced search

Message boards : Number crunching : Optimizing Collatz Sieve

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
Author Message
ExtraTerrestrial Apes
Avatar
Send message
Joined: 22 Aug 09
Posts: 56
Credit: 262,212,529
RAC: 207,115
Message 22326 - Posted: 1 May 2016, 21:18:48 UTC

Can the app be tested in a stand-alone mode, i.e. with a small yet representative task and without BOINC? This might be neat to find the optimal parameters automatically.

MrS
____________
Scanning for our furry friends since Jan 2002

Anthony Ayiomamitis
Send message
Joined: 21 Jan 15
Posts: 48
Credit: 1,041,331,091
RAC: 10,072,437
Message 22333 - Posted: 2 May 2016, 22:45:34 UTC - in response to Message 22317.
Last modified: 2 May 2016, 22:46:58 UTC

I.e. for my HD530 that's 524 kB, an unusually large size for such a relatively weak GPU. Increasing the lut size showed very nice performance gains (I'm approaching 2x speed-up in my testing now) over the default setting.

ExtraTerrestrial,

Can you share the complete contents of your config file for the HD 530? It would be very much appreciated.

Also, what are your typical run times now?

ExtraTerrestrial Apes
Avatar
Send message
Joined: 22 Aug 09
Posts: 56
Credit: 262,212,529
RAC: 207,115
Message 22339 - Posted: 3 May 2016, 20:41:57 UTC - in response to Message 22333.

I'm still testing, but can say so far that increasing the LUT size improved the throughput the most, by far. I'm currently working with a value of 20, i.e. an 8 MB Loo Up Table size. I could probably improve things further going to larger numbers, but would not generally recommend to do so:

I'm using DDR4-3200 dual channel, i.e. pretty fast memory. Increasing LUT I can see the power consumption of my DRAM rising, from 1.7 W to around 3.7 W (e.g. HWinfo64 shows this). I know that if I run SETI and reach over 4 W, performance of other tasks suffers, which I want to avoid. My i3 only has 3 MB L3 cache, i.e. an 8 MB LUT exceeds the cache by a significant amount and hence the increased main memory access. On a fast GPU this should reduce performance, but apparently it'S faster for the HD530 to access main memory than to recompute the values stored in the LUT. With an i5 or i7 you may want to go higher, with slower main memory or other demanding tasks you may want to reduce it.

Apart from that I can add a word of caution towards changing the sieve size: increasing it reduced my runtimes significantly, but reduced the credits as well. The "seconds per credit" remain almost constant, with the default (26) performing the best in my case.

MrS
____________
Scanning for our furry friends since Jan 2002

Anthony Ayiomamitis
Send message
Joined: 21 Jan 15
Posts: 48
Credit: 1,041,331,091
RAC: 10,072,437
Message 22340 - Posted: 4 May 2016, 6:34:19 UTC - in response to Message 22339.
Last modified: 4 May 2016, 6:35:12 UTC

I also have an i3 (6300, 3.8 GHz) processor with Crucial Ballistix Tactical (2x4Gb) DDR4 2666 MHz and which is dual channel dual rank.

ExtraTerrestrial Apes
Avatar
Send message
Joined: 22 Aug 09
Posts: 56
Credit: 262,212,529
RAC: 207,115
Message 22343 - Posted: 4 May 2016, 19:33:11 UTC - in response to Message 22340.

Then go for this config :)

lut_size=20

... and maybe higher in a few days, if you want to test it. In the worst case CC and other BOINc projects would become slower.

MrS
____________
Scanning for our furry friends since Jan 2002

Anthony Ayiomamitis
Send message
Joined: 21 Jan 15
Posts: 48
Credit: 1,041,331,091
RAC: 10,072,437
Message 22344 - Posted: 5 May 2016, 0:12:52 UTC - in response to Message 22343.
Last modified: 5 May 2016, 0:13:10 UTC

Will do, thanks.

Other projects are not an issue since I generally have one project at a time running on any one system.

ExtraTerrestrial Apes
Avatar
Send message
Joined: 22 Aug 09
Posts: 56
Credit: 262,212,529
RAC: 207,115
Message 22347 - Posted: 5 May 2016, 11:51:52 UTC - in response to Message 22344.

I've checked the impact on the GPU utilization of my GTX970 running POEM:

without CC: 92.8%
LUT 18 (2 MB): 92.8%
LUT 19 (4 MB): 91.8%
LUT 20 (8 MB): 87.5%

This makes a lot of sense since my i3 has 3 MB L3 cache. With this setting my iGPU would reach 211k RAC (if it runs CC 24/7), whereas with LUT=20 it would reach 232k RAC. I'm going for the slower config with LUT=18 now because for me the work POEM is worth (a lot) more than CC.

And generally both are respectable numbers, especially considering that the iGPU needs less than 10 W for that (I have mine running at reduced voltage)!

MrS
____________
Scanning for our furry friends since Jan 2002

HAL9000
Avatar
Send message
Joined: 19 Nov 09
Posts: 15
Credit: 104,993,705
RAC: 0
Message 22351 - Posted: 6 May 2016, 1:25:42 UTC - in response to Message 22317.
Last modified: 6 May 2016, 1:27:03 UTC

Slicker, would it make sense to read some parameters of the OpenCL device upon app startup to set it optimally? Building upon HAL's comment those could be:

threads:
Should match the GPUs Max work group size from clinfo. 7^2=128, 8^2=256, 9^2=512

and lut_size:
clinfo can read out the amount of L2 cache, so setting something that fits in there is better than setting a too small value just to be safe for any GPU. I.e. for my HD530 that's 524 kB, an unusually large size for such a relatively weak GPU. Increasing the lut size showed very nice performance gains (I'm approaching 2x speed-up in my testing now) over the default setting.

You could easily support a manual override of this as well: "if the user has set anything in the .config file, use this value instead".

MrS

Which clinfo value are you reading to determine the GPU cache size?
For my R9 390X clinfo shows Cache size: 16384. However it has a 16 64KB blocks of L2 cache or a total of 1024KB of L2 cache.
For my HD6870 clinfo shows Cache size: 0. However it has 4 128KB blocks of L2 cache or a total of 512KB of L2 cache.

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 22354 - Posted: 6 May 2016, 11:37:14 UTC - in response to Message 22343.

Then go for this config :)
lut_size=20

... and maybe higher in a few days, if you want to test it. In the worst case CC and other BOINc projects would become slower.

MrS


Thank you for sharing this !
For me (i5-6500 with 2 * 8 Gb dual channel DDR4-2133) setting lut_size to 20 meant that run times went from appr. 2800 seconds to less than 1800 seconds.
Did you make any other changes except for the lut_size ? (I can't check your results as your computers are hidden)

Thanks,

Tom

ExtraTerrestrial Apes
Avatar
Send message
Joined: 22 Aug 09
Posts: 56
Credit: 262,212,529
RAC: 207,115
Message 22371 - Posted: 9 May 2016, 21:19:40 UTC - in response to Message 22354.
Last modified: 9 May 2016, 21:22:29 UTC

I have now settled on:

threads=4
kernels_per_reduction=64
lut_size=18
sieve_size=26

I hesitate to recommend these values generally, though, as I don't have my display attached to the iGPU and am not measuring responsivity, so this config may lag as hell.

I tested all other paramters and they did not make any statistically significant difference in those ranges:

threads=4..6 almost similar, 7+ slower
kernels_per_reduction=32..64 doesnt' matter, did not test lower
sieve_size=25..29, slight tendency towards worse results at 30

The threads showed the biggest response, so I should also check lower values.

MrS
____________
Scanning for our furry friends since Jan 2002

ExtraTerrestrial Apes
Avatar
Send message
Joined: 22 Aug 09
Posts: 56
Credit: 262,212,529
RAC: 207,115
Message 22372 - Posted: 9 May 2016, 21:21:53 UTC - in response to Message 22351.

Which clinfo value are you reading to determine the GPU cache size?
For my R9 390X clinfo shows Cache size: 16384. However it has a 16 64KB blocks of L2 cache or a total of 1024KB of L2 cache.
For my HD6870 clinfo shows Cache size: 0. However it has 4 128KB blocks of L2 cache or a total of 512KB of L2 cache.

Ouch! Apparently I was a bit naive, thinking that OpenCL would provide means to reliably check which hardware it is running on (to optimize for this at run-time). Or would there be other means? I have no experience programming OpenCL.

MrS
____________
Scanning for our furry friends since Jan 2002

ExtraTerrestrial Apes
Avatar
Send message
Joined: 22 Aug 09
Posts: 56
Credit: 262,212,529
RAC: 207,115
Message 22431 - Posted: 22 May 2016, 11:28:14 UTC - in response to Message 22354.
Last modified: 22 May 2016, 11:55:54 UTC

Update: reducing the number of threads improved performance on my HD530. Surprisingly the optimum is 1, whereas AMD and nVidia prefer much higher values. The difference is small, i.e. ~3% going from 4 to 1 thread, but it's been consistent in my measurements nevertheless. Finally I would recommend these values:

Core i3 and lower

threads=1
kernels_per_reduction=64
lut_size=18
sieve_size=26


Core i5
threads=1
kernels_per_reduction=64
lut_size=19
sieve_size=26


Core i7
threads=1
kernels_per_reduction=64
lut_size=20
sieve_size=26


CPUs with Crystal Well (Iris Pro) may be able to profit further from far higher lut_size values. And again, I didn't test the screen responsivity.. but decreasing the number of threads should not have made this any worse.
____________
Scanning for our furry friends since Jan 2002

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 22442 - Posted: 23 May 2016, 18:32:36 UTC - in response to Message 22431.

Thanks for the info.
On my i5-6500 lut_size=20 seems to work better than lut_size=19.
What exactly does threads=1 do ? I see no difference in the output file:

threads 2^5 (32)
actual threads 32

My config file:

verbose=1
kernels_per_reduction=64
lut_size=20
sieve_size=26
threads=1

With threads=1 I would expect to see:

threads 2^1 (2)
actual threads 2

Am I missing something here ?

Thanks,

Tom

ExtraTerrestrial Apes
Avatar
Send message
Joined: 22 Aug 09
Posts: 56
Credit: 262,212,529
RAC: 207,115
Message 22443 - Posted: 23 May 2016, 20:10:13 UTC - in response to Message 22442.

God catch! In the 1st post it's written that 6 is actually the minimum for threads. And it makes sense that too small values should not be allowed, since we've got vector ALUs and not scalar ones. I wrote that the difference between those thread values was quite small, 3% at most, so maybe I simply didn't average over enough WUs.

Regarding lut_size: yes, values of 1 - 2 high than what I wrote perform better. But you should see increased DRAM power consumption with that (doesn't matter much) and, more importantly, increased memory bandwidth consumption which probably slows down your other tasks. Hence I would not generally recommend this, especially if your CPU is also feeding a fast discrete GPU.

MrS
____________
Scanning for our furry friends since Jan 2002

BetelgeuseFive
Send message
Joined: 14 Nov 09
Posts: 26
Credit: 3,052,082
RAC: 1
Message 22445 - Posted: 24 May 2016, 5:41:24 UTC - in response to Message 22443.

God catch! In the 1st post it's written that 6 is actually the minimum for threads. And it makes sense that too small values should not be allowed, since we've got vector ALUs and not scalar ones. I wrote that the difference between those thread values was quite small, 3% at most, so maybe I simply didn't average over enough WUs.

Regarding lut_size: yes, values of 1 - 2 high than what I wrote perform better. But you should see increased DRAM power consumption with that (doesn't matter much) and, more importantly, increased memory bandwidth consumption which probably slows down your other tasks. Hence I would not generally recommend this, especially if your CPU is also feeding a fast discrete GPU.

MrS


Thanks for the extra info. I am now back to lut_size=19.

Tom

vdvogt
Send message
Joined: 10 Jan 16
Posts: 38
Credit: 1,090,698,551
RAC: 0
Message 22618 - Posted: 23 Jun 2016, 22:57:30 UTC

Hi,
I just dicovered this thread of optimisation.

I have a DUAL-GPU grafic card and unfortunately only one GPU is calculating tasks for collatz. The second is idle.

Is there any of these settings responsable for this phenomenon?

Could or should I change some for a better calculation?
Would this bring up the second GPU to calculation?

regards
Veit

nanoprobe
Send message
Joined: 21 Jun 16
Posts: 4
Credit: 30,870,695
RAC: 0
Message 22623 - Posted: 24 Jun 2016, 21:39:56 UTC - in response to Message 22618.

Hi,
I just dicovered this thread of optimisation.

I have a DUAL-GPU grafic card and unfortunately only one GPU is calculating tasks for collatz. The second is idle.

Is there any of these settings responsable for this phenomenon?

Could or should I change some for a better calculation?
Would this bring up the second GPU to calculation?

regards
Veit

Try adding this to your cc_config.xml file and then restart BOINC.

<use_all_gpus>1</use_all_gpus>

If the file has only log flags you'll also need to add the "options" flag. Your file would then look like this for the use_all_gpus option to work.
<cc_config>
<log_flags>
[ ... ]
</log_flags>
<options>
<use_all_gpus>1</use_all_gpus>
</options>
</cc_config>

vdvogt
Send message
Joined: 10 Jan 16
Posts: 38
Credit: 1,090,698,551
RAC: 0
Message 22627 - Posted: 25 Jun 2016, 10:27:47 UTC - in response to Message 22623.

Hi,
this is already my cc_config.xml

<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
<coproc_debug>1</coproc_debug>
</log_flags>
<options>
<use_all_gpus>1</use_all_gpus>
</options>
</cc_config>

All you suggestions are in.

regards
Veit

nanoprobe
Send message
Joined: 21 Jun 16
Posts: 4
Credit: 30,870,695
RAC: 0
Message 22636 - Posted: 27 Jun 2016, 19:00:49 UTC - in response to Message 22627.
Last modified: 27 Jun 2016, 19:04:41 UTC

Hi,
this is already my cc_config.xml

<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
<coproc_debug>1</coproc_debug>
</log_flags>
<options>
<use_all_gpus>1</use_all_gpus>
</options>
</cc_config>

All you suggestions are in.

regards
Veit

If you're running CPU tasks along side the GPU tasks try suspending the CPU tasks and see if the second GPU starts crunching. Also I wonder about some of the log flags you're using, especially the <task>1</task> one. What is that for? I never use any except the debug flag when there is a problem but that's just me.

Profile arkayn
Volunteer tester
Avatar
Send message
Joined: 30 Aug 09
Posts: 219
Credit: 676,877,192
RAC: 23,722
Message 22637 - Posted: 27 Jun 2016, 19:02:41 UTC

On my 1070/960 machine, this seems to work out nicely for both cards.

verbose=1
kernels_per_reduction=48
threads=8
lut_size=17
sleep=1
reduce_cpu=0
sieve_size=28

If I go up more it slows down the 960.
____________

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
Post to thread

Message boards : Number crunching : Optimizing Collatz Sieve


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.