Suggestion : Use sieve to speed up
log in

Advanced search

Message boards : Science : Suggestion : Use sieve to speed up

Previous · 1 · 2 · 3 · 4 · 5 · Next
Author Message
Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19644 - Posted: 20 Jun 2014, 8:45:44 UTC - in response to Message 19641.

I have quite a few versions of reduction kernels. The most efficient versions use shared memory, require calling barrier(CLK_LOCAL_MEM_FENCE), and must be 2^k items in size.

FYI, the collatz applications are designed such that they can run the reduction on the GPU, the CPU, or both. That allows me to verify that the reduction kernel has no bugs. Even if only the GPU is used for reductions, any time higher steps are found, the CPU is used to validate that they are correct.

I agree with you. After some internet searching, two-stage reduction is the most efficient for the GPU and the local memory is used in the second stage.
http://pastebin.com/usLerssD
The work group size (local work size, 'threads')should be 2^N, global work size = k*(work group size), and arrSize = j* (global work size)
What's interesting is that the CPUs perform better than the GPUs in this scenario.

As to getSieve kernel, it was ALU-bound in the GPU, but transfering the data to the host(5GB/sec using pre-pinned buffer) and doing remove_copy() (if it's not zero, add it to the array) were the bottleneck, even using asynchronous data transfer and kernel execution. I haven't come up with something like reduction() yet because the data is too 'dynamic' for the GPU to process. Maybe it's already fast enough because this kernel is executed only when the sieve binary file is not available, so it's not the 'rate-determing step' of the whole Collatz App.
____________
Sosiris, team BOINC@Taiwan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19648 - Posted: 20 Jun 2014, 22:43:52 UTC

If you add "#define N arrSize/globalSize-1" as a build option, you could use the "#pragma unroll N" before the for loop which should improve performance.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19673 - Posted: 24 Jun 2014, 15:18:58 UTC - in response to Message 19648.

If you add "#define N arrSize/globalSize-1" as a build option, you could use the "#pragma unroll N" before the for loop which should improve performance.

Yes, nut unrolling also increases register pressure. e.g. the getSieve() kernel uses 3 times the registers after loop unrolling is applied and the kernel occupacy dropped to ~40%. (However there is no performance penalty since it is still ALU-bound.)

As a side note, I'm trying to utilize mad24(), which is 4 times faster than regular 32-bit multiplication (1 clock cycle vs 4 clock cycles), on kernelSteps().
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19675 - Posted: 26 Jun 2014, 6:00:30 UTC - in response to Message 19673.

Some corrections here:

FYI, in your kernel, instead of using conditional statement for every kernel instance to see if the offset is 0 and then clean the result, I recommend insert a "clear result" kernel as needed (In fact, in openCL 1.2, there's clEnqueueFillBuffer() doing the same work, but not in v1.1). Plus using select(), expensive 'if' statements can be avoided to further optimize the kernel.


In fact, the first 'if statement' in your kernel(the one before the while loop) does not hurt as much performance as I thought because all threads in a warp/wavefront have the same condition (all true or all false) without divergence.
The second 'if statement'(the one after the while loop) however will encounter divergence (some true, some false); it may be faster to use select() function.

mad24(), which is 4 times faster than regular 32-bit multiplication

It's for AMD GPUs only as 24-bit mul = SP rate; 32-bit mul = DP rate. In NV GPUs and CPUs, it's faster to use 32-bit mul. It may be even better to use 64-bit mul in CPU devices.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19687 - Posted: 2 Jul 2014, 6:46:27 UTC - in response to Message 19644.

As to getSieve kernel, it was ALU-bound in the GPU, but transfering the data to the host(5GB/sec using pre-pinned buffer) and doing remove_copy() (if it's not zero, add it to the array) were the bottleneck, even using asynchronous data transfer and kernel execution. I haven't come up with something like reduction() yet because the data is too 'dynamic' for the GPU to process. Maybe it's already fast enough because this kernel is executed only when the sieve binary file is not available, so it's not the 'rate-determing step' of the whole Collatz App.


If one wants to implement getSieve() + remove_copy() both in opencl, the following steps are needed as far as I know.
1. 1 thread evaluate N numbers(for performance), put those # passed the sieve to local memory (again, for performance).
2. combine the valid values within the work group and write them to global memory. Sizes of valid values (per work group) are also accumulated and written to another buffer. (called 'outputSizes')
3.Get prefix sum (scan) of outputSizes to pinpoint the positions used later.
4.Use positions from 3. to combine chunks from 2. to one continuous chunk of valid values.

Step 3. is the most complicated, which requires 1 kernel function and 3-4 auxillary functions. I don't fully understand them yet, though there are lots of examples.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19690 - Posted: 3 Jul 2014, 12:56:10 UTC

As a side note, I'm trying to utilize mad24(), which is 4 times faster than regular 32-bit multiplication (1 clock cycle vs 4 clock cycles), on kernelSteps().

Tested, it's two times slower than the normal version. (7-9ms vs 16-18ms using items_per_kernel =22)
____________
Sosiris, team BOINC@Taiwan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19691 - Posted: 3 Jul 2014, 19:49:04 UTC - in response to Message 19690.
Last modified: 3 Jul 2014, 19:50:29 UTC

As a side note, I'm trying to utilize mad24(), which is 4 times faster than regular 32-bit multiplication (1 clock cycle vs 4 clock cycles), on kernelSteps().

Tested, it's two times slower than the normal version. (7-9ms vs 16-18ms using items_per_kernel =22)


I kind of expected that. Gipsel and I tried that with early v2 apps with both CAL and CUDA with the same results. We also tried using floats with fewer digits and that didn't help either. The bit shifts were still faster and they can only be done on integers. Casting back and forth from float to int was slower also.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19694 - Posted: 7 Jul 2014, 5:07:34 UTC - in response to Message 19691.

I kind of expected that. Gipsel and I tried that with early v2 apps with both CAL and CUDA with the same results. We also tried using floats with fewer digits and that didn't help either. The bit shifts were still faster and they can only be done on integers. Casting back and forth from float to int was slower also.


Speaking of floats, I'm working on calculating collatz glide using look-up table and a scoring system based on log2 and log3. (for any positive integer N > 1, let k be the lowest index for which Sk < N.) We may count less steps compared to calculating the whole steps.(aka 'delays'). In fact, before the sieve, my first idea about the project is getting the glide => less loop iterations=> faster speed. The sieve is a static form of it.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19695 - Posted: 7 Jul 2014, 14:08:16 UTC - in response to Message 19687.

If one wants to implement getSieve() + remove_copy() both in opencl, the following steps are needed as far as I know.
1. 1 thread evaluate N numbers(for performance), put those # passed the sieve to local memory (again, for performance).
2. combine the valid values within the work group and write them to global memory. Sizes of valid values (per work group) are also accumulated and written to another buffer. (called 'outputSizes')
3.Get prefix sum (scan) of outputSizes to pinpoint the positions used later.
4.Use positions from 3. to combine chunks from 2. to one continuous chunk of valid values.

Step 3. is the most complicated, which requires 1 kernel function and 3-4 auxillary functions. I don't fully understand them yet, though there are lots of examples.


Tested. It's too complicated to be quick. Compliling the openCL kernel took nearly 1 second. (The previous one only took 0.06 sec, a bit strange since I did not add so much code.)
____________
Sosiris, team BOINC@Taiwan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19696 - Posted: 7 Jul 2014, 18:58:50 UTC - in response to Message 19695.

If one wants to implement getSieve() + remove_copy() both in opencl, the following steps are needed as far as I know.
1. 1 thread evaluate N numbers(for performance), put those # passed the sieve to local memory (again, for performance).
2. combine the valid values within the work group and write them to global memory. Sizes of valid values (per work group) are also accumulated and written to another buffer. (called 'outputSizes')
3.Get prefix sum (scan) of outputSizes to pinpoint the positions used later.
4.Use positions from 3. to combine chunks from 2. to one continuous chunk of valid values.

Step 3. is the most complicated, which requires 1 kernel function and 3-4 auxillary functions. I don't fully understand them yet, though there are lots of examples.


Tested. It's too complicated to be quick. Compliling the openCL kernel took nearly 1 second. (The previous one only took 0.06 sec, a bit strange since I did not add so much code.)


I wouldn't count a 1 second compile time as too long. If a large WU runs for 5 hours, 1 second is only 1/18000 of the time. If it runs even 1% faster, the extra second of compile time is work it. That's why I always run complete WUs after making a change and then evaluate the total time vs previous runs of the same WU. If a kernel takes 2ms longer, it may not seem like much but if called thousands or millions of times, it can add up. The reverse is that if .94 seconds seems like a lot, consider that it would only be calculated once. If more than the .94 seconds can be gained in the processing of each kernel, then it would still be worth the effort.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19697 - Posted: 8 Jul 2014, 3:00:16 UTC - in response to Message 19696.


I wouldn't count a 1 second compile time as too long. If a large WU runs for 5 hours, 1 second is only 1/18000 of the time. If it runs even 1% faster, the extra second of compile time is work it. That's why I always run complete WUs after making a change and then evaluate the total time vs previous runs of the same WU. If a kernel takes 2ms longer, it may not seem like much but if called thousands or millions of times, it can add up. The reverse is that if .94 seconds seems like a lot, consider that it would only be calculated once. If more than the .94 seconds can be gained in the processing of each kernel, then it would still be worth the effort.


I count 1 second compile time too long because the original getSieve-opencl generates a 32-step sieve in 1.6 seconds total, including compiling kernel code, running getSieve kernels, and the cpu gathering the results. The old and the new getSieve-opencl produce the same sieve table, so it would not impact kernelSteps() kernel excution later.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19702 - Posted: 11 Jul 2014, 15:09:10 UTC - in response to Message 19694.

Speaking of floats, I'm working on calculating collatz glide using look-up table and a scoring system based on log2 and log3. (for any positive integer N > 1, let k be the lowest index for which Sk < N.) We may count less steps compared to calculating the whole steps.(aka 'delays'). In fact, before the sieve, my first idea about the project is getting the glide => less loop iterations=> faster speed. The sieve is a static form of it.


I got collatGlide() kernel working finally. The steps are:
1. Calculate collatz steps as in the kernelSteps() kernel and record the 'score', an approximation of log2( val / original val). The score is calculated by this formula: score += powerOf3 * log2(3) - lookahead.
2. When the score plus 'nadir' value of the look-up table of next val reaches negative value, i.e. log2( next val / original val) < 0 => next val < original val; implies the val hits the reef, jump to step3. The nadir value is obtained whilst generating lookahead table.
3. Find at what step val drops below its original. Also calculate val to that point.
4. Check if the approximation is correct (Is calculated val really less than its original?). Recorded step is the 'glide' step. In contrast, kernelSteps counts full steps, or called 'delays' in this website:
http://www.ericr.nl/wondrous/

Talk is cheap. Here's the code:
kernel code : http://pastebin.com/dPSauZ0L
host code for table generation : http://pastebin.com/hFeJpTda
host code for validation:[url] http://pastebin.com/LARagpgd[/url]

collatzGlide() running speed is around 5.8ms / 2^22 threads. On the contrary, kernelSteps() is about 10ms / 2^22 threads; 1.7x faster. Performance counters showed ALU usege is 90%; the limiting factor currently is most likely thread divergence due to the loops.
What's your thoughts on this, Slicker?[/url]
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19704 - Posted: 14 Jul 2014, 7:04:50 UTC - in response to Message 19697.

I count 1 second compile time too long because the original getSieve-opencl generates a 32-step sieve in 1.6 seconds total, including compiling kernel code, running getSieve kernels, and the cpu gathering the results. The old and the new getSieve-opencl produce the same sieve table, so it would not impact kernelSteps() kernel excution later.


I use 4 command queues concurrently and 10-step lookahead on my getSieve kernel and cut down (32-step) sieve generation time from 1.7 sec to 1 sec. The remove_copy() runs on the CPU.
Here's the code:
kernel code : http://pastebin.com/rDNq7ExK
host code : http://pastebin.com/dnepQB1k
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19713 - Posted: 17 Jul 2014, 12:42:41 UTC - in response to Message 19704.

I'm investigating the use of scoring system to accelerate sieve generation. e.g. Instead of computing every residue 32 times to create a 32-step sieve, one can apply a 16-step jump table (with scores) twice to (hopefully) gain the same result. I'll test it on the CPU asap.
Maybe this can also be utilized to create a higher step sieve on-the-fly to further reduce the numbers to compute.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19715 - Posted: 21 Jul 2014, 3:55:30 UTC - in response to Message 19713.

I'm investigating the use of scoring system to accelerate sieve generation. e.g. Instead of computing every residue 32 times to create a 32-step sieve, one can apply a 16-step jump table (with scores) twice to (hopefully) gain the same result. I'll test it on the CPU asap.
Maybe this can also be utilized to create a higher step sieve on-the-fly to further reduce the numbers to compute.


It's a significant advance in terms of performance. The CPU version of the new getSieve32() only took 1.2 secs to create a 32-step sieve, 13 times faster than the old version. Of course one can alter the look-up tables to obtain a sieve with a different step.
code : http://pastebin.com/6JdW6Sy5

And I tried the same thing in OpenCL:
kernel code : http://pastebin.com/ck08yvXJ
1. Query the output size for each thread. Every thread effectively evaluates 2^16 numbers.
2. Inclusively scans 'size' to get insert positions in the memory.
3. Each thread write output to the result buffer (the sieve table)

host code :
clHelper.h : http://pastebin.com/h1ZUGhe7
host code : http://pastebin.com/9QbADN9N

sieve32OCL() function call: 891ms
building program : 108ms
creating command queue : 40ms
Kernel : 86 ms (All memory-bound, ALU utilization 10% at most)
Read buffer : 27 ms

Perhaps we can 'stream' data from the getSieve() kernel to the kernelSteps() kernel, because it makes using higher step sieve (like 64 steps) possible. I'll investigate it asap.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19719 - Posted: 22 Jul 2014, 14:03:49 UTC - in response to Message 19715.

Perhaps we can 'stream' data from the getSieve() kernel to the kernelSteps() kernel, because it makes using higher step sieve (like 64 steps) possible. I'll investigate it asap.


Here's the code:
Kernel : http://pastebin.com/AvVsqRLB
Host : http://pastebin.com/inZ0Pg0s

I store the look-up tables in the local memory (shared memory in CUDA) and achieved 95% ALU utilization in the filter() kernel. Filter rate is about 2^26 numbers per ms for creating a 32-step sieve (slightly faster than the previous version I posted). About 2^25 numbers per ms for a 64-step sieve. Still more testing is required for the 64-step sieve, though.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19727 - Posted: 25 Jul 2014, 6:13:05 UTC - in response to Message 19715.

Perhaps we can 'stream' data from the getSieve() kernel to the kernelSteps() kernel, because it makes using higher step sieve (like 64 steps) possible. I'll investigate it asap.


Or maybe not, first I underestimated the power of 2^64. It would require ~80 years to validate the 64-step sieve with my algorithm on my machine. Second the filter rate is not fast enough to feed either kernelSteps() or collatzGlide() kernel. Anyways, we have a better algorithm (using look-up tables) to make the sieve table.

Since this thread is getting long. I'll do a summary here about the points of this thread for anyone who is interested in it as soon as I can.
____________
Sosiris, team BOINC@Taiwan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19731 - Posted: 25 Jul 2014, 20:07:50 UTC

Thanks for all the work you've done. A summary would be appreciated.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19732 - Posted: 26 Jul 2014, 3:29:59 UTC - in response to Message 19731.

Since this thread is getting long. I'll do a summary here about the points of this thread for anyone who is interested in it as soon as I can.


To sum up what we've done in this thread:

1. Using the sieve table speeds up the app by 20X(In theory, 80-100X)

What is the sieve? Suppose we pick a number X = 2^N * k + R (0<= R < 2^n), the fate of X within N Collatz steps is only determined by R (the least significant N bits). We can look ahead N steps, and identify those R's which would be less than itself; then we know which X's would be less than itself within N steps and can be safely filtered out, if we only want to get a counterexample of Collatz Conjecture. The input we need, containing those R's passed the N-step sieve, is just 1/80 ~ 1/100 of all numbers (for N = 29 ~ 32), thus greatly increase the searching speed of the app. The kernel is almost the same at this point.
Reference : http://en.wikipedia.org/wiki/Collatz_conjecture#Modular_restrictions

2. Improved sieve table generation algorithm. From initial 30 secs to 1.5 seconds (CPU version). The openCL version is even faster, within a second.

Cleaned-up code coming soon.

3. Tweaking look-up table size according to L2 size of GCN cards improves kernel execution speed by 2X

In GCN cards (and possibly older AMD cards), the ALUs wait memory fetch most of the time because its L2 cache (512KB ~ 768KB) is far smaller than the size of look-up table (16MB). By decreasing look-ahead steps of the table, we can achieve 3X ALU usage and twice the speed.
As to NVidia GPUs, they probably have different caching, doing the same hurts the performance a lot.

4. Using multiple command queues on a device = asynchronous data transfer and kernel execution

5. By counting the Glide instead of the whole steps speeds up the app by 1.7X

http://boinc.thesonntags.com/collatz/forum_thread.php?id=1157&postid=19702
Because it's the new thing, more testing is needed.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19738 - Posted: 27 Jul 2014, 9:46:03 UTC
Last modified: 27 Jul 2014, 10:15:11 UTC

2. Improved sieve table generation algorithm. From initial 30 secs to 1.5 seconds (CPU version). The openCL version is even faster, within a second.

*Edit* added kernel files to the zip file

The code (files packed in a zip file):
https://drive.google.com/file/d/0B_zdIaR2HqBEUWRuOHFNbWU2UEU/edit?usp=sharing

Please see getSieve()(in both CPU and OCL version) and getSieve32() (in the latest OCL version, kernel execution time is ~90 ms on my HD7850). Please reply if there's anything unclear in my code.

PS. I also made a custom openCL command queue class to encapsulate the one in the c++ wrapper provided by Khronos to ease my coding. And some features from c++11 are also used.

Previous · 1 · 2 · 3 · 4 · 5 · Next
Post to thread

Message boards : Science : Suggestion : Use sieve to speed up


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.