Suggestion : Use sieve to speed up
log in

Advanced search

Message boards : Science : Suggestion : Use sieve to speed up

1 · 2 · 3 · 4 . . . 5 · Next
Author Message
Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19133 - Posted: 24 Apr 2014, 11:05:17 UTC

According to wikipedia "Modular restrictions, under Optimization of Collatz conjecture": http://en.wikipedia.org/wiki/Collatz_conjecture#Optimizations

I wrote a small program to test how effective it is. I applied 2 rules below:
For n = 2^k + b (0 <= b < 2^k)
1. Omit the b's that will drop below the original in k steps.
(e.g. even number will be omitted because they drop in 1 step)
2. Omit the b's that will share the same "path" with other smaller b's in k steps.

That results in a significant reduction of numbers we need to 'feed' the collatz app.
For example, as k = 26, only 1037374 residues 'survived'.
That's a potential 64x speed up (1037374 residues representing 2^26 ones, and 1037374 being close to 2^20, 1048576). Only an additional 4MB static array is required. What do you think, Slicker?

BTW, this board has had no post for over 2 months lol.

Sosiris, team Taiwan.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19134 - Posted: 24 Apr 2014, 14:03:12 UTC - in response to Message 19133.

BTW, this board has had no post for over 2 months lol.


SETI hasn't produced positive results in 40 years and 2 months seems like a long time? LOL

I really like your idea though. How fast does the sieve work?

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19156 - Posted: 25 Apr 2014, 3:22:28 UTC - in response to Message 19134.

I really like your idea though. How fast does the sieve work?


Working on it.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19158 - Posted: 25 Apr 2014, 7:01:24 UTC - in response to Message 19156.

Working on it.


Oops! Flooded by error messages when trying to compile the app (scratching my head). I'll keep trying.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19160 - Posted: 25 Apr 2014, 14:21:57 UTC

I assume the 4MB array is for 32-bit numbers and that for 64-bit numbers it would double in size and for 96 bit numbers (the current collatz number size) it would triple. Or am I mistaken?

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19188 - Posted: 26 Apr 2014, 2:22:27 UTC - in response to Message 19160.

I assume the 4MB array is for 32-bit numbers and that for 64-bit numbers it would double in size and for 96 bit numbers (the current collatz number size) it would triple. Or am I mistaken?


Thanks for your reply, Slicker. To be clear, let me show my 'simplified' kernel, adapted from yours from here : http://devgurus.amd.com/thread/166544

__kernel void kernelSteps (
__global const uint4 *table,//Look- up table (mosc)
__global const uint *sieve, //The sieve I proposed
const uint4 start26, //starting value, 128-bit,should be multiple of 2^26
__global const uint* steps) //The result
{
size_t gid = get_global_id(0); //Global WU size = size of sieve
uint8 val = (start26, (uint4)0); //val : 256-bit integer for convenience
val.s0 += sieve[gid]; //Sieve as an offset
stepCount = 0; //store collatz step for a given number
//The rest is similar to you kernel....
//....Done!
steps[gid] = stepCount;
}
PS. Sorry, I don't understand what offest, stepIn, stepOut, and uint4 *steps really do, so i did not use it.

The sieve act as an offset and is added to starting value in the kernel. For example, 5-step sieve contains 4 numbers: 7, 15, 27, and 31. We only need to take 7, 15, 27, 31, 39(7 + 32*1), 47(15+32*1), 59(27+32*1), 63(31+32*1), 71(7+32*2),... into account, instead of 1,2,3,4,5, .... This result in a potential 8x speed-up.

As to 26-step sieve, only 1037374 numbers (nearly 2^20) are needed to take into account instead of 67108864 (2^26) ones. (That's the potential 64x speed up.) And it only consist of 32-bit integers (27,31,47,63,...,67108863). 4bytes * 2^20 = 4MB if my memory serves me right.

Sosiris, team BOINC@Taiwan.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19398 - Posted: 8 May 2014, 15:07:43 UTC

After some trail and error, I finally got my small collatz app working. Running an old-solo sized (3* 2^38 numbers) WU takes 81 seconds on my HD7850. As a comparison, the original app took 20min20sec (1220 sec). This is about 15x speed boost. Of course, there is room for further optimizations because I only used simple ways to get it done.

These things below are basically I did:
1. Load look-up-table and 'sieve' arrays (external files)
2. Initialize opencl, including loading kernel (a separate .cl file)
3. In the kernel, values from the 'sieve' array are added to starting value(set by host code). Then get the collatz steps of the values and save them into 'steps' array.
4. The CPU (host code) get the maximal step and the corresponding index value from the 'steps' array. Also sum up the steps.

That's it. Thank for reading. I really hope my idea can be integrated into your project and get it faster. :)

Sosiris, team BOINC@Taiwan.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19406 - Posted: 9 May 2014, 1:33:03 UTC

Here's the code, please have a look:

kernel: http://pastebin.com/V5tLxVeu

host code : http://pastebin.com/VJdTJy3u

Sosiris, team BOINC@Taiwan.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19412 - Posted: 9 May 2014, 15:12:15 UTC

The stepsIn and stepsOut are used so that rather than checking the output after each kernel, it can run multiple kernels in a row before checking. Combined with cl_events so it doesnt' need to block while waiting for the last kernel to finish, it drastically increases the GPU load.

For efficiency, your code alternates between two sets of data so it basically does:

run kernel
read results
check results on the cpu while sumultaneously running the second kernel
read the second results
check results on the cpu

If the kernel size is matched to the CPU such that the kernel runtime and checking the results take the same amount of time, it would use both 100% cpu and 100% GPU. In order for the GPU apps to use as little CPU as possible, stepsIn and stepsOut are used to swapping the input and output data between each subsequent kernel call, like:

offset = x
for(i-0;i<kernels_per_reduction;i+=2)
call kernel(stepsIn, stepsOut, x)
x += items_per_kernel
call kernel(stepsOut, stepsIn)
x += items_per_kernel
}
read the results (non-blocking with clEvent)
wait until clEvent status is complete
check results on cpu

So, if using 1024 kernels per reduction, it will only read and check the results every 1024 kernel calls which reduces the CPU load, especially when used with asynchronous reads. And yet, the recent video drivers choose to run the kernels in blocking mode which defeats the whole purpose.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19413 - Posted: 9 May 2014, 15:25:57 UTC - in response to Message 19398.

After some trail and error, I finally got my small collatz app working. Running an old-solo sized (3* 2^38 numbers) WU takes 81 seconds on my HD7850. As a comparison, the original app took 20min20sec (1220 sec). This is about 15x speed boost. Of course, there is room for further optimizations because I only used simple ways to get it done.

These things below are basically I did:
1. Load look-up-table and 'sieve' arrays (external files)
2. Initialize opencl, including loading kernel (a separate .cl file)
3. In the kernel, values from the 'sieve' array are added to starting value(set by host code). Then get the collatz steps of the values and save them into 'steps' array.
4. The CPU (host code) get the maximal step and the corresponding index value from the 'steps' array. Also sum up the steps.

That's it. Thank for reading. I really hope my idea can be integrated into your project and get it faster. :)

Sosiris, team BOINC@Taiwan.


Well done. It looks like the next version will be adding sieve logic.

How long does it take to load the lookup and sieve tables compated to generating them on the fly? Whey Gipsel wrote the original version, his tests showed that
there was little difference in time so it was just a matter of I/Os vs CPU load for < 1 second.

On a side note, AMD support forwarded the issue on the the engineering team almost a year ago but still hasn't followed up on the post to let us know the status. The 7970's seem to be doing OK on Collatz so it is fixed for some hardware. But, the problem still exists on OS X so they haven't fixed it completely.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19419 - Posted: 10 May 2014, 7:58:04 UTC - in response to Message 19412.

The stepsIn and stepsOut are used ...


Wow, that's brilliant. I learned something good.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19420 - Posted: 10 May 2014, 8:15:39 UTC - in response to Message 19413.

How long does it take to load the lookup and sieve tables compated to generating them on the fly?


My results :

Creating lut on-the-fly: 66 ms.
Loading lut (text): 1318 ms.
Loading lut (binary): 8 ms.
Creating sieve on-the-fly: 704 ms.
Loading sieve(text): 439 ms.
Loading sieve(binary): 2 ms.

System specs:
CPU : i5-2500k @ 4.5GHz
RAM : DDR3-1600 4G *2
GPU : AMD HD7850
HDD (where the files are) : WD 1TB Blue @ 7200rpm

The code: the project packed in .7z format

https://drive.google.com/file/d/0B_zdIaR2HqBEd3BYRDJsNDlGN3c/edit?usp=sharing

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19440 - Posted: 11 May 2014, 9:43:15 UTC - in response to Message 19420.

Oops, there's an error in my seive generating app.

Line 115 in sieve.cpp:

for (unsigned int lim = static_cast<unsigned int>(threshold - 3); b < lim; b += 2)

should be:

for (unsigned int lim = static_cast<unsigned int>(threshold - 1); b < lim; b += 2)

It does affect the last number of the sieve. Sieve size is not affected.

I'm glad I have the habit of writing detailed comments so the error is clealy seen.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19453 - Posted: 12 May 2014, 14:22:40 UTC - in response to Message 19412.

The stepsIn and stepsOut are used

I made the second kernel that uses stepsIn and stepsOut. GPU usage reached
99%. The kernel took 70 seconds to finish with WU size = 3*(2^38) = old solo size.
As a comaprison, my first kernel took 81 sec. Original collatz App took 1220 sec.

That's about 17x faster than original, still far less than theoretical 64x speed-up. I don't know why. Maybe introducing another global memory read hurts the performance.

As a side note, I second kernel took a whole CPU core (25%) by checking most of the results on the gpu, while my first slower kernel only took 5% by checking all results on the cpu. It looks like an openCL problem you have mentioned. (pushing kernels into the GPU's queue eats the CPU)


And yet, the recent video drivers choose to run the kernels in blocking mode which defeats the whole purpose.

Yeah. Many OpenCL platforms don't support out-of-order command queues. Maybe 2 command queues running simultaneously will work.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19454 - Posted: 12 May 2014, 14:41:39 UTC

Global memory access is quite slow. Using textures with CAL gave a big speed increase but it hasn't seemed to make a difference with OpenCL or CUDA. The sieve table is probably too large for either constant or local/shared memory.

The way the original AMD app got its speed increase was by using a single array as both read and write. OpenCL crashes when I did that or gave sporadic results (bad optimization in the drivers which allow read or write but not both within the same kernel) so I had to use separate stepsIn and stepsOut.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19455 - Posted: 12 May 2014, 18:11:20 UTC - in response to Message 19406.

Here's the code, please have a look:

kernel: http://pastebin.com/V5tLxVeu

host code : http://pastebin.com/VJdTJy3u

Sosiris, team BOINC@Taiwan.


FYI, in the kernel:
uint8 val = (uint8)(start26.s0 + sOffest, start26.s1, start26.s2, start26.s3, (uint4)0);

You should probably check if start26.s0 + sOffest exceeds 32 bits such that you need to add 1 to start26.s1. That of course means checking that .s1 doesn't overflow such that you have to add 1 to .s2 etc. Otherwise, the steps reported may not be from the number you expected.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19460 - Posted: 13 May 2014, 3:15:30 UTC - in response to Message 19455.


FYI, in the kernel:
uint8 val = (uint8)(start26.s0 + sOffest, start26.s1, start26.s2, start26.s3, (uint4)0);

You should probably check if start26.s0 + sOffest exceeds 32 bits such that you need to add 1 to start26.s1. That of course means checking that .s1 doesn't overflow such that you have to add 1 to .s2 etc. Otherwise, the steps reported may not be from the number you expected.


It won't AFAIK. Because start26.s0 is a multiple of 2^26, somewhere between 2^26 ~ 2^32. sOffset is less than 2^26 (the maximum value of sOffset is 2^26-1). Adding the two should not exceed 2^32.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19473 - Posted: 14 May 2014, 3:11:39 UTC - in response to Message 19413.

How long does it take to load the lookup and sieve tables compated to generating them on the fly? Whey Gipsel wrote the original version, his tests showed that there was little difference in time so it was just a matter of I/Os vs CPU load for < 1 second.


Update : I accidentally found that even without a blacklist (an unordered set) recording every coefficient being put into sieve array, the results are still the same. Plus using a 5-step sieve to speed up, sieve creation time were cut down from 704 ms to 252 ms. Much faster.

Code : http://pastebin.com/NmCZ4KtK

Sosiris

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19474 - Posted: 14 May 2014, 5:33:57 UTC

I was running tests on the first sieve version (not the latest 5-step sieve) and was getting about 20x faster than the solo collatz app on an nVidia GTX 670M processor. That was done with reading the sieve as a binary file.

I've also been trying to do "smart" padding. For example, if using 1024 work items, then rather than padding the sieve to 1048576, it only needs to be padded with zeros to 1038336 since that's the lowest number evenly divisible by 1024 that is greater than or equal to 1037374 (the size of sieve26). Less padding means fewer redundant calculations.

It looks like a sieve of 2^29 is about the best as far as performance on my laptop's GPU. I need to run more tests on several other nVidia GPUs to see whether that remains the case with either faster or slower GPUs.

Increasing the work items beyond 128 seems to slow it down but that probably due to the number of registers available per work item and I know that will change depending upon the compute capability of the nVidia GPUs.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19475 - Posted: 14 May 2014, 6:33:17 UTC - in response to Message 19474.

I was running tests on the first sieve version (not the latest 5-step sieve) and was getting about 20x faster than the solo collatz app on an nVidia GTX 670M processor. That was done with reading the sieve as a binary file.

The only advantage of my second version is less RAM and time required on sieve creation because I removed the blacklist. It does not matter to reading.

I've also been trying to do "smart" padding. For example, if using 1024 work items, then rather than padding the sieve to 1048576, it only needs to be padded with zeros to 1038336 since that's the lowest number evenly divisible by 1024 that is greater than or equal to 1037374 (the size of sieve26). Less padding means fewer redundant calculations.

I agree. I like 26-step sieve jsut because the sieve size is slightly less than 1048576, which is convenient for GPU computations. Using smart padding will eliminate the need to stick to 26-step sieve, and sieves of various steps can be tested as desired.
____________
Sosiris, team BOINC@Taiwan

1 · 2 · 3 · 4 . . . 5 · Next
Post to thread

Message boards : Science : Suggestion : Use sieve to speed up


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.