Suggestion : Use sieve to speed up
log in

Advanced search

Message boards : Science : Suggestion : Use sieve to speed up

Previous · 1 · 2 · 3 · 4 · 5 · Next
Author Message
Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19748 - Posted: 31 Jul 2014, 10:06:19 UTC - in response to Message 19738.

5. By counting the Glide instead of the whole steps speeds up the app by 1.7X

http://boinc.thesonntags.com/collatz/forum_thread.php?id=1157&postid=19702
Because it's the new thing, more testing is needed.

The collatzGlide() kernel file : http://pastebin.com/Qv3PbVSS
I made minor adjustments compared to the previous one.

Looks like it produces correct results after I tried some different starting values.

The performance part:
~4.8ms per kernel execution (with items_per_kernel = 22 and threads = 8)
kernelSteps() kernel is ~10ms/kernel (under the same circumstances)

The most interesting thing is that this kernel is not memory bound (Mem busy=81%) or ALU bound(ALU busy = 88%). It's the do...while loops that affects its speed. Only 46% of threads in a warp/wavefront is active because of thread divergence. It would be challenging to solve it due to the randomness of hailstone sequence.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19749 - Posted: 31 Jul 2014, 12:32:59 UTC - in response to Message 19748.

Turning off validation inside the kernel (the second half of it) pushes the execution time to 4.4ms / kernel and produces the same results (I still use the CPU to validate them). Decrease in ALU uitilization(affected by thread divergence) in the new kernel means that, thread divergence is mainly located in the first half in the kernel.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19752 - Posted: 1 Aug 2014, 3:25:55 UTC - in response to Message 19749.

I also tried doing reduction in the work group in the collatzGlide() kernel.
http://pastebin.com/6gYek6x2

Although doing so adds ~ 10% execution time, i can push items_per_kernel higher (as high as 2^25) while keeping the size of output buffers in a reasonable range. A small increase of performance is noted. (33.6 ms / 2^25 threads , equivalent to 4.2 ms / 2^22 threads). This helps to reduce host - device transfer demand as well.
____________
Sosiris, team BOINC@Taiwan

merle van osdol
Send message
Joined: 30 Nov 09
Posts: 34
Credit: 12,489,622
RAC: 0
Message 19889 - Posted: 28 Sep 2014, 9:57:24 UTC

This was a very interesting discussion but nothing has been posted recently.
Are you still working on this or another optimization?
Thanks

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19891 - Posted: 29 Sep 2014, 4:24:12 UTC - in response to Message 19889.

To merle van osdol : Just because I'm out of ideas to get it even faster :) And Slicker definitely needs some time to test and put the optimizations into practice.
____________
Sosiris, team BOINC@Taiwan

merle van osdol
Send message
Joined: 30 Nov 09
Posts: 34
Credit: 12,489,622
RAC: 0
Message 19898 - Posted: 29 Sep 2014, 13:21:17 UTC - in response to Message 19891.

Great and thanks for your reply.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 20455 - Posted: 18 May 2015, 18:12:28 UTC

I have found over the years that heat, overclocking, and drivers can lead to GPUs calculating invalid results which is why every number checked that results in a new "high steps" for the WU gets checked by the CPU as well. That way, if the GPU is spitting out garbage, it can error out. Otherwise, if the validation is only caught on the server, the host can trash thousands of WUs in a very short time.

I've come across an issue trying to validate the results using the collatzGlide kernel on the CPU. The issue is that the first "do while" loop can result in either a higher or lower number. The statement in the kernel "The first iteration, val will always greater than its original because the sieve is applied;" is only valid if NOT using a lookup table. Because it jumps ahead a number of steps, it may generate a much higher glide than is the actual. For example, the number 2^32 + 109 yields a glide of 73 when in reality it is only 3 because the first loop results in jumping ahead more than 3 iterations so it has to continue on until after 73 steps, the number is once again below the start.

Technically, the output isn't wrong in that it has determined that a glide exists and therefore the conjecture is not dis-proven by that number. The problem is that I can't verify the glide returned using a "brute force" method because it won't match. So, the only way to validate is to use the same algorithm on the CPU. While that would validate that the GPU is generating the same results, it doesn't test whether the optimized algorithm always returns a valid result because the algorithm isn't being checked by a non-optimized method (e.g. using GMP to verify each highest glide in the WU).

So, it looks like I can use the sieve, but won't be able to use the "score > negNadir" logic and which means returning the steps instead of the glide.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 20456 - Posted: 19 May 2015, 3:07:18 UTC - in response to Message 20455.

Thanks for your reply, Slicker. (Both in this thread and in the PM)

First of all, the 'lut jump' is shorter than the sieve table in my settings, so the number always goes higher (or it would not be in the sieve table), rendering the first checking redundant.

Actually, I try to calculate the glide, which is more complex (and maybe more error-prone) to calculate , is not for performance reasons, but for the flexibility of the sieve table. Different steps of sieve table make different results when calculating the delay steps, but not the glide steps. The most prominent example is that some even numbers are on the highest step page; it would not be possible once the sieve table is applied. So if you want to use, for instance, a 29-step sieve table, all apps should use it for reliable results.

IMHO, the easiest thing to do in the first place is tweak the size of look-up tables for AMD GPUs. Reducing the size actually benefits the speed because the smaller cache in those devices, making it ALU-bound instead of waiting for memory accesses.

I'm thinking about generating a sieve table on-the-fly with higher, like 64, steps, which is about 6 time less 'dense' than a 32-step one. However it requires either lots of CPU-GPU communication or dynamic parallelism(in openCL 2.0, not widely supported yet). It's not worth the cost right now.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 20515 - Posted: 6 Jun 2015, 8:21:08 UTC - in response to Message 20456.


I'm thinking about generating a sieve table on-the-fly with higher, like 64, steps, which is about 6 time less 'dense' than a 32-step one. However it requires either lots of CPU-GPU communication or dynamic parallelism(in openCL 2.0, not widely supported yet). It's not worth the cost right now.


Or alternatively, we can 'scan' the numbers in another way. Assume we want to use a 128-step sieve (with density about 60 ppm) for an theoretical 166666x speed-up. Instead of producing all the sieve entries by client apps, the server side assign an entry A, which passes the 128-step sieve, i.e. A < 2^128 and the glide steps of A >= 128, to the clients and let them crunch the numbers X = k*(2^128) + A, k being an integer. After some time, when k reaches our goal (say k = 2^64 or something large), assign another entry B, which is the next one passing the 128-step sieve; then rinse and repeat until all entries of the 128-step sieve are covered. Then assign A with a even larger k.

We could gain lots of speed without occupying too much RAM or disk space. The additional work of the server is only keeping track of a few currently active entries and the k value. Finding the entries for the server can be a special WU if you want. For the client apps, sieve table creation is not required anymore, and the apps are basically the same as before except changing the higher bits of val instead of lower bits.

Any feedback is appreciated.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 21082 - Posted: 19 Aug 2015, 15:33:44 UTC - in response to Message 20515.
Last modified: 19 Aug 2015, 15:36:59 UTC

The kernel looks like this, it uses a 256-step sieve: http://pastebin.com/06F23Cc4

Scanning the numbers horizontally in previous kernels requires the app to generate all the sieve elements, so we cannot push the sieve step very high or memory will be an issue. The current sieve app with 32-step sieve uses 800+MB already.

To achieve higher sieve step, we can scan the numbers 'vertically'. That is, the app takes only one sieve element from the server, and goes through the numbers by changing higher bits. When the clients reaches an adequate 'height' like 2^64, the server can issue another sieve element to the clients.

There are some benefits and requirements about this new method:
Benefits :
1. Sieve step can be much higher than before. The more step of the sieve, the less 'density' of the numbers. For a 128-step sieve, it's about 60 / million. 256-step sieve, about 0.4 / million, i.e. a theoretical 2.5 million times speed boost.
Source : http://www.ericr.nl/wondrous/terras.html
2. Lower memory usage since the app doesn't have to load and store the whole sieve table.
3. You can precalculate N steps when using N-step sieve, another speed boost.

Requirements:
1. The server has to remember current sieve elements and needs to issue a special workuint to the client to find a new sieve number when required.
2. The higher step the sieve, the more bits we need to calculate. I used 16 ints for the kernel above, which is expected to be (at least) 2.5x slower than 6 ints used in previous kernels.
This kernel is not tested yet, as I'm fixing some problems with the GPU driver and the profiler right now. Any feedback is appreciated.
____________
Sosiris, team BOINC@Taiwan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 21089 - Posted: 19 Aug 2015, 18:01:07 UTC

Looking at the sieve sizes as the power increases, it seems like the size about doubles for each increase in power e.g. 2^30 contains 12,771,274 items and 2^31 has 23,642,078 items and 2^32 has 41,347,483 items. At 4 bytes per sieve number (so long as the numbers are less than 2^32), that's 160MB. For 64-bit numbers it would use 320MB for 2^32 sieve. For a sieve that is 2^256, it would require about 70GB to just store the sieve numbers. Given the way MySQL stores data, I assume that would be 140GB to store the data and index. Since MySQL is... poop, it only works well when the entire database fits into RAM and there's no way I'm upgrading to a couple hundred GB. So..... If the sieve could be generated incrementally as required, only the number in progress plus the number that haven't yet been sent would need to exist in the database. That would reduce the storage of the sieve on the server to a few MB. The finished numbers could be archived as the next sieve values are generated the same way the results are archived today.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 21092 - Posted: 20 Aug 2015, 3:52:06 UTC - in response to Message 21089.

Here'e the code for finding the sieve number : http://pastebin.com/xewUGEQ3

Just put the starting value and the step desired, you get the sieve number.

So the server doesn't have to store all the sieve numbers but a couple of them.
The code would look like this :


sieveNum1 = getSieveNum(0, 256, "reference sieve to speed up if you want");
sieveNum2 = getSieveNum(sieveNum1+1, 256);
sieveNum3 = getSieveNum(sieveNum2+1, 256);
......


Dispatch the sieveNums to the client. After they are 'finished', go on the next ones. Moreover, getSieveNum() and transformed coefficients can also be offloaded to the clients as a special workunit.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 21093 - Posted: 20 Aug 2015, 7:47:45 UTC - in response to Message 21092.

http://pastebin.com/Wmyq0YvM
Also, this is the code that precalculates N steps for an N-step sieve number. It returns the precalculated coefficients (i.e. a256 and b256 in the comments in the kernel file) and odd # occurrence (acutally, INITSTEPS = 256 + oddOccurrence) for the kernel.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 21237 - Posted: 7 Sep 2015, 7:35:08 UTC

I created a GitHub repo to ease code management, and anyone who is interested in this project can have a look or give some feedback:
https://github.com/SosirisTseng/collatz-opencl

The developement will focus on utilizing 256-step sieve, and (maybe) test something interesting like using persistent thread model to counteract thread divergence. The kernel code is almost completed; however the host code is WIP.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 21661 - Posted: 3 Nov 2015, 8:28:11 UTC

The fist working example using the 256-step sieve.
https://drive.google.com/file/d/0B_zdIaR2HqBES01yMTg4QllPZ0k/view?usp=sharing

Performance-wise, it runs at 2^20 numbers per 28ms, about 4 times slower than the current sieve APP. It's expected because 512-bit arithmetic is used and the average step is about 2800 since the input is larger.

Some possible optimizations are :
*Pre-calculate 256 steps before sending the numbers into the kernel, ~10% improvement.
*Persistent threads (fewer threads per kernel but one thread processes lots of numbers)to cover thread divergence (~20% currently) from the while loop.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 21737 - Posted: 15 Nov 2015, 13:26:14 UTC - in response to Message 21661.

Update : improved host code so lots of options are configurable

https://drive.google.com/file/d/0B_zdIaR2HqBEMWZpM3Iwa1F1OFU/view?usp=sharing

Code on GitHub also updated.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 22047 - Posted: 1 Mar 2016, 8:23:42 UTC - in response to Message 21737.

After a gap more than a hundred days, I finally figured out how to make the precalculation kernel not too complicated. On my R9-380 , the naive kernel took 60ms for 2^22 numbers, and the one with pre-calculation took 53ms (an expected 11% increase of speed). Any feedback is appreciated.

Code :
collatz256.cl is the naive kernel with 256-step sieve
collatzPreCalc.cl is the one with 256-step precalculation
https://drive.google.com/file/d/0B_zdIaR2HqBENm5iOS1abkRmbVU/view?usp=sharing

Also on the master branch of my github repo:
https://github.com/SosirisTseng/collatz-opencl

What to do next:
1. Organize the code, and remember to do it on another branch so I don't broke the master branch.
2. There is about 20% thread divergence. Maybe a thread should grab a new number once it finished counting collatz steps of previous number.
____________
Sosiris, team BOINC@Taiwan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 22195 - Posted: 1 Apr 2016, 21:03:33 UTC - in response to Message 21661.

The fist working example using the 256-step sieve.
https://drive.google.com/file/d/0B_zdIaR2HqBES01yMTg4QllPZ0k/view?usp=sharing

Performance-wise, it runs at 2^20 numbers per 28ms, about 4 times slower than the current sieve APP. It's expected because 512-bit arithmetic is used and the average step is about 2800 since the input is larger.

Some possible optimizations are :
*Pre-calculate 256 steps before sending the numbers into the kernel, ~10% improvement.
*Persistent threads (fewer threads per kernel but one thread processes lots of numbers)to cover thread divergence (~20% currently) from the while loop.


Even with 256 bits there were tests which showed that switching to 192 bits when the number was of a sufficient size was faster even with the branching. That never made it into the final kernels because an alternate way of doing it was found to be even faster. But, if only a small percentage of the numbers would need to use 512 bits, it may be more efficient to branch if all of the other numbers being calculated will branch the same way. e.g. if a small enough percentage of the numbers need 512 bits, then using an "if" to cut back to 256 or even 192 bits can actually be faster when all of the numbers in the kernel can cut back to 256 or 192 bits. Its only when some need 256 and others need 512 that it would be less efficient than the 512 bit only calculations. If that happens few enough times, then net result is that it may be faster with the branching. ...just a thought.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 22528 - Posted: 6 Jun 2016, 6:36:47 UTC - in response to Message 22195.

That's a briliant idea. The kernel code would look like this:


#ifdef TEST_COMPILE
//Constants required while compiling this kernel
#define LUTSIZE_LOG2 16 //Size of look-up table, in log2(size)
#define DELAYSIZE_LOG2 20 //Size of delay table, in log2(size)
#define INITSTEP 350 //collatz steps added in the precalculation
#endif //TEST_COMPILE

#define VAL_LENGTH 16 //15 uints + 1 uint for overflow bits by default
#define LUTMASK ((1u<<LUTSIZE_LOG2)-1)
#define DELAYMASK ((1u<<DELAYSIZE_LOG2)-1)
#define MAXSTEP 0xFFFFFFFu

__constant uint power3[] =
{
1u,3u,9u,27u,81u,243u,729u,2187u,6561u,19683u,
59049u,177147u,531441u,1594323u,4782969u,14348907u,43046721u,129140163u,387420489u,1162261467u,
3486784401u
};

inline ulong mul64L(const uint a, const uint b)
{
return upsample(mul_hi(a,b) /*hi*/, a*b /*lo*/);
}

//returns length of val in terms of 32-bit ints, assuming val != 0
inline uint getValLength(const __private uint * restrict val, uint idx)
{
while(val[idx] == 0) --idx;
return (idx + 1);
}

inline uint isOverflow(const uint valLength)
{
return valLength == VAL_LENGTH;
}

inline uint isNormalExit(const uint valLength, const uint val)
{
uint pred = (valLength == 1) && (val <= DELAYMASK);
return pred;
}

inline uint isOverSteps(const uint stepCount)
{
return stepCount >= MAXSTEP;
}

//////////////////////////////////////////////////////////////////////
//
// Collatz openCL kernel to find the counter-example of Collatz conjecture, optimized by 256-step sieve
//
// N0 = 2**256 * k + b0, where 0 <= b0 = sieveNum < 2**256
// = 2**256 * (gid + kOffset + startHi) + b0
// After 256 collatz iterations:
// N256 = 3**X * (gid + kOffset + startHi) + b256
// = (3**X * gid) + (3**X * (startHi + kOffset) + b256);
// Increment of kOffset each kernel launch = GLOBALSIZE
//
/////////////////////////////////////////////////////////////////////////

__kernel void collatzVariableLength(
__global uint * restrict g_maxStep, /* maximum step for this thread */
__global ulong * restrict g_maxPos, /* position where the max step is */
__global ulong * restrict g_totalSteps, /* total collatz steps calculated */
__global const ulong * restrict g_lut, /* look-up table. lo: powerOf3 ; hi: addVal */
__global const uint * restrict g_delays, /* collatz steps for # < 2**(DELAYSIZE_LOG2) */
__constant uint * restrict c_a256, /* 3**X */
const uint16 c_baseVal, /* 3**X * (startHi + kOffset) + b256 */
uint valLength, /* length of baseVal*/
const ulong kOffset
){
//val = baseVal
uint val[VAL_LENGTH];
vstore16(c_baseVal, 0, val);
//val += (3**X * kOffset_lo)
ulong addRes = 0ul;
uint pred = get_global_id(0) + convert_uint(kOffset); //pred as mulVal
for(uint i = 0; i < valLength; ++i)
{
addRes = (addRes>>32) + val[i] + mul64L(pred, c_a256[i]);
val[i] = convert_uint(addRes);
} //for()
val[valLength] = convert_uint(addRes>>32);

//fill 0's for rest of digits
for(uint i = valLength + 1; i < VAL_LENGTH; ++ i)
{
val[i] = 0u;
} //for()
valLength += val[valLength] > 0;//Adjust valLength
uint stepCount = INITSTEP;
do
{
addRes = g_lut[val[0] & LUTMASK]; //most time-consuming global mem. access in this kernel
pred = power3[convert_uint(addRes)]; //pred as multiplier
stepCount += convert_uint(addRes) + LUTSIZE_LOG2;
//val = (val >> LUTSIZE_LOG2) * mulVal + addVal
for(uint i = 0; i < valLength; ++i)
{
addRes = (addRes >> 32) + mul64L(pred, (val[i] >> LUTSIZE_LOG2) | (val[i+1] << (32 - LUTSIZE_LOG2)));
val[i] = convert_uint(addRes);
} //for()
val[valLength] = convert_uint(addRes >> 32);
//Took advantage of that valLength will increase or decrease by 1 at most
valLength += val[valLength] > 0;
valLength -= (val[valLength] == 0) && (val[valLength - 1] == 0);
pred = (isOverflow(valLength) << 2) | (isOverSteps(stepCount) << 1) | isNormalExit(valLength, val[0]);
} while(pred == 0);
stepCount += g_delays[val[0] & DELAYMASK];
stepCount = select(stepCount, MAXSTEP, pred > 1);
g_totalSteps[get_global_id(0)] += stepCount;
valLength = g_maxStep[get_global_id(0)];
g_maxStep[get_global_id(0)] = max(stepCount, valLength);
addRes = g_maxPos[get_global_id(0)];
g_maxPos[get_global_id(0)] = select(addRes, kOffset | get_global_id(0), convert_ulong(stepCount > valLength));
} //collatzVariableLength()


As you can see, the kernel code becomes leaner (and maybe faster) than the previous one.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 22795 - Posted: 6 Aug 2016, 1:40:39 UTC

To my surprise, the variable value length kernel is slower than fixed length one. (68ms vs 55 ms per 2^22 items). Probably because of loop unrolling in the fixed version and more branching instructions in the variable one. Those factors outweigh the benefit of less numbers to calculate.

Fixed length:
https://drive.google.com/open?id=15eLvvnNfOtWWq4T_yQeKAtzJ7u4JjixXbre9Ll1frl0

Variable Length:
https://drive.google.com/open?id=1VIGAJ-_3upkyynGTnc4I_cm49sCRkE6Hq5Yk02vrCF4

Kernel code : (I merged the kernels into one since both use the same logic)


#ifdef TEST_COMPILE
//Constants required while compiling this kernel
#define LUTSIZE_LOG2 16 //Size of look-up table, in log2(size)
#define DELAYSIZE_LOG2 20 //Size of delay table, in log2(size)
#define INITSTEP 350 //collatz steps added in the precalculation
#endif //#ifdef TEST_COMPILE

#define VAL_LENGTH 16 //15 uints for value itself and 1 for overflow

#define LUTMASK ((1u<<LUTSIZE_LOG2)-1)
#define DELAYMASK ((1u<<DELAYSIZE_LOG2)-1)
#define MAXSTEP 0xFFFFFFFu

__constant uint power3[] =
{
1u,3u,9u,27u,81u,243u,729u,2187u,6561u,19683u,
59049u,177147u,531441u,1594323u,4782969u,14348907u,43046721u,129140163u,387420489u,1162261467u,
3486784401u
};

inline ulong mul64L(uint a, uint b)
{
return upsample(mul_hi(a,b), a*b);
}

#ifdef VARIABLE_LENGTH //Functions for variable length kernel

//Assuming at least one element is not zero
inline uint getValLength(uint* val)
{
uint idx = VAL_LENGTH - 1;
while(val[idx] == 0) --idx;
return idx + 1;
}
inline uint isOverflow(uint valLength, uint* val)
{
return valLength >= VAL_LENGTH;
}
inline uint isNormalExit(uint valLength, uint* val)
{
return valLength == 1 && val[0] <= DELAYMASK;
}

#else //Functions for fixed length kernel

inline uint isOverflow(uint valLength, uint* val)
{
return val[VAL_LENGTH - 1] > 0;
}
inline uint isNormalExit(uint valLength, uint* val)
{
uint pred = 0;
#pragma unroll
for(uint i = 1; i < VAL_LENGTH-1; ++i) {
pred |= val[i];
}
return pred == 0 && val[0] <= DELAYMASK;
}
#endif //#ifdef VARIABLE_LENGTH

inline uint isOverSteps(uint stepCount)
{
return stepCount >= MAXSTEP;
}

//////////////////////////////////////////////////////////////////////
//
// Collatz openCL kernel to find the counter-example of Collatz conjecture, optimized by the 256-step sieve
//
// N0 = 2**256 * k + b0, where 0 <= b0 = sieveNum < 2**256
// = 2**256 * (gid + kOffset) + b0
// After 256 collatz iterations:
// N256 = 3**X * (gid + kOffset) + b256
// = (3**X * gid) + (3**X * kOffset + b256)
//
/////////////////////////////////////////////////////////////////////////

__kernel void collatz(
__global uint * restrict g_maxStep, /* maximum collatz step */
__global ulong * restrict g_maxPos, /* the very kOffset where the max step is */
__global ulong * restrict g_totalSteps, /* total collatz steps calculated */
__global const ulong * restrict g_lut, /* look-up table. lo: powerOf3 ; hi: addent */
__global const uint * restrict g_delays, /* collatz steps for # < 2**(DELAYSIZE_LOG2) */
__constant uint * restrict c_multiplier, /* 3**X */
const uint16 c_addent, /* 3**X * kOffset + b256, should be updated when launching kernels */
const ulong kOffset
){
uint val[VAL_LENGTH];
vstore16(c_addent, 0, val); //val = c_addent
uint pred = get_global_id(0); //pred as multiplier
//val += gid * c_multiplier
ulong addRes = val[0] + mul64L(pred, c_multiplier[0]);
val[0] = convert_uint(addRes);
#pragma unroll
for(uint i = 1; i < VAL_LENGTH; ++i) {
addRes = (addRes>>32) + val[i] + mul64L(pred, c_multiplier[i]);
val[i] = convert_uint(addRes);
}
#ifdef VARIABLE_LENGTH
uint valLength = getValLength(val);
#else
#define valLength (VAL_LENGTH-1)
#endif
uint stepCount = INITSTEP;
do {
addRes = g_lut[val[0] & LUTMASK]; //most time-consuming global mem. access in this kernel
pred = convert_uint(addRes);
stepCount += pred + LUTSIZE_LOG2;
pred = power3[pred]; //pred as multiplier
//val = (val >> LUTSIZE_LOG2) * multiplier + addend, only "valLength" numbers in val array are calculated
for(uint i = 0; i < valLength; ++i)
{
addRes = (addRes >> 32) + mul64L(pred, (val[i] >> LUTSIZE_LOG2) | (val[i+1] << (32 - LUTSIZE_LOG2)));
val[i] = convert_uint(addRes);
}
val[valLength] = convert_uint(addRes >> 32);
#ifdef VARIABLE_LENGTH
//valLength changes by 1 at most
valLength += val[valLength] > 0;
valLength -= val[valLength] == 0 && val[valLength - 1] == 0;
#endif //#ifdef VARIABLE_LENGTH
pred = (isOverflow(valLength, val) << 2) | (isOverSteps(stepCount) << 1) | isNormalExit(valLength, val);
} while(pred == 0);
stepCount += g_delays[val[0] & DELAYMASK];
stepCount = select(stepCount, MAXSTEP, pred > 1);
pred = g_maxStep[get_global_id(0)];
addRes = g_maxPos[get_global_id(0)];
g_totalSteps[get_global_id(0)] += stepCount;
g_maxStep[get_global_id(0)] = max(stepCount, pred);
g_maxPos[get_global_id(0)] = select(addRes, kOffset + get_global_id(0), convert_ulong(stepCount > pred));
} //collatzVariableLength()

//clearRes() : clears result buffers, could use clEnqueueFillBuffer() in openCL 1.2 or above
__kernel void clearRes(
__global uint * restrict g_maxStep, /* maximum step for this thread */
__global ulong * restrict g_maxPos, /*position where the max step is*/
__global ulong * restrict g_totalSteps /* total collatz (delay) steps calculated */
) {
g_maxStep[get_global_id(0)] = 0u;
g_maxPos[get_global_id(0)] = 0ul;
g_totalSteps[get_global_id(0)] = 0ul;
} //clearRes()

____________
Sosiris, team BOINC@Taiwan

Previous · 1 · 2 · 3 · 4 · 5 · Next
Post to thread

Message boards : Science : Suggestion : Use sieve to speed up


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.