Suggestion : Use sieve to speed up

log in |

Message boards : Science : Suggestion : Use sieve to speed up

Previous · 1 · 2 · 3 ·

Author | Message |
---|---|

5. By counting the Glide instead of the whole steps speeds up the app by 1.7X The collatzGlide() kernel file : http://pastebin.com/Qv3PbVSS I made minor adjustments compared to the previous one. Looks like it produces correct results after I tried some different starting values. The performance part: ~4.8ms per kernel execution (with items_per_kernel = 22 and threads = 8) kernelSteps() kernel is ~10ms/kernel (under the same circumstances) The most interesting thing is that this kernel is not memory bound (Mem busy=81%) or ALU bound(ALU busy = 88%). It's the do...while loops that affects its speed. Only 46% of threads in a warp/wavefront is active because of thread divergence. It would be challenging to solve it due to the randomness of hailstone sequence. ____________ Sosiris, team BOINC@Taiwan | |

ID: 19748 · Rating: 0 · rate:
/ Reply
Quote
| |

Turning off validation inside the kernel (the second half of it) pushes the execution time to 4.4ms / kernel and produces the same results (I still use the CPU to validate them). Decrease in ALU uitilization(affected by thread divergence) in the new kernel means that, thread divergence is mainly located in the first half in the kernel. | |

ID: 19749 · Rating: 0 · rate:
/ Reply
Quote
| |

I also tried doing reduction in the work group in the collatzGlide() kernel. | |

ID: 19752 · Rating: 0 · rate:
/ Reply
Quote
| |

This was a very interesting discussion but nothing has been posted recently. | |

ID: 19889 · Rating: 0 · rate:
/ Reply
Quote
| |

To merle van osdol : Just because I'm out of ideas to get it even faster :) And Slicker definitely needs some time to test and put the optimizations into practice. | |

ID: 19891 · Rating: 0 · rate:
/ Reply
Quote
| |

Great and thanks for your reply. | |

ID: 19898 · Rating: 0 · rate:
/ Reply
Quote
| |

I have found over the years that heat, overclocking, and drivers can lead to GPUs calculating invalid results which is why every number checked that results in a new "high steps" for the WU gets checked by the CPU as well. That way, if the GPU is spitting out garbage, it can error out. Otherwise, if the validation is only caught on the server, the host can trash thousands of WUs in a very short time. | |

ID: 20455 · Rating: 0 · rate:
/ Reply
Quote
| |

Thanks for your reply, Slicker. (Both in this thread and in the PM) | |

ID: 20456 · Rating: 0 · rate:
/ Reply
Quote
| |

Or alternatively, we can 'scan' the numbers in another way. Assume we want to use a 128-step sieve (with density about 60 ppm) for an theoretical 166666x speed-up. Instead of producing all the sieve entries by client apps, the server side assign an entry A, which passes the 128-step sieve, i.e. A < 2^128 and the glide steps of A >= 128, to the clients and let them crunch the numbers X = k*(2^128) + A, k being an integer. After some time, when k reaches our goal (say k = 2^64 or something large), assign another entry B, which is the next one passing the 128-step sieve; then rinse and repeat until all entries of the 128-step sieve are covered. Then assign A with a even larger k. We could gain lots of speed without occupying too much RAM or disk space. The additional work of the server is only keeping track of a few currently active entries and the k value. Finding the entries for the server can be a special WU if you want. For the client apps, sieve table creation is not required anymore, and the apps are basically the same as before except changing the higher bits of val instead of lower bits. Any feedback is appreciated. ____________ Sosiris, team BOINC@Taiwan | |

ID: 20515 · Rating: 0 · rate:
/ Reply
Quote
| |

The kernel looks like this, it uses a 256-step sieve: http://pastebin.com/06F23Cc4 | |

ID: 21082 · Rating: 0 · rate:
/ Reply
Quote
| |

Looking at the sieve sizes as the power increases, it seems like the size about doubles for each increase in power e.g. 2^30 contains 12,771,274 items and 2^31 has 23,642,078 items and 2^32 has 41,347,483 items. At 4 bytes per sieve number (so long as the numbers are less than 2^32), that's 160MB. For 64-bit numbers it would use 320MB for 2^32 sieve. For a sieve that is 2^256, it would require about 70GB to just store the sieve numbers. Given the way MySQL stores data, I assume that would be 140GB to store the data and index. Since MySQL is... poop, it only works well when the entire database fits into RAM and there's no way I'm upgrading to a couple hundred GB. So..... If the sieve could be generated incrementally as required, only the number in progress plus the number that haven't yet been sent would need to exist in the database. That would reduce the storage of the sieve on the server to a few MB. The finished numbers could be archived as the next sieve values are generated the same way the results are archived today. | |

ID: 21089 · Rating: 0 · rate:
/ Reply
Quote
| |

Here'e the code for finding the sieve number : http://pastebin.com/xewUGEQ3 sieveNum1 = getSieveNum(0, 256, "reference sieve to speed up if you want"); sieveNum2 = getSieveNum(sieveNum1+1, 256); sieveNum3 = getSieveNum(sieveNum2+1, 256); ...... Dispatch the sieveNums to the client. After they are 'finished', go on the next ones. Moreover, getSieveNum() and transformed coefficients can also be offloaded to the clients as a special workunit. ____________ Sosiris, team BOINC@Taiwan | |

ID: 21092 · Rating: 0 · rate:
/ Reply
Quote
| |

http://pastebin.com/Wmyq0YvM | |

ID: 21093 · Rating: 0 · rate:
/ Reply
Quote
| |

I created a GitHub repo to ease code management, and anyone who is interested in this project can have a look or give some feedback: | |

ID: 21237 · Rating: 0 · rate:
/ Reply
Quote
| |

The fist working example using the 256-step sieve. | |

ID: 21661 · Rating: 0 · rate:
/ Reply
Quote
| |

Update : improved host code so lots of options are configurable | |

ID: 21737 · Rating: 0 · rate:
/ Reply
Quote
| |

After a gap more than a hundred days, I finally figured out how to make the precalculation kernel not too complicated. On my R9-380 , the naive kernel took 60ms for 2^22 numbers, and the one with pre-calculation took 53ms (an expected 11% increase of speed). Any feedback is appreciated. | |

ID: 22047 · Rating: 0 · rate:
/ Reply
Quote
| |

The fist working example using the 256-step sieve. Even with 256 bits there were tests which showed that switching to 192 bits when the number was of a sufficient size was faster even with the branching. That never made it into the final kernels because an alternate way of doing it was found to be even faster. But, if only a small percentage of the numbers would need to use 512 bits, it may be more efficient to branch if all of the other numbers being calculated will branch the same way. e.g. if a small enough percentage of the numbers need 512 bits, then using an "if" to cut back to 256 or even 192 bits can actually be faster when all of the numbers in the kernel can cut back to 256 or 192 bits. Its only when some need 256 and others need 512 that it would be less efficient than the 512 bit only calculations. If that happens few enough times, then net result is that it may be faster with the branching. ...just a thought. | |

ID: 22195 · Rating: 0 · rate:
/ Reply
Quote
| |

That's a briliant idea. The kernel code would look like this: #ifdef TEST_COMPILE //Constants required while compiling this kernel #define LUTSIZE_LOG2 16 //Size of look-up table, in log2(size) #define DELAYSIZE_LOG2 20 //Size of delay table, in log2(size) #define INITSTEP 350 //collatz steps added in the precalculation #endif //TEST_COMPILE #define VAL_LENGTH 16 //15 uints + 1 uint for overflow bits by default #define LUTMASK ((1u<<LUTSIZE_LOG2)-1) #define DELAYMASK ((1u<<DELAYSIZE_LOG2)-1) #define MAXSTEP 0xFFFFFFFu __constant uint power3[] = { 1u,3u,9u,27u,81u,243u,729u,2187u,6561u,19683u, 59049u,177147u,531441u,1594323u,4782969u,14348907u,43046721u,129140163u,387420489u,1162261467u, 3486784401u }; inline ulong mul64L(const uint a, const uint b) { return upsample(mul_hi(a,b) /*hi*/, a*b /*lo*/); } //returns length of val in terms of 32-bit ints, assuming val != 0 inline uint getValLength(const __private uint * restrict val, uint idx) { while(val[idx] == 0) --idx; return (idx + 1); } inline uint isOverflow(const uint valLength) { return valLength == VAL_LENGTH; } inline uint isNormalExit(const uint valLength, const uint val) { uint pred = (valLength == 1) && (val <= DELAYMASK); return pred; } inline uint isOverSteps(const uint stepCount) { return stepCount >= MAXSTEP; } ////////////////////////////////////////////////////////////////////// // // Collatz openCL kernel to find the counter-example of Collatz conjecture, optimized by 256-step sieve // // N0 = 2**256 * k + b0, where 0 <= b0 = sieveNum < 2**256 // = 2**256 * (gid + kOffset + startHi) + b0 // After 256 collatz iterations: // N256 = 3**X * (gid + kOffset + startHi) + b256 // = (3**X * gid) + (3**X * (startHi + kOffset) + b256); // Increment of kOffset each kernel launch = GLOBALSIZE // ///////////////////////////////////////////////////////////////////////// __kernel void collatzVariableLength( __global uint * restrict g_maxStep, /* maximum step for this thread */ __global ulong * restrict g_maxPos, /* position where the max step is */ __global ulong * restrict g_totalSteps, /* total collatz steps calculated */ __global const ulong * restrict g_lut, /* look-up table. lo: powerOf3 ; hi: addVal */ __global const uint * restrict g_delays, /* collatz steps for # < 2**(DELAYSIZE_LOG2) */ __constant uint * restrict c_a256, /* 3**X */ const uint16 c_baseVal, /* 3**X * (startHi + kOffset) + b256 */ uint valLength, /* length of baseVal*/ const ulong kOffset ){ //val = baseVal uint val[VAL_LENGTH]; vstore16(c_baseVal, 0, val); //val += (3**X * kOffset_lo) ulong addRes = 0ul; uint pred = get_global_id(0) + convert_uint(kOffset); //pred as mulVal for(uint i = 0; i < valLength; ++i) { addRes = (addRes>>32) + val[i] + mul64L(pred, c_a256[i]); val[i] = convert_uint(addRes); } //for() val[valLength] = convert_uint(addRes>>32); //fill 0's for rest of digits for(uint i = valLength + 1; i < VAL_LENGTH; ++ i) { val[i] = 0u; } //for() valLength += val[valLength] > 0;//Adjust valLength uint stepCount = INITSTEP; do { addRes = g_lut[val[0] & LUTMASK]; //most time-consuming global mem. access in this kernel pred = power3[convert_uint(addRes)]; //pred as multiplier stepCount += convert_uint(addRes) + LUTSIZE_LOG2; //val = (val >> LUTSIZE_LOG2) * mulVal + addVal for(uint i = 0; i < valLength; ++i) { addRes = (addRes >> 32) + mul64L(pred, (val[i] >> LUTSIZE_LOG2) | (val[i+1] << (32 - LUTSIZE_LOG2))); val[i] = convert_uint(addRes); } //for() val[valLength] = convert_uint(addRes >> 32); //Took advantage of that valLength will increase or decrease by 1 at most valLength += val[valLength] > 0; valLength -= (val[valLength] == 0) && (val[valLength - 1] == 0); pred = (isOverflow(valLength) << 2) | (isOverSteps(stepCount) << 1) | isNormalExit(valLength, val[0]); } while(pred == 0); stepCount += g_delays[val[0] & DELAYMASK]; stepCount = select(stepCount, MAXSTEP, pred > 1); g_totalSteps[get_global_id(0)] += stepCount; valLength = g_maxStep[get_global_id(0)]; g_maxStep[get_global_id(0)] = max(stepCount, valLength); addRes = g_maxPos[get_global_id(0)]; g_maxPos[get_global_id(0)] = select(addRes, kOffset | get_global_id(0), convert_ulong(stepCount > valLength)); } //collatzVariableLength() As you can see, the kernel code becomes leaner (and maybe faster) than the previous one. ____________ Sosiris, team BOINC@Taiwan | |

ID: 22528 · Rating: 0 · rate:
/ Reply
Quote
| |

To my surprise, the variable value length kernel is slower than fixed length one. (68ms vs 55 ms per 2^22 items). Probably because of loop unrolling in the fixed version and more branching instructions in the variable one. Those factors outweigh the benefit of less numbers to calculate. #ifdef TEST_COMPILE //Constants required while compiling this kernel #define LUTSIZE_LOG2 16 //Size of look-up table, in log2(size) #define DELAYSIZE_LOG2 20 //Size of delay table, in log2(size) #define INITSTEP 350 //collatz steps added in the precalculation #endif //#ifdef TEST_COMPILE #define VAL_LENGTH 16 //15 uints for value itself and 1 for overflow #define LUTMASK ((1u<<LUTSIZE_LOG2)-1) #define DELAYMASK ((1u<<DELAYSIZE_LOG2)-1) #define MAXSTEP 0xFFFFFFFu __constant uint power3[] = { 1u,3u,9u,27u,81u,243u,729u,2187u,6561u,19683u, 59049u,177147u,531441u,1594323u,4782969u,14348907u,43046721u,129140163u,387420489u,1162261467u, 3486784401u }; inline ulong mul64L(uint a, uint b) { return upsample(mul_hi(a,b), a*b); } #ifdef VARIABLE_LENGTH //Functions for variable length kernel //Assuming at least one element is not zero inline uint getValLength(uint* val) { uint idx = VAL_LENGTH - 1; while(val[idx] == 0) --idx; return idx + 1; } inline uint isOverflow(uint valLength, uint* val) { return valLength >= VAL_LENGTH; } inline uint isNormalExit(uint valLength, uint* val) { return valLength == 1 && val[0] <= DELAYMASK; } #else //Functions for fixed length kernel inline uint isOverflow(uint valLength, uint* val) { return val[VAL_LENGTH - 1] > 0; } inline uint isNormalExit(uint valLength, uint* val) { uint pred = 0; #pragma unroll for(uint i = 1; i < VAL_LENGTH-1; ++i) { pred |= val[i]; } return pred == 0 && val[0] <= DELAYMASK; } #endif //#ifdef VARIABLE_LENGTH inline uint isOverSteps(uint stepCount) { return stepCount >= MAXSTEP; } ////////////////////////////////////////////////////////////////////// // // Collatz openCL kernel to find the counter-example of Collatz conjecture, optimized by the 256-step sieve // // N0 = 2**256 * k + b0, where 0 <= b0 = sieveNum < 2**256 // = 2**256 * (gid + kOffset) + b0 // After 256 collatz iterations: // N256 = 3**X * (gid + kOffset) + b256 // = (3**X * gid) + (3**X * kOffset + b256) // ///////////////////////////////////////////////////////////////////////// __kernel void collatz( __global uint * restrict g_maxStep, /* maximum collatz step */ __global ulong * restrict g_maxPos, /* the very kOffset where the max step is */ __global ulong * restrict g_totalSteps, /* total collatz steps calculated */ __global const ulong * restrict g_lut, /* look-up table. lo: powerOf3 ; hi: addent */ __global const uint * restrict g_delays, /* collatz steps for # < 2**(DELAYSIZE_LOG2) */ __constant uint * restrict c_multiplier, /* 3**X */ const uint16 c_addent, /* 3**X * kOffset + b256, should be updated when launching kernels */ const ulong kOffset ){ uint val[VAL_LENGTH]; vstore16(c_addent, 0, val); //val = c_addent uint pred = get_global_id(0); //pred as multiplier //val += gid * c_multiplier ulong addRes = val[0] + mul64L(pred, c_multiplier[0]); val[0] = convert_uint(addRes); #pragma unroll for(uint i = 1; i < VAL_LENGTH; ++i) { addRes = (addRes>>32) + val[i] + mul64L(pred, c_multiplier[i]); val[i] = convert_uint(addRes); } #ifdef VARIABLE_LENGTH uint valLength = getValLength(val); #else #define valLength (VAL_LENGTH-1) #endif uint stepCount = INITSTEP; do { addRes = g_lut[val[0] & LUTMASK]; //most time-consuming global mem. access in this kernel pred = convert_uint(addRes); stepCount += pred + LUTSIZE_LOG2; pred = power3[pred]; //pred as multiplier //val = (val >> LUTSIZE_LOG2) * multiplier + addend, only "valLength" numbers in val array are calculated for(uint i = 0; i < valLength; ++i) { addRes = (addRes >> 32) + mul64L(pred, (val[i] >> LUTSIZE_LOG2) | (val[i+1] << (32 - LUTSIZE_LOG2))); val[i] = convert_uint(addRes); } val[valLength] = convert_uint(addRes >> 32); #ifdef VARIABLE_LENGTH //valLength changes by 1 at most valLength += val[valLength] > 0; valLength -= val[valLength] == 0 && val[valLength - 1] == 0; #endif //#ifdef VARIABLE_LENGTH pred = (isOverflow(valLength, val) << 2) | (isOverSteps(stepCount) << 1) | isNormalExit(valLength, val); } while(pred == 0); stepCount += g_delays[val[0] & DELAYMASK]; stepCount = select(stepCount, MAXSTEP, pred > 1); pred = g_maxStep[get_global_id(0)]; addRes = g_maxPos[get_global_id(0)]; g_totalSteps[get_global_id(0)] += stepCount; g_maxStep[get_global_id(0)] = max(stepCount, pred); g_maxPos[get_global_id(0)] = select(addRes, kOffset + get_global_id(0), convert_ulong(stepCount > pred)); } //collatzVariableLength() //clearRes() : clears result buffers, could use clEnqueueFillBuffer() in openCL 1.2 or above __kernel void clearRes( __global uint * restrict g_maxStep, /* maximum step for this thread */ __global ulong * restrict g_maxPos, /*position where the max step is*/ __global ulong * restrict g_totalSteps /* total collatz (delay) steps calculated */ ) { g_maxStep[get_global_id(0)] = 0u; g_maxPos[get_global_id(0)] = 0ul; g_totalSteps[get_global_id(0)] = 0ul; } //clearRes() ____________ Sosiris, team BOINC@Taiwan | |

ID: 22795 · Rating: 0 · rate:
/ Reply
Quote
| |

Post to thread

Message boards :
Science :
Suggestion : Use sieve to speed up

Copyright © 2018 Jon Sonntag; All rights reserved.