Posts by sosiris

log in |

1)
Message boards :
Science :
Suggestion : Use sieve to speed up
(Message 23942)
Posted 350 days ago by sosiris It's me again. After some personal things, I was able to sit down and think about more optimizations I could do for this project. The first one is to get the glide steps correctly and efficiently and we don't have to go all way down to the full stopping time (aka delay steps). My first attempt was a failure because of inaccuracies within floating point numbers. This time I used a sliding window of cumulative parity vector, against a minimal requirement one. The kernel drops the numbers of which requirements are not met. Because all of them are integer ops, it should be accurate. The second one is to increase look-ahead steps of the sieve. Currently, the 32-step sieve lets 1% of the numbers passing through. The 256-step one only allows 4 per 10 million. The more look-ahead steps of the sieve, the fewer numbers we need to check. While the running time goes up linearly (proportional to the bits of value), the problem size goes down approximately exponentially., which is great. (See lemma 6 of http://www.ericr.nl/wondrous/terras.html). And since we only need a couple of residues instead of the entire sieve to get the kernels working, being farsighted should not cause storage issues. Right now I'm writing a collatz glide kernel which utilizes a 5000-step sieve with an 8192-bit value. Each thread holds 32-bit so there are 256 threads in a thread block. It would be slower to compute one single value, but each value is now a trillion worth. And it's not April Fools. Believe me :) |

2)
Message boards :
Science :
Suggestion : Use sieve to speed up
(Message 22795)
Posted 589 days ago by sosiris To my surprise, the variable value length kernel is slower than fixed length one. (68ms vs 55 ms per 2^22 items). Probably because of loop unrolling in the fixed version and more branching instructions in the variable one. Those factors outweigh the benefit of less numbers to calculate. Fixed length: https://drive.google.com/open?id=15eLvvnNfOtWWq4T_yQeKAtzJ7u4JjixXbre9Ll1frl0 Variable Length: https://drive.google.com/open?id=1VIGAJ-_3upkyynGTnc4I_cm49sCRkE6Hq5Yk02vrCF4 Kernel code : (I merged the kernels into one since both use the same logic) #ifdef TEST_COMPILE //Constants required while compiling this kernel #define LUTSIZE_LOG2 16 //Size of look-up table, in log2(size) #define DELAYSIZE_LOG2 20 //Size of delay table, in log2(size) #define INITSTEP 350 //collatz steps added in the precalculation #endif //#ifdef TEST_COMPILE #define VAL_LENGTH 16 //15 uints for value itself and 1 for overflow #define LUTMASK ((1u<<LUTSIZE_LOG2)-1) #define DELAYMASK ((1u<<DELAYSIZE_LOG2)-1) #define MAXSTEP 0xFFFFFFFu __constant uint power3[] = { 1u,3u,9u,27u,81u,243u,729u,2187u,6561u,19683u, 59049u,177147u,531441u,1594323u,4782969u,14348907u,43046721u,129140163u,387420489u,1162261467u, 3486784401u }; inline ulong mul64L(uint a, uint b) { return upsample(mul_hi(a,b), a*b); } #ifdef VARIABLE_LENGTH //Functions for variable length kernel //Assuming at least one element is not zero inline uint getValLength(uint* val) { uint idx = VAL_LENGTH - 1; while(val[idx] == 0) --idx; return idx + 1; } inline uint isOverflow(uint valLength, uint* val) { return valLength >= VAL_LENGTH; } inline uint isNormalExit(uint valLength, uint* val) { return valLength == 1 && val[0] <= DELAYMASK; } #else //Functions for fixed length kernel inline uint isOverflow(uint valLength, uint* val) { return val[VAL_LENGTH - 1] > 0; } inline uint isNormalExit(uint valLength, uint* val) { uint pred = 0; #pragma unroll for(uint i = 1; i < VAL_LENGTH-1; ++i) { pred |= val[i]; } return pred == 0 && val[0] <= DELAYMASK; } #endif //#ifdef VARIABLE_LENGTH inline uint isOverSteps(uint stepCount) { return stepCount >= MAXSTEP; } ////////////////////////////////////////////////////////////////////// // // Collatz openCL kernel to find the counter-example of Collatz conjecture, optimized by the 256-step sieve // // N0 = 2**256 * k + b0, where 0 <= b0 = sieveNum < 2**256 // = 2**256 * (gid + kOffset) + b0 // After 256 collatz iterations: // N256 = 3**X * (gid + kOffset) + b256 // = (3**X * gid) + (3**X * kOffset + b256) // ///////////////////////////////////////////////////////////////////////// __kernel void collatz( __global uint * restrict g_maxStep, /* maximum collatz step */ __global ulong * restrict g_maxPos, /* the very kOffset where the max step is */ __global ulong * restrict g_totalSteps, /* total collatz steps calculated */ __global const ulong * restrict g_lut, /* look-up table. lo: powerOf3 ; hi: addent */ __global const uint * restrict g_delays, /* collatz steps for # < 2**(DELAYSIZE_LOG2) */ __constant uint * restrict c_multiplier, /* 3**X */ const uint16 c_addent, /* 3**X * kOffset + b256, should be updated when launching kernels */ const ulong kOffset ){ uint val[VAL_LENGTH]; vstore16(c_addent, 0, val); //val = c_addent uint pred = get_global_id(0); //pred as multiplier //val += gid * c_multiplier ulong addRes = val[0] + mul64L(pred, c_multiplier[0]); val[0] = convert_uint(addRes); #pragma unroll for(uint i = 1; i < VAL_LENGTH; ++i) { addRes = (addRes>>32) + val[i] + mul64L(pred, c_multiplier[i]); val[i] = convert_uint(addRes); } #ifdef VARIABLE_LENGTH uint valLength = getValLength(val); #else #define valLength (VAL_LENGTH-1) #endif uint stepCount = INITSTEP; do { addRes = g_lut[val[0] & LUTMASK]; //most time-consuming global mem. access in this kernel pred = convert_uint(addRes); stepCount += pred + LUTSIZE_LOG2; pred = power3[pred]; //pred as multiplier //val = (val >> LUTSIZE_LOG2) * multiplier + addend, only "valLength" numbers in val array are calculated for(uint i = 0; i < valLength; ++i) { addRes = (addRes >> 32) + mul64L(pred, (val[i] >> LUTSIZE_LOG2) | (val[i+1] << (32 - LUTSIZE_LOG2))); val[i] = convert_uint(addRes); } val[valLength] = convert_uint(addRes >> 32); #ifdef VARIABLE_LENGTH //valLength changes by 1 at most valLength += val[valLength] > 0; valLength -= val[valLength] == 0 && val[valLength - 1] == 0; #endif //#ifdef VARIABLE_LENGTH pred = (isOverflow(valLength, val) << 2) | (isOverSteps(stepCount) << 1) | isNormalExit(valLength, val); } while(pred == 0); stepCount += g_delays[val[0] & DELAYMASK]; stepCount = select(stepCount, MAXSTEP, pred > 1); pred = g_maxStep[get_global_id(0)]; addRes = g_maxPos[get_global_id(0)]; g_totalSteps[get_global_id(0)] += stepCount; g_maxStep[get_global_id(0)] = max(stepCount, pred); g_maxPos[get_global_id(0)] = select(addRes, kOffset + get_global_id(0), convert_ulong(stepCount > pred)); } //collatzVariableLength() //clearRes() : clears result buffers, could use clEnqueueFillBuffer() in openCL 1.2 or above __kernel void clearRes( __global uint * restrict g_maxStep, /* maximum step for this thread */ __global ulong * restrict g_maxPos, /*position where the max step is*/ __global ulong * restrict g_totalSteps /* total collatz (delay) steps calculated */ ) { g_maxStep[get_global_id(0)] = 0u; g_maxPos[get_global_id(0)] = 0ul; g_totalSteps[get_global_id(0)] = 0ul; } //clearRes() |

3)
Message boards :
Number crunching :
PCI-e Bandwidth Usage
(Message 22599)
Posted 634 days ago by sosiris Theoretically the collatz sieve kernel uses little bandwidth once the look up tables are loaded into VRAM, just about one number per kernel launch. |

4)
Message boards :
Science :
Suggestion : Use sieve to speed up
(Message 22528)
Posted 649 days ago by sosiris That's a briliant idea. The kernel code would look like this: #ifdef TEST_COMPILE //Constants required while compiling this kernel #define LUTSIZE_LOG2 16 //Size of look-up table, in log2(size) #define DELAYSIZE_LOG2 20 //Size of delay table, in log2(size) #define INITSTEP 350 //collatz steps added in the precalculation #endif //TEST_COMPILE #define VAL_LENGTH 16 //15 uints + 1 uint for overflow bits by default #define LUTMASK ((1u<<LUTSIZE_LOG2)-1) #define DELAYMASK ((1u<<DELAYSIZE_LOG2)-1) #define MAXSTEP 0xFFFFFFFu __constant uint power3[] = { 1u,3u,9u,27u,81u,243u,729u,2187u,6561u,19683u, 59049u,177147u,531441u,1594323u,4782969u,14348907u,43046721u,129140163u,387420489u,1162261467u, 3486784401u }; inline ulong mul64L(const uint a, const uint b) { return upsample(mul_hi(a,b) /*hi*/, a*b /*lo*/); } //returns length of val in terms of 32-bit ints, assuming val != 0 inline uint getValLength(const __private uint * restrict val, uint idx) { while(val[idx] == 0) --idx; return (idx + 1); } inline uint isOverflow(const uint valLength) { return valLength == VAL_LENGTH; } inline uint isNormalExit(const uint valLength, const uint val) { uint pred = (valLength == 1) && (val <= DELAYMASK); return pred; } inline uint isOverSteps(const uint stepCount) { return stepCount >= MAXSTEP; } ////////////////////////////////////////////////////////////////////// // // Collatz openCL kernel to find the counter-example of Collatz conjecture, optimized by 256-step sieve // // N0 = 2**256 * k + b0, where 0 <= b0 = sieveNum < 2**256 // = 2**256 * (gid + kOffset + startHi) + b0 // After 256 collatz iterations: // N256 = 3**X * (gid + kOffset + startHi) + b256 // = (3**X * gid) + (3**X * (startHi + kOffset) + b256); // Increment of kOffset each kernel launch = GLOBALSIZE // ///////////////////////////////////////////////////////////////////////// __kernel void collatzVariableLength( __global uint * restrict g_maxStep, /* maximum step for this thread */ __global ulong * restrict g_maxPos, /* position where the max step is */ __global ulong * restrict g_totalSteps, /* total collatz steps calculated */ __global const ulong * restrict g_lut, /* look-up table. lo: powerOf3 ; hi: addVal */ __global const uint * restrict g_delays, /* collatz steps for # < 2**(DELAYSIZE_LOG2) */ __constant uint * restrict c_a256, /* 3**X */ const uint16 c_baseVal, /* 3**X * (startHi + kOffset) + b256 */ uint valLength, /* length of baseVal*/ const ulong kOffset ){ //val = baseVal uint val[VAL_LENGTH]; vstore16(c_baseVal, 0, val); //val += (3**X * kOffset_lo) ulong addRes = 0ul; uint pred = get_global_id(0) + convert_uint(kOffset); //pred as mulVal for(uint i = 0; i < valLength; ++i) { addRes = (addRes>>32) + val[i] + mul64L(pred, c_a256[i]); val[i] = convert_uint(addRes); } //for() val[valLength] = convert_uint(addRes>>32); //fill 0's for rest of digits for(uint i = valLength + 1; i < VAL_LENGTH; ++ i) { val[i] = 0u; } //for() valLength += val[valLength] > 0;//Adjust valLength uint stepCount = INITSTEP; do { addRes = g_lut[val[0] & LUTMASK]; //most time-consuming global mem. access in this kernel pred = power3[convert_uint(addRes)]; //pred as multiplier stepCount += convert_uint(addRes) + LUTSIZE_LOG2; //val = (val >> LUTSIZE_LOG2) * mulVal + addVal for(uint i = 0; i < valLength; ++i) { addRes = (addRes >> 32) + mul64L(pred, (val[i] >> LUTSIZE_LOG2) | (val[i+1] << (32 - LUTSIZE_LOG2))); val[i] = convert_uint(addRes); } //for() val[valLength] = convert_uint(addRes >> 32); //Took advantage of that valLength will increase or decrease by 1 at most valLength += val[valLength] > 0; valLength -= (val[valLength] == 0) && (val[valLength - 1] == 0); pred = (isOverflow(valLength) << 2) | (isOverSteps(stepCount) << 1) | isNormalExit(valLength, val[0]); } while(pred == 0); stepCount += g_delays[val[0] & DELAYMASK]; stepCount = select(stepCount, MAXSTEP, pred > 1); g_totalSteps[get_global_id(0)] += stepCount; valLength = g_maxStep[get_global_id(0)]; g_maxStep[get_global_id(0)] = max(stepCount, valLength); addRes = g_maxPos[get_global_id(0)]; g_maxPos[get_global_id(0)] = select(addRes, kOffset | get_global_id(0), convert_ulong(stepCount > valLength)); } //collatzVariableLength() As you can see, the kernel code becomes leaner (and maybe faster) than the previous one. |

5)
Message boards :
Cafe :
The Last One to Post Wins! XXI
(Message 22053)
Posted 745 days ago by sosiris Okay. Winning then. It's me, winning! |

6)
Message boards :
Science :
Suggestion : Use sieve to speed up
(Message 22047)
Posted 746 days ago by sosiris After a gap more than a hundred days, I finally figured out how to make the precalculation kernel not too complicated. On my R9-380 , the naive kernel took 60ms for 2^22 numbers, and the one with pre-calculation took 53ms (an expected 11% increase of speed). Any feedback is appreciated. Code : collatz256.cl is the naive kernel with 256-step sieve collatzPreCalc.cl is the one with 256-step precalculation https://drive.google.com/file/d/0B_zdIaR2HqBENm5iOS1abkRmbVU/view?usp=sharing Also on the master branch of my github repo: https://github.com/SosirisTseng/collatz-opencl What to do next: 1. Organize the code, and remember to do it on another branch so I don't broke the master branch. 2. There is about 20% thread divergence. Maybe a thread should grab a new number once it finished counting collatz steps of previous number. |

7)
Message boards :
Number crunching :
Optimizing Collatz Sieve
(Message 21966)
Posted 779 days ago by sosiris Hello, kcharuso. For CPUs and NV GPUs, lut_size=20 is recommended since they have better caching. However you should try it yourself to find the sweet spot. For ATI GPUs (at least for GCN ones), the size of look-up table(lut) should be limited in accordance with the L2 cache, or the GPU has to find (some of) the entries in VRAM, which is very slow. For instance, the L2 of my HD 7850 is 512KB; one look-up table entry is 3*32-bit integer = 96 bits = 12 Bytes. So I can fit 512KB / 12B = 43K entries, about 2^15~2^16 (lut_size = 15~16). |

8)
Message boards :
Number crunching :
(unknown error) - exit code -1073741515 (0xc0000135)
(Message 21751)
Posted 851 days ago by sosiris
Should I adjust the kernel code so that the kernel uses different buffers for reading /writing? |

9)
Message boards :
Science :
Suggestion : Use sieve to speed up
(Message 21737)
Posted 853 days ago by sosiris Update : improved host code so lots of options are configurable https://drive.google.com/file/d/0B_zdIaR2HqBEMWZpM3Iwa1F1OFU/view?usp=sharing Code on GitHub also updated. |

10)
Message boards :
Number crunching :
WUs completing too quickly & getting too few WUs
(Message 21727)
Posted 854 days ago by sosiris What I've been running on the Nvidia side are 750, and 750ti. Partly because I tested the kernel on my 7850 so this GPU got most of my 'love' and ran faster. |

11)
Message boards :
Number crunching :
WUs completing too quickly & getting too few WUs
(Message 21709)
Posted 856 days ago by sosiris Don't worry. I'm currently working on a more computing intensive (at least 2.5x) sieve app which adopts the 256-step sieve instead of the 32-step one currently used. And it generates 2.5 million times more 'science' than the one without sieve. BTW, it also consumes much less GPU RAM which is a good news. |

12)
Message boards :
Science :
Suggestion : Use sieve to speed up
(Message 21661)
Posted 865 days ago by sosiris The fist working example using the 256-step sieve. https://drive.google.com/file/d/0B_zdIaR2HqBES01yMTg4QllPZ0k/view?usp=sharing Performance-wise, it runs at 2^20 numbers per 28ms, about 4 times slower than the current sieve APP. It's expected because 512-bit arithmetic is used and the average step is about 2800 since the input is larger. Some possible optimizations are : *Pre-calculate 256 steps before sending the numbers into the kernel, ~10% improvement. *Persistent threads (fewer threads per kernel but one thread processes lots of numbers)to cover thread divergence (~20% currently) from the while loop. |

13)
Message boards :
Science :
Suggestion : Use sieve to speed up
(Message 21237)
Posted 922 days ago by sosiris I created a GitHub repo to ease code management, and anyone who is interested in this project can have a look or give some feedback: https://github.com/SosirisTseng/collatz-opencl The developement will focus on utilizing 256-step sieve, and (maybe) test something interesting like using persistent thread model to counteract thread divergence. The kernel code is almost completed; however the host code is WIP. |

14)
Message boards :
News :
Collatz Sieve 1.10 Released for Windows
(Message 21171)
Posted 928 days ago by sosiris I looked at the assembly code the opencl compiler generated for the AMD GPUs, and found that 64-bit additions had about the same amount of instructions as 32-bit ones, i.e. they are about the same speed. So ulongs might be better than Do-It-Yourself 64-bit integers in additions. |

15)
Message boards :
News :
Collatz Sieve 1.10 Released for Windows
(Message 21159)
Posted 931 days ago by sosiris I was suspicious about the result and went for deep analysis today. It turned out the 32-bit kernel stuck in the for loop but not the 64-bit one (weird!). They actually took the same GPU cycles in add-and-carry statements. As to half adders (no carry in, just a+b), 32-bit and 64-bit kernels have equal speed, too. |

16)
Message boards :
News :
Collatz Sieve 1.10 Released for Windows
(Message 21158)
Posted 931 days ago by sosiris *edited out |

17)
Message boards :
News :
Collatz Sieve 1.10 Released for Windows
(Message 21150)
Posted 932 days ago by sosiris I did some testing on my Intel GPU too because 'full adders' are also required in my new kernel. using uint and carry : 3.3ms. using ulong : 0.32ms. It looks like my Intel GPU handles 64-bit integers very well. code: #define NLENGTH 16 #define NLOOPS 1024 __kernel void add32( __global uint *g_result, __constant uint* c_addVal ){ uint val[16]; uint add[16]; for(int i=0;i<NLENGTH;++i) { val[i] = g_result[i]; add[i] = c_addVal[i]; } for(int i=0 ;i<NLOOPS;++i) { uint carryIn =0; for(int j=0;j<NLENGTH;++j) { uint sum = val[j] + add[j]; uint carryOut = sum < val[j]; sum += carryIn; carryIn = carryOut | (sum < carryIn); val[j] = sum; } //for(j) }//for(i) for(int i=0;i<NLENGTH;++i) { g_result[i] = val[i]; } } //add32() __kernel void add64( __global uint *g_result, __constant uint* c_addVal ){ uint val[16]; uint add[16]; for(int i=0;i<NLENGTH;++i) { val[i] = g_result[i]; add[i] = c_addVal[i]; } for(int i=0 ;i<NLOOPS;++i) { ulong addRes =0; for(int j=0;j<NLENGTH;++j) { addRes = val[j] + add[j] + (addRes>>32); val[j] = convert_uint(addRes); } //for(j) }//for(i) for(int i=0;i<NLENGTH;++i) { g_result[i] = val[i]; } } //add64() I tested the kernels in Intel kernel builder, both producing the same results. My GPU is HD4000 in i7-3610QM.[/list] |

18)
Message boards :
Science :
Are workunits on individual numbers or ranges of numbers?
(Message 21132)
Posted 934 days ago by sosiris About 2 ^ 40 numbers for a large WU, AFAIK. |

19)
Message boards :
News :
Collatz Sieve 1.10 Released for Windows
(Message 21128)
Posted 935 days ago by sosiris It looks like there is a bug with the offset calculation in the kernel. That is a result of my changing it to work with multiple sieve sizes to keep GPUs from running out of RAM and crashing the video driver. I heard that one could test A+B >= 2^32 (both 32-bit uint) by (A>=-B) because "-B == 2^32 - B" in openCL (it's 2's complement). So this may be what you want: uint4 addu4u4(const uint4 a, const uint4 b) { uint4 s; //sum uint c; //carry s.s0 = a.s0 + b.s0; c = (a.s0 >= -b.s0); s.s1 = a.s1 + b.s1 + c; c = (a.s1 >= -b.s1) | (c > s.s1); //for the case that a+b == 0xffffffff && c == 1 s.s2 = a.s2 + b.s2 + c; c = (a.s2 >= -b.s2) | (c > s.s2); s.s3 = a.s3 + b.s3 + c; } But still, using INT64 makes the code cleaner. I'm not sure if it's faster or slower, maybe depend on the hardware. And if my memory serves me right, openCL supports INT64 from ver 1.0. (FP64 is an optional feature in ver 1.0, though) |

20)
Message boards :
Science :
Suggestion : Use sieve to speed up
(Message 21093)
Posted 940 days ago by sosiris http://pastebin.com/Wmyq0YvM Also, this is the code that precalculates N steps for an N-step sieve number. It returns the precalculated coefficients (i.e. a256 and b256 in the comments in the kernel file) and odd # occurrence (acutally, INITSTEPS = 256 + oddOccurrence) for the kernel. |

Copyright © 2018 Jon Sonntag; All rights reserved.