Posts by sosiris
log in
1) Message boards : Science : Suggestion : Use sieve to speed up (Message 23942)
Posted 292 days ago by Profile sosiris
It's me again. After some personal things, I was able to sit down and think about more optimizations I could do for this project.
The first one is to get the glide steps correctly and efficiently and we don't have to go all way down to the full stopping time (aka delay steps). My first attempt was a failure because of inaccuracies within floating point numbers. This time I used a sliding window of cumulative parity vector, against a minimal requirement one. The kernel drops the numbers of which requirements are not met. Because all of them are integer ops, it should be accurate.
The second one is to increase look-ahead steps of the sieve. Currently, the 32-step sieve lets 1% of the numbers passing through. The 256-step one only allows 4 per 10 million. The more look-ahead steps of the sieve, the fewer numbers we need to check. While the running time goes up linearly (proportional to the bits of value), the problem size goes down approximately exponentially., which is great. (See lemma 6 of http://www.ericr.nl/wondrous/terras.html). And since we only need a couple of residues
instead of the entire sieve to get the kernels working, being farsighted should not cause storage issues.
Right now I'm writing a collatz glide kernel which utilizes a 5000-step sieve with an 8192-bit value. Each thread holds 32-bit so there are 256 threads in a thread block. It would be slower to compute one single value, but each value is now a trillion worth.
And it's not April Fools. Believe me :)
2) Message boards : Science : Suggestion : Use sieve to speed up (Message 22795)
Posted 530 days ago by Profile sosiris
To my surprise, the variable value length kernel is slower than fixed length one. (68ms vs 55 ms per 2^22 items). Probably because of loop unrolling in the fixed version and more branching instructions in the variable one. Those factors outweigh the benefit of less numbers to calculate.

Fixed length:
https://drive.google.com/open?id=15eLvvnNfOtWWq4T_yQeKAtzJ7u4JjixXbre9Ll1frl0

Variable Length:
https://drive.google.com/open?id=1VIGAJ-_3upkyynGTnc4I_cm49sCRkE6Hq5Yk02vrCF4

Kernel code : (I merged the kernels into one since both use the same logic)

#ifdef TEST_COMPILE
//Constants required while compiling this kernel
#define LUTSIZE_LOG2 16 //Size of look-up table, in log2(size)
#define DELAYSIZE_LOG2 20 //Size of delay table, in log2(size)
#define INITSTEP 350 //collatz steps added in the precalculation
#endif //#ifdef TEST_COMPILE

#define VAL_LENGTH 16 //15 uints for value itself and 1 for overflow

#define LUTMASK ((1u<<LUTSIZE_LOG2)-1)
#define DELAYMASK ((1u<<DELAYSIZE_LOG2)-1)
#define MAXSTEP 0xFFFFFFFu

__constant uint power3[] =
{
1u,3u,9u,27u,81u,243u,729u,2187u,6561u,19683u,
59049u,177147u,531441u,1594323u,4782969u,14348907u,43046721u,129140163u,387420489u,1162261467u,
3486784401u
};

inline ulong mul64L(uint a, uint b)
{
return upsample(mul_hi(a,b), a*b);
}

#ifdef VARIABLE_LENGTH //Functions for variable length kernel

//Assuming at least one element is not zero
inline uint getValLength(uint* val)
{
uint idx = VAL_LENGTH - 1;
while(val[idx] == 0) --idx;
return idx + 1;
}
inline uint isOverflow(uint valLength, uint* val)
{
return valLength >= VAL_LENGTH;
}
inline uint isNormalExit(uint valLength, uint* val)
{
return valLength == 1 && val[0] <= DELAYMASK;
}

#else //Functions for fixed length kernel

inline uint isOverflow(uint valLength, uint* val)
{
return val[VAL_LENGTH - 1] > 0;
}
inline uint isNormalExit(uint valLength, uint* val)
{
uint pred = 0;
#pragma unroll
for(uint i = 1; i < VAL_LENGTH-1; ++i) {
pred |= val[i];
}
return pred == 0 && val[0] <= DELAYMASK;
}
#endif //#ifdef VARIABLE_LENGTH

inline uint isOverSteps(uint stepCount)
{
return stepCount >= MAXSTEP;
}

//////////////////////////////////////////////////////////////////////
//
// Collatz openCL kernel to find the counter-example of Collatz conjecture, optimized by the 256-step sieve
//
// N0 = 2**256 * k + b0, where 0 <= b0 = sieveNum < 2**256
// = 2**256 * (gid + kOffset) + b0
// After 256 collatz iterations:
// N256 = 3**X * (gid + kOffset) + b256
// = (3**X * gid) + (3**X * kOffset + b256)
//
/////////////////////////////////////////////////////////////////////////

__kernel void collatz(
__global uint * restrict g_maxStep, /* maximum collatz step */
__global ulong * restrict g_maxPos, /* the very kOffset where the max step is */
__global ulong * restrict g_totalSteps, /* total collatz steps calculated */
__global const ulong * restrict g_lut, /* look-up table. lo: powerOf3 ; hi: addent */
__global const uint * restrict g_delays, /* collatz steps for # < 2**(DELAYSIZE_LOG2) */
__constant uint * restrict c_multiplier, /* 3**X */
const uint16 c_addent, /* 3**X * kOffset + b256, should be updated when launching kernels */
const ulong kOffset
){
uint val[VAL_LENGTH];
vstore16(c_addent, 0, val); //val = c_addent
uint pred = get_global_id(0); //pred as multiplier
//val += gid * c_multiplier
ulong addRes = val[0] + mul64L(pred, c_multiplier[0]);
val[0] = convert_uint(addRes);
#pragma unroll
for(uint i = 1; i < VAL_LENGTH; ++i) {
addRes = (addRes>>32) + val[i] + mul64L(pred, c_multiplier[i]);
val[i] = convert_uint(addRes);
}
#ifdef VARIABLE_LENGTH
uint valLength = getValLength(val);
#else
#define valLength (VAL_LENGTH-1)
#endif
uint stepCount = INITSTEP;
do {
addRes = g_lut[val[0] & LUTMASK]; //most time-consuming global mem. access in this kernel
pred = convert_uint(addRes);
stepCount += pred + LUTSIZE_LOG2;
pred = power3[pred]; //pred as multiplier
//val = (val >> LUTSIZE_LOG2) * multiplier + addend, only "valLength" numbers in val array are calculated
for(uint i = 0; i < valLength; ++i)
{
addRes = (addRes >> 32) + mul64L(pred, (val[i] >> LUTSIZE_LOG2) | (val[i+1] << (32 - LUTSIZE_LOG2)));
val[i] = convert_uint(addRes);
}
val[valLength] = convert_uint(addRes >> 32);
#ifdef VARIABLE_LENGTH
//valLength changes by 1 at most
valLength += val[valLength] > 0;
valLength -= val[valLength] == 0 && val[valLength - 1] == 0;
#endif //#ifdef VARIABLE_LENGTH
pred = (isOverflow(valLength, val) << 2) | (isOverSteps(stepCount) << 1) | isNormalExit(valLength, val);
} while(pred == 0);
stepCount += g_delays[val[0] & DELAYMASK];
stepCount = select(stepCount, MAXSTEP, pred > 1);
pred = g_maxStep[get_global_id(0)];
addRes = g_maxPos[get_global_id(0)];
g_totalSteps[get_global_id(0)] += stepCount;
g_maxStep[get_global_id(0)] = max(stepCount, pred);
g_maxPos[get_global_id(0)] = select(addRes, kOffset + get_global_id(0), convert_ulong(stepCount > pred));
} //collatzVariableLength()

//clearRes() : clears result buffers, could use clEnqueueFillBuffer() in openCL 1.2 or above
__kernel void clearRes(
__global uint * restrict g_maxStep, /* maximum step for this thread */
__global ulong * restrict g_maxPos, /*position where the max step is*/
__global ulong * restrict g_totalSteps /* total collatz (delay) steps calculated */
) {
g_maxStep[get_global_id(0)] = 0u;
g_maxPos[get_global_id(0)] = 0ul;
g_totalSteps[get_global_id(0)] = 0ul;
} //clearRes()
3) Message boards : Number crunching : PCI-e Bandwidth Usage (Message 22599)
Posted 576 days ago by Profile sosiris
Theoretically the collatz sieve kernel uses little bandwidth once the look up tables are loaded into VRAM, just about one number per kernel launch.
4) Message boards : Science : Suggestion : Use sieve to speed up (Message 22528)
Posted 591 days ago by Profile sosiris
That's a briliant idea. The kernel code would look like this:


#ifdef TEST_COMPILE
//Constants required while compiling this kernel
#define LUTSIZE_LOG2 16 //Size of look-up table, in log2(size)
#define DELAYSIZE_LOG2 20 //Size of delay table, in log2(size)
#define INITSTEP 350 //collatz steps added in the precalculation
#endif //TEST_COMPILE

#define VAL_LENGTH 16 //15 uints + 1 uint for overflow bits by default
#define LUTMASK ((1u<<LUTSIZE_LOG2)-1)
#define DELAYMASK ((1u<<DELAYSIZE_LOG2)-1)
#define MAXSTEP 0xFFFFFFFu

__constant uint power3[] =
{
1u,3u,9u,27u,81u,243u,729u,2187u,6561u,19683u,
59049u,177147u,531441u,1594323u,4782969u,14348907u,43046721u,129140163u,387420489u,1162261467u,
3486784401u
};

inline ulong mul64L(const uint a, const uint b)
{
return upsample(mul_hi(a,b) /*hi*/, a*b /*lo*/);
}

//returns length of val in terms of 32-bit ints, assuming val != 0
inline uint getValLength(const __private uint * restrict val, uint idx)
{
while(val[idx] == 0) --idx;
return (idx + 1);
}

inline uint isOverflow(const uint valLength)
{
return valLength == VAL_LENGTH;
}

inline uint isNormalExit(const uint valLength, const uint val)
{
uint pred = (valLength == 1) && (val <= DELAYMASK);
return pred;
}

inline uint isOverSteps(const uint stepCount)
{
return stepCount >= MAXSTEP;
}

//////////////////////////////////////////////////////////////////////
//
// Collatz openCL kernel to find the counter-example of Collatz conjecture, optimized by 256-step sieve
//
// N0 = 2**256 * k + b0, where 0 <= b0 = sieveNum < 2**256
// = 2**256 * (gid + kOffset + startHi) + b0
// After 256 collatz iterations:
// N256 = 3**X * (gid + kOffset + startHi) + b256
// = (3**X * gid) + (3**X * (startHi + kOffset) + b256);
// Increment of kOffset each kernel launch = GLOBALSIZE
//
/////////////////////////////////////////////////////////////////////////

__kernel void collatzVariableLength(
__global uint * restrict g_maxStep, /* maximum step for this thread */
__global ulong * restrict g_maxPos, /* position where the max step is */
__global ulong * restrict g_totalSteps, /* total collatz steps calculated */
__global const ulong * restrict g_lut, /* look-up table. lo: powerOf3 ; hi: addVal */
__global const uint * restrict g_delays, /* collatz steps for # < 2**(DELAYSIZE_LOG2) */
__constant uint * restrict c_a256, /* 3**X */
const uint16 c_baseVal, /* 3**X * (startHi + kOffset) + b256 */
uint valLength, /* length of baseVal*/
const ulong kOffset
){
//val = baseVal
uint val[VAL_LENGTH];
vstore16(c_baseVal, 0, val);
//val += (3**X * kOffset_lo)
ulong addRes = 0ul;
uint pred = get_global_id(0) + convert_uint(kOffset); //pred as mulVal
for(uint i = 0; i < valLength; ++i)
{
addRes = (addRes>>32) + val[i] + mul64L(pred, c_a256[i]);
val[i] = convert_uint(addRes);
} //for()
val[valLength] = convert_uint(addRes>>32);

//fill 0's for rest of digits
for(uint i = valLength + 1; i < VAL_LENGTH; ++ i)
{
val[i] = 0u;
} //for()
valLength += val[valLength] > 0;//Adjust valLength
uint stepCount = INITSTEP;
do
{
addRes = g_lut[val[0] & LUTMASK]; //most time-consuming global mem. access in this kernel
pred = power3[convert_uint(addRes)]; //pred as multiplier
stepCount += convert_uint(addRes) + LUTSIZE_LOG2;
//val = (val >> LUTSIZE_LOG2) * mulVal + addVal
for(uint i = 0; i < valLength; ++i)
{
addRes = (addRes >> 32) + mul64L(pred, (val[i] >> LUTSIZE_LOG2) | (val[i+1] << (32 - LUTSIZE_LOG2)));
val[i] = convert_uint(addRes);
} //for()
val[valLength] = convert_uint(addRes >> 32);
//Took advantage of that valLength will increase or decrease by 1 at most
valLength += val[valLength] > 0;
valLength -= (val[valLength] == 0) && (val[valLength - 1] == 0);
pred = (isOverflow(valLength) << 2) | (isOverSteps(stepCount) << 1) | isNormalExit(valLength, val[0]);
} while(pred == 0);
stepCount += g_delays[val[0] & DELAYMASK];
stepCount = select(stepCount, MAXSTEP, pred > 1);
g_totalSteps[get_global_id(0)] += stepCount;
valLength = g_maxStep[get_global_id(0)];
g_maxStep[get_global_id(0)] = max(stepCount, valLength);
addRes = g_maxPos[get_global_id(0)];
g_maxPos[get_global_id(0)] = select(addRes, kOffset | get_global_id(0), convert_ulong(stepCount > valLength));
} //collatzVariableLength()


As you can see, the kernel code becomes leaner (and maybe faster) than the previous one.
5) Message boards : Cafe : The Last One to Post Wins! XXI (Message 22053)
Posted 686 days ago by Profile sosiris
Okay. Winning then.


You only think you are winning, in fact I am winning!!


Not yet. Winning!


Now?


Sooner or later. Winning!


Okay then I choose now for me to be winning!!


Not now. Maybe later.


Now?


Nope. Winning!


It's me, winning!
6) Message boards : Science : Suggestion : Use sieve to speed up (Message 22047)
Posted 688 days ago by Profile sosiris
After a gap more than a hundred days, I finally figured out how to make the precalculation kernel not too complicated. On my R9-380 , the naive kernel took 60ms for 2^22 numbers, and the one with pre-calculation took 53ms (an expected 11% increase of speed). Any feedback is appreciated.

Code :
collatz256.cl is the naive kernel with 256-step sieve
collatzPreCalc.cl is the one with 256-step precalculation
https://drive.google.com/file/d/0B_zdIaR2HqBENm5iOS1abkRmbVU/view?usp=sharing

Also on the master branch of my github repo:
https://github.com/SosirisTseng/collatz-opencl

What to do next:
1. Organize the code, and remember to do it on another branch so I don't broke the master branch.
2. There is about 20% thread divergence. Maybe a thread should grab a new number once it finished counting collatz steps of previous number.
7) Message boards : Number crunching : Optimizing Collatz Sieve (Message 21966)
Posted 721 days ago by Profile sosiris
Hello, kcharuso.

For CPUs and NV GPUs, lut_size=20 is recommended since they have better caching. However you should try it yourself to find the sweet spot.

For ATI GPUs (at least for GCN ones), the size of look-up table(lut) should be limited in accordance with the L2 cache, or the GPU has to find (some of) the entries in VRAM, which is very slow. For instance, the L2 of my HD 7850 is 512KB; one look-up table entry is 3*32-bit integer = 96 bits = 12 Bytes. So I can fit 512KB / 12B = 43K entries, about 2^15~2^16 (lut_size = 15~16).
8) Message boards : Number crunching : (unknown error) - exit code -1073741515 (0xc0000135) (Message 21751)
Posted 793 days ago by Profile sosiris

The AMD drivers for Apple almost all crash. It looks like their "new" drivers are several years old. It has mostly to do with the way it does the optimization. The newer drivers seem to be able to handle both reading and writing to the kernel parameters whereas the older drivers would crash when the kernel tried to write to a parameter after it had previously read from the same variable. At lease PCs have the ability to update the drivers. Apple users are "stuck" with the drivers shipped with the OS version.


Should I adjust the kernel code so that the kernel uses different buffers for reading /writing?
9) Message boards : Science : Suggestion : Use sieve to speed up (Message 21737)
Posted 795 days ago by Profile sosiris
Update : improved host code so lots of options are configurable

https://drive.google.com/file/d/0B_zdIaR2HqBEMWZpM3Iwa1F1OFU/view?usp=sharing

Code on GitHub also updated.
10) Message boards : Number crunching : WUs completing too quickly & getting too few WUs (Message 21727)
Posted 796 days ago by Profile sosiris
What I've been running on the Nvidia side are 750, and 750ti.

The 750ti cards push out Collatz work units a bit slower than my ATI 7850.


Partly because I tested the kernel on my 7850 so this GPU got most of my 'love' and ran faster.
11) Message boards : Number crunching : WUs completing too quickly & getting too few WUs (Message 21709)
Posted 798 days ago by Profile sosiris
Don't worry. I'm currently working on a more computing intensive (at least 2.5x) sieve app which adopts the 256-step sieve instead of the 32-step one currently used. And it generates 2.5 million times more 'science' than the one without sieve. BTW, it also consumes much less GPU RAM which is a good news.
12) Message boards : Science : Suggestion : Use sieve to speed up (Message 21661)
Posted 807 days ago by Profile sosiris
The fist working example using the 256-step sieve.
https://drive.google.com/file/d/0B_zdIaR2HqBES01yMTg4QllPZ0k/view?usp=sharing

Performance-wise, it runs at 2^20 numbers per 28ms, about 4 times slower than the current sieve APP. It's expected because 512-bit arithmetic is used and the average step is about 2800 since the input is larger.

Some possible optimizations are :
*Pre-calculate 256 steps before sending the numbers into the kernel, ~10% improvement.
*Persistent threads (fewer threads per kernel but one thread processes lots of numbers)to cover thread divergence (~20% currently) from the while loop.
13) Message boards : Science : Suggestion : Use sieve to speed up (Message 21237)
Posted 864 days ago by Profile sosiris
I created a GitHub repo to ease code management, and anyone who is interested in this project can have a look or give some feedback:
https://github.com/SosirisTseng/collatz-opencl

The developement will focus on utilizing 256-step sieve, and (maybe) test something interesting like using persistent thread model to counteract thread divergence. The kernel code is almost completed; however the host code is WIP.
14) Message boards : News : Collatz Sieve 1.10 Released for Windows (Message 21171)
Posted 870 days ago by Profile sosiris
I looked at the assembly code the opencl compiler generated for the AMD GPUs, and found that 64-bit additions had about the same amount of instructions as 32-bit ones, i.e. they are about the same speed. So ulongs might be better than Do-It-Yourself 64-bit integers in additions.
15) Message boards : News : Collatz Sieve 1.10 Released for Windows (Message 21159)
Posted 873 days ago by Profile sosiris
I was suspicious about the result and went for deep analysis today. It turned out the 32-bit kernel stuck in the for loop but not the 64-bit one (weird!). They actually took the same GPU cycles in add-and-carry statements. As to half adders (no carry in, just a+b), 32-bit and 64-bit kernels have equal speed, too.
16) Message boards : News : Collatz Sieve 1.10 Released for Windows (Message 21158)
Posted 873 days ago by Profile sosiris
*edited out
17) Message boards : News : Collatz Sieve 1.10 Released for Windows (Message 21150)
Posted 874 days ago by Profile sosiris
I did some testing on my Intel GPU too because 'full adders' are also required in my new kernel.

using uint and carry : 3.3ms.
using ulong : 0.32ms.
It looks like my Intel GPU handles 64-bit integers very well.

code:




#define NLENGTH 16
#define NLOOPS 1024

__kernel void add32(
__global uint *g_result,
__constant uint* c_addVal
){
uint val[16];
uint add[16];
for(int i=0;i<NLENGTH;++i)
{
val[i] = g_result[i];
add[i] = c_addVal[i];
}
for(int i=0 ;i<NLOOPS;++i)
{
uint carryIn =0;
for(int j=0;j<NLENGTH;++j)
{
uint sum = val[j] + add[j];
uint carryOut = sum < val[j];
sum += carryIn;
carryIn = carryOut | (sum < carryIn);
val[j] = sum;
} //for(j)
}//for(i)
for(int i=0;i<NLENGTH;++i)
{
g_result[i] = val[i];
}
} //add32()

__kernel void add64(
__global uint *g_result,
__constant uint* c_addVal
){
uint val[16];
uint add[16];
for(int i=0;i<NLENGTH;++i)
{
val[i] = g_result[i];
add[i] = c_addVal[i];
}
for(int i=0 ;i<NLOOPS;++i)
{
ulong addRes =0;
for(int j=0;j<NLENGTH;++j)
{
addRes = val[j] + add[j] + (addRes>>32);
val[j] = convert_uint(addRes);
} //for(j)
}//for(i)
for(int i=0;i<NLENGTH;++i)
{
g_result[i] = val[i];
}
} //add64()


I tested the kernels in Intel kernel builder, both producing the same results.
My GPU is HD4000 in i7-3610QM.[/list]
18) Message boards : Science : Are workunits on individual numbers or ranges of numbers? (Message 21132)
Posted 876 days ago by Profile sosiris
About 2 ^ 40 numbers for a large WU, AFAIK.
19) Message boards : News : Collatz Sieve 1.10 Released for Windows (Message 21128)
Posted 876 days ago by Profile sosiris
It looks like there is a bug with the offset calculation in the kernel. That is a result of my changing it to work with multiple sieve sizes to keep GPUs from running out of RAM and crashing the video driver.

To fix it, I need to find an efficient way to do 64-bit addition or multiple 32-bit additions with carry logic in the kernel. Since I don't think OpenCL 1.0 supported 64-bit logic, that present a bit of a challenge which may result in different kernels running on older vs newer hardware.



I heard that one could test A+B >= 2^32 (both 32-bit uint) by (A>=-B) because "-B == 2^32 - B" in openCL (it's 2's complement). So this may be what you want:


uint4 addu4u4(const uint4 a, const uint4 b)
{
uint4 s; //sum
uint c; //carry
s.s0 = a.s0 + b.s0;
c = (a.s0 >= -b.s0);
s.s1 = a.s1 + b.s1 + c;
c = (a.s1 >= -b.s1) |
(c > s.s1); //for the case that a+b == 0xffffffff && c == 1
s.s2 = a.s2 + b.s2 + c;
c = (a.s2 >= -b.s2) | (c > s.s2);
s.s3 = a.s3 + b.s3 + c;
}


But still, using INT64 makes the code cleaner. I'm not sure if it's faster or slower, maybe depend on the hardware. And if my memory serves me right, openCL supports INT64 from ver 1.0. (FP64 is an optional feature in ver 1.0, though)
20) Message boards : Science : Suggestion : Use sieve to speed up (Message 21093)
Posted 882 days ago by Profile sosiris
http://pastebin.com/Wmyq0YvM
Also, this is the code that precalculates N steps for an N-step sieve number. It returns the precalculated coefficients (i.e. a256 and b256 in the comments in the kernel file) and odd # occurrence (acutally, INITSTEPS = 256 + oddOccurrence) for the kernel.


Next 20

Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.