Suggestion : Use sieve to speed up
log in

Advanced search

Message boards : Science : Suggestion : Use sieve to speed up

Previous · 1 · 2 · 3 · 4 · 5 · Next
Author Message
Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19505 - Posted: 20 May 2014, 0:44:46 UTC - in response to Message 19474.

Increasing the work items beyond 128 seems to slow it down but that probably due to the number of registers available per work item and I know that will change depending upon the compute capability of the nVidia GPUs.


Setting it to 64 already achieved 100% kernel occupancy on my GPU (HD7850). I even modified the kernel so it uses less variables(for example, lut also acts as stepIn after the loop is completed, and cont / carry use the same var.).

BTW, codeXL gave me a blue screen whenever I tried to profile the kernel at first. After reverting Catalyst from 14.4 to 14.3beta, it magically worked. Yet another driver-related issue.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19506 - Posted: 20 May 2014, 1:12:48 UTC - in response to Message 19505.

As a side note, VALU-busy of the kernel is only 30% (SALU-busy ~5%). MemoryUnitBusy is 98%. That means this kernel is fetch-bound. (bounded by global memory access, I guess)
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19507 - Posted: 20 May 2014, 6:34:29 UTC - in response to Message 19506.

As a side note, VALU-busy of the kernel is only 30% (SALU-busy ~5%). MemoryUnitBusy is 98%. That means this kernel is fetch-bound. (bounded by global memory access, I guess)




After I found the kernel was fetch-bound, I started tweaking the kernel. The look-up table was shrunk to 64KB (i.e. 12-step) to fit into constant memory on my AMD card(I'll call the new kernel v2, in comaprison with the original v1). I ran 10 kernels in row, respectively, and they gave consistent results. A 29-step sieve was used as it gave large enough global work size.

Time: (ms per kernel)
v1 : 38.6
v2 : 20 (almost a half)

VALU busy: (percentage of GPU time when vector ALU instructions are percessed)
v1 : 30%
v2 : 98% (3 times)

SALU busy : (percentage of GPU time when scalar ALU instructions are percessed)
v1 : 6%
v2 : 19% (3 times, too)

MemUnitBusy : (percentage of GPU time when the memory unit is active)
v1 : 98%
v2 : 55% (about a half)

MemUnitStalled : (less is better)
v1 : 6.4%
v2 : 0.5% (less than 1/10)

So moving the look-up table from global memory to constant memory really helps performance and turned the kernel from fetch-bound into ALU-bound (at least for my 7850) and potentially makes it faster.
____________
Sosiris, team BOINC@Taiwan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 19515 - Posted: 20 May 2014, 13:49:25 UTC

In the original Brook version, it used texture memory which was noticeably faster than global memory. When ported to OpenCL, AMD didn't support texture memory at first. Also, there is nothing in the spec that says OpenCL image buffers must use AMD textures, so the implementation is up to the hardware vendor. When AMD finally got image support in their OpenCL version, it wasn't any faster than global memory which is why the lookup table (lut) was never changed back.

So, moving the lut to constant memory made a huge difference. I was under the impression that if the memory was allocated via CL_MEM_READ_ONLY that it would be considered constant memory. I guess it only does that if the variable is less than the constant memory buffer size. Have you tried moving the sieve to constant memory as well?

for example, if sending 2^24 numbers allows 100% occupancy but only 2^12 items fit in the sieve, have the main kernel break it into groups of 2^12 numbers which it would send to a second kernel to compute. That would essentially be like calling the kernel with an offset parameter.

That, or only send 2^12 items to the main kernel and have it loop N times internally so that it actually does N * 2^12 iterations within the same kernel call. That would allow the program to dynamically change the kernel size to match the GPU so that no kernel call takes > 33 milliseconds (e.g. crashes the video driver) and yet can keep the GPU as busy as possible. The app would need to check the timing when first starting up to determine the correct value for N but that should take < 1 second. After that, it should run at 97-99% load without having to mess around with a config file and manual settings.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 19517 - Posted: 20 May 2014, 16:27:04 UTC

Changing the lookup table from global 2^20 to constant 2^12 caused the kernel on my nVidia GTX 670M to take 5 times longer. Did you hard code the table as a constant or pass it in to the kernel as a constant arguement?

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19521 - Posted: 21 May 2014, 2:31:12 UTC - in response to Message 19517.

Changing the lookup table from global 2^20 to constant 2^12 caused the kernel on my nVidia GTX 670M to take 5 times longer. Did you hard code the table as a constant or pass it in to the kernel as a constant arguement?


I just changed the kernel parameter list:

__kernel void kernelSteps (
__constant uint4 *table,
__global const uint *sieve, const uint sieveStep,
const uint4 start,
const uint offset, //offset = 1 means offset = 1*2^(sieveStep);
const __global uint4 *resIn, __global uint4 *resOut){...}

Host code is the same except tableStep being changed to 12.
AFAIK, __global const means read-only global memomey. __constant qulifier is required for constant memory.
Perhaps the slow down is hardware-specific (speed-up on AMD, slow-down on NV). But it should not be 5 times slower since it's 40% slower theoretically if other conditions are identiacal. I recommend use a profiler (for NV cards, please see visual profiler) to see where the bottleneck is. I'll try my GT640M on my laptop once it's set up.

So, moving the lut to constant memory made a huge difference. I was under the impression that if the memory was allocated via CL_MEM_READ_ONLY that it would be considered constant memory. I guess it only does that if the variable is less than the constant memory buffer size. Have you tried moving the sieve to constant memory as well?

__constant qualifier is required. CL_MEM_READ_ONLY only tells openCL it's read-only, if my memory serves me right.
Because constant memory size is only 64 KB total per AMD/NV card, the sieve could not fit in once a 64kb lut is allocated, otherwise an error is raised by the openCL API.
Moreover, it's not necessary to put sieve into constant memory. First it is read only once in the kernel. Second it's coalesced memory access (reading ~16 ints at a time) so it's quite efficient. In contrast, lut is read dozens of times in a kernel, and it's random access, so reading it causes long waits. Third, the v2 kernel I mentioned is already ALU-bound. (memory units waiting for ALUs to finish computation) Faster memory access probably could not get the kernel faster in this case.

That would allow the program to dynamically change the kernel size to match the GPU so that no kernel call takes > 33 milliseconds (e.g. crashes the video driver) and yet can keep the GPU as busy as possible. The app would need to check the timing when first starting up to determine the correct value for N but that should take < 1 second. After that, it should run at 97-99% load without having to mess around with a config file and manual settings.


Smart adjustment of work load would be great!

Just another thing, since global memory is as fast as local on the CPU, there's no need to change the code (like I did) for kernels running on it. Perhaps we can make adjusted kernels for the CPUs and GPUs, respectively.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19522 - Posted: 21 May 2014, 8:09:18 UTC - in response to Message 19521.

for NV cards, please see visual profiler

I spent an afternoon trying to profile v2 kernel (the one uses constant memory) on my GT640M and found that openCL support was removed since visual profiler v5.0 (it only supports CUDA currently), so I need to find an older version (maybe v4.2). Wish me good luck.
____________
Sosiris, team BOINC@Taiwan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 19525 - Posted: 21 May 2014, 15:08:03 UTC

After trying the __constant argument to the kernel I also changed the kernel so that the lookup table was pre-defined as a constant so it didn't need to be passed in as an argument with each kernel call. That improved performance significantly but was still less than when using a 2^20 size lookup table. I'm going to compare the speed again with your kernel and see whether the results are the same when passing in an array of uint rather than uint4 (needed to back to back kernel calls). It may have to do with how efficient nVidia vs. AMD transfers data and/or the type of data to the GPU in addition to the access speed of constant vs local vs global variables.

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19526 - Posted: 21 May 2014, 16:33:14 UTC - in response to Message 19522.

for NV cards, please see visual profiler

I spent an afternoon trying to profile v2 kernel (the one uses constant memory) on my GT640M and found that openCL support was removed since visual profiler v5.0 (it only supports CUDA currently), so I need to find an older version (maybe v4.2). Wish me good luck.


I could not get any profiler for NV openCL kernel running after several attempts. But I recreated your problem on my GT640M.

Running 10 kernels, using 26-step sieve:
v1(global mem, 20-steps) : 307ms
v2(const mem, 12-steps) : 2976ms (almost 10 times slower)

I don't know the exact reason because of lack of available profilers to count ALU usage, mem usage, etc. But according to NVIDIA OpenCL Best Practices Guide, "for all threads of a half warp, reading from the constant cache is as fast as reading from a register as long as all threads read the same address. Accesses to different addresses by threads within a half warp are serialized." Perhaps the waiting time was just too long, so it's slower than global memory.

As to AMD cards, accoring to the documentation from AMD, reading constant memory is about the same as reading global memory for a look-up table. But constant memory stuff is also cached in L2 cache, so if there is a cache hit, it can be faster than global memory (much less waiting). Cache-hit of v2 kernel is ~99%. (v1 is 18~19%) That's probably the reason.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19527 - Posted: 22 May 2014, 3:57:05 UTC - in response to Message 19526.

I found something even more interesting after I tested the effectiveness of global memory caching.

(All of the kernels use 29-step sieve, each kernel runs 20 times)
v1: 20-step, global mem
v2: 12-step, constant mem
v3: 12-step, global mem
v4: 15-step, global mem (512KB, same as L2 cache size on hd7850)

Excution time : (ms per kernel)
v1 : 38.5 ms
v2 = v3 : 20-21 ms
v4 : 16.5 ms

VALU use:
v1: 30%
v2 = v3 = v4 : 98.5%

Cache hits:
v1 : 19%
v2 = v3 : 93.5%
v4 : 80%

The data gathered by CodeXL (csv file) :
https://drive.google.com/file/d/0B_zdIaR2HqBERVVIczNUaVBGTDg/edit?usp=sharing

Looks like it does not matter whether the look-up table located in constant or global memory. What really matters is cache hits. The 20-step look-up table (16MB) is just too big for L2 cache (512KB) for GCN devices. Smaller ones fitting in the L2 are read much faster. I have no idea if NV cards or pre-GCN AMD cards will do the same; more benchmarking is still needed. User defined lut step may be the way to adjust it for different GPUs/CPUs.

As a side note, the host code can define a constant using preprocessor in the build options if it's not changed at all after building the openCL program. It's a trick from the optimization guide from IntelĀ® openCL SDK.
For example, if we want to build a kernel for 20 step lut.
In the host code : attach this build option : "-D LUTSTEP=20". It is equivalent to #define LUTSTEP 20 in the kernel code, so passing it via arguments is not needed, saving some overheads. But if we want another LUTSTEP value, recompilation of kernel code is required (not a problem since collatz app use the same kernel until exit). Hope this helps.
____________
Sosiris, team BOINC@Taiwan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 19532 - Posted: 22 May 2014, 16:05:36 UTC

I'm revisiting use of cl_ulong instead of cl_uint in the kernel since using a sieve size of 2^29 with padding only allows 4 kernels per reduction or it overflows cl_uint. I am hoping that running more kernels back to back will offset the time needed to copy twice as much data back and forth.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 19535 - Posted: 22 May 2014, 19:44:31 UTC

The following kernel crashes CodeXL when it tries to compile it. Unhandled exception and then total crash. It compiles and runs OK on my laptop's nVidia GPU though. Looks like AMD still hasn't fixed their optimization logic. This problem is the whole reason I first posted to the AMD forum over a year ago. If you disable optimization "-O0" then it compiles OK but performance is terrible.


__constant unsigned int LOOKAHEAD=20;
__constant unsigned int BITS=32;
__constant unsigned long maxStep = 0x1000000ul;

inline ulong2 mul128(const unsigned long a, const unsigned long b) {
return (ulong2)(a*b,mul_hi(a,b));
}

__kernel void kernelSteps64(__global const uint *sieve, const unsigned long offset, const ulong2 start, __global ulong4 *steps, __global const uint4 *mosc) {
const uint lookahead = LOOKAHEAD;
const uint4 sc = (uint4)((1<<lookahead)-1,BITS-lookahead,(1<<lookahead)+1,0);
const uint t_offset = get_global_id(sc.w);
const unsigned long totalOffset = offset + sieve[t_offset];
ulong2 carry,mul_r;
uint4 lut;
ulong2 icont;
ulong4 stepsOut,val;
val.x = start.x;
val.y = start.y;
icont.x = sc.w;
val.x += totalOffset;
carry.x = (val.x < totalOffset);
val.y += carry.x;
val.z = val.w = sc.w;
icont.y = 1;
while(icont.y)
{
lut = mosc[val.x & sc.x];

mul_r = mul128((val.x >> lookahead) + (val.y << sc.y), (unsigned long)lut.x);
val.x = mul_r.x + lut.y;
carry.x = mul_r.y + (val.x < mul_r.x);

mul_r = mul128((val.y >> lookahead) + (val.z << sc.y), (unsigned long)lut.x);
val.y = mul_r.x + carry.x;
carry.y = mul_r.y + (val.y < mul_r.x);
mul_r = mul128((val.z >> lookahead) + (val.w << sc.y), (unsigned long)lut.x);
val.z = mul_r.x + carry.y;
carry.x = mul_r.y + (val.z < mul_r.x);
mul_r = mul128((val.w >> lookahead), (unsigned long)lut.x);
val.w = mul_r.x + carry.x;
carry.y = mul_r.y + (val.w < mul_r.x);
icont.x += lut.z;
icont.y = ((val.x > (unsigned long)sc.z) | val.y | val.z | val.w | carry.y) && (icont.x<maxStep);
}
icont.x += (unsigned long)mosc[(val.x-2u) & sc.x].w;
if(carry.y)
icont.x = 0x1000000u;
if (offset == 0)
val = (ulong4)(0,0,0,0);
else
val = steps[t_offset];
stepsOut.z = carry.x = val.z + icont.x;
stepsOut.w = val.w + (carry.x < icont.x);
if (icont.x > val.x)
{
stepsOut.x = icont.x;
stepsOut.y = totalOffset;
} else {
stepsOut.x = val.x;
stepsOut.y = val.y;
}
steps[t_offset] = stepsOut;
}

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19536 - Posted: 23 May 2014, 5:24:50 UTC - in response to Message 19535.

The following kernel crashes CodeXL when it tries to compile it. Unhandled exception and then total crash. It compiles and runs OK on my laptop's nVidia GPU though. Looks like AMD still hasn't fixed their optimization logic. This problem is the whole reason I first posted to the AMD forum over a year ago. If you disable optimization "-O0" then it compiles OK but performance is terrible.


Yeah, I got the same error when trying to compile your kernel with codeXL. Intel kernel builder worked fine.

BTW, I'm also trying to simplify the look-up table to reduce memory bandwidth demand because s0 is relevant to s2. (s0 = 3 ^(s2-lookahead)) So I need to make __constant power3[21] = {1,3,9,...} in the program scope, the table is reduced to uint2 (stores original s1 and s2), and s3 is another separate table(I'll call it delays) because it's used only once after exiting the loop.

I'm going to travel abroad from Sunday(May. 25) to Friday without a computer (I don't have a smartphone, either), so I won't be able to reply next week.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19537 - Posted: 23 May 2014, 5:59:37 UTC - in response to Message 19532.

I'm revisiting use of cl_ulong instead of cl_uint in the kernel since using a sieve size of 2^29 with padding only allows 4 kernels per reduction or it overflows cl_uint. I am hoping that running more kernels back to back will offset the time needed to copy twice as much data back and forth.


My approach is using 2 uints recording the total offset : one is sieve value(sieveVal), the other is kernel set number (kNo).

The host code looks like:
for(uint kNo=0; kNo<kPerRdx. ++kNo) kernel(.....,kNo,...);
//...
uint64_t total_offset = ((uint64_t)kNo<<sieveStep) + sieveVal;


The kernel code:
uint8 val = (uint8)(start.s0 + sieveVal + (kNo << SIEVESTEP), start.s1 + (kNo >> (32-SIEVESTEP)), start.s23, (uint4)0);


I (personally) would not like to deal with 64-bit integers on the GPU because 64-bit ops are emualted by multiple 32-bit ops; 64-bit IOPS is just a quarter of 32-bit IOPS.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19538 - Posted: 23 May 2014, 9:35:09 UTC - in response to Message 19536.

BTW, I'm also trying to simplify the look-up table to reduce memory bandwidth demand because s0 is relevant to s2. (s0 = 3 ^(s2-lookahead)) So I need to make __constant power3[21] = {1,3,9,...} in the program scope, the table is reduced to uint2 (stores original s1 and s2), and s3 is another separate table(I'll call it delays) because it's used only once after exiting the loop.


OkK, I tried it.
The kernel:

//NOTICE: This kernel must define LUTSTEP and SIEVESTEP

//Table: power of 3
__constant uint power3[21] =
{1u,3u,9u,27u,81u,243u,729u,2187u,6561u,19683u,
59049u,177147u,531441u,1594323u,4782969u,14348907u,43046721u,129140163u,387420489u,1162261467u,
3486784401u
};

inline uint2 mul64(const uint a, const uint b){
return (uint2)(a*b, mul_hi(a,b));
}

__kernel void kernelSteps ( __global const uint2 *table, //look-up table for jumping in collatz algorithm. Size = 2^(LUTSTEP)
__global const uint *delay, //Collatz delays for numbers under 2^(LUTSTEP),. Size = 2^(LUTSTEP)
__global const uint *sieve, //Numbers needed to compute for every 2^(SIEVESTEP) numbers. size = global work size
const uint4 start, //Starting value (128-bit), must be multiple of 2^(SIEVESTEP)
const uint kNo, //'kernel set' number, kNo = 1 means offset = 1* 2^(SIEVESTEP). kNo = 0 means clear the result
__global uint4 *results //Result: s0 : max collatz step; s1: kNo; s2: sieveVal; s3 : sum of steps. Size = global work size
)
{
const uint gid = get_global_id(0);
const uint bitDiff = 32 - LUTSTEP, tableSizeMinusOne = (1u << LUTSTEP) - 1;
const uint maxStep = 0xfffffff; //maximum step allowed by this kernel (2^28-1)
const uint sieveVal = sieve[gid]; //sieveVal : sieve value, as an kNo
uint2 lut, mulResult, val_h; //lut : table item. mulResult : mul64() result, val_h : val and val_h combined to be a 196-bit integer
uint4 val = start;
uint contCarry = 1u, stepCount = 0, p3 = 0, overflow = 0; //contCarry : if loop should continue and hold carry bits, stepCount : count collatz Delay, p3 : store power3[]
//Do Val += sieveVal + (kNo << SIEVESTEP)
val.s0 += sieveVal + (kNo << SIEVESTEP); //bit 0-31
val.s1 += (kNo >> (32-SIEVESTEP)) + (val.s0 < start.s0); //bit 32-63
val.s2 += (val.s1 < start.s1); //bit 64-95
val.s3 += (val.s2 < start.s2); //bit 96-127
val_h = (uint2) (val.s3 < start.s3, 0); //bit 128-195
//Loop
while(contCarry){
//get look-up table item
lut = table[val.s0 & tableSizeMinusOne];
p3 = power3[lut.s0];
//Do n = (n>>LUTSTEP)*a + b
//bit 0-31
mulResult = mul64((val.s0 >> LUTSTEP) + (val.s1 << bitDiff), p3); //Multiply 'a'
val.s0 = mulResult.s0 + lut.s1; //Add 'b'
contCarry = mulResult.s1 + (val.s0 < mulResult.s0); //count carry bits
//bit 32-63
mulResult = mul64((val.s1 >> LUTSTEP) + (val.s2 << bitDiff), p3);
val.s1 = mulResult.s0 + contCarry;
contCarry = mulResult.s1 + (val.s1 < mulResult.s0);
//bit 64-95
mulResult = mul64((val.s2 >> LUTSTEP) + (val.s3 << bitDiff), p3);
val.s2 = mulResult.s0 + contCarry;
contCarry = mulResult.s1 + (val.s2 < mulResult.s0);
//bit 96-127
mulResult = mul64((val.s3 >> LUTSTEP) + (val_h.s0 << bitDiff), p3);
val.s3 = mulResult.s0 + contCarry;
contCarry = mulResult.s1 + (val.s3 < mulResult.s0);
//bit 128-159
mulResult = mul64((val_h.s0 >> LUTSTEP) + (val_h.s1 << bitDiff), p3);
val_h.s0 = mulResult.s0 + contCarry;
contCarry = mulResult.s1 + (val_h.s0 < mulResult.s0);
//bit 160-191 and overflow detection
mulResult = mul64((val_h.s1 >> LUTSTEP), p3);
val_h.s1 = mulResult.s0 + contCarry;
overflow = mulResult.s1 + (val_h.s1 < mulResult.s0);
stepCount += (lut.s0 + LUTSTEP); //add step
//if val < tableSize or overflow or step >= maxStep, exit loop (set contCarry to 0), else set it to 1
contCarry = ((val.s0 > tableSizeMinusOne) | (val.s1 | val.s2 | val.s3 | val_h.s0 | val_h.s1)) && (!(overflow | (stepCount >= maxStep)));
}
//If overflow is true, stepCount = maxStep; otherwise, stepCount += delay[val.s0&(tableSize-1)], contCarry is use as a temporary int
contCarry = stepCount + delay[val.s0 & tableSizeMinusOne];
stepCount = select(contCarry, maxStep, overflow);

//Use val as the result to save space as it's no longer used
val = results[gid];
//Compare "stepCount from this kernel" to "result from resIn array". contCarry is used here to decide to replace or not
//If 1.kNo is new (==0) or 2.step from this kernel > from the array. Replace it with result from this kernel. I avoid using 'if' for GPUs handles it poorly.
contCarry = (stepCount > val.s0) | (kNo == 0);
val.s0 = select(val.s0, stepCount, contCarry);
val.s1 = select(val.s1, kNo, contCarry);
val.s2 = select(val.s2, sieveVal, contCarry);
//If kNo is new (==0), start a new sum; else, accumulate the sum
val.s3 = select(val.s3 + stepCount, stepCount, kNo == 0);

//save result to output array
results[gid] = val;
} //end of kernel


LUTSTEP = 16 (to fit lut into 512KB L2) and SIEVESTEP = 29.
Speed is ~16ms per kernel (about the same of previous v4 kernel). ALUs are still busy and memory units are still waiting for ALUs to complete.
Performance counters:
https://drive.google.com/file/d/0B_zdIaR2HqBEOGE3eThfeU1odG8/edit?usp=sharing
I don't know whether this design is better or not. It's totally up to you.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19588 - Posted: 1 Jun 2014, 9:20:36 UTC - in response to Message 19538.

I made some kernel changes last week.


    1. Look-up table size and delay table size are decoupled. The results will be correct as long as delay > look-up table size. We can expand the delay table a little bit and hope the value will 'hit' delay table sooner. Thus it decreases kernel excution time.
    2. Changed the result array and it's more like yours. Because I noticed sieveVal can be read in the host code and it's not necessary to keep it in the result array.

    3. Rearranged statements so the less registers are used.



kernel code :

http://pastebin.com/bDYHAhS2

V6 execution time went down from 16 ms(v5) to 14 ms(v6) per kernel (12.5% improvement ) using LUTSTEP = 16, SIEVESTEP = 29, DELAYSTEP = 22 (v5 uses LUTSTEP = 16, SIEVESTEP = 29, DELAYSTEP = 16). A little gain without complications.

It would be great if the future collatz app provides options to adjust these 'step' parameters for someone who really has the time to test and find the best parameters for him/her.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19591 - Posted: 1 Jun 2014, 14:37:56 UTC - in response to Message 19588.

I also tried the latest catalyst 14.6 beta, so far so good with codeXL.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19613 - Posted: 16 Jun 2014, 16:14:24 UTC - in response to Message 19515.

That, or only send 2^12 items to the main kernel and have it loop N times internally so that it actually does N * 2^12 iterations within the same kernel call. That would allow the program to dynamically change the kernel size to match the GPU so that no kernel call takes > 33 milliseconds (e.g. crashes the video driver) and yet can keep the GPU as busy as possible. The app would need to check the timing when first starting up to determine the correct value for N but that should take < 1 second. After that, it should run at 97-99% load without having to mess around with a config file and manual settings.


Since the sieve step should be the same across all hosts(otherwise the results will differ from one host to another), I come up with the idea of using 32-step sieve, the largest available for 32-bit integer. Of course you can use any step if you like. Just stick to the same step.
It contains 41347483 numbers (thus 157MB in size). Large, but it can be partitioned into several 2^(items_per_kernel) sized blocks adn loaded sequentially. The last block can be sized according to "threads" (the smart padding you mentioned). The sieve can be either created on-the-fly or loaded from a binary file.
As to sieve generation, creating a 32-step sieve takes 15-17 seconds when running sequentially on the CPU (i5-2500k @ 4.5GHz), which is pretty long. So I made an openCL-accelerated getSieve() last week.
kernel code: http://pastebin.com/0RJv6m8p
host code : http://pastebin.com/aRuys8yM
The kernel code is very straightforward. The host code uses 2 command queues with interleaving commands to achieve overlapping data transfers and kernel execution (just like asynchronous transfer). The excution time was under 1.6 secs (1/10 of reference).

FYI, in your kernel, instead of using conditional statement for every kernel instance to see if the offset is 0 and then clean the result, I recommend insert a "clear result" kernel as needed (In fact, in openCL 1.2, there's clEnqueueFillBuffer() doing the same work, but not in v1.1). Plus using select(), expensive 'if' statements can be avoided to further optimize the kernel.
____________
Sosiris, team BOINC@Taiwan

Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 19639 - Posted: 19 Jun 2014, 3:25:27 UTC

I'm currently trying to offload most of the reduction to openCL kernel, called parallel reduction, until the reduction result is too small for openCL kernel to be efficient.
The reduction kernel looks like this:
(Note : sizeof (result) and sizeof(sieve) are assumed size of 2^N)

//reduce() : compares the former half to the latter half of the result buffer, saves results on former half, thus effectively reduce the result buffer by half
//result : buffer from kernelSteps(), size = global work size from kernelSteps(), ASSUME SIZE = 2^n. s0 : max collatz step; s1: kNo; s23 : sum of steps
//sieve: the sieve value of the corresponding result, size = sizeof(result)
__kernel void reduce(__global uint4 *result, __global uint* sieve){
const uint tid = get_global_id(0), farHalf = tid + (get_global_size(0) >> 1);
uint4 resA = result[tid], resB = result[farHalf];
uint sieveA = sieve[tid], sieveB = sieve[farHalf]; //index
const uint larger = resA.s0 < resB.s0; //whether maxStep of resB is larger than that of resA
resA.s0 = max(resA.s0, resB.s0);
resA.s1 = select(resA.s1, resB.s1, larger);
sieveA = select(sieveA, sieveB, larger);
resA.s2 += resB.s2; //accumulate sum of steps
resA.s3 += (resA.s2 < resB.s2) + resB.s3; // if s2 overflows, carry to s3
//save results
result[tid] = resA;
sieve[tid] = sieveA;
} //reduction()


The host code looks like this:

for (cl_uint workSize = globalWSize; workSize > REDUCTION_LIMIT; workSize >>= 1){
cq.enqueueNDRangeKernel(kerReduc, cl::NullRange, cl::NDRange(workSize), lRange);
}

//Read the first 'REDUCTION_LIMIT' items respectively from result and sieve buffer...


That should further reduces CPU usage by many times. e.g. if items_per_kernel = 22 and REDUCTION_LIMIT = 1024, theoretical CPU usage is 2^22 / 2^10 = 4096x less per reduction.
____________
Sosiris, team BOINC@Taiwan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 1
Message 19641 - Posted: 19 Jun 2014, 15:34:53 UTC

I have quite a few versions of reduction kernels. The most efficient versions use shared memory, require calling barrier(CLK_LOCAL_MEM_FENCE), and must be 2^k items in size.

FYI, the collatz applications are designed such that they can run the reduction on the GPU, the CPU, or both. That allows me to verify that the reduction kernel has no bugs. Even if only the GPU is used for reductions, any time higher steps are found, the CPU is used to validate that they are correct.

Previous · 1 · 2 · 3 · 4 · 5 · Next
Post to thread

Message boards : Science : Suggestion : Use sieve to speed up


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.