Posts by Gipsel
log in
1) Message boards : Number crunching : All is "NOT" sunny in Tahiti... (Message 14164)
Posted 2056 days ago by Profile Gipsel
The OpenCL version works. It is slower than the version written in IL (but I think a HD7970 with OpenCL is still faster than a HD6970 with IL), but produces correct results on HD7000 series GPUs.
Btw., I'm currently looking into some kind of a hybrid IL/OpenCL application combining the speed of the old version with HD7000 series compatibility. Unfortunately, I hadn't as much time as I wished so far.
2) Message boards : Windows : Task doesn't suspend when on battery (Message 14031)
Posted 2069 days ago by Profile Gipsel
From what I recall, it is more a problem of BOINC than one of the Collatz application itself. I remember modding the BOINC API lib (two years ago or so) to get rid of this behaviour. What happens is that suspend commands are basically ignored if they happen to arrive when the application is within a critical section (what happens quite often in GPU applications, at least if one wants maximum stability, otherwise the applications can crash under some circumstances [BOINC may kill the process on the host CPU while the kernel on the GPU is still running => bad]). That is a design error of the BOINC API lib. It should remember the suspend command until the application exits the critical section and not just simply drop the command as it does (or did, maybe they fixed it in later versions).
It is somewhat connected to the problem here. The proposed solution there is to include exit code for the application so the context on the GPU also gets released. But I still think the whole suspend/exit procedure should work under the applications regime, as the proposed solution there only fixes the symptoms, not the cause of the problem (the critical sections should solve the problem too if they would work correctly [and are used correctly in the application]).

Btw., on my system the CUDA versions (I tried the automatically distributed one as well as the optimized one for the newer CUDA version) run without problems and also suspend is working without problems. Hmm.
3) Message boards : Windows : Collatz Conjecture OpenCL kernel causes crash with minimum settings. (Message 13999)
Posted 2077 days ago by Profile Gipsel
Yes, I ran the batch file.
Also, I program in OpenCL, so I can tell you first hand that if your OpenCL kernel runs slower than your Brook+ app or doesn't run in the first place, you're either using depreciated logic or have a few optimizations that need to be made.

It isn't actually slower than a Brook+ version, it is slower than some handtuned kernels written in IL. ;)
You simply can't squeeze that speed out of a high level C-like language before some (more) intrinsics (like carries for integer adds) are exposed in OpenCL by some extension. And the compiler optimizations are not that good so far. For example, the code generated for evaluating conditionals are often not as efficient as they could be (what you can get by writing in IL).

And the speed of that IL code on the HD7000 series is terrific (as Collatz achieved only a medium occupancy of the VLIW slots [below 3 of 5], lots of dependencies in the code), alone it doesn't return correct results. Or it does return correct results only sometimes. The problem is obviously some synchronization/coherency thing when doing read-write-accesses to textures/images in the old version. GCN changed the whole memory structure and the caches for that so the old rules are not valid anymore.

Btw., I would like to try writing directly ISA/assembler for GCN. It is a really clean architecture with quite some goodies. That would probably be even faster than IL (as not all possibilities of GCN got exposed to IL so far). But the documentation for it is only spotty at best.
4) Message boards : Number crunching : 6850 performance? (Message 10804)
Posted 2572 days ago by Profile Gipsel
You are doing units in around 360 seconds and my 6850 is doing them in around 540 seconds, seems about right by comparison.

5870 are crunching faster than 6970 - thats why I'm a little disappointed. My 6970 GPU load: 95%

The problem is the very specific nature of the Collatz code. It uses only integer arithmetics (and a lot of it), something which is not common for normal graphics stuff.
Remember the huge advantage ATI cards had when compared to G80/G92/GT200 cards? The reason was simply that nvidia GPUs more or less emulated 32bit integer arithmetics by multiple operations while the 5th t-unit of AMD GPUs could directly do 32bit integer multiplications. But while nvidia beefed up the (formerly abysmal) integer multiplication performance with the GF1xx line, AMD cut some corners there (as it can be considered not of much importance for normal graphics stuff). In fact AMD deleted the 5th unit in each VLIW group which was responsible for the 32bit integer multiplications. It is now done jointly by the remaining 4 units. That means each group (now consisting of only 4 units instead of 5) can still do a single 32bit integer multiplication each cycle (i.e. the theoretical peak performance stays the same). The difference is that the older GPUs could use the other units in parallel for doing some additions or bit shifts also used heavily by Collatz. On Caymans (HD6900 GPUs) this is not possible which means it needs more cycles to complete the calculation.

To sum it up, Collatz is one of the rare examples where the new VLIW4 units cost some performance compared to the older VLIW5 architecture. And this happens even while the occupancy of the functional units with the Collatz code is actually relatively low (about 55%, which would normally mean the reduction to 4 units is not a problem). This is in contrast to Milkyway, where despite the high occupancy of functional units (about 4.3 of 5 on Cypress iirc, i.e. ~86%) the performance of Cayman is nevertheless higher compared to Cypress (just by the increased number of VLIW groups which easily compensates the reduction to VLIW4 there).
5) Message boards : Science : collatz parity help (Message 8728)
Posted 2754 days ago by Profile Gipsel
but if I only apply F function 5 times, there can only be a max of 5 "odds along the way", c can't be greater then 5, can it? or am I missing some part of information.

But then are you using k=5.
Actually, the maximum c is exactly k. So if you use k=5, then c can't be greater than 5 (as in the wikipedia example).

But for k=16, c can be 16 of course and you will need to store 3^16 (43,046,721) as the maximum number in your lookup table for your p, as you have to calculate a*43,046,721 for a number were the last 16 bits are all set (which means you have to calculate 16 times in a row *(3+1)/2).
This can be shortened by shifting the whole number 16 bits to the right (/2^16), multiplying it by 3^16 and adding a quite large number d to it (I think it is 3^16-1=43,046,720).

Oops, I didn't refreshed the page. Didn't see that you have found this already.
6) Message boards : Science : collatz parity help (Message 8726)
Posted 2754 days ago by Profile Gipsel
That packing of 3 values into 8 byte only works up to k=17 and breaks down at k=18, isn't it?

actually no, //{c,p,zz,dddd} - This is how my data is stored, d is 32bits, c is 8 bits,p is 8 bits. p is c that has been pre calculated 3 to the power of c. I only access c to add to the step counter.

The problem I see is that if c is 19 for instance, 3^19 needs 31 bits to store, the maximum occuring offset (d) also needs 31 bits and c itself 5 bits. If you don't partition your p=3^c in some strange way (Is the zz be used for that? But it will hurt the performance later when you have to reconstruct it for the multiplication with a), I don't see how you fit those 67 bits into 64 bits.

I need to actually make it so it c will subtract from the step count if necessary, do to the fact that applying the F function 5 times to lets say the number 1 for example would result in a overflow into the repeating sequence, when we calculate the real steps as ending at one.

For that I use my fourth entry ;)
7) Message boards : Science : collatz parity help (Message 8724)
Posted 2754 days ago by Profile Gipsel
I've tried k=20, but I notice a slowdown compared to k=16, as it doesn't fit into my cpu's cache(AMD phenom), k=16 on my cpu has the fastest speeds. Keep in mind AMD cpu's arent the same as Intels. If the stock cpu apps are using k=20, you might consider testing AMD versions with k=16.
I've also implemented a method of storing all 3 data requirements in a single 8 byte location, as the first access loads at least 8 bytes into the cache, subsequent access bypass memory because it's already loaded, this resulted in a 5% speedup. I'd also take any suggestions you might have on optimizations.

Actually I've tested it on a Phenom 9750 and k=20 appeared to be the fastest. And I store it even 16 byte aligned (4x32bits).

But to tell the truth, the 16 byte alignment came from the GPU version, where it was just laziness from my side to pack 4 values into a single LUT. Using smaller ones may buy one or two percent performance, but at that time there were lower hanging fruits for performance gains. And on CPUs a 16 byte alignment is probably still better than aligning on 12 byte borders ;)

That packing of 3 values into 8 byte only works up to k=17 and breaks down at k=18, isn't it?

It should be easy for you to shorten your numbers to 192bits and run the project applications (64bit is definitely the fastest one) side by side to your version for performance comparisons. The checked number range can be seen in plain text in the WU files and also the task details, so it should be easy to set up a test with the same checked numbers (and don't forget to compare the results ;).


Just found the benchmarks on the Phenom 9750 (running at 2.55 GHz) I did back then for some reduced size WUs:

k runtime [s] size 20 61.97 16 MB 19 63.01 8 MB 18 64.21 4 MB 17 72.66 2 MB 16 68.20 1 MB 15 69.75 512 kB 14 74.97 256 kB 13 80.09 128 kB 12 88.25 64 kB 10 108.86 16 kB (could use 2 byte per value)

So one indeed see some effects from the cache when increasing the size of the lookup table, but the trend to lower times still prevails.
All runs were done for 402653184 consecutive numbers starting with 2361183348087637911912. Total executed steps were 204,260,372,158, i.e. just short of 3.3 billion steps per second on that 2.55 GHz Phenom.

Running Collatz Conjecture (3x+1) application version 0.1 by Gipsel
Reading input file ... done.
Checking 402653184 numbers starting with 2361183348087637911912

CPU: AMD Phenom(tm) 9750 Quad-Core Processor (4 cores/threads) 2.5492 GHz (2ms)

Initializing lookup table (16384 kB) ... done
needed 1467 steps for 2361183348087997950857
204260372158 total executed steps for 402653184 numbers

WU completed.
CPU time: 61.9688 seconds, wall clock time: 61.91 seconds, CPU frequency: 2.54961 GHz
8) Message boards : Science : collatz parity help (Message 8716)
Posted 2755 days ago by Profile Gipsel
If any one is interested, I have it up and running with k=16, which seems to fit nicely in my cpu's cache. It's processing at about ~1.3 billion steps per second on 320bit starting integers, which can expand up to as big as memory allows (8 gig in my case);

Just for comparison, running the official project applications a HD5870 does about 255 billion steps per second and a 2.4GHz Core2 about 2.37 billion steps per second (the 64bit version I think, but I may be wrong, it was benchmarked a long time ago). But the project currently use only 192bit integers (but detects overflow if it isn't enough). So your implementation appears to be competetive.

By the way, you can go up to k=20 (for k=21 the values in the tables don't fit 32bit integers anymore).
9) Message boards : Science : collatz parity help (Message 8701)
Posted 2756 days ago by Profile Gipsel
It is also known that {4,2,1} is the only repeating cycle possible with fewer than 35400 terms.

I aim to test numbers that have more terms then this, I assume terms means illiterations, or the number size.

It means the shortest possible cycle (beside the "normal" 4,2,1,4) consists of at least 35,400 steps if it exists.
10) Message boards : Science : collatz parity help (Message 8698)
Posted 2756 days ago by Profile Gipsel
I'm having a lot of trouble understanding the step ahead k steps using a parity sequence.

The "parity" section above gives a way to speed up simulation of the sequence. To jump ahead k steps on each iteration
(using the f function from that section), break up the current number into two parts, b (the k least significant bits, interpreted as an integer),
and a (the rest of the bits as an integer). The result of jumping ahead k steps can be found as:

f k(a 2k+b) = a 3^c[b]+d[b].
The c and d arrays are precalculated for all possible k-bit numbers b, where d [b] is the result of applying the f function k times to b,
and c [b] is the number of odd numbers encountered on the way. For example, if k=5, you can jump ahead 5 steps on each iteration by separating out the 5 least significant bits of a number and using:

c [0...31] = {0,3,2,2,2,2,2,4,1,4,1,3,2,2,3,4,1,2,3,3,1,1,3,3,2,3,2,4,3,3,4,5}
d [0...31] = {0,2,1,1,2,2,2,20,1,26,1,10,4,4,13,40,2,5,17,17,2,2,20,20,8,22,8,71,26,26,80,242}.

Using the above example and k = 5
Take for example the number 25, in 5 steps I know my result should be 44(76 38 19 29 44). I know that 3^c{b}+d{b} ends up as 49, being c=3 and d=22.
I've thought every way of applying the number 49 to come up with the result of 44, and I can not find it.( nor can I use it to come up with any number in the sequence)

Actually you don't calculate the the number after k steps (it depends on how you count). In fact, you calculate the number after 5 divisions by two, so that would be f_k+c{b} in case you count every step (3x+1 and the following /2 as two separate ones), so in your example it would be the number after 8 steps in the notation of this project.

f_1(25) = 76
f_2(25) = 38 (would be f_1 if you only count divisions)
f_3(25) = 19 (f_2)
f_4(25) = 58
f_5(25) = 29 (f_3)
f_6(25) = 88
f_7(25) = 44 (f_4)
f_8(25) = 22 (would be f_5 if you only count divisions)

Now, remember that in your example a is zero, b=3, c[b] = 3, and d[b] = 22, the whole thing comes out as:

f_5+c[b](25) = f_8(25) = a*3^c[b] + d[b] = 0*3^3 + 22 = 22

Problem solved.
11) Message boards : Number crunching : Only one ATI card in boinc (Message 8091)
Posted 2795 days ago by Profile Gipsel
Or if the cards are from the same series simply activate Crossfire and one doesn't need a monitor at all.
12) Message boards : Number crunching : ati13ati is burning up a full cpu, way more than cuda (Message 7217)
Posted 2836 days ago by Profile Gipsel
Agreed, the CPU is not anywhere near %99. However, I disagree with the reasoning behind substituting GPU time for CPU time. I do not see any problem with milkyway and they return CPU time in the CPU field which you can see in the following statistics.
Substituting GPU time for CPU time is misleading. I do not see that being done on your cuda tasks as shown here

The reasoning behind this decision was very clear: BOINC had severe problems with the DCF calculation when very low CPU times were reported. The result was that the client simply refused to fetch new work even when the system was idling.
I don't know if that is/was specific to ATI (handled slightly differently by the BOINC client) or if it also applies to nvidia, but even you admitted that it is/was a problem. And due to the method that factor is/was calculated, reporting a higher time lessens the severity of the problem (the DCF doesn't get that fast that low). You are free to disagree, but that doesn't change the valid reasons why it was introduced.

Later BOINC versions have changed a lot to that. I hope the DCF calculation in more recent versions isn't that borked as it was. It is the reason why I switched it back to reporting real CPU time there (i.e. MW). But it was done just last week with MW 0.23. All prior versions also reported GPU time.

Btw., I have not written the CUDA app. While it uses an almost 1:1 adaption of the CPU and/or ATI code I gave Slicker (with the CUDA specific stuff done by Slicker), what an application reports is more or less at the discretion of the guy actually building the app (one has to change one or two lines in the BOINC API code), and that was not me.
13) Message boards : Number crunching : ati13ati is burning up a full cpu, way more than cuda (Message 7213)
Posted 2837 days ago by Profile Gipsel
However, what I meant to report was the difference between Collatz 2.02 and Collatz 2.09 (the CUDA -vs- the ATI). I am guessing that the ATI version has the CPU polling to see if the ATI is finished while the CUDA interrupts the CPU to tell it is done.

No, the ATI app doesn't use a full CPU and it doesn't busy waits for the GPU (unless you tell it to do so by some command line parameters).
What you see in the task manager is real, I just had a look on your tasks and you are using roughly 2 seconds CPU time per WU. The difference between the CUDA and the ATI app is really that the CUDA app reports the CPU time, but the ATI app reports the GPU time. So in fact it is roughly the GPU utilization you see there. Look at the task details (or the task manager) to see the real CPU load for the ATI app. It is very low with the standard options. Kashi is completely right that it is a cosmetic issue.
14) Message boards : Number crunching : What's your bet for GTX470/480 performance at Collatz? (Message 7162)
Posted 2839 days ago by Profile Gipsel
A GTX470 crunching Collatz. It needs roughly 10 minutes for a WU. A GTX480 would probably around 8 minutes.
15) Message boards : Number crunching : nvidia cuda (Message 7107)
Posted 2841 days ago by Profile Gipsel
If a kernel runs too long, ATI thinks it is hung and aborts. That's why Gipsel calculates how long it will take to run and then splits each chunk into smaller chunks so as not to exceed the max time per kernel call.

Beginning with Vista this driver reset after 2 seconds is actually a Windows feature ;). The ATI driver at least waits 10 seconds or so under WinXP.

I'm doing it also to get a more responsive user interface. Any kernel call blocks other acesses to the GPU during its runtime (same is true for nvidias CUDA), not even the screen can be updated.
DirectX11 cards (HD5000 and GTX4xx) are supposed to do possibly better (allowing several kernels to run concurrently, I would assume it means the GPU is not completely blocked by the start of a kernel), but the software support for this feature is still lacking. Maybe we need DX11.1 or an updated Windows graphics driver model or something like that.
16) Message boards : Number crunching : nvidia cuda (Message 7106)
Posted 2841 days ago by Profile Gipsel
Started with an ATI HD5770 and put in a HD4850, last week, I wonder what the difference in performance is, since one is a single precision and the 4850 a Double Precision, but it's GPU is in fact, less powerfull.
Both aare rated ~1000GFLOPS, each.

Another thing, numbers are getting bigger, seeing a rise in time.
predicted runtime per iteration is 60 ms (33.3333 ms are allowed), dividing each iteration in 2 parts
borders of the domains at 0 2048 4096

predicted runtime per iteration is 80 ms (33.3333 ms are allowed), dividing each iteration in 3 parts
borders of the domains at 0 1368 2736 4096 .

The latter is a HD4850 and obviously slower, then the 5770.

The HD48x0 and HD57x0 GPUs have roughly the same theoretical peak performance, that is true. But the newer series has enhancements that make it possible to use more of this potential in som scenarios, i.e. to get the real performance closer to the theoretical one.

For Collatz the flops rating is somewhat irrelevant as Collatz does pure integer calculations. And the integer capabilities have improved between generations(*). Even when looking at cards wih the same number of stream processors, the HD4000 series is faster than the HD2000/3000 series and the HD5000 series still tops that.
HD4000 cards can do bitshifts (frequently used here) a whopping 5 times as fast (with the same number of SPs) as the HD2000/3000s. The HD5000 series introduced some new instructions again speeding up the bitshifts (one can shift now 64bit at once, not only 32bit in one instruction) as well as enabling a more efficient handling of additions of large integers (especially the carry calculation).

I've done some profiling for the different GPUs. That means the application "knows" how fast a certain GPU is going to be and uses that to pedict the runtimes. That prediction takes into account the GPU family, the number of the SPs of the used GPU, the core clockspeed and also the size of the numbers to be checked (it estimates the needed steps for the WU from the size of the numbers within). That's why you see a different predictions for a HD4800 and HD5700 cards even if they have the same theoretical peak performance. The prediction should closely reflect the actual performance, otherwise the prediction would be wrong ;)

A major improvement of the integer capabilities is expected for nvidias soon to be released/available GTX470/480 GPUs. Older nvida GPUs suck at integer multiplications (also used frequently with Collatz), it should get a nice boost with a GTX4x0 GPU.
17) Message boards : Number crunching : nvidia cuda (Message 7079)
Posted 2843 days ago by Profile Gipsel
Yes, the new 400 series coming out on the 12th will be much faster. Whether it'll be faster than the current ATI, I don't know. I assume this project works mostly with Integer operations? If so, The GTX 480 performs at ~672GIOPS. I do not know the performance of the ATI cards to make a direct comparison, but maybe someone else does and can complete this post :)

See here:
It depends on the instruction. For adds/subs, shifts and logical instructions it is 1360 GIOps/s on a HD5870, 24 bit integer multiplies are done with 1088 GIOps/s (it can do actually a multiply-add with the same rate, that would mean even 2176 GOperations/s) and 32bit integer multiplies with 272 GIOps/s. Furthemore, you reach the peak values only, if the the compiler finds enough parallel instructions (32bit multiplies don't block the execution of other instructions at the same time for instance). To find the real world throughput for a certain algorithm, the easiest thing is to look at the average filling of the five slots of each execution unit.

As we had Collatz@home already as example, the average utilization in the innermost loop is about 58% for HD5000 cards. As 100% would equal 1360 GigaInstructions/s, one arrives roughly at 785 GInstructions/second. That is already above the theoretical throughput of a GTX480, even if the nvidia GPU would reach 100% efficiency. I therefore doubt a bit it will beat the crap out of a HD5870 there. But to say the truth, the efficiency of the cache and memory system also affect the performance with Collatz. Nevertheless, I fully expect a 20% to 30% advantage of a HD5870 against a GTX480 there, which is actually quite an improvement compared to a GTX285 (which has less than 30% the speed of a HD5870, I expect a GTX480 to roughly triple that).

By the way, Collatz doesn't use 24 bit instructions, only 32bit ones.
18) Message boards : Number crunching : GPU memory underclock? (Message 6967)
Posted 2849 days ago by Profile Gipsel
Can the GPU memory be underclocked without affecting the WU runtime like it can at MW?
(W7 64, Q6600 @3.1, HD4870 x2)

A HD4870 (or a HD3870) is probably a bit more tolerant to a lower memory clock than a HD4850 for instance, as it has 80% higher bandwidth than a HD4850 at default clock. But I've done no tests with different clocks.

Generally, Collatz needs quite some memory bandwidth (that's a difference to MW) and some cards may be even severly limited by it (a HD4650 with DDR2 memory comes to my mind). As nvidia GPUs are quite a bit slower in the moment (the GTX470/480 will hopefully change that) they should be able to work quite well with downclocked memory.
19) Message boards : Number crunching : Future collatz applications? (Message 6880)
Posted 2852 days ago by Profile Gipsel
Would it be possible that, not in the distant future, collatz apps would be run on OpenCL compute units, to take advantage of all the power from the shaders?
I think it would be more efficient if the apps could be run, let's say using my own 4650 mobility card, on 8 OpenCL compute units (8 * 8 * 5 = 320 shaders) independently, in the same way that boinc has an option for how many cores to use on multiprocessor systems.

Don't worry, the GPU applications use all available shaders automatically. But a single WU uses all shaders.

It is currently not possible to "split" the GPU and let the compute units (an OpenCL compute unit is a SIMD on ATI GPUs or a SM on nvidia GPUs) work independently on different tasks. The hardware supports that starting with the HD5000 line or nvidias GF100 (GTX480/470), but the APIs don't do it really right now. Furthermore this would increase the memory requirements (each compute unit would need the amount necessary for a WU). Generally, it only provides benefits for the manageability and especially for tasks with a low parallelism (i.e. tasks which can't use the full GPU). Collatz is an embarrassingly parallel problem, it is very easy to use all units for a single WU without compromising the scaling. That's the reason a GPU with twice the shader units (at the same clock and with enough bandwidth) is exactly twice as fast.
20) Message boards : Number crunching : Future collatz applications? (Message 6879)
Posted 2852 days ago by Profile Gipsel
To support what Slicker just wrote, OpenCL is a good idea when you just think about starting with some GPGPU stuff or if you want a somewhat easier port from CUDA to OpenCL (which is often easier than to CAL/Brook) to have *something* to run on ATI hardware at all. In the current state it is very likely it won't perform very well, let alone close to the optimum.

To sum it up, the current ATI application is extremely streamlined. I would say it is simply impossible to get a faster version of the current algorithm with OpenCL, even in a year from now with matured tools and SDKs.

Next 20

Main page · Your account · Message boards

Copyright © 2018 Jon Sonntag; All rights reserved.