ATI vs CUDA relative to Quad Core
log in

Advanced search

Message boards : Number crunching : ATI vs CUDA relative to Quad Core

1 · 2 · 3 · Next
Author Message
jjoshua2
Send message
Joined: 29 Aug 09
Posts: 4
Credit: 2,953
RAC: 0
Message 747 - Posted: 29 Aug 2009, 22:51:31 UTC

So ATI or CUDA faster and how much faster than a quad core?

SuperViruS
Avatar
Send message
Joined: 28 Jul 09
Posts: 5
Credit: 86,322,091
RAC: 0
Message 751 - Posted: 30 Aug 2009, 8:22:13 UTC

Depends on the model of GPU and CPU.

For example, In CUDA my Asus GTX275 elapsed time about 11 minutes per WU and the Quad 6600 about 150 minutes.


____________

_hiVe*
Send message
Joined: 9 Aug 09
Posts: 106
Credit: 162,673,032
RAC: 0
Message 760 - Posted: 30 Aug 2009, 15:54:51 UTC

mobility HD4650 does 1WU in about 30~35 mins.
Painfully slow indeed...
Seems like CUDA app. works faster.
Anyone with Gf. 9600GT that I could compare to?

Bymark
Avatar
Send message
Joined: 28 Jul 09
Posts: 78
Credit: 586,899,108
RAC: 1,121,162
Message 762 - Posted: 30 Aug 2009, 16:16:12 UTC - in response to Message 760.
Last modified: 30 Aug 2009, 16:20:04 UTC

Got one 9600GT ....

http://boinc.thesonntags.com/collatz/show_host_detail.php?hostid=1202
<core_client_version>6.4.7</core_client_version>
<![CDATA[
<stderr_txt>
Beginning processing...
Collatz CUDA v1.10 (GPU Optimized Application)
worker: trying boinc_get_init_data()...
Looking for checkpoint file...
No checkpoint file found. Starting at beginning.
Success in SetCUDABlockingSync for device 0
Generating result output.
2361184737438691207528
2361184737442986174824
2361184737439328684543
1467
2339134884609
Elapsed time: 2673.19 seconds
____________

_hiVe*
Send message
Joined: 9 Aug 09
Posts: 106
Credit: 162,673,032
RAC: 0
Message 765 - Posted: 30 Aug 2009, 17:59:20 UTC

thanks a lot!
that would mean a Gf9600GT spits out 1 WU in approx. 44mins?
i'm not quite familiar with CUDA app. so got no idea whether it processes 2 concurrent WU's at a time or only 1

pls. clarify~
thanks again...

Bymark
Avatar
Send message
Joined: 28 Jul 09
Posts: 78
Credit: 586,899,108
RAC: 1,121,162
Message 766 - Posted: 30 Aug 2009, 18:14:23 UTC - in response to Message 765.
Last modified: 30 Aug 2009, 18:25:12 UTC

Only 1, but enough, that host is doing about 6000 credit/day at collatz.

This computer, with a asus 250 average credit/day is 10k+ not bad, witch I writing from and using all the time. Then the 260 cards are getting almost same credit. Don't ask why....
____________

_hiVe*
Send message
Joined: 9 Aug 09
Posts: 106
Credit: 162,673,032
RAC: 0
Message 770 - Posted: 30 Aug 2009, 19:32:44 UTC

are you saying that the GTX250 and 260's are shedding about the same credit? :o intriguing...
anyway, could you pls. reveal me the TOP/max RAC you've reached with the Gf. 9600GT?

i'd be able to extrapolate mine based on your info~
thanks in advance~

Profile kevint
Send message
Joined: 18 Jun 09
Posts: 34
Credit: 246,186,607
RAC: 0
Message 776 - Posted: 30 Aug 2009, 22:58:31 UTC - in response to Message 747.

So ATI or CUDA faster and how much faster than a quad core?



That is kind of like asking "What color is a car"


Donnie
Avatar
Send message
Joined: 14 Jul 09
Posts: 75
Credit: 140,070,034
RAC: 0
Message 777 - Posted: 30 Aug 2009, 23:47:55 UTC - in response to Message 776.
Last modified: 31 Aug 2009, 0:08:58 UTC

So ATI or CUDA faster and how much faster than a quad core?



That is kind of like asking "What color is a car"




So this is the color of my car:

I ran 2 GTX 280s on a single mobo; Doze XP SP3 32 bit. 1 card was 648 Mhz and the other was 621 Mhz (EVGA had to RMA both of them). Anyway, these cards would finish a WU between 810 seconds to 1100 seconds elapsed time. Most would finish WUs in about 15 minutes.

2 ATI 4850 cards running at 625 Mhz on the same system takes about 1600 seconds or roughly 25-26 minutes. I changed the cmdline to -k512 -n1 to run them at top speed. My other 32 bit system also has 2 4850 cards at the same clock speed and takes roughly the same amount of time. I suppose the difference is the GTX 280 cards has a shader clock between 1350 & 1404 Mhz and the ATIs only have shader clocks of 993 Mhz.

The Doze 64 bit, Vista Ultimate machine has 2 4850 X2s in it and has the same clock speeds of 625/993 Mhz. These WUs take about 45 minutes elapsed time each, even though they have the same -k512 -n1 settings.

I couldn't tell you how long the CPU WUs take as I haven't run any. It's difficult for me to determine why the ATI cards take 2 to 3 times longer than the CUDA cards, especially the 64 bit app (240 vs 800 double precission streams) unless it's the shader clock speed.

I also noticed on the 64 bit quad system that the CPU load on each core was 75% when running the ATI app (all 4 CPUs) and the CPU load on the dual core 32 bit XP SP3 machines doesn't go above 40%. Go figure.

All WU comparisons are for WUs granted between 158 & 170 credits per WU. My computers aren't hidden so they can be identified. Ignore the 3850 as it's the slowboat of the family, although not much slower than the twin 4850 X2 in the 64 bit machine.

I don't run any CPU projects on any of the 4 machines.
____________

Profile kashi
Send message
Joined: 28 Jul 09
Posts: 164
Credit: 100,303,718
RAC: 0
Message 783 - Posted: 31 Aug 2009, 4:19:13 UTC

HD 4890 @ 950 MHz, about 9.5 minutes.

<avg_ncpus>1</avg_ncpus>, <cmdline>-k512 -n1</cmdline>, Windows 7 64 Cat 9.4. W3520 @ 2.67 GHz, 1 core left free to support GPU, WCG Flu on other 7 cores. Collatz is showing CPU load of 7%, total CPU Usage 94%.

_hiVe*
Send message
Joined: 9 Aug 09
Posts: 106
Credit: 162,673,032
RAC: 0
Message 787 - Posted: 31 Aug 2009, 6:20:28 UTC - in response to Message 777.
Last modified: 31 Aug 2009, 6:23:33 UTC




2 ATI 4850 cards running at 625 Mhz....the ATIs only have shader clocks of 993 Mhz.



Suppose you meant to say "shader clock of 625mhz"...993mhz is the freq. of the memory
shader domain on ATI cards works at the same freq. as the core

It is really interesting though why the CUDA cards are faster despite the clear superiority of CAL cards

any dev. would like to enlighten us?

Profile Paul D. Buck
Volunteer tester
Send message
Joined: 30 Aug 09
Posts: 412
Credit: 185,735,226
RAC: 0
Message 788 - Posted: 31 Aug 2009, 8:16:26 UTC - in response to Message 787.




2 ATI 4850 cards running at 625 Mhz....the ATIs only have shader clocks of 993 Mhz.



Suppose you meant to say "shader clock of 625mhz"...993mhz is the freq. of the memory
shader domain on ATI cards works at the same freq. as the core

It is really interesting though why the CUDA cards are faster despite the clear superiority of CAL cards

any dev. would like to enlighten us?

If the code structures and data structures are optimized for CUDA and they did a minimalist port to CAL then you can see this kind of effect. Where the faster hardware produces slower results.

There are significant differences in the ways that the cards are organized and the ways that the processing elements are programmed and so to get the best results from both cards you almost have to write two completely different programs.

_hiVe*
Send message
Joined: 9 Aug 09
Posts: 106
Credit: 162,673,032
RAC: 0
Message 795 - Posted: 31 Aug 2009, 11:29:10 UTC - in response to Message 788.

Thanks for the info Paul, If I remember well you're a dweller from GPUGRID, how are things going there? :)

Ok back to the issue...
I would concur with your logical explanation, but as far as I know the development of Collatz code was heavily influenced by MW code, which is CAL optimized, and its here that I'm getting stuck :|
Any thoughts?

Profile STE\/E
Avatar
Send message
Joined: 12 Jul 09
Posts: 581
Credit: 761,710,729
RAC: 0
Message 796 - Posted: 31 Aug 2009, 11:38:34 UTC - in response to Message 787.




2 ATI 4850 cards running at 625 Mhz....the ATIs only have shader clocks of 993 Mhz.



Suppose you meant to say "shader clock of 625mhz"...993mhz is the freq. of the memory
shader domain on ATI cards works at the same freq. as the core

It is really interesting though why the CUDA cards are faster despite the clear superiority of CAL cards

any dev. would like to enlighten us?


There is no "shader clock" on the ATI GPU's, shader units are run 1:1 at engine clock or Core Clock more commonly known as which is the 625 mentioned & the Memory is the 993 mentioned.

borandi
Send message
Joined: 12 Aug 09
Posts: 6
Credit: 23,162,912
RAC: 0
Message 799 - Posted: 31 Aug 2009, 12:32:59 UTC
Last modified: 31 Aug 2009, 12:33:36 UTC

GTX280 does ~14k creds/day
4850 does ~12k creds/day
4670 does ~5k creds/day
Q6600 does ~2k creds/day

From my experience.

Profile Gipsel
Volunteer moderator
Project developer
Project tester
Send message
Joined: 2 Jul 09
Posts: 279
Credit: 77,193,069
RAC: 77,543
Message 801 - Posted: 31 Aug 2009, 13:28:33 UTC - in response to Message 787.

It is really interesting though why the CUDA cards are faster despite the clear superiority of CAL cards

any dev. would like to enlighten us?

Actually, the current ATI version isn't slower than the CUDA one.

just to cite some numbers from this thread:

HD 4890 @ 950 MHz, about 9.5 minutes

I ran 2 GTX 280s on a single mobo; Doze XP SP3 32 bit. 1 card was 648 Mhz and the other was 621 Mhz (EVGA had to RMA both of them). Anyway, these cards would finish a WU between 810 seconds to 1100 seconds elapsed time. Most would finish WUs in about 15 minutes.

The times Donnie sees with his Dual HD4850X2 (4 GPUs in a single system) show an issue with the MultiGPU support.

Generally the Collatz code run on the GPU is not "influenced" by MW. It would be quite impossible as MW does double precision floating point calculations and Collatz uses exclusively integer arithmetics, especially quite a lot of bitshifts. Both projects do something completely different. The only thing where you can see such an influence is the handling of the GPU calls, the GPU detection and so on. But this doesn't apply to the functional part (where the execution time is spent), only to the framework of the app.

But back to the performance and the ATI<->nvidia differences. The new version still in the works (CPU and ATI versions are done, there are only some problems with the CUDA version left) will use quite a lot of multiplications instead of only bitshifts and additions. Furthermore it will have less inherent instruction level parallelism within each thread (it gets more difficult to fill the 5 slots of the ATI units). Both will be beneficial to nvidia (multiplications use the SFU units which are unused so far in parallel to the normal ALUs) or detrimental to the ATI performance (filling of the slots), respectively.

In the end, it could result (if the remaining problem is solved) in the CUDA version on a GTX280 being faster than the ATI app on a HD4870, which is not the case now. But both will literally destroy the current versions. Hell, the new CPU version on a reasonable processor is as fast as a HD4890 now!

And all of the new versions don't use optimizations to the architecture of a specific GPU (would be hard to do so). Besides the actual calculation (which is a fixed algorithm executed in the same way on all architectures(*)) the task for the GPUs (and CPUs too) is to fetch a lot of data from random memory locations. The needed bandwidth is quite close to the available bandwidth especially on cards with a restricted memory interface (HD4770, HD4850 or integrated chipsets for instance) so an efficient memory controller could be beneficial. But there is nothing one could do to "coalesce" these random accesses to contiguous ones. The same applies to the new CPU version.

Generally speaking we traded the number of needed instructions to get the result against some precomputation and massive table lookups, the table sizes far exceed the caches. Actually we have driven it to the max, larger tables are not possible on 32bit systems (because of the values in it, not because of the size). But it is the fastest solution according to my tests, even if the memory interface may start to limit the performance.

(*):
64bit systems can process the data in chunks twice as large, so one needs only half the instructions to get the same work done than on a 32bit CPU or a GPU (which are not 64bit capable so far). This translates in about twice the performance of the 64bit version compared to the same CPU in 32bit mode. Because of this, the memory interface is also more important on 64bit systems than on 32bit systems (it does not limit even on an AthlonXP with SDR memory). I have not tested it, but I would expect a Core i7 in 64bit mode to be without competition. Besides GPUs of course ;)

_hiVe*
Send message
Joined: 9 Aug 09
Posts: 106
Credit: 162,673,032
RAC: 0
Message 802 - Posted: 31 Aug 2009, 13:50:02 UTC - in response to Message 801.

Nice, very exhaustive.~
It was you I had in mind while compiling my msg. I think this clears lots of questions of a ton of confusion!
Thanks for the quick reply =)

Only one question remains, would the new app's be distributed within September?
(so I could estimate when will my/our boxes need attention again^^)

cheers

Profile Gipsel
Volunteer moderator
Project developer
Project tester
Send message
Joined: 2 Jul 09
Posts: 279
Credit: 77,193,069
RAC: 77,543
Message 805 - Posted: 31 Aug 2009, 15:11:57 UTC - in response to Message 802.

Only one question remains, would the new app's be distributed within September?
(so I could estimate when will my/our boxes need attention again^^)

I would think so. As I said Slicker has tested the new CPU and ATI versions already. Only the CUDA version is not finished. But Slicker is working on it and in principle it runs already.

From what Slicker told me, there is some weird problem when copying large lookup tables to the graphics card memory. When using smaller ones it works as it should. But using the exact same size as the ATI and CPU versions would probably result in almost twice the performance compared to the table sizes the CUDA version currently runs with. I guess you understand why Slicker is trying to solve this problem first ;)

Donnie
Avatar
Send message
Joined: 14 Jul 09
Posts: 75
Credit: 140,070,034
RAC: 0
Message 843 - Posted: 1 Sep 2009, 17:59:43 UTC - in response to Message 801.

[quote]It is really interesting though why the CUDA cards are faster despite the clear superiority of CAL cards

any dev. would like to enlighten us?

Actually, the current ATI version isn't slower than the CUDA one.

just to cite some numbers from this thread:

HD 4890 @ 950 MHz, about 9.5 minutes

I ran 2 GTX 280s on a single mobo; Doze XP SP3 32 bit. 1 card was 648 Mhz and the other was 621 Mhz (EVGA had to RMA both of them). Anyway, these cards would finish a WU between 810 seconds to 1100 seconds elapsed time. Most would finish WUs in about 15 minutes.


The times Donnie sees with his Dual HD4850X2 (4 GPUs in a single system) show an issue with the MultiGPU support.


I see my original point was missed when posted my dual GTX 280 cards were finishing most WUs in roughly 15 minutes at core clock speeds between 621 & 648 Mhz.

The dual ATI HD4850s at 625 Mhz core clock speeds take roughly 23 minutes.

The clock speeds were to compare apples to apples and not clock speeds of 621 Mhz to 950 Mhz.

Sorry for the confusion.

Profile Gipsel
Volunteer moderator
Project developer
Project tester
Send message
Joined: 2 Jul 09
Posts: 279
Credit: 77,193,069
RAC: 77,543
Message 845 - Posted: 1 Sep 2009, 21:55:52 UTC - in response to Message 843.

It is really interesting though why the CUDA cards are faster despite the clear superiority of CAL cards

any dev. would like to enlighten us?

Actually, the current ATI version isn't slower than the CUDA one.

just to cite some numbers from this thread:

HD 4890 @ 950 MHz, about 9.5 minutes

I ran 2 GTX 280s on a single mobo; Doze XP SP3 32 bit. 1 card was 648 Mhz and the other was 621 Mhz (EVGA had to RMA both of them). Anyway, these cards would finish a WU between 810 seconds to 1100 seconds elapsed time. Most would finish WUs in about 15 minutes.


The times Donnie sees with his Dual HD4850X2 (4 GPUs in a single system) show an issue with the MultiGPU support.


I see my original point was missed when posted my dual GTX 280 cards were finishing most WUs in roughly 15 minutes at core clock speeds between 621 & 648 Mhz.

The dual ATI HD4850s at 625 Mhz core clock speeds take roughly 23 minutes.

The clock speeds were to compare apples to apples and not clock speeds of 621 Mhz to 950 Mhz.

Sorry for the confusion.

I guess you didn't got my point. The problem with the current ATI version is that it doesn't correctly support multi GPU setups as your HD4850X2. It is running most likely a lot slower on your system as it could.

If one adjust the 9.5 minutes of the HD4890 for the 950MHz (performance scales indeed linear with clockspeed) you get 12 minutes for 750MHz or 14.5 minutes for 625MHz.

1 · 2 · 3 · Next
Post to thread

Message boards : Number crunching : ATI vs CUDA relative to Quad Core


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.