2x290-x (problems)
log in

Advanced search

Message boards : Number crunching : 2x290-x (problems)

Author Message
Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,470,992,947
RAC: 15,827,806
Message 19501 - Posted: 19 May 2014, 14:55:59 UTC
Last modified: 19 May 2014, 14:56:14 UTC

Hi all,
I just got this new PC (http://boinc.thesonntags.com/collatz/show_host_detail.php?hostid=145423), installed Win7, Catalyst 14.4 and started crunching Collatz workunits. What I got are a lot of computation errors (say around 50%) so I started investigating the reasons, also playing with the config file. What I noticed is that, regardless of the items_per_kernel value, the gpu usage (monitored using both Afterburner and GPU-Z) stays fixed at 100% for a while, then drops to another value (which depends on items_per kernel) and back to 100%. This is a strange behavior, on any other pc the gpu usage, after proper configuration, stays almost fixed around 97-98%. I don't know if this may be related to my error rate...

Any hints or suggestions?
Thanks a lot in advance.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19513 - Posted: 20 May 2014, 12:57:10 UTC - in response to Message 19501.

Hi all,
I just got this new PC (http://boinc.thesonntags.com/collatz/show_host_detail.php?hostid=145423), installed Win7, Catalyst 14.4 and started crunching Collatz workunits. What I got are a lot of computation errors (say around 50%) so I started investigating the reasons, also playing with the config file. What I noticed is that, regardless of the items_per_kernel value, the gpu usage (monitored using both Afterburner and GPU-Z) stays fixed at 100% for a while, then drops to another value (which depends on items_per kernel) and back to 100%. This is a strange behavior, on any other pc the gpu usage, after proper configuration, stays almost fixed around 97-98%. I don't know if this may be related to my error rate...

Any hints or suggestions?
Thanks a lot in advance.


What settings are you using in the Collatz config?
What settings are you using in the app_config.xml?

You will need the latter since the faster GPUs pretty much require multiple WUs running at once in order to keep busy. The errors show that the GPU is working but returning bad data. That's probably due to being clocked too high and/or running too hot. It could also be flaky RAM on the GPU but more often than not it is heat/overclocking that causes it. That or AMD has a buggy driver which is over optimizing the GPU code and losing data that way.

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,470,992,947
RAC: 15,827,806
Message 19529 - Posted: 22 May 2014, 11:00:45 UTC - in response to Message 19513.

After some testing I think that I may have some kind of hardware problem. It seems that if I crunch Collatz gpu only everything is going ok, as if I just crunch cpu workunits only, or test the system with prime95. I start to get some windows freezes if I BOTH use cpu and gpu (the last freeze messed up, someway, my network switches...) I have to dig deeper... BTW there are two Sapphire r290-x tri-x, the cpu is a 4930k and the PSU is a Corsair RM1000.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19531 - Posted: 22 May 2014, 15:51:32 UTC - in response to Message 19501.

Hi all,
I just got this new PC (http://boinc.thesonntags.com/collatz/show_host_detail.php?hostid=145423), installed Win7, Catalyst 14.4 and started crunching Collatz workunits. What I got are a lot of computation errors (say around 50%) so I started investigating the reasons, also playing with the config file. What I noticed is that, regardless of the items_per_kernel value, the gpu usage (monitored using both Afterburner and GPU-Z) stays fixed at 100% for a while, then drops to another value (which depends on items_per kernel) and back to 100%. This is a strange behavior, on any other pc the gpu usage, after proper configuration, stays almost fixed around 97-98%. I don't know if this may be related to my error rate...

Any hints or suggestions?
Thanks a lot in advance.


How many WUs are you running at a time on each GPU? What collatz config settings are you using?

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,470,992,947
RAC: 15,827,806
Message 19534 - Posted: 22 May 2014, 16:43:07 UTC - in response to Message 19531.

just 1 wu on each gpu, with the following config:

verbose=1
items_per_kernel=20
kernels_per_reduction=9
threads=8
sleep=1

This is a snapshot of the gpu utilization:

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19544 - Posted: 24 May 2014, 15:18:07 UTC

Every 512 kernels (e.g. kernels_per_reduction=9, 2^9=512) each of which contains 1,048,576 numbers (items_per_kernel=20, 2^20=1,048,576) the CPU summarizes the results and verifies that the GPU is returning valid data. On most systems, the GPU load will be reduced ever so slightly during that period, but it should occur at regular intervals so long as you are running only one WU per GPU. The only other time that I would expect the GPU load to drop is when a few of the 1,048,576 numbers have a higher than average number of steps. If all the other numbers of that kernel have only an average number of steps, those stream processors will sit idle until the others have completed. GPUs do everything in parallel, so as soon as branching occurs, the processors have to wait until the others are finished so they can all proceed to the next instructions. I wonder if that is happening here.
Have you tried running two WUs at a time via an app_config file?

Profile valterc
Send message
Joined: 21 Sep 09
Posts: 39
Credit: 14,470,992,947
RAC: 15,827,806
Message 19551 - Posted: 26 May 2014, 11:34:56 UTC - in response to Message 19544.

I did not try to run two wus at a time, nor I raised the items_per_kernel to more then 21. That's because
1- I have to fight with temperatures, the card at the top of the other is running a little bit hot... as expected...
2- I also may have some hardware problems... memtest86 gave me errors while using the higher (1866Mhz) XMP profile, using the other profile (1600Mhz) seems okay but I have to do some further testing...

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 19558 - Posted: 27 May 2014, 2:30:16 UTC - in response to Message 19551.

I did not try to run two wus at a time, nor I raised the items_per_kernel to more then 21. That's because
1- I have to fight with temperatures, the card at the top of the other is running a little bit hot... as expected...
2- I also may have some hardware problems... memtest86 gave me errors while using the higher (1866Mhz) XMP profile, using the other profile (1600Mhz) seems okay but I have to do some further testing...


It could be that the temps are causing the card to slow down. I know that happens with my GTCX 770M which only runs for a few seconds at 100% before getting slowed down because, well..., because laptop cooling fans are total garbage. If I run CPU, Intel GPU, and nVidia GPU at the same time, it literally shuts the laptop down. Running only the GPU, it slows down to 60-70% because of running too hot.


Post to thread

Message boards : Number crunching : 2x290-x (problems)


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.