Posts by nedmanjo

1) Message boards : Number crunching : Computation Errors (Message 1744)
Posted 8 May 2019 by nedmanjo
Post:
That's a possibility. I'll give it a try. Any knowledge about the meaning of Exit status -102 (0xFFFFFF9A) ERR_READ?
2) Message boards : Number crunching : Computation Errors (Message 1739)
Posted 7 May 2019 by nedmanjo
Post:
Random Error while computing. Usually 5 - 6 seconds into processing. Error repeats ever several hours.

- Outcome Computation error
- Client state Compute error
- Exit status -102 (0xFFFFFF9A) ERR_READ

System Config: Supermicro SYS-7047R-TRF 4U Server, X9DA7 MB, two Xeon E5-2697 V2, two Nvidia GTX 1080 TI

Have used DDU to clear the drivers and reinstalled older driver, v388.13. No difference with newer driver.
GPU is running at stock factory settings.

Configuration:

verbose=1
kernels_per_reduction=48
threads=9
lut_size=18
reduce_CPU=0
sieve_size=30
cache_sieve=1
sleep=0

Have tried this:

verbose=1
kernels_per_reduction=48
threads=8
lut_size=17
reduce_CPU=0
sieve_size=30
cache_sieve=1
sleep=0

Now these are two cards, not new, currently running 1 card at a time due to thermals, summer weather. Both cards behave similarly. Pretty sure its not the cards.

Any ideas?
3) Message boards : Number crunching : Optimizing the apps (Message 1321)
Posted 15 Dec 2018 by nedmanjo
Post:
~ 2 minutes per WU. Nice!
4) Message boards : Number crunching : Optimizing the apps (Message 1286)
Posted 29 Nov 2018 by nedmanjo
Post:
Just the EVGA GTX 1070 SC. TI's should perform much better.
5) Message boards : Number crunching : Optimizing the apps (Message 1280)
Posted 28 Nov 2018 by nedmanjo
Post:
Interesting update. So, it's winter here and I've some free cooling to employ. Noticed a nice reduction in the thermals as well as run times. Then I got the bright idea to pull 4 empty drive bay cassettes from my Supermicro and open the front door and what a difference that made. The thermals dropped significantly. Now a single WU per GTX 1080 FE's is running in the 5 min 30 second range and speeds are negligibly faster when running two WU's at a time. The extra cooling helped allot but getting more air flow through the chassis made the biggest difference. So, perhaps the real difference is the chassis air flow design. I am surprised given the Supermicro's push pull design, 3x Middle 8cm (5000 rpm) PWM Fans & 2x Exhaust 8cm (5000 rpm) PWM Fans.
6) Message boards : Number crunching : Error while downloading (Message 1279)
Posted 27 Nov 2018 by nedmanjo
Post:
Thanks Mikey! Appreciate it. I'd been boosting my OC a bit and am glad to know they are not connected.
7) Message boards : Number crunching : Error while downloading (Message 1273)
Posted 26 Nov 2018 by nedmanjo
Post:
Generated 10 of these over the past few days and am not sure what to make of it. Can any one tell me specifically what's the cause of this error?

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>collatz_sieve_5c28b912-cdcf-40fd-9917-8ae84ff3f77a</file_name>
<error_code>-200 (wrong size)</error_code>
</file_xfer_error>
</message>
]]>
8) Message boards : Number crunching : Optimizing the apps (Message 1200)
Posted 25 Oct 2018 by nedmanjo
Post:
Hi Corsair, Happy to help. When running two tasks simultaneously be sure to divide your results by two. Some folks produce measurably better results running one task per GPU, others running two tasks per GPU.
9) Message boards : Number crunching : Optimizing the apps (Message 1197)
Posted 24 Oct 2018 by nedmanjo
Post:
Hey Corsair, both my 1070's and now my 1080's run two (2) tasks per using this config. Try with the gpu_versions statement.

<app_config>
<app>
<name>collatz_sieve</name>
<max_concurrent>2</max_concurrent>
<gpu_versions>
<gpu_usage>0.50</gpu_usage>
<cpu_usage>0.25</cpu_usage>
</gpu_versions>
</app>
</app_config>
10) Message boards : Number crunching : Optimizing the apps (Message 1152)
Posted 13 Oct 2018 by nedmanjo
Post:
Good question. I ran up one (1) WU per GPU at stock and OC and then two (2) WU's per GPU at stock and OC. My results favor two (2) WU's per GPU.

With stock settings, just one (1) WU per GPU, I posted 437.39 (7.3 min.) and 444.68 (7.4 min.) per task.
Reverting back, same settings but two (2) WU's per GPU, I posted ~ 814.8 (13.6 min.) or 6.8 min. per task and 821.9 (13.7 min.) or 6.85 min per task.

With slight OC, just one (1) WU per GPU, I posted 357.43 (6 min.) and 404.16 (6.7 min.) per task.
Reverting back, same settings but two (2) WU's per GPU, I posted ~ 702.6 (11.7 min.) or 5.85 min. per task and 752.9 (12.5 min.) or 6.3 min per task.

verbose=1
kernels_per_reduction=48
threads=8
lut_size=17
sieve_size=30
cache_sieve=1
sleep=1
11) Message boards : Number crunching : Optimizing the apps (Message 1151)
Posted 13 Oct 2018 by nedmanjo
Post:
I'm running some single GPU WU's now. Initial WU's are at ~ 7.5. If I set at threads=9 I produce "Error while computing" WU's.

verbose=1
kernels_per_reduction=48
threads=8
lut_size=17
sieve_size=30
cache_sieve=1
sleep=1
12) Message boards : Number crunching : Optimizing the apps (Message 1149)
Posted 13 Oct 2018 by nedmanjo
Post:
Swapped out my two GPU's, again. Sold the two EVGA GTX 1070 SC ACX's for two Nvidia GTX 1080 Founders Edition. Using the recommended settings I was getting "Error while computing" but then reduced threads and sieve_size down a point and am getting stable operation and valid results.

verbose=1
kernels_per_reduction=48
threads=8
lut_size=17
sieve_size=29
cache_sieve=1
sleep=1

Still running two WU's per GPU, measures faster than 1 WU per GPU. Utilization is at a steady 100% as currently configured.

<app_config>
<app>
<name>period_search</name>
<max_concurrent>4</max_concurrent>
<gpu_versions>
<gpu_usage>0.50</gpu_usage>
<cpu_usage>0.25</cpu_usage>
</gpu_versions>
</app>
<app>
<name>collatz_sieve</name>
<max_concurrent>4</max_concurrent>
<gpu_versions>
<gpu_usage>0.50</gpu_usage>
<cpu_usage>0.25</cpu_usage>
</gpu_versions>
</app>
</app_config>

If anyone has suggestions for further improvement please do tell.
13) Message boards : Number crunching : Optimizing the apps (Message 1047)
Posted 18 Sep 2018 by nedmanjo
Post:
Woot! Hit > 900M Collatz credits today!!! And hit > 1,000M total credits in BOINC!!! Picked up a couple of used EVGA 1070 SC ACX's, yup, sold the Vega. These 1070's are humming along at a combined > 10M credits per day! Shooting for > 1,000M on Collatz! Power consumption is no more than the Vega plus 4M more credits per day. The posted configuration data is priceless. Thanks to all.

verbose=1
kernels_per_reduction=48
threads=9
lut_size=17
sieve_size=30
cache_sieve=1
sleep=1

<app_config>
<app>
<name>period_search</name>
<max_concurrent>4</max_concurrent>
<gpu_versions>
<gpu_usage>0.50</gpu_usage>
<cpu_usage>0.25</cpu_usage>
</gpu_versions>
</app>
<app>
<name>collatz_sieve</name>
<max_concurrent>4</max_concurrent>
<gpu_versions>
<gpu_usage>0.50</gpu_usage>
<cpu_usage>0.25</cpu_usage>
</gpu_versions>
</app>
</app_config>
14) Message boards : Number crunching : Optimizing the apps (Message 1032)
Posted 14 Sep 2018 by nedmanjo
Post:
Configuration file locations (on Windows, replace the first path components on other OSs accordingly):

C:\ProgramData\BOINC\projects\boinc.thesonntags.com_collatz\<app_name>.config

During project initialization on your client, empty <app_name>.config files will be created for each of the application versions that match your GPUs. You can enter parameters into these files in order to deviate from default values, and they will be picked up as soon as a Collatz GPU task starts.

Configuration file format

Plain text file, one "parameter=value" pair per line, unrecognized parameter names are simply ignored (you can use this to comment out some parameters during testing), missing parameters fall back to their default values.

Example (suitable for a GTX 1080):
kernels_per_reduction=48
threads=9
lut_size=17
sieve_size=30
cache_sieve=1

Parameters

cache_sieve
default: 1 (?)
range: 0 or 1 (?)
definition: "any setting other than 1 will add several seconds to the run time as it will re-create the sieve for each WU run rather than re-using it"

kernels_per_reduction
default: 32
range: 1...64
definition: "the number of kernels that will be run before doing a reduction. Too high a number may cause a video driver crash or poor video response. Too low a number will slow down processing. Suggested values are between 8 and 48 depending upon the speed of the GPU."
comment: "affects GPU usage and video lag the most from what I [sosiris] tested."

lut_size
default: 10
range: 2...31
definition: "the size (in power of 2) of the lookup table. Chances are that any value over 20 will cause the GPU driver to crash and processing to hang. The default results in 2^10 or 1024 items. Each item uses 8 bytes. So 10 would result in 2^10 * 8 bytes or 8192 bytes. Larger is better so long as it will fit in the GPUs L1/L2 cache. Once it exceeds the cache size, it will actually take longer to complete a WU since it has to read from slower global memory rather than high speed cached memory."
comment: "I [sosiris] choose 16, 65536 items for the look up table because it would fit into the L2$ (512KB) in GCN devices. IMHO it could be 20 for NV GPUs, just like previous apps, because NV GPUs have better caching."

reduce_cpu
default: 0
range: 0 or 1
definition: "The default is 0 which will do the total steps summation and high steps comparison on the GPU. Setting to 1 will result in more CPU utilization but may make the video more responsive. I have yet to find a reason to do the reduction on the CPU other than for testing the output of new versions."
comment: "I [sosiris] choose to do the reduction on the CPU because AMD OpenCL apps will take up a CPU core no matter what you do (aka 'busy waiting') and because I want better video response."

sieve_size
default: ?
range: 15...32
definition: "controls both the size of the sieve used 2^15 thru 2^32 as well as the items per kernel are they are directly associated with the sieve size. A sieve size of 26 uses approx 1 million items per kernel. Each value higher roughly doubles the amount. Each value lower decreases the amount by about half. Too high a value will crash the video driver."

sleep
default: 1
range: ?
definition: "the number of milliseconds to sleep while waiting for a kernel to complete. A higher value may result in less CPU utilization and improve video response, but it also may lengthen the processing time."

threads
default: 6
range: 6...11
definition: "the 2^N size of the local size (a.k.a. work group size or threads). Too high a value results in more threads but that means more registers being used. If too many registers are used, it will use slower non-register memory. The goal is to use as many as possible, but not so many that processing slows down. AMD GPUs tend to work best with a value of 6 or 7 even though they can support values of up to 10 or 11. nVidia GPUs seem to work as well with higher values as lower values."
comment: "I [sosiris] didn't see lots of difference once items per work-group is more than wavefront size (64) of my HD7850 in the profiler."

verbose
default: 0
range: 0 or 1
definition: "1 will result in more detail in the output."

Definitions are taken from Slicker's post from June 2015, last modified in September 2015.
Comments are taken from sosiris' post from June 2015.
Edit April 28 2018, added definition of cache_sieve from a post from Slicker from April 2018
15) Message boards : Number crunching : Optimizing the apps (Message 940)
Posted 25 Aug 2018 by nedmanjo
Post:
Smokin' fast! Nice!
16) Message boards : Number crunching : Optimizing the apps (Message 935)
Posted 22 Aug 2018 by nedmanjo
Post:
Back up and running. Vega Frontier failed, took my MB out with it. The good, RMA replacement GPU, thank you AMD, the bad, new MB. So far, so good. No thermal issues or issues otherwise.
17) Message boards : Number crunching : Optimizing the apps (Message 806)
Posted 19 Jul 2018 by nedmanjo
Post:
I take it this is the CPU config file? "collatz_sieve_1.40_windows_x86_64" Are there any optimizations for this like for the "collatz_sieve_1.30_windows_x86_64__opencl_nvidia_gpu" config file?
18) Message boards : Number crunching : Optimizing the apps (Message 642)
Posted 6 Jul 2018 by nedmanjo
Post:
Hi Martin, Appreciate your advice. I may get to that in time. Prior to Flashing the BIOS I found an error, SMBIOS 0X01 P1-DIMMB1 Single Bit ECC Memory Error. Could be the error originated from the multiple system crashes, could be I have a bad memory module. Happens. I'm creating a Memtest86 bootable USB drive to check it out.

Regarding the GPU. It simply won't run any GPU tasks (Collatz, GPUGrid, Amicable...) with any of the three AMD drivers written specific for this card. Not at default settings, not tweaked. Runs just several minutes, screen corrupts, goes black and there's no getting it back. I had no such issues for the first week. Ran stable 24/7 at default settings but throttled constantly. Downclocking eliminated the throttling and it was stable at ~ 1400Mhz at 20 degrees below its thermal ceiling. Ran great for several days. I had my best day ever at 6,610,612 credits!

Something changed... bad hardware, software, settings... don't know.

So, reseated the hardware, no issues found. Flashed the BIOS so it's factory configured with minor but necessary adjustments. I'll run a full test of the memory modules and we'll see where we go from there.

Thanks, I appreciate you reaching out.
19) Message boards : Number crunching : Optimizing the apps (Message 635)
Posted 5 Jul 2018 by nedmanjo
Post:
Update.... I have no clue... no stablity with GPU under load at stock settings or otherwise. So, start from scratch.

Flash BIOS
Restore Defaults
Reseat hardware
Reinstall OS & Drivers
Test
Upon Fail... stick cattle prod in box, then activate.

much apologies for all the lost WU's.

:(
20) Message boards : Number crunching : Optimizing the apps (Message 599)
Posted 4 Jul 2018 by nedmanjo
Post:
Problem sorted but not fully understood. It was tied to running LHC but not at 75% as I assumed, I was running 23 WU's at 100%. So much for going off memory. MB thermals were fine but guess this PC couldn't take it. Reduced the CPU use % to 75% and I'm stable again. It's not the first time I've run CPU projects at 100%, some projects don't drive high thermals and when they do I reduce % use. Lesson learned.


Next 20


©2020 Jon Sonntag; All rights reserved