Intel Phi and Collatz programming
log in

Advanced search

Message boards : Cafe : Intel Phi and Collatz programming

Author Message
Send message
Joined: 25 Dec 09
Posts: 2
Credit: 322,399,576
RAC: 914,916
Message 20392 - Posted: 19 Apr 2015, 23:44:04 UTC


The short version of a long story, I sysadmin a couple of large Linux clusters and the company just bought a *bunch* of Intel gear (software/servers/Xeons/Phis/ect). I ended up with a personal Intel Xeon Phi 31S1P + Pro software pack whose sole purpose in life is for me to tinker with. This system is *not* a rock star by any means with only a single proc 6 core 1.6Ghz Xeon + 8GB memory, /but/ it does rock a 31S1P Phi...

Now to get more to the point, I am not really a programmer. I am around code all day and I help debug users code all the time, but I don't write a lot of code myself. Anyway, since I have a personal Phi AND everything I hear about the future Knights Landing is pretty awesome, I figured I should at least learn *something* about programming this beast. It would be a shame to let a TeraFLOP just sit idle at my feet...I started just hacking some OpenMP C code together on some basic math problems and I decided to kick it up a notch to see if I couldn't code up a bigger example. So I started looking at BOINC for inspiration.

I am really surprised that there isn't more being done with the Phi. At a 150$ price point with companies like Advanced Clustering and Colfax international /giving/ away 31S1P's and 5100's with the purchase of a ~$1800-$2000 system, I really thought people would have jumped on this.

Well then I saw this:

Apparently Slicker is working on it??
I know in his profile he says he likes the Collatz problem because it parallelizes well. With 57-61 cores sporting 4 threads each, the Phi certainly does parallel!

Any chance anyone knows if there has been any progress on the Phi?

With regards to the Collatz math, it doesn't look too difficult. The challenge I think would be optimizing it for a device. Utilizing my vast 5 minutes of Wikipedia reading ;-D it looks like the math could be done in straight binary, coupled with vectors it could potentially scream on the Phi...

So just to give me a challenge to think about I am going to see if I can't get a quick sloppy version running and then start tweaking it. Any advice on tackling large numbers of the current WU size would be much appreciated.


Send message
Joined: 25 Dec 09
Posts: 2
Credit: 322,399,576
RAC: 914,916
Message 20393 - Posted: 20 Apr 2015, 0:43:30 UTC - in response to Message 20392.

So first simple serial program was a snap as expected. :-)

Now I have two challenges (I think):

# include <iostream>
int main()
using namespace std;
unsigned long long a = (unsigned long long) -1;
cout << "long long:" << sizeof(long long) <<"\t" << a << endl;
return 0;

$ icpc size.cpp; ./a.out
long long: 8 18446744073709551615

Crap...I am so going to hit a limit real quick...Looks like I am going to have to go learn how to deal with really big numbers...

2) On one hand there is a Multiply-Add which means the threads can do that calculation in a single clock cycle which is good and in the other hand is a basic multiplication (invert the division since division can be 6 times slower than multiplication) which is also good. The problem is that between the two, there is an if statement. That if statement is going to kill any chance at vectorization...


Profile sosiris
Send message
Joined: 11 Dec 13
Posts: 123
Credit: 55,800,869
RAC: 0
Message 20448 - Posted: 12 May 2015, 6:22:43 UTC - in response to Message 20393.

Hello, Stack.

The opencl kernel code of this project is something like this :

It uses a couple of 32-bit ints to simulate 192-bit integer arithmetic and a look-up table to 'jump' 20 steps in one iteration. Hope it helps.

I also tried to speed things up by proposing a rather radical approach:
That's 70 times faster; impressive, isn't it?

As to your questions, first of all, division is done by bit shifting because the number is always divided by 2^N. Secondly, the conditional statement is a necessary evil or the program will not know when to stop. However the compiler will try its best to vectorize the code, I think.

IMHO, you could try using 64-bit ints instead of 32-bit ints. Processing 64-bit ints is a lot slower in the GPUs but maybe faster in CPUs and Xeon Phi.
Sosiris, team BOINC@Taiwan

Post to thread

Message boards : Cafe : Intel Phi and Collatz programming

Main page · Your account · Message boards

Copyright © 2018 Jon Sonntag; All rights reserved.