Clearing up the Backlog (12/05/2013)
log in

Advanced search

Message boards : News : Clearing up the Backlog (12/05/2013)

1 · 2 · Next
Author Message
Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 18032 - Posted: 6 Dec 2013, 4:15:07 UTC

In order to clear up the backlog of half a million workunits that the BOINC daemons felt the need to create when the database was unresponsive, a number of them have been redesignated as solo_collatz workunits. Those workunits that were originally mini_collatz workunits will run in 1/8th the time and receive 1/8th the credit. Those workunits that were originally collatz workunits are the same size as the solo_collatz workunits and will take the normal time to crunch and receive the normal solo_collatz credit.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 18033 - Posted: 6 Dec 2013, 4:16:16 UTC

Also, the transitioner backlog is almost a full day now, so it will take a while for the pending solo_collatz results to validate. Please try and be patient.

Profile Tarmo Ilves
Avatar
Send message
Joined: 6 Jun 10
Posts: 8
Credit: 1,636,802,721
RAC: 354,358
Message 18045 - Posted: 6 Dec 2013, 19:59:08 UTC

Thanks for information it's good to know what's happening and why:)

Profile Sphynx
Send message
Joined: 7 Jan 11
Posts: 22
Credit: 1,138,480,505
RAC: 176,413
Message 18046 - Posted: 6 Dec 2013, 22:17:06 UTC

Seems To Have Broken Loose Now. My Pendings Dropped From 825 To 140 In The Last Hour. Thanks Jon!

Tackleway
Send message
Joined: 29 Sep 13
Posts: 53
Credit: 1,734,204,499
RAC: 1,741,752
Message 18055 - Posted: 8 Dec 2013, 1:28:28 UTC - in response to Message 18046.

Well lucky 'ole you!

My solo validation pending list has been growing all day!
The server reports, all is good-(ish) and computing status
states no transitioner backlog & next to nothing awaiting
validation! So what's actually the true state of affairs?
Is that page reporting 'Iffy' info at this time?

Can anyone please explain, patient but curious, thanks.



____________

Profile UBT - NaRyan
Send message
Joined: 15 May 10
Posts: 2
Credit: 133,336,986
RAC: 0
Message 18072 - Posted: 8 Dec 2013, 20:35:54 UTC
Last modified: 8 Dec 2013, 20:44:45 UTC

I'm guessing that page is reporting "iffy" info.
As it is validating the solo workunits, although it seems to be a bit random, as some hours it does several then does nothing for a few hours, and it also seems to like to validate the newer workunits before the older ones.

Also since you are a member of UK Boinc Team, you can look at Temujin's Collatz stats and if you click the graph after your name and other credit related info you can see what you have got per hour.
So that will help you see when things have validated.

Edit.
Just noticed when looking at the workunits that are still pending, that they are actually collatz workunits that were sent out as solo_collatz, but they still have a quorum of 2, so got to wait for a wingman....

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 18081 - Posted: 9 Dec 2013, 18:50:54 UTC

I changed the minimum quorum on the solo_collatz WUs to 1 and then re-submitted them all for validation. On the up side, the solo WUs won't be asking for a wingman and those already returned __should__ be validated when the validator catches up as it now has a backlog as does the transitioner (again). The down side is that all the WUs that were already validated are now reporting that there was an error while validating. The error is just that they were already granted credit. So, just ignore those errors.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 18082 - Posted: 9 Dec 2013, 18:53:11 UTC - in response to Message 18072.

Also since you are a member of UK Boinc Team, you can look at Temujin's Collatz stats and if you click the graph after your name and other credit related info you can see what you have got per hour.
So that will help you see when things have validated.


That is a really nice stats page. Almost makes a person want to implement something like that on the project site...

Profile Sphynx
Send message
Joined: 7 Jan 11
Posts: 22
Credit: 1,138,480,505
RAC: 176,413
Message 18085 - Posted: 9 Dec 2013, 21:14:39 UTC

I just downloaded some fresh solos and the quorum is still 2.,,FYI

Profile Sphynx
Send message
Joined: 7 Jan 11
Posts: 22
Credit: 1,138,480,505
RAC: 176,413
Message 18086 - Posted: 9 Dec 2013, 21:21:58 UTC

Sorry, this is what I'm seeing:

minimum quorum 1
initial replication 2

What does that mean?

Profile Pooh Bear 27
Avatar
Send message
Joined: 1 Aug 10
Posts: 54
Credit: 108,227,920
RAC: 0
Message 18087 - Posted: 9 Dec 2013, 22:16:35 UTC

Yes, all current Solos going out are still being send to two hosts, even though minimum quorum is 1. When they were originally regular there were already tagged with creating two in the replication phase, so it will still create two.

I am unsure if there is a way to fix that. The issue BoincStats is currently running a Challenge and this will throw the challenge numbers way off the way things are happening.

Lets hope there is a fix to the current situation.

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 18088 - Posted: 9 Dec 2013, 23:23:16 UTC

Thanks for catching that. I had only changed the min_quorum before. Now, I updated the solo WUs so that the initial replication (a.k.a. target_nresults column in the BOINC database) is 1 as well.

Profile Pooh Bear 27
Avatar
Send message
Joined: 1 Aug 10
Posts: 54
Credit: 108,227,920
RAC: 0
Message 18095 - Posted: 10 Dec 2013, 2:05:28 UTC

As I said, they were already tagged for creation of two, since they were already created there are two and they are still going out to two people, even though all the numbers say one. Until those units are cleared or you manually take down the database and delete the duplicates we will just have to live with this.

Profile Sphynx
Send message
Joined: 7 Jan 11
Posts: 22
Credit: 1,138,480,505
RAC: 176,413
Message 18104 - Posted: 10 Dec 2013, 14:18:12 UTC

Just curious. I see that all solo wus have now been sent. Will we be waiting until they are all returned before the solo work generator is turned back on?

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 18105 - Posted: 10 Dec 2013, 15:08:45 UTC - in response to Message 18104.

Just curious. I see that all solo wus have now been sent. Will we be waiting until they are all returned before the solo work generator is turned back on?


That's the same logic BOINC uses which is causing lots and lots of problems. You would think that that is the case but in reality, there are thousands upon thousands of solo WUs. But, because the transitioner is backlogged, it looks like there aren't any. So, that causes the work generator (if running) to create even more work. If you look at the collatz WUs, you will see that is has a few hundred thousand WUs today whereas yesterday it showed 0. The work generator is supposed to stop when it creates 1K WUs but because the bass ackwards way the that the transitioner actually inserts the result records rather than the work generator, the left hand doesn't know what the right hand is doing. Once the transitioner catches up, I can re-enable the work generators. Until then, doing so will just make matters worse.

Profile Scribe
Avatar
Send message
Joined: 22 Dec 11
Posts: 10
Credit: 12,743,541
RAC: 0
Message 18109 - Posted: 10 Dec 2013, 16:27:31 UTC

.....once everything has settled out, how do you prevent it all happening again? What caused it, do you know?
____________
Alan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 18111 - Posted: 10 Dec 2013, 17:50:09 UTC - in response to Message 18109.

.....once everything has settled out, how do you prevent it all happening again? What caused it, do you know?


The problem is that the BOINC work generator doesn't take into account when the BOINC transitioner is either backlogged or not running. The work generator creates the WU record and the physical WU files. The transitioner creates the result records (what you think of as a WU) according to the quorum rules. The work generator creates new WUs whenever the number of results is below a set threshhold. If the transitioner is backlogged or not running at all, the work generator just keeps making more and more WUs which adds to the backlog. For Collatz, because the WUs are really simple (just two numbers), creating a WU is as fast as creating the result records. So, it becomes a "catch 22". BOINC delays 5 seconds before creating more work which normally is enough, but once a backlog exists that is more than 5 seconds, the transitioner can't keep up with the work generator and the endless cycle of generating more work because it doesn't think it has enough begins.

I've submitted a change request along with some code that will help fix the issue to the BOINC development mailing list so hopefully that will become a part of the server code so it won't happen again. That, or they have a better way to fix it.

Profile Scribe
Avatar
Send message
Joined: 22 Dec 11
Posts: 10
Credit: 12,743,541
RAC: 0
Message 18112 - Posted: 10 Dec 2013, 17:52:58 UTC

Was this the first time it happened, if so do you know what the "trigger" was?
____________
Alan

Profile Slicker
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 11 Jun 09
Posts: 2525
Credit: 740,580,099
RAC: 2
Message 18117 - Posted: 11 Dec 2013, 2:41:03 UTC - in response to Message 18112.

Was this the first time it happened, if so do you know what the "trigger" was?


http://boinc.thesonntags.com/collatz/forum_thread.php?id=1092&postid=17974#17974

Profile Scribe
Avatar
Send message
Joined: 22 Dec 11
Posts: 10
Credit: 12,743,541
RAC: 0
Message 18118 - Posted: 11 Dec 2013, 5:54:54 UTC - in response to Message 18117.

Thanks
____________
Alan

1 · 2 · Next
Post to thread

Message boards : News : Clearing up the Backlog (12/05/2013)


Main page · Your account · Message boards


Copyright © 2018 Jon Sonntag; All rights reserved.