Page MenuHomePhabricator

Retry counts not working / jobs re-executed beyond retry limits
Closed, DeclinedPublic

Description

The job retry counters don't seem to be working, at least for ParsoidCache* jobs:

gwicke@tin:~$ mwscript showJobs.php --wiki=enwiki ---type=ParsoidCacheUpdateJobOnDependencyChange --list | grep ParsoidCache | grep -c '=201401'
86

gwicke@fluorine:/a/mw-log$ grep ParsoidCache runJobs.log | grep -c '=201401'
4006

Normally I'd expect those jobs to be retried three times only (default setting for 'attempts' in puppet/modules/mediawiki/templates/jobrunner/jobrunner.conf.erb ). Instead, these jobs have been around since January. The 86 jobs have been executed 4006 times today alone, which works out to more than three retries per job.

Example runJobs.log line:

2014-10-08 08:35:49 mw1007 enwiki: ParsoidCacheUpdateJobOnDependencyChange Module:Message_box/configuration pages=array(12) rootJobSignature=dd78b0f638bd2f608d82c1e31574b46271060622 rootJobTimestamp=20140127010733 table=templatelinks type=OnDependencyChange t=14415 good

t=14415 would be over the default timeout of 3600.


Version: unspecified
Severity: normal

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:57 AM
bzimport set Reference to bz71853.
bzimport added a subscriber: Unknown Object (MLST).

I don't get this math at all. You are comparing current pending jobs with a root timestamp prefix to all of those run today(ish) with that root timestamp prefix.

A slightly more interesting query is:

aaron@fluorine:~/mw-log$ head -n 100000 runJobs.log | grep -Po '\w+: ParsoidCacheUpdateJobOnDependencyChange .+STARTING' | sort | uniq -c | sort -k1 -n -r | head -n 20

310 svwiki: ParsoidCacheUpdateJobOnDependencyChange Mall:Infobox_biografi pages=array(12) rootJobSignature=a5adce8cb4b7f6429c63373cf750eb7fa58d4952 rootJobTimestamp=20141008081240 table=templatelinks type=OnDependencyChange STARTING
232 enwiktionary: ParsoidCacheUpdateJobOnDependencyChange Module:utilities pages=array(12) rootJobSignature=06aaf41a1bd13c4ebe635e76eb45193d8b9fb46a rootJobTimestamp=20140612102850 table=templatelinks type=OnDependencyChange STARTING
164 enwiktionary: ParsoidCacheUpdateJobOnDependencyChange Module:languages/data2 pages=array(12) rootJobSignature=e458a770b9fdf5efb61fb51efd18f26bacc7780f rootJobTimestamp=20140908024938 table=templatelinks type=OnDependencyChange STARTING

This still useless since pages=array(12) does not imply that it was the *same* 12 pages. That's the expected results of the partitioning scheme. The query would work if each job had a hash of the params printed to the log though...

(In reply to Aaron Schulz from comment #1)

I don't get this math at all. You are comparing current pending jobs with a
root timestamp prefix to all of those run today(ish) with that root
timestamp prefix.

The point is that all those are retries, and that there are many more retries of those very old jobs than we'd expect from the default retry limit of 3. And this is by looking on the logs of a single day only. Those jobs have been retried many more times since January.

Looking in Redis I also don't see any jobs tried more than thrice now:

var_dump( $conn->hGetAll( wfMemcKey( 'jobqueue', $type, 'h-attempts' ) ), $prop );

array(4) {

["dd3198047ef64b6c8d6b9bc4d9455f72"]=>
string(1) "3"
["cfe7372acf9c412da3fc30f58bc16073"]=>
string(1) "3"
["79df259a1a3b4227bceab760c11cc64d"]=>
string(1) "1"
["8166de1e7c3140aeb4fbfb9b31ce87bb"]=>
string(1) "1"

}

As with a few weeks before, the abandoned (and therefore recently retried) job count is very low with that queue.

Actually looking at the code long enough made me notice a somewhat opposite bug (e.g. not enough retries, fixed in https://gerrit.wikimedia.org/r/#/c/166681/ ).

Just re-checked on tin:

gwicke@tin:~$ mwscript showJobs.php --wiki=enwiki ---type=ParsoidCacheUpdateJobOnDependencyChange --list | grep ParsoidCache | grep -c '=201401'
86

So all of those jobs are still in the queue, despite being retried many times since I opened this bug (I count 3250 executions in today's log).

So perhaps the retry counter is not incremented properly for jobs that succeed after the timeout has elapsed?

@aaron: Should we just clear the old jobs by temporarily merging https://gerrit.wikimedia.org/r/#/c/165635/ ? If it's still a problem with new jobs then we'll find new ones to investigate soon. If it's actually fixed & some freak corruption issue, then we might avoid spending too much time on the investigation.

It doesn't seem to be the redis job queue retry system. Maybe it's the partitioning code in the Parsoid extension?

aaron@fluorine:~/mw-log$ head -n 10000000 runJobs.log | grep -Po '\w+: ParsoidCacheUpdateJobOnDependencyChange .+STARTING' | grep -Po '\(uuid=\w+,' | sort | uniq -c | sort -k1 -n -r | head -n 20

1 (uuid=ffffe377d4694ceda951d80a215ffb77,
1 (uuid=ffffd36e7f654c4397d76b344a0cf58e,
1 (uuid=ffffacd744294446909b57e1882b2d87,
1 (uuid=ffff831d0d2f47a8833487fb3ccfaa6f,
1 (uuid=ffff7c6b2df845328b6a6e5545dfdfd2,
1 (uuid=fffef1cd5a3b4ef69ae06cf2056b3e9c,
1 (uuid=fffed88560724c3a9daa4a1ae8e28a15,
1 (uuid=fffeb2b7138946298e2871bcf56697c4,
1 (uuid=fffe9e3930cb49419595dabcdd9e873f,
1 (uuid=fffe961daccb4654acacf6eb624f9560,
1 (uuid=fffe93496ad947e0a98e9caabe7ef117,
1 (uuid=fffe8ae575884c54ba6bd134750e2dd9,
1 (uuid=fffe29d2643542b5828c1707abb6fc76,
1 (uuid=fffe0d75e1b149c98e78c14a24ee6cf3,
1 (uuid=fffdefdfa4764fc7a3ad041f9177893a,
1 (uuid=fffdeddcfd664ed08f8da0043df51d35,
1 (uuid=fffdbbe0e9454adfb08a9f3749b2361f,
1 (uuid=fffd9f3395084cde91865dbe41d7db14,
1 (uuid=fffd90b99c1c40a095be8d6d66aaa5f1,
1 (uuid=fffd850a6b344fffaf904fc1a5df567c,

aaron@fluorine:~/mw-log$ grep -Po '\w+: refreshLinks .+STARTING' runJobs.log | grep -Po '\(uuid=\w+,' | sort | uniq -c | sort -k1 -n -r | head -n 20

2 (uuid=a5cabb857ffd4953a4484d1a9c75d91c,
2 (uuid=67be1fc2aec44fc9821092478fbefcbb,
1 (uuid=fffff607cfd2421384a2addb06fdf67d,
1 (uuid=ffffeff90821463f8b006e5358dd6a67,
1 (uuid=ffffef61e5554ecfb5ff5b002b3788c0,
1 (uuid=ffffeee9ec4942cf9a1e46a7759da168,
1 (uuid=ffffec525411482d892480d6bc9bdefa,
1 (uuid=ffffe9cbf6c04bfdbd7dbb39b134afdf,
1 (uuid=ffffe7522cd545f6b201b6f38ba5fe13,
1 (uuid=ffffe6c011414031b88d7e735728e3c3,
1 (uuid=ffffe51c1d8a4cc4bbead6e5bb285e94,
1 (uuid=ffffe44f1c2e48d7ae0b1a9aec87eca4,
1 (uuid=ffffe02ca1c443da9ff2fe1485eeec0f,
1 (uuid=ffffde5755014d918c98818a2a937f34,
1 (uuid=ffffd6897a01417fbb996918c8fdbdc8,
1 (uuid=ffffd2fde873473297b4f194e3791508,
1 (uuid=ffffd17e8a2f40de85940d24000bbea7,
1 (uuid=ffffc1bec5fa44ed841ba2cf4811d488,
1 (uuid=ffffc187ad06449f9a5b0b862f3b7eca,
1 (uuid=ffffb30fd1b54aa494c09ccc8d599d30,

Change 176621 had a related patch set uploaded (by GWicke):
Return early if partitionBacklinkJob is called with a leaf job

https://gerrit.wikimedia.org/r/176621

Patch-For-Review

Change 176621 abandoned by GWicke:
Return early if partitionBacklinkJob is called with a leaf job

Reason:
Won't actually make a difference

https://gerrit.wikimedia.org/r/176621

Maybe it's the partitioning code in the Parsoid extension?

That's just calling BacklinkJobUtils::partitionBacklinkJob, and the way it's called seems to be correct.

After some redis wobbles earlier today the old jobs apparently disappeared from the queue, and the load on the Parsoid cluster has dropped by ~12%. Lets see if the problem reappears at some point.

Another option we could consider is unconditionally dropping jobs whose rootJobTimestamp is older than some threshold, either in the Parsoid job or BacklinkJobUtils.

We have recently seen a strange pattern of elevated p99 in RESTBase for 1-3 hours each night, somewhat coinciding with an influx of apparently old jobs (based on 4xx rate from the If-Unmodified-Since check rising). @aaron, is there any cyclical activity in the job queue related to retries that could possibly be connected to this?

We have recently seen a strange pattern of elevated p99 in RESTBase for 1-3 hours each night, somewhat coinciding with an influx of apparently old jobs (based on 4xx rate from the If-Unmodified-Since check rising). @aaron, is there any cyclical activity in the job queue related to retries that could possibly be connected to this?

Not that I know of.

As for delay, retried jobs are re-enqueued to the end of the queue, so if the queue is long it could be much longer than 1 hour before the retry actually happens. That's easy to change though.

I'm seeing a couple of *really* old jobs again, like this one from March (!):

2015-07-24 18:56:01 mw1004 commonswiki runJobs DEBUG: RestbaseUpdateJobOnDependencyChange File:Cc-sa_white.svg pages=array(4) rootJobSignature=17abf67d15931b7b4246c8700de67bf02f6b1685 rootJobTimestamp=20150327131504 table=imagelinks type=OnDependencyChange ts=1427462104 (uuid=b9bcb8acdb3747349b65e66739a5f0f9,timestamp=1437751260,QueuePartition=rdb2) STARTING

I'm pretty sure the job queue was completely drained since (we had five workers per job runner node for several months in between), and even now with three workers and after some slow RB period there are only 36k of those jobs for commonswiki in the queue. Additionally, RESTBase jobs should never be retried.

If they should never be retried, then RestbaseUpdateJob should overwrite allowRetries() to return false.

@aaron, aha. Didn't know about that & shall add that. However, https://github.com/wikimedia/mediawiki-extensions-RestBaseUpdateJobs/blob/master/RestbaseUpdateJob.php#L174 always returns true, so I would expect it to never be retried. Are there situations where it would still be legitimately retried?

@aaron, aha. Didn't know about that & shall add that. However, https://github.com/wikimedia/mediawiki-extensions-RestBaseUpdateJobs/blob/master/RestbaseUpdateJob.php#L174 always returns true, so I would expect it to never be retried. Are there situations where it would still be legitimately retried?

Exceptions.

Fatal errors and network partitions are also possible. Preventing retries for those would involve updating the jobrunner service json config (limits => attempts => <jobtype>).

Change 226900 had a related patch set uploaded (by GWicke):
Explicitly disallow retries of RestbaseUpdateJobs

https://gerrit.wikimedia.org/r/226900

Change 226901 had a related patch set uploaded (by GWicke):
Enforce a hard limit on RestbaseUpdateJobOnDependencyChange retries

https://gerrit.wikimedia.org/r/226901

@aaron, thanks for the pointers; see patches above.

Given the age of these jobs, do you think these were retried more than three times?

Change 226900 merged by jenkins-bot:
Explicitly disallow retries of RestbaseUpdateJobs

https://gerrit.wikimedia.org/r/226900

Change 226901 merged by Ori.livneh:
Enforce a hard limit on RestbaseUpdateJobOnDependencyChange retries

https://gerrit.wikimedia.org/r/226901

Change 231625 had a related patch set uploaded (by GWicke):
Drop jobs with a rootJobTimestamp older than 60 days

https://gerrit.wikimedia.org/r/231625

There are still around 50k+ really old RESTBase jobs being recycled forever. Example from tail runJobs.log just now:

2015-08-14 19:16:13 mw1001 commonswiki runJobs INFO: RestbaseUpdateJobOnDependencyChange File:Checked_copyright_icon.svg pages=array(4) rootJobSignature=f583c2cb8ebfc23cb2c11137c415bff28133bace rootJobTimestamp=20150326184916 table=imagelinks type=OnDependencyChange ts=1427395756 (uuid=a626554f3a894ca19afa2c73228a9326,timestamp=1439570206,QueuePartition=rdb2) t=784 good

Change 231625 merged by jenkins-bot:
Drop jobs with a rootJobTimestamp older than 60 days

https://gerrit.wikimedia.org/r/231625

Change 232088 had a related patch set uploaded (by GWicke):
Lower maximum job age to 30 days

https://gerrit.wikimedia.org/r/232088

See https://gist.github.com/gwicke/0908d952f7739f824bd3 for current data. Note that there are fairly old root jobs in the queue, including some from January and early February.

gwicke@fluorine:/a/mw-log$ tail -10000 runJobs.log | grep rootJobTimestamp=20160201 | tail
2016-03-16 06:42:08 mw1003 commonswiki 1.27.0-wmf.16 runJobs DEBUG: refreshLinks Module:Coordinates pages=array(1) rootJobSignature=7f9ae1b7eba847fecb76332538ab98647d1ea3e2 rootJobTimestamp=20160201125021 masterPos=db1040-bin.001598/626031307 triggeredRecursive=1 (uuid=66b6913eaf3e4c22b116bf0379793b68,timestamp=1458110034,QueuePartition=rdb3-6381) STARTING 

gwicke@fluorine:/a/mw-log$ tail -1000000 runJobs.log | grep -c rootJobSignature=7f9ae1b7eba847fecb76332538ab98647d1ea3e2
354523

Some notes from last night's IRC conversation:

[23:41] <gwicke> interesting.. that job from Feb 1 was responsible for 360k out of the last 1 million job executions
[23:41] <ebernhardson> leaf nodes do a single update, range jobs when run queue up ~300 leaf jobs, and then some more range jobs, and the base job queue's the range jobs
[23:42] <ori> gwicke: what job id?
[23:42] <gwicke> 2016-03-16 06:42:08 mw1005 commonswiki 1.27.0-wmf.16 runJobs DEBUG: refreshLinks Module:Coordinates pages=array(1) rootJobSignature=7f9ae1b7eba847fecb76332538ab98647d1ea3e2 rootJobTimestamp=20160201125021 masterPos=db1040-bin.001598/626031307 triggeredRecursive=1 (uuid=b0fcbf6d8c1d4965a6d23b8bf04fdc6e,timestamp=1458110077,QueuePartition=rdb3-6379) STARTING
[23:43] <ebernhardson> that Module:Coordinates ... me and max noticed it earlier today as well
[23:43] <ebernhardson> i ran a count on fluorine, it's run more than 100M times in the last 5 days
[23:43] <MaxSem> heheh: delete that page, wait couple hours, restore
[23:43] <MaxSem> so duplication's bonkers
[23:44] <MaxSem> however, I don't see any changes in job or backlinks cache code in the last couple weeks
[23:44] <gwicke> out of the last millon lines, that root job id was executed 354523 times
[23:44] <_joe_> (we also need to understand why this began all of a sudden)
[23:44] <gwicke> also, this is from Feb 1st
[23:45] <ori> pages=array(1)?
[23:46] <MaxSem> gwicke, Feb 1 in page_touched
[23:46] <MaxSem> s/in/is/
[23:46] <gwicke> rootJobTimestamp=20160201125021
[23:46] <_joe_> MaxSem: rootJobTimestamp=20160201125021
[23:47] <_joe_> I think that's what gwicke is referring to
[23:47] <_joe_> honestly we should reap any job older than say one day
[23:47] <_joe_> any root job
[23:47] <_joe_> at least on queues like refreshlink or htmlcacheupdate, imo
[23:47] <ebernhardson> that timestamp is misleading, it's the page_touched not the age of the job
[23:47] <ebernhardson> i confused that earlier today too
[23:48] <_joe_> ebernhardson: ahah ok maybe we can change that logging, I guess :P
[23:48] <gwicke> ebernhardson: touched has a much wider range
[23:48] <gwicke> some of the jobs operate on pages that were touched last in March last year
[23:48] <ori> that array(1) is not a serialization artifact; that is the value
[23:49] <_joe_> ori: I thought it was some representation
[23:49] <ori> and RefreshLinksJob.php treats it as an array of $pageId => $nsAndKey
[23:49] <ori> (L110)
[23:50] <ebernhardson> i thought it was some representation as well, but didn't look
[23:50] <MaxSem> actually, 'rootJobTimestamp' => wfTimestampNow()
[23:50] <ori> it's always array(1), too
[23:50] <_joe_> ori: every job I can see has pages = array(1)
[23:50] <_joe_> yes
[23:50] <ebernhardson> MaxSem: hmm, interesting
[23:51] <MaxSem> so timestamp is BS, root timestamp is not
[23:51] <ori> this should be adopted as formal notation of computational complexity
[23:51] <MaxSem> we should have alrerts for maximum job age
[23:51] <ori> O(1), O(n), O(log n), array(1)
[23:51] <ebernhardson> :)
[23:52] <MaxSem> O(+INF)
[23:52] * gwicke nominates array(1) as a measure of accidental complexity
[23:52] <ori> haha
[23:52] <ori> yes, I think this has to happen now
[23:53] <ori> OK, going to confirm that it is actually iterating on that value in the $pageId => $nsAndKey loop
[23:57] <ebernhardson> i just pulled the first 10 Module:Coordinates jobs from the queue on commonswiki, they all have valid looking page arrays
[00:03] <gwicke> there is a template / module usage stat tool somewhere, but I don't seem to be able to find it right now
[00:04] <legoktm> https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Templates_transcluded_on_the_most_pages ?
[00:04] <gwicke> http://tools.wmflabs.org/templatecount/, but it doesn't cover commons
[00:05] <MaxSem> gwicke, ~7.5M
[00:05] <legoktm> "data as of 16:47, 08 September 2013" aw :(
[00:05] <gwicke> isn't 7.5 million the top one on enwiki?
[00:06] <ori> ebernhardson: ack, I was wrong
[00:10] <gwicke> if we assume that this module is also used in ~7 million pages (which I doubt), then I don't see why it isn't fully processed by now, given that we have had 360k executions in the last 1 million executions alone
[00:11] <ebernhardson> the templatelinks table on commons seems to think 7.5M for that module, it's basically on every image that has a geo-coordinate
[00:11] <ebernhardson> but indeed, it's been run more than 100M times in the last few days. 
[00:12] <_joe_> so I repeat my question from the other day
[00:13] <_joe_> would it help if I flushed the queues of all the backlog of refreshlink jobs?
[00:13] <_joe_> that would allow us to debug what's new
[00:14] <gwicke> it would certainly lighten the load, but it would also remove some of those self-replicating jobs we are interested in studying
[00:14] <ori> yeah, I think we're close
[00:14] <ori> and this issue has been biting us since at least 2014 (T73853)
[00:17] <gwicke> I'm too tired to actually debug the code now, and didn't spot the root cause in a cursory look a while ago
[00:17] <_joe_> gwicke: so you're convinced as well that this has to do with deduplication failing for some mysterious reason
[00:18] <gwicke> _joe_: deduplication is a different mechanism; this issue seems to be jobs being recycled / re-tried forever
[00:18] <gwicke> deduplication tries to eliminate jobs before they even start, but it's not intended to handle retries
[00:18] <gwicke> it also only works on root jobs
[00:18] <gwicke> (IIRC)
[00:19] <gwicke> so, if you edit a template two times in quick succession, then there is a chance that deduplication would remove one of those root jobs
[00:21] <gwicke> the new data seems to be very similar to the old one, where jobs were retried many more times than expected / configured over a long time
[00:23] <gwicke> in any case, I should head to bed
[00:24] <gwicke> see you tomorrow!
[09:02] <_joe_> 03:24:41> good night!
[09:02] <MaxSem> 03:31:43> refreshLinks Module:Coordinates pages=array(1) rootJobSignature=7f9ae1b7eba847fecb76332538ab98647d1ea3e2 rootJobTimestamp=20160201125021 masterPos=db1040-bin.001598/626031307 triggeredRecursive=1 (uuid=6ccf600d08734635b271d4503559bf8a,timestamp=1458111840,QueuePartition=rdb4-6380)
[09:02] <MaxSem> 03:32:04> grrr, against what table this is?
[09:02] <_joe_> 03:35:19> MaxSem: what is Шаблон:Неоднозначность/данные
[09:02] <_joe_> 03:35:23> ?
[09:02] <MaxSem> 03:35:38> some russian template
[09:02] <_joe_> 03:35:39> (the fact that we have localized logs is priceless when debugging)
[09:02] <_joe_> 03:35:48> yeah I was asking for a translation :P
[09:02] <_joe_> 03:36:35> because out of the refreshlink jobs, most are either from that module or from module:wikidata2
[09:02] <ebernhardson> 03:36:51> so my script finally finished counting, of all the 'range' refreshLinks jobs for Module:Coordinates, there are 60M subranges currently enqueud
[09:02] <MaxSem> 03:36:56> template:disambig/data
[09:02] <ebernhardson> 03:36:57> on commonswiki
[09:02] <MaxSem> 03:37:11> what it means? trying to understand myself :P
[09:02] <_joe_> 03:38:12> MaxSem: ahah
[09:02] <ebernhardson> 03:40:21> 60m subranges means there pretty much have to be duplicates in there, but not sure how they would have got there...i gotta get some sleep and think about this in the morning though
[09:02] <ebernhardson> 03:41:19> fwiw, almost all the jobs on commonswiki are for 4 pages: Module:Coordinates, Template:Compass rose file, Template:City and Template:Header
GWicke renamed this task from Retry counts not working to Retry counts not working / jobs re-executed beyond retry limits.Mar 16 2016, 5:30 PM

Today between 5:00-5:20 UTC, read load of db1018 (s2 master) got reduced dramatically to "normal" levels.

This is a followup to T129830, which was merged into T129517, as refreshlinks was found as the cause for it.

I can't pinpoint it more exactly than that just yet. Tailing the log of articles being refreshed showed a large but finite set of jobs cycling endlessly through the queue. One way this could happen is if there was a cycle somewhere, such that refreshing some page set off a sequence of surrogate jobs which eventually comes back around to scheduling a refresh of the page that initiated the sequence. If that is indeed the case, we may be able to isolate the problem by making jobs keep track of their lineage.

I think jobs that spawn other jobs must include some kind of reference or identifier in order to track lineage. Ideally this identifier would also include a counter of some kind. Should a separate task be filed tracking this specific request or is it covered by this task?

We're going to continue to work on isolating this problem and fixing it properly. (T73853 is the task to follow.) In the interim, we are going to minimize the user impact of any recurrences of this issue by processing RefreshLinks jobs in a separate queue. This should prevent a spike in RefreshLinks jobs from causing the processing of other job types to be delayed.

How many job queues are there now for a wiki such as the English Wikipedia? Two? Three? More?

We plan to address this by adding additional metrics, alerts, and logging, and then using those to guide further refactoring work.

Can you provide additional details? What do we currently have besides https://grafana.wikimedia.org/dashboard/db/job-queue-health? What else is needed?

Joe raised the priority of this task from Medium to High.Apr 20 2017, 8:42 AM
Joe subscribed.

Raised the priority as this keeps biting us again and again, see T163418

Krinkle subscribed.

Closing for now. Various patches have been made toward addressing this. But some (or all) of the actual problems we observed seem to stem from other causes (e.g. not exceeding any retry limit).

@Krinkle, are you saying that we are confident that jobs are no longer retried for more times than the retry limit would nominally allow?

I just looked into HTMLCacheUpdate jobs executed in the last 15 hours, and the number of really old jobs still being executed (presumably retried or respawned) is greater than I would expect with a retry limit of 3 (or 2?):

   1233 20170407
   8237 20170408
     18 20170423
     18 20170426
     20 20170429
     50 20170430
     18 20170502
     18 20170504
     20 20170509
     10 20170512
     18 20170513
     16 20170523
     22 20170528
     10 20170529
     40 20170606
     20 20170617
     18 20170622
     21 20170625
     16 20170627
     10 20170628
     10 20170630
     36 20170701
     20 20170705
     28 20170708
     18 20170712
     10 20170715
     16 20170717
     18 20170724
     42 20170725
     20 20170726
     20 20170728
     17 20170729
     34 20170803
     46 20170804
     30 20170805
     50 20170807
     54 20170808
    260 20170809
    137 20170810
     16 20170811
     17 20170812
     84 20170813
     36 20170814
     10 20170815
     72 20170816
    445 20170817
     82 20170818
     67 20170819
  21452 20170820
   1825 20170821
     81 20170822
    176 20170823
   4810 20170824
   9773 20170825
  21842 20170826
 218770 20170827
   8087 20170828
 183142 20170829
3805398 20170830

Of particular note are the jobs from April 7th and 8th. There were some switchdc tests done around that time (in preparation for the DC switch-over), but I don't see anything obviously related happening on those exact dates.

The replication issues discussed in T163337 could play a role in duplication / keeping old jobs alive.

The replication issues discussed in T163337 could play a role in duplication / keeping old jobs alive.

Not really - bad replication can create inconsistent job structures, but only in replicas, that we don't use during normal operations. So I'd tend to discard it.