Again. See last messages on T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes), the problem described there, @Bawolff asked me to file a separate ticket. Now I have closed UW tab in browser and opened new tab, tried to upload 9 files and got this. Translation: 1 - Server doesn't respond in time, 2 - Wait your queue...
Description
Related Objects
- Mentioned In
- T378609: Monitoring to surface "low-traffic" jobs isolation failure
T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes) - Mentioned Here
- T378609: Monitoring to surface "low-traffic" jobs isolation failure
T379035: Consider lifting AssembleUploadChunks and PublishStashedFile out of the low-traffic consumer
T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes)
Event Timeline
@Scott_French Bawolff mentioned you in the last ticket, what we can do to fix this problem?
Thanks for the report. Indeed, this seems to have been another low-traffic consumer isolation failure, this time caused by a large influx of ChangeDeletionNotification jobs [0]. Those appear to have returned to normal insertion rates by ~ 1:20 UTC on the 10th, then by ~ 14:40 their backlog had drained off [1] and AssembleUploadChunks backlog time largely returned to normal [2].
While I've not looked closely at the trigger for the ChangeDeletionNotification jobs, at least to some extent, it does not really matter - i.e., this kind of event is always possible as long as these jobs share a consumer. Given that, I'll aim to prioritize T379035 and T378609 this week.
[0] https://grafana.wikimedia.org/goto/mXvr7oGHg?orgId=1
[1] https://grafana.wikimedia.org/goto/7dz67TMNg?orgId=1
[2] https://grafana.wikimedia.org/goto/sANWVoGNR?orgId=1
Alright, as noted in T379035#10318390, since ~ 17:50 UTC today all three job types that are critical to (async) uploads have been lifted out of the low-traffic consumer and into dedicated rules. Thus, they should no longer be exposed to this kind of isolation failure.
Given that, I'll mark this task resolved, while work continues in parallel in T378609 for the monitoring aspect (i.e., so we can respond reactively to isolate antagonist workloads from other job types).