Page MenuHomePhabricator

Repeated UploadWizard failures: "Server did not respond in time"
Closed, ResolvedPublicBUG REPORT

Description

Again. See last messages on T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes), the problem described there, @Bawolff asked me to file a separate ticket. Now I have closed UW tab in browser and opened new tab, tried to upload 9 files and got this. Translation: 1 - Server doesn't respond in time, 2 - Wait your queue...

{0F0A62A2-0DD6-4ABC-B22B-7BF85AD7E4C0}.png (1×915 px, 73 KB)

Event Timeline

@Scott_French Bawolff mentioned you in the last ticket, what we can do to fix this problem?

Aklapper renamed this task from UploadWizard doesn't work to Repeated UploadWizard failures: "Server did not respond in time".Nov 10 2024, 6:42 PM
Aklapper updated the task description. (Show Details)

Thanks for the report. Indeed, this seems to have been another low-traffic consumer isolation failure, this time caused by a large influx of ChangeDeletionNotification jobs [0]. Those appear to have returned to normal insertion rates by ~ 1:20 UTC on the 10th, then by ~ 14:40 their backlog had drained off [1] and AssembleUploadChunks backlog time largely returned to normal [2].

While I've not looked closely at the trigger for the ChangeDeletionNotification jobs, at least to some extent, it does not really matter - i.e., this kind of event is always possible as long as these jobs share a consumer. Given that, I'll aim to prioritize T379035 and T378609 this week.

[0] https://grafana.wikimedia.org/goto/mXvr7oGHg?orgId=1

[1] https://grafana.wikimedia.org/goto/7dz67TMNg?orgId=1

[2] https://grafana.wikimedia.org/goto/sANWVoGNR?orgId=1

jijiki triaged this task as High priority.Nov 11 2024, 1:06 PM
jijiki moved this task from Incoming 🐫 to Production Errors 🚜 on the serviceops board.

Alright, as noted in T379035#10318390, since ~ 17:50 UTC today all three job types that are critical to (async) uploads have been lifted out of the low-traffic consumer and into dedicated rules. Thus, they should no longer be exposed to this kind of isolation failure.

Given that, I'll mark this task resolved, while work continues in parallel in T378609 for the monitoring aspect (i.e., so we can respond reactively to isolate antagonist workloads from other job types).