Repeated UploadWizard failures: "Server did not respond in time"
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	MBH
	Nov 9 2024, 2:33 PM

Description

Again. See last messages on T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes), the problem described there, @Bawolff asked me to file a separate ticket. Now I have closed UW tab in browser and opened new tab, tried to upload 9 files and got this. Translation: 1 - Server doesn't respond in time, 2 - Wait your queue...

{0F0A62A2-0DD6-4ABC-B22B-7BF85AD7E4C0}.png (1×915 px, 73 KB)

Related Objects

Mentioned In: T378609: Monitoring to surface "low-traffic" jobs isolation failure
T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes)
Mentioned Here: T378609: Monitoring to surface "low-traffic" jobs isolation failure
T379035: Consider lifting AssembleUploadChunks and PublishStashedFile out of the low-traffic consumer
T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes)

Event Timeline

MBH created this task.Nov 9 2024, 2:33 PM

Restricted Application added subscribers: Base, Aklapper. · View Herald TranscriptNov 9 2024, 2:33 PM

MBH updated the task description. (Show Details)Nov 9 2024, 2:38 PM

@Scott_French Bawolff mentioned you in the last ticket, what we can do to fix this problem?

Aklapper renamed this task from UploadWizard doesn't work to Repeated UploadWizard failures: "Server did not respond in time".Nov 10 2024, 6:42 PM

Aklapper updated the task description. (Show Details)

Thanks for the report. Indeed, this seems to have been another low-traffic consumer isolation failure, this time caused by a large influx of ChangeDeletionNotification jobs [0]. Those appear to have returned to normal insertion rates by ~ 1:20 UTC on the 10th, then by ~ 14:40 their backlog had drained off [1] and AssembleUploadChunks backlog time largely returned to normal [2].

While I've not looked closely at the trigger for the ChangeDeletionNotification jobs, at least to some extent, it does not really matter - i.e., this kind of event is always possible as long as these jobs share a consumer. Given that, I'll aim to prioritize T379035 and T378609 this week.

[0] https://grafana.wikimedia.org/goto/mXvr7oGHg?orgId=1

[1] https://grafana.wikimedia.org/goto/7dz67TMNg?orgId=1

[2] https://grafana.wikimedia.org/goto/sANWVoGNR?orgId=1

Scott_French added a project: serviceops.Nov 11 2024, 1:52 AM

jijiki triaged this task as High priority.Nov 11 2024, 1:06 PM

jijiki moved this task from Incoming 🐫 to Production Errors 🚜 on the serviceops board.

Scott_French claimed this task.Nov 13 2024, 12:45 AM

Alright, as noted in T379035#10318390, since ~ 17:50 UTC today all three job types that are critical to (async) uploads have been lifted out of the low-traffic consumer and into dedicated rules. Thus, they should no longer be exposed to this kind of isolation failure.

Given that, I'll mark this task resolved, while work continues in parallel in T378609 for the monitoring aspect (i.e., so we can respond reactively to isolate antagonist workloads from other job types).

Scott_French mentioned this in T378609: Monitoring to surface "low-traffic" jobs isolation failure.Nov 15 2024, 2:33 AM

	F57690830: {0F0A62A2-0DD6-4ABC-B22B-7BF85AD7E4C0}.png
	Nov 9 2024, 2:33 PM

Repeated UploadWizard failures: "Server did not respond in time"Closed, ResolvedPublicBUG REPORTActions

Description

Related Objects

Event Timeline

Repeated UploadWizard failures: "Server did not respond in time"
Closed, ResolvedPublicBUG REPORT
Actions