We will be failing over the Toolforge and Project NFS in 10 minutes to move the main interface to 10Gb Ethernet. The previous work should make this fairly non-disruptive, but that was believed in the past as well.
Brooke Storm
Cloud Service Team
In yet another effort to restore replication and preserve the redundancy of the data in ToolsDB (user writable database in Toolforge), we need to take the database (tools.db.svc.eqiad.wmflabs) completely offline at 1700 UTC on 16 Dec. Apps that depend on the ToolsDB service will fail during the outage (which will take at least an hour, and we aren’t entirely sure exactly how long—expect multiple hours). This will be much faster than the last outage because we are doing a straight copy of the binary database files between the servers. Details of this mess and efforts to restore the replication service can be found at https://phabricator.wikimedia.org/T266587 <https://phabricator.wikimedia.org/T266587>
If we succeed in producing a viable copy of the database on another system, we will also perform an upgrade on the hypervisor it is on before closing the maintenance period. That should be an additional hour or so.
We appreciate your patience with this process. It is very important that we establish a second copy of this database, especially in light of recent crashes (https://phabricator.wikimedia.org/T253738 <https://phabricator.wikimedia.org/T253738>).
Brooke Storm
Staff SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org
IRC: bstorm_
Hi there!
Today 2020-12-10 @ 15:30 UTC we will perform an upgrade of the Toolforge
kubernetes cluster [0].
We don't expect any major disruption of the service, but we detected in past
upgrades that some components might be restarted, causing brief interruptions of
network flows.
Given the amount of worker nodes we have, more than 50, the operation will take
us at least a couple of hours.
Tools maintainers: you don't have to do anything during this operation, but if
you detect anything weird please contact us either in the phabricator task [0],
in the IRC channel #wikimedia-cloud or in the cloud(a)lists.wikimedia.org [1]
mailing list.
regards.
[0] https://phabricator.wikimedia.org/T263284
[1] https://lists.wikimedia.org/mailman/listinfo/cloud
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation