Jump to content

Switch Datacenter: Difference between revisions

From Wikitech
Content deleted Content added
When?: link to section for just the next switch(es?) when you don’t need the next 20-odd years :)
m Databases: Add finalize step for databases.
 
(23 intermediate revisions by 7 users not shown)
Line 79: Line 79:
For example, if currently our primary DC is <code>codfw</code> and for the upcoming switchover we will be switching to <code>eqiad</code>, '''the direction for a live test is eqiad→codfw:'''
For example, if currently our primary DC is <code>codfw</code> and for the upcoming switchover we will be switching to <code>eqiad</code>, '''the direction for a live test is eqiad→codfw:'''


cumin1002:~# cookbook sre.switchdc.mediawiki --live-test '''eqiad codfw'''
cumin1002:~# cookbook sre.switchdc.mediawiki --live-test --task-id TXXXXXX --ro-reason "Datacenter MediaWiki switchover live-test" '''eqiad codfw'''
<entering cookbook menu>
<entering cookbook menu>
Line 85: Line 85:
> 00-reduce-ttl
> 00-reduce-ttl


'''Limitations:''' The <code>03-set-db-readonly</code> cookbook will fail if circular replication is not already enabled everywhere. It can be skipped if the live-test is run before circular replication is enabled. Please check with [[SRE/Data_Persistence/About| Data Persistence]] if you need to run this test or not.
'''Note:''' If circular replication is not yet enabled everywhere, the <code>Check that all core primaries in DC_TO are in sync with the core primaries in DC_FROM</code> step of the <code>03-set-db-readonly</code> cookbook will fail, but the error is suppressed in <code>--live-test</code> mode. Consider checking with [[SRE/Data_Persistence/About| Data Persistence]] about whether this is expected to fail.


==== Dry Run ====
==== Dry Run ====
Line 104: Line 104:
=== Preparation - a few days before ===
=== Preparation - a few days before ===
<u>Data Persistance checklist:</u>
<u>Data Persistance checklist:</u>
* There is no ongoing long-running maintenance that affects database availability or lag (schema changes, upgrades, hardware issues, etc.)
* There is no ongoing long-running maintenance that affects database availability, capacity or lag (schema changes, upgrades, hardware issues, etc.)
* Replication is flowing from eqiad -> codfw and from codfw -> eqiad
* Replication is flowing from eqiad -> codfw and from codfw -> eqiad running the <code>sre.switchdc.databases.prepare</code> [[Spicerack/Cookbooks#Run a single Cookbook|cookbook]].
* All database servers have its buffer pool filled up. This is taken care automatically with the [[MariaDB/buffer pool dump|automatic buffer pool warmup functionality]]. For sanity checks, some sample load could be sent to the MediaWiki application server to check requests happen as quickly as in the active datacenter.


<u>Service Operations checklist:</u>
<u>Service Operations checklist:</u>
* '''Check capacity in destination datacentre'''. More specifically, ensure that MediaWiki deployments share at least the same number of pods in both datacentres.
* '''Check capacity in the destination datacentre'''. More specifically, ensure that MediaWiki deployments share at least the same number of pods in both datacentres.
** High-traffic services with a large RO component, such as <code>mw-web</code> and <code>mw-api-ext</code>, may need additional upsizes before Day 1 to accommodate all traffic in the destination datacentre. See {{phabricator|T371273}} for example analysis to determine appropriate upsizes.
* Prepare all patches
* Prepare all patches
** Day 1: Any patches necessary to augment capacity as described above.
** Day 2:
** Day 2:
*** [[gerrit:#/c/operations/dns/+/1013005/|#2 Update DNS records for master DBs]]
*** [[gerrit:#/c/operations/dns/+/1073897/|#1 Update DNS records for master DBs]]
*** [[gerrit:#/c/operations/dns/+/1013064/|#3 Update DNS records for maintenance host]]
*** [[gerrit:#/c/operations/dns/+/1073898/|#2 Update DNS records for maintenance host]]
*** [[gerrit:#/c/operations/dns/+/1013070/|#4 geo-maps: set default datacentre]]
*** [[gerrit:#/c/operations/dns/+/1073899/|#3 geo-maps: set default datacentre]]
*** [[gerrit:#/c//operations/mediawiki-config/+/1013083/|#5 debug.json: List primary DC servers first ]]
*** [[gerrit:#/c//operations/mediawiki-config/+/1073895/|#4 debug.json: List primary DC servers first ]]
** Day 3:
** Day 3:
*** [[gerrit:#/c/operations/dns/+/1013272/|#6 update deployment DNS record ]]
*** [[gerrit:#/c/operations/dns/+/1073900/|#5 update deployment DNS record ]]
*** [[gerrit:#/c/operations/puppet/+/1013274 |#7 update deployment_server on puppet]]
*** [[gerrit:#/c/operations/puppet/+/1073894 |#6 update deployment_server on puppet]]


== Per-service switchover instructions ==
== Per-service switchover instructions ==
Line 128: Line 129:
GeoDNS (User-facing) Routing:
GeoDNS (User-facing) Routing:


Use the <code>sre.dns.admin</code> cookbook to depool all GeoDNS services from the source DC. See [[DNS#Change_GeoDNS_/_Depool_a_Site|Change GeoDNS / Depool a Site]] for details.
Use the <code>sre.dns.admin</code> cookbook to depool all GeoDNS services from the source DC. See [[DNS#Change_GeoDNS_/_Depool_a_Site|Change GeoDNS / Depool a Site]] for details. Example run: if you are depooling eqiad,

cookbook sre.dns.admin depool eqiad

(<code>authdns-update</code> is not required, you only need to run the cookbook)


==== Day 8: Switch to Multi-DC again ====
==== Day 8: Switch to Multi-DC again ====
Line 146: Line 151:
* active/passive ones will be switched over to the alternative DC, per user input
* active/passive ones will be switched over to the alternative DC, per user input


However, there are a few services we completely exclude from this process. These are hardcoded in the [[gerrit:plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/discovery/datacenter.py| sre.discovery.datacenter ]] cookbook.
However, there are a few services we completely exclude from this process. These are hardcoded in the [[gerrit:plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/discovery/datacenter.py| sre.discovery.datacenter ]] cookbook (see <code>EXCLUDED_SERVICES</code>).


What the cookbook does is:
What the cookbook does is, for each service:


# Reduce the TTL of the DNS discovery records to 10 seconds
# Depool the datacenter we're moving away from in confctl / discovery
# Depool the datacenter we're moving away from in confctl / discovery
# Poll all authoritative nameservers until their responses matche the new intent
# Restore the original TTL
# Flush the associated discovery DNS record from all recursive resolver caches



==== Day 1: Depooling source DC ====
==== Day 1: Depooling source DC ====
Line 196: Line 200:


<li><u>Manual</u> '''scap lock:''' Add a scap lock on a separate tmux/screen on the deployment server. This will block any scap deployments, and it will stay there waiting for your input to unlock it.
<li><u>Manual</u> '''scap lock:''' Add a scap lock on a separate tmux/screen on the deployment server. This will block any scap deployments, and it will stay there waiting for your input to unlock it.
scap lock --all "Datacenter Switchover - T12345"</li>
<code>scap lock --all "Datacenter Switchover - T12345"</code></li>


<li><code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-disable-puppet.py|00-disable-puppet]]</code>: Disables puppet on maintenance hosts in both eqiad and codfw</li>
<li><code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-disable-puppet.py|00-disable-puppet]]</code>: Disables puppet on maintenance hosts in both eqiad and codfw</li>
Line 202: Line 206:
<li><code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-reduce-ttl.py|00-reduce-ttl]]:</code> Reduces TTL for various DNS discovery entries. '''Make sure that at least 5 minutes (the old TTL) have passed before moving to Phase 1. The cookbook should force you to wait anyway'''.</li>
<li><code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-reduce-ttl.py|00-reduce-ttl]]:</code> Reduces TTL for various DNS discovery entries. '''Make sure that at least 5 minutes (the old TTL) have passed before moving to Phase 1. The cookbook should force you to wait anyway'''.</li>


<li>(Optional-Skip)<code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-warmup-caches.py|00-warmup-caches]]:</code> Warms up APC running the mediawiki-cache-warmup on the new site clusters. The warmup queries will repeat automatically until the response times stabilize:
<li>(Optional-Skip)<code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-optional-warmup-caches.py|00-optional-warmup-caches]]:</code> Warms up shared (e.g., Memcache) and local (e.g., APCu) caches in <code>DC_TO</code> using the mediawiki-cache-warmup tool. The warmup queries will repeat automatically until the response times stabilize, and include:
* The global "urls-cluster" warmup against the appservers cluster
* The global "urls-cluster" warmup against <code>mw-web</code>.
* The "urls-server" warmup against all hosts in the appservers cluster.
* The "urls-server" warmup against all pods in each of <code>mw-web</code>, <code>mw-api-ext</code>, and <code>mw-api-int</code>.
*The "urls-server" warmup against all hosts in the api-appservers cluster.</li>


<li><code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-downtime-db-readonly-checks|00-downtime-db-readonly-checks]]:</code> Sets downtime for Read only checks on mariadb masters changed on Phase 3 so they don't page. </li>
<li><code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-downtime-db-readonly-checks.py|00-downtime-db-readonly-checks]]:</code> Sets downtime for Read only checks on mariadb masters changed on Phase 3 so they don't page.
* You can confirm downtimes have been set at https://icinga.wikimedia.org (navigate to Downtime > Scheduled Service Downtime).</li>
</ol>
</ol>


Line 214: Line 218:
==== Phase 1 - stop maintenance ====
==== Phase 1 - stop maintenance ====
*<code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/01-stop-maintenance.py|01-stop-maintenance]]:</code> Stops maintenance jobs in both datacenters and kill all the periodic jobs (systemd timers) on maintenance hosts in both datacenters. Keep in mind there is a chance of a manual job running. Check again with your peers; usually the way forward is to kill the job by force.
*<code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/01-stop-maintenance.py|01-stop-maintenance]]:</code> Stops maintenance jobs in both datacenters and kill all the periodic jobs (systemd timers) on maintenance hosts in both datacenters. Keep in mind there is a chance of a manual job running. Check again with your peers; usually the way forward is to kill the job by force.
** The logic to validate that all timers were disabled can fail if any unit is in a failed state. If this happens, you can clear failed states with <code>sudo systemctl reset-failed <failed unit></code> on the maintenance host, and then re-run the cookbook.


{{Note|type=error|text='''Final GO/NOGO before read-only''': Ask what time is it. <u>''This is the point of no return.</u> The following steps until Phase 7 need to be executed in quick succession to minimise read-only time''}}
{{Note|type=error|text='''Final GO/NOGO before read-only''': Ask what time is it. ''<u>This is the point of no return.</u> The following steps until Phase 7 need to be executed in quick succession to minimise read-only time''}}


==== Phase 2 - read-only mode ====
==== Phase 2 - read-only mode ====
Line 226: Line 231:


*<code> [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/04-switch-mediawiki.py|04-switch-mediawiki]]:</code> Switches the discovery records and MediaWiki active datacenter
*<code> [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/04-switch-mediawiki.py|04-switch-mediawiki]]:</code> Switches the discovery records and MediaWiki active datacenter
** Flips <code> [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/__init__.py#17| MEDIAWIKI_SERVICE]]</code> to <code>pooled=true</code> in destination DC
** Flips <code> [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/__init__.py#26| MEDIAWIKI_SERVICES]]</code> to <code>pooled=true</code> in destination DC
** Flips <code>WMFMasterDatacenter</code> from <code>DC_FROM</code> to <code>DC_TO</code>
** Flips <code>WMFMasterDatacenter</code> from <code>DC_FROM</code> to <code>DC_TO</code>
** Flips <code> [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/__init__.py#17| MEDIAWIKI_SERVICE]]</code> to <code>pooled=false</code> in source DC
** Flips <code> [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/__init__.py#26| MEDIAWIKI_SERVICES]]</code> to <code>pooled=false</code> in source DC


After this, DNS will be changed for the source DC and internal applications (except mediawiki) will start hitting the new DC
After this, DNS will be changed for the source DC and internal applications (except mediawiki) will start hitting the new DC
Line 267: Line 272:
</li>
</li>


<li><u>Manual</u> [[gerrit:#/c/operations/dns/+/1013005/|#2 Update DNS records for master DBs]]: merge and run <code>authdns-update</code>
<li><u>Manual</u> [[gerrit:#/c/operations/dns/+/1073897/|#1 Update DNS records for master DBs]]: merge and run <code>authdns-update</code>
* Please use the following to SAL log: <code>!log Phase 9: Update DNS records for new database masters</code>
* Please use the following to SAL log: <code>!log Phase 9: Update DNS records for new database masters</code>
</li>
</li>
Line 273: Line 278:
<li>
<li>
<code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/09-run-puppet-on-db-masters.py|09-run-puppet-on-db-masters]]:</code> Runs Puppet on the database masters in both DCs, to update expected read-only state.
<code>[[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/09-run-puppet-on-db-masters.py|09-run-puppet-on-db-masters]]:</code> Runs Puppet on the database masters in both DCs, to update expected read-only state.
* This also removes the downtimes set in Phase 0
* This also removes the downtimes set in Phase 0.
* As before, you can confirm the downtimes have been removed at https://icinga.wikimedia.org (navigate to Downtime > Scheduled Service Downtime).
</li>
</li>


Line 281: Line 287:
</li>
</li>
<li>
<li>
<u>Manual</u> [[gerrit:#/c/operations/dns/+/1013064/|#3 Update DNS records for maintenance host]]: merge and run <code>authdns-update</code>
<u>Manual</u> [[gerrit:#/c/operations/dns/+/1073898/|#2 Update DNS records for maintenance host]]: merge and run <code>authdns-update</code>
</li>
</li>
<li>
<li>
<u>Manual</u> [[gerrit:#/c/operations/dns/+/1013070/|#4 geo-maps: set default datacentre]]: merge and run <code>authdns-update</code>
<u>Manual</u> [[gerrit:#/c/operations/dns/+/1073899/|#3 geo-maps: set default datacentre]]: merge and run <code>authdns-update</code>
* This default only affects a small portion of traffic, so this is mostly about logical consistency (when we have no idea where to route a request, we prefer the primary DC).
* This default only affects a small portion of traffic, so this is mostly about logical consistency (when we have no idea where to route a request, we prefer the primary DC).
</li>
</li>
<li>
<li>
<u>Manual</u> [[gerrit:#/c//operations/mediawiki-config/+/1013083/|#5 debug.json: List primary DC servers first ]]: Re-order noc.wm.o's debug.json to have primary servers listed first, see [[phab:T289745|T289745]]. Run scap backport to deploy.
<u>Manual</u> [[gerrit:#/c//operations/mediawiki-config/+/1073895/|#4 debug.json: List primary DC servers first ]]: Re-order noc.wm.o's debug.json to have primary servers listed first, see [[phab:T289745|T289745]]. Run [[Backport_windows/Deployers#Using_scap_backport|scap backport]] to deploy.
</li>
</li>
</ol>
</ol>
Line 300: Line 306:
* <code>curl -s -H 'Accept: application/json' https://stream.wikimedia.org/v2/stream/recentchange | jq .</code></li>
* <code>curl -s -H 'Accept: application/json' https://stream.wikimedia.org/v2/stream/recentchange | jq .</code></li>
<li> <u>Manual</u> '''Email:''' Ensure email works via [[mw:Special:EmailUser|test an email]]
<li> <u>Manual</u> '''Email:''' Ensure email works via [[mw:Special:EmailUser|test an email]]
. The following command should fluctuate between 0m and a few minutes </li>
. For the following commands the Total messages in queue should mostly be 0, but the value will fluctuate as new mail is received then sent </li>
<code> mx1001:~$ sudo -i; sudo exim4 -bp | exiqsumm | tail -n 5 </code>
<code> mx-in1001:~$ sudo watch qshape </code>
<br>
<code> mx-out1001:~$ sudo watch qshape </code>
</ol>
</ol>


==== Dashboards ====
==== Dashboards ====
==== Dashboards ====


*[https://grafana.wikimedia.org/d/000000327/apache-fcgi?orgId=1 Apache/FCGI]
* [https://grafana.wikimedia.org/d/a46755c2-ea4c-4cff-92c4-58ec0cf77fa9/mw-on-k8s-overview?orgId=1&from=now-3h&to=now&refresh=1m MW-on-k8s overview]
* High-traffic services:
*[https://grafana.wikimedia.org/dashboard/db/mediawiki-application-servers?orgId=1 App servers]
*[https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-datasource=esams%20prometheus%2Fops&var-layer=backend&var-cluster=text ATS cluster view (text)]
** [https://grafana.wikimedia.org/d/AhCt7bdVk/mw-web?orgId=1&refresh=1m&from=now-3h&to=now mw-web]
** [https://grafana.wikimedia.org/d/_qKzVxO4z/mw-api-ext?orgId=1&refresh=1m&from=now-3h&to=now mw-api-ext]
*[https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=text&var-origin=api-rw.discovery.wmnet&var-origin=appservers-rw.discovery.wmnet&var-origin=restbase.discovery.wmnet ATS backends<->Origin servers overview (appservers, api, restbase)]
** [https://grafana.wikimedia.org/d/t7EiVbdVk/mw-api-int?orgId=1&refresh=1m&from=now-3h&to=now mw-api-int]
** [https://grafana.wikimedia.org/d/MVDqnbOVk/mw-jobrunner?orgId=1&from=now-3h&to=now&refresh=1m mw-jobrunner]
** [https://grafana.wikimedia.org/d/aSiSoKoSk/mw-parsoid?orgId=1&from=now-3h&to=now&refresh=1m mw-parsoid]
*[https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-datasource=esams%20prometheus%2Fops&var-layer=backend&var-cluster=text&from=now-3h&to=now&refresh=1m ATS cluster view (text)]
*[https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=text&var-origin=mw-api-ext.discovery.wmnet&var-origin=restbase.discovery.wmnet&var-origin=mw-web.discovery.wmnet&var-origin=mw-web-ro.discovery.wmnet&var-origin=mw-api-ext-ro.discovery.wmnet&var-site=All&from=now-3h&to=now ATS backends<->Origin servers overview (mw-web(-ro), mw-api-ext(-ro), restbase) (text)]
*[https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors Logstash: mediawiki-errors]
*[https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors Logstash: mediawiki-errors]


Line 391: Line 403:
=== Databases ===
=== Databases ===
Main document: [[MariaDB/Switch Datacenter]]
Main document: [[MariaDB/Switch Datacenter]]

Once we're confident that the switchover will not be rolled back run the <code>sre.switchdc.databases.finalize</code> [[Spicerack/Cookbooks#Run a single Cookbook|cookbook]].


=== Other miscellaneous ===
=== Other miscellaneous ===
Line 407: Line 421:


{{Tracked|T370962}}
{{Tracked|T370962}}
'''September 2024'''
'''March 2025'''
* Services + Traffic: [https://zonestamp.toolforge.org/1727190000 Tuesday, 24 September 2024 @ 15:00 UTC]
* Services + Traffic: [https://zonestamp.toolforge.org/1742310000 Tuesday, 18 March 2025 14:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1727276400 Wednesday, 25 September 2024 @ 15:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1742396400 Wednesday, 19 March 2025 14:00 UTC]


== Past Switches ==
== Past Switches ==
Line 420: Line 434:
* MediaWiki: Wednesday, March 20th, 2024 14:00UTC
* MediaWiki: Wednesday, March 20th, 2024 14:00UTC
* Read only: 3 minutes 8 seconds
* Read only: 3 minutes 8 seconds
{{tracked|T370962}}
'''September'''

* Services + Traffic: Tuesday, 24 September 2024 @ 15:00 UTC
* MediaWiki: Wednesday, 25 September 2024 @ 15:00 UTC
* Read only: 2 minutes 46 seconds


=== 2023 switches ===
=== 2023 switches ===
{{tracked|T345263}}
{{tracked|T345263}}
;SeptemberServices + Traffic: [https://zonestamp.toolforge.org/1695132049 Tuesday, September 19th, 2023 14:00 UTC]
;September
* Services + Traffic: [https://zonestamp.toolforge.org/1695132049 Tuesday, September 19th, 2023 14:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1695218452 Wednesday, September 20th, 2023 14:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1695218452 Wednesday, September 20th, 2023 14:00 UTC]
;FebruaryServices: [https://zonestamp.toolforge.org/1677592832 Tuesday, February 28th, 2023 14:00 UTC]
{{tracked|T327920}}

;February
* Services: [https://zonestamp.toolforge.org/1677592832 Tuesday, February 28th, 2023 14:00 UTC]
* Traffic: [https://zonestamp.toolforge.org/1677596427 Tuesday, February 28th, 2023 15:00 UTC]
* Traffic: [https://zonestamp.toolforge.org/1677596427 Tuesday, February 28th, 2023 15:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1677679227 Wednesday, March 1st, 2023 14:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1677679227 Wednesday, March 1st, 2023 14:00 UTC]
Line 436: Line 452:
* [[listarchive:list/wikitech-l@lists.wikimedia.org/thread/QXNSWHT7G2TUZRTYKLOGJR7IHEAHXWK7/|Recap]]
* [[listarchive:list/wikitech-l@lists.wikimedia.org/thread/QXNSWHT7G2TUZRTYKLOGJR7IHEAHXWK7/|Recap]]
* Read only: 1 minute 59 seconds
* Read only: 1 minute 59 seconds



'''Switching back:'''
'''Switching back:'''
Line 449: Line 464:


=== 2021 switches ===
=== 2021 switches ===
;ScheduleServices: [https://zonestamp.toolforge.org/1624888854 Monday, June 28th, 2021 14:00 UTC]
{{tracked|T281515}}

;Schedule
* Services: [https://zonestamp.toolforge.org/1624888854 Monday, June 28th, 2021 14:00 UTC]
* Traffic: [https://zonestamp.toolforge.org/1624892434 Monday, June 28th, 2021 15:00 UTC]
* Traffic: [https://zonestamp.toolforge.org/1624892434 Monday, June 28th, 2021 15:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1624975258 Tuesday, June 29th, 2021 14:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1624975258 Tuesday, June 29th, 2021 14:00 UTC]
Line 463: Line 475:


'''Switching back:'''
'''Switching back:'''

{{Tracked|T287539}}
* Services: [https://zonestamp.toolforge.org/1631541650 Monday, Sept 13th 14:00 UTC]
* Services: [https://zonestamp.toolforge.org/1631541650 Monday, Sept 13th 14:00 UTC]
*Traffic: [https://zonestamp.toolforge.org/1631545256 Monday, Sept 13th 15:00 UTC]
*Traffic: [https://zonestamp.toolforge.org/1631545256 Monday, Sept 13th 15:00 UTC]
*MediaWiki: [https://zonestamp.toolforge.org/1631628029 Tuesday, Sept 14th 14:00 UTC]
*MediaWiki: [https://zonestamp.toolforge.org/1631628029 Tuesday, Sept 14th 14:00 UTC]
;Reports
;Reports
* [https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/6UZCCACCBCZLN5MHROZQXUG6ZOQTDCLO/ Datacenter switchover recap] on wikitech-l
[[listarchive:list/wikitech-l@lists.wikimedia.org/message/6UZCCACCBCZLN5MHROZQXUG6ZOQTDCLO/|Datacenter switchover recap]] on wikitech-l
* Read only duration: 2 minutes 42 seconds
* Read only duration: 2 minutes 42 seconds


=== 2020 switches ===
=== 2020 switches ===
;ScheduleServices: Monday, August 31st, 2020 14:00 UTC
{{tracked|T243314}}
;Schedule
* Services: Monday, August 31st, 2020 14:00 UTC
* Traffic: Monday, August 31st, 2020 15:00 UTC
* Traffic: Monday, August 31st, 2020 15:00 UTC
* MediaWiki: Tuesday, September 1st, 2020 14:00 UTC
* MediaWiki: Tuesday, September 1st, 2020 14:00 UTC
;Reports
;Reports
* [[Incident documentation/2020-09-01 data-center-switchover]]
[[Incident documentation/2020-09-01 data-center-switchover]]
* Read only duration: 2 minutes 49 seconds
* Read only duration: 2 minutes 49 seconds


Line 489: Line 497:
=== 2018 switches ===
=== 2018 switches ===
{{tracked|T199073}}
{{tracked|T199073}}
;ScheduleServices: Tuesday, September 11th 2018 14:30 UTC
;Schedule
* Services: Tuesday, September 11th 2018 14:30 UTC
*Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
*Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
*Traffic: Tuesday, September 11th 2018 19:00 UTC
*Traffic: Tuesday, September 11th 2018 19:00 UTC

Latest revision as of 16:31, 7 November 2024

Introduction

Datacenter switchovers are a standard response to certain types of situations (web search), where traffic is shifted from one site to another. Technology organisations regularly practice them to ensure that tooling and hardware will respond appropriately in case of an emergency. Moreover, switching between datacenters makes room for potentially disruptive maintenance work on inactive servers, such as database upgrades/changes, hardware replacement etc. In other words, while we're serving traffic from an active datacentre, we are doing our regular upkeep work on the inactive one to maintain its efficiency and reliability.

What?

At Wikimedia, a datacentre switchover means switching over different components between our two main datacentres; eqiad and codfw.

When?

We perform two datacenter switchovers annually, during the week of the solar equinox:

  • Northward: ~21st March
  • Southward: ~21st September

See #Upcoming Switches for the next switchover dates, and Switch Datacenter/Switchover Dates for a pre-calculated list of switchover dates through 2050.

Our switchover process is broken down into stages, where some can progress independently, while others need to progress in lockstep. This page documents all the steps needed for this work, broken down by component. SRE/Service_Operations is driving the process and maintains the software necessary to run the switchover, with a little help from their friends.

Impact

Impact of a switchover is expected to be 2-3 minutes of read-only for MediaWiki, including extensions. Any other services/features/infrastructures not participating directly in the Switchover, will continue to work as normal. However, anything relying on MediaWiki indirectly (e.g. via some data pipeline) may experience some minor impact, for example some delay in receiving events. This is expected.

What does read-only mean?

Read-only is a two-step process: we first set MediaWiki itself read-only, and then the MediaWiki databases. We allow an amount of time between the two so to allow the last in-flight edits to land safely. All read-only functionality will continue to function as usual.

During read-only, any kind of writes reaching our MediaWiki databases (UPDATE, DELETE, INSERT in SQL terms) will be denied. Additionally, any features ignoring the global MediaWiki read-only configuration, will not function during this time window. This scheduled read-only period, adds a 0.001% MediaWiki edit unavailability per year.

Notes: Non-MediaWiki databases, are not part of the switchover.

High Level switchover flow

Scheduling details

Datacenter Switchovers will take place on the work week of a Solar Equinox, where we have assumed that the Northward Solar Equinox happens on March 21st, and the Southward Solar Equinox on September 21st. This doesn't match exactly the astronomical event on purpose.

Disruptive operations such as the MediaWiki Switchover (see below) will target 14:00 UTC as their start time. However, SRE Service Operations reserves the right to adjust this by up to +/- 2h with sufficient prior notification.

A controlled switchover occurs in a span of 8 days:

Day 1 - Tuesday: Traffic+Services

Non read-only parts of the Switchover always take place on Tuesday. This process is non disruptive and lower risk and it may be scheduled @ 14:00 UTC, however that is not necessary.

  • Traffic: Disable caching in the origin datacenter - Switch_Datacenter#Traffic
    • ~20 minutes for disabling caching completely from origin dc_from datacentre
  • Services: Depool services in the origin datacenter to destination - Switch_Datacenter#Services
    • ~15-40 minutes to switchover services to destination dc_to
      • Leave active/active services pooled only to destination dc_to
      • Switchover active/passive services from origin dc_from to destination dc_to

Day 2 - Wednesday: MediaWiki

The MediaWiki Switchover (read-only), will always take place on the Wednesday of the above mentioned week. During read-only (2-3 minutes), no Wikis will be editable and editors will be seeing a warning message asking to try again later. Read-only starts @ 14:00 UTC (subject to change with sufficient prior notification, as noted above). Readers should experience no changes for the entirety of the event.

  • Switch Mediawiki itself to destination datacentre Switch_Datacenter#MediaWiki
    • ~35 minutes for a complete run of the cookbook, from disabling puppet to re-enabling it, if timed right for the read-only part of the cookbook to fall at the start of the announced window. Doing it in an emergency can be done faster since there is no need to wait for a set time.

Note: For the next 7 calendar days after the MW read-only phase traffic will be flowing solely to one datacentre (destination), rendering the other datacenter effectively inactive.

Day 3 - Thursday: Deployment Server + Special cases

At your convenience, after coordinating with deployers, you may switch the Special cases

Day 8 - Wednesday: Pool back inactive DC

A week later, we activate caching and services in the inactive/secondary datacenter again. With traffic flowing in both DCs, we are back in the normal Multi-DC mode. This period may be extended, depending how maintenance work progresses at the inactive DC.

Note: As of September 2023, we are running each datacenter as primary for half of the year. The 2 datacentres are be considered coequal, alternating roles every 6 months.

Weeks in advance: communication, testing, and preparation

Communication - 10 weeks before

See Switch_Datacenter/Coordination, coordinate dates and communication plan with involved groups.

Testing - 3 weeks before

Run a "live test" of the MediaWiki cookbook, and a dry-run for everything.

Depending on what changes have occurred to our infrastructure/production from the previous switchover, code changes in cookbooks are expected. The purpose of the live-test and the dry-run is to test most of the existing and updated codepaths, and identify potential issues there.

Note: Always use the --dry-run flag when running cookbooks for testing purposes

Live Test

A live test ( --live-test) flag, will skip actions that could harm the primary DC or perform them on the secondary DC, and is available only for the sre.switchdc.mediawiki cookbook. What we should be careful about it is that we "switch" from the currently secondary DC to the currently primary DC. While the live-test process will log your actions to SAL, please remember to announce to #wikimedia-sre and to #wikimedia-operations too, that you will be running this test. Unless something goes really badly, this is a non-disruptive test.

For example, if currently our primary DC is codfw and for the upcoming switchover we will be switching to eqiad, the direction for a live test is eqiad→codfw:

cumin1002:~# cookbook sre.switchdc.mediawiki --live-test --task-id TXXXXXX --ro-reason "Datacenter MediaWiki switchover live-test" eqiad codfw
<entering cookbook menu>

> 00-disable-puppet
> 00-reduce-ttl 

Note: If circular replication is not yet enabled everywhere, the Check that all core primaries in DC_TO are in sync with the core primaries in DC_FROM step of the 03-set-db-readonly cookbook will fail, but the error is suppressed in --live-test mode. Consider checking with Data Persistence about whether this is expected to fail.

Dry Run

A dry-run is available for both cookbooks we use during a switchover; sre.switchdc.mediawiki and sre.discovery.datacenter. During a dry-run, the direction is the one we have announced.

For example, if we are currently on codfw, switching over to eqiad, a dry-run's direction would be codfw→eqiad, as follows:

cumin1002:~# cookbook --dry-run sre.switchdc.mediawiki codfw eqiad
<entering cookbook menu>

> 00-disable-puppet
> 00-reduce-ttl
cumin1002:~# cookbook --dry-run sre.discovery.datacenter depool codfw \
                      --all --reason "Datacenter services switchover dry-run" \
                      --task-id T357547

Preparation - a few days before

Data Persistance checklist:

  • There is no ongoing long-running maintenance that affects database availability, capacity or lag (schema changes, upgrades, hardware issues, etc.)
  • Replication is flowing from eqiad -> codfw and from codfw -> eqiad running the sre.switchdc.databases.prepare cookbook.

Service Operations checklist:

Per-service switchover instructions

Traffic

General procedure

See: Global traffic routing.

Day 1: Depool source datacentre

Make sure you have gone through Testing, and Preperation, including patches.

GeoDNS (User-facing) Routing:

Use the sre.dns.admin cookbook to depool all GeoDNS services from the source DC. See Change GeoDNS / Depool a Site for details. Example run: if you are depooling eqiad,

cookbook sre.dns.admin depool eqiad

(authdns-update is not required, you only need to run the cookbook)

Day 8: Switch to Multi-DC again

Same procedure as above, just using the pool command rather than depool.

Dashboards

Services

General procedure

For a global switchover we are using the sre.discovery.datacenter to depool all services from a DC:

  • active-active services in DNS discovery will be depooled from said DC
  • active/passive ones will be switched over to the alternative DC, per user input

However, there are a few services we completely exclude from this process. These are hardcoded in the sre.discovery.datacenter cookbook (see EXCLUDED_SERVICES).

What the cookbook does is, for each service:

  1. Depool the datacenter we're moving away from in confctl / discovery
  2. Poll all authoritative nameservers until their responses matche the new intent
  3. Flush the associated discovery DNS record from all recursive resolver caches

Day 1: Depooling source DC

Make sure you have gone through Testing, and Preparation, including patches.

Before depooling any service, do not forget to review (and copy/paste) the current status of all services, but running:

cookbook sre discovery.datacenter status all

The following command will depool all active/active services from a DC, and will prompt to move or skip the active/passive ones.

# Switch all services to eqiad
$ sudo cookbook sre.discovery.datacenter depool codfw --all --reason "Datacenter Switchover" --task-id T12345

Day 8: Switch to Multi-DC again

The following command will repool all active/active services to a DC, and will prompt to move or skip the active/passive ones.

# Repool codfw
$ sudo cookbook sre.discovery.datacenter pool codfw --reason "Datacenter switch to Multi-DC" --task-id T12345

MediaWiki

We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in the operations/cookbooks repository, in the cookbooks/sre/switchdc/mediawiki/ path.

Day 2: MediaWiki Switchover

Make sure you have gone through Testing, and Preperation, including patches.
Audible indicator: Put Listen to wikipedia in the background during the switchover. Silence indicates read-only, when it starts to make sounds again, edits are back up.

Execution tip: The best way to run this multi step cookbook is to start it in interactive mode from the cookbook root:
sudo cookbook sre.switchdc.mediawiki --ro-reason 'DC switchover (TXXXXXX)' codfw eqiad

and proceed through the steps
Start the following steps about 30-60mins before the scheduled switchover time, in a tmux or a screen.

Phase 0 - preparation

  1. Manual StatusPage: Add a scheduled maintenance (Maintenances -> Schedule Maintenance)
  2. Manual scap lock: Add a scap lock on a separate tmux/screen on the deployment server. This will block any scap deployments, and it will stay there waiting for your input to unlock it. scap lock --all "Datacenter Switchover - T12345"
  3. 00-disable-puppet: Disables puppet on maintenance hosts in both eqiad and codfw
  4. 00-reduce-ttl: Reduces TTL for various DNS discovery entries. Make sure that at least 5 minutes (the old TTL) have passed before moving to Phase 1. The cookbook should force you to wait anyway.
  5. (Optional-Skip)00-optional-warmup-caches: Warms up shared (e.g., Memcache) and local (e.g., APCu) caches in DC_TO using the mediawiki-cache-warmup tool. The warmup queries will repeat automatically until the response times stabilize, and include:
    • The global "urls-cluster" warmup against mw-web.
    • The "urls-server" warmup against all pods in each of mw-web, mw-api-ext, and mw-api-int.
  6. 00-downtime-db-readonly-checks: Sets downtime for Read only checks on mariadb masters changed on Phase 3 so they don't page.
Stop for GO/NOGO: Ask your peers for Go or NoGo

Phase 1 - stop maintenance

  • 01-stop-maintenance: Stops maintenance jobs in both datacenters and kill all the periodic jobs (systemd timers) on maintenance hosts in both datacenters. Keep in mind there is a chance of a manual job running. Check again with your peers; usually the way forward is to kill the job by force.
    • The logic to validate that all timers were disabled can fail if any unit is in a failed state. If this happens, you can clear failed states with sudo systemctl reset-failed <failed unit> on the maintenance host, and then re-run the cookbook.
Final GO/NOGO before read-only: Ask what time is it. This is the point of no return. The following steps until Phase 7 need to be executed in quick succession to minimise read-only time

Phase 2 - read-only mode

  • 02-set-readonly: Sets read-only mode by changing the ReadOnly conftool value

Phase 3 - lock down database masters

  • 03-set-db-readonly: Puts origin DC DC_FROM core DB masters (shards: s1-s8, x1, es4-es5) in read-only mode and waits for destination DC's DC_TO databases to catch up with replication

Phase 4 - switch active datacenter configuration

After this, DNS will be changed for the source DC and internal applications (except mediawiki) will start hitting the new DC

Phase 5 - DEPRECATED - Invert Redis replication for MediaWiki sessions

Phase 6 - Set new site's databases to read-write

  • 06-set-db-readwrite: Sets destination DC's core DB masters (shards: s1-s8, x1, es4-es5) in read-write mode

Phase 7 - Set MediaWiki to read-write

  • 07-set-readwrite: Goes back to read-write mode by changing the ReadOnly conftool value
You are now out of read-only mode.

Take a breath, smile!

Phase 8 - Restore rest of MediaWiki

  1. 08-restart-envoy-on-jobrunners: Restarts pods on the (now) inactive jobrunners, trigger changeprop to re-resolve the DNS name and connect to destination DC
    • A steady rate of 500s is expected until this step is completed, as changeprop may still be sending edits to source DC, though the database master will reject them.
  2. 08-start-maintenance: Starts maintenance on destination DC
    • Runs puppet on the maintenance hosts, which will reactivate systemd timers in destination DC
    • Most Wikidata-editing bots will restart once this is done and the "dispatch lag" has recovered. This should bring us back to 100% of editing traffic.
  3. Manual StatusPage: End the planned maintenance

Phase 9 - Post read-only

  1. 09-restore-ttl: Sets the TTL for the DNS records to 300 seconds again
  2. Manual #1 Update DNS records for master DBs: merge and run authdns-update
    • Please use the following to SAL log: !log Phase 9: Update DNS records for new database masters
  3. 09-run-puppet-on-db-masters: Runs Puppet on the database masters in both DCs, to update expected read-only state.
  4. Manual CentralNotice banner: Ensure the banner informing users of readonly is removed. There is some minor HTTP caching involved (~5mins) here too.
  5. Manual Scap lock: Go back to the terminal where you added the lock and press enter
  6. Manual #2 Update DNS records for maintenance host: merge and run authdns-update
  7. Manual #3 geo-maps: set default datacentre: merge and run authdns-update
    • This default only affects a small portion of traffic, so this is mostly about logical consistency (when we have no idea where to route a request, we prefer the primary DC).
  8. Manual #4 debug.json: List primary DC servers first : Re-order noc.wm.o's debug.json to have primary servers listed first, see T289745. Run scap backport to deploy.

Phase 10 - verification and troubleshooting

  1. Manual Reading and Editing: Ensure they work! :)
  2. Manual Recent Changes: Ensure recent changes are flowing
  3. Manual Email: Ensure email works via test an email . For the following commands the Total messages in queue should mostly be 0, but the value will fluctuate as new mail is received then sent
  4. mx-in1001:~$ sudo watch qshape
    mx-out1001:~$ sudo watch qshape

Dashboards

ElasticSearch

General context on how to switchover

CirrusSearch talks by default to the local datacenter ($wmgDatacenter). No special actions are required when disabling a datacenter.

Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing wgCirrusSearchDefaultCluster ext-CirrusSearch.php.

To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.

Dashboards


Special cases

Exclusions

Exclusions have been implemented in the Switchover cookbook. The next section is still around for historical and information purposes. While it will probably not be needed, it's still useful information to have around.

If it is needed to exclude services, using the old sre.switchdc.services is still necessary until exclusion is implemented.

# Switch all services to codfw, excluding parsoid and cxserver
$ sudo cookbook sre.switchdc.services --exclude parsoid cxserver -- eqiad codfw

Single service

If you are switching only one service, using the old sre.switchdc.services is still necessary

# Switch the service "parsoid" to codfw-only
$ sudo cookbook sre.switchdc.services --services parsoid -- eqiad codfw

apt

In March 2023 Switchover, we identified issues with apt.wikimedia.org being switched over. As of the September 2023 Switchover, those haven't been solved yet and apt.wikimedia.org won't participate in the Switchover.

apt.wikimedia.org needs a puppet change

restbase-async

As of September 2023, this is no longer needed. We let restbase-async pooled in both DCs for now on. This is kept in the doc for historical purposes for now.

Restbase-async is a bit of a special case, being pooled active/passive with the active in the secondary datacenter. As such, it needs an additional step if we're just switching active traffic over and not simulating a complete failover:

  1. pool restbase-async everywhere
    sudo cookbook sre.discovery.service-route --reason T123456 pool --wipe-cache $dc_from restbase-async
    sudo cookbook sre.discovery.service-route --reason T123456 pool --wipe-cache $dc_to restbase-async
    
  2. depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.
    sudo cookbook sre.discovery.service-route --reason T123456 depool --wipe-cache $dc_to restbase-async
    

When simulating a complete failover, keep restbase pooled in $dc_to for as long as possible to test capacity, then switch it to $dc_from by using the above procedure.

As it is async, we trade the added latency from running it in the secondary datacenter for the lightened load on the primary datacenter's appservers.

Manual switch

These services require manual changes to be switched over and have not yet been included in service::catalog

  • planet.wikimedia.org
    • The DNS discovery name planet.discovery.wmnet needs to be switched from one backend to another as in example change gerrit:891369. No other change is needed.
  • people.wikimedia.org
    • In puppet hieradata the rsync_src and rsync_dst hosts need to be flipped as in example change gerrit:891382.
    • FIXME: manual rsync command has to be run
    • The DNS discovery name peopleweb.discovery.wmnet needs to be switched from one backend to another as in example change gerrit::891381.
  • noc.wikimedia.org This is no longer applicable as of September 2023, noc.wikimedia.org is now active/active in mw-on-k8s.
    • The noc.wikimedia.org DNS name points to DNS discovery name mwmaint.discovery.wmnet that needs to be switched from one backend to another as in example change gerrit:896118. No other change is needed.

Dashboards

Databases

Main document: MariaDB/Switch Datacenter

Once we're confident that the switchover will not be rolled back run the sre.switchdc.databases.finalize cookbook.

Other miscellaneous

Predictable, Recurring Switchovers

A few months after the Switchback of 2023, and following a feedback gathering process, a proposal to move to a predictable set of dates for the dates while also increasing the Switchover duration to 6 months was adopted and turned into a process. The document can be found in the link below:

Recurring, Equinox-based, Data Center Switchovers

Upcoming Switches

See Switch Datacenter/Switchover Dates for a pre-calculated list up to 2050

March 2025

Past Switches

2024 switches

March

  • Services + Traffic: Tuesday, March 19th, 2024 14:00 UTC
  • MediaWiki: Wednesday, March 20th, 2024 14:00UTC
  • Read only: 3 minutes 8 seconds

September

  • Services + Traffic: Tuesday, 24 September 2024 @ 15:00 UTC
  • MediaWiki: Wednesday, 25 September 2024 @ 15:00 UTC
  • Read only: 2 minutes 46 seconds

2023 switches

SeptemberServices + Traffic
Tuesday, September 19th, 2023 14:00 UTC
FebruaryServices
Tuesday, February 28th, 2023 14:00 UTC

Reports

  • Recap
  • Read only: 1 minute 59 seconds

Switching back:

Schedule

Reports

  • Read only: 3 minutes 1 second

2021 switches

ScheduleServices
Monday, June 28th, 2021 14:00 UTC
Reports

Switching back:

Reports

Datacenter switchover recap on wikitech-l

  • Read only duration: 2 minutes 42 seconds

2020 switches

ScheduleServices
Monday, August 31st, 2020 14:00 UTC
  • Traffic: Monday, August 31st, 2020 15:00 UTC
  • MediaWiki: Tuesday, September 1st, 2020 14:00 UTC
Reports

Incident documentation/2020-09-01 data-center-switchover

  • Read only duration: 2 minutes 49 seconds

Switching back:

  • Traffic: Thursday, September 17th, 2020 17:00 UTC
  • MediaWiki: Tuesday, October 27th, 2020 14:00 UTC
  • Services: Wednesday, October 28th, 2020 14:00 UTC

2018 switches

ScheduleServices
Tuesday, September 11th 2018 14:30 UTC
  • Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
  • Traffic: Tuesday, September 11th 2018 19:00 UTC
  • MediaWiki: Wednesday, September 12th 2018: 14:00 UTC
Reports

Switching back:

Schedule
  • Traffic: Wednesday, October 10th 2018 09:00 UTC
  • MediaWiki: Wednesday, October 10th 2018: 14:00 UTC
  • Services: Thursday, October 11th 2018 14:30 UTC
  • Media storage/Swift: Thursday, October 11th 2018 15:00 UTC
Reports

2017 switches

Schedule
  • Elasticsearch: elasticsearch is automatically following mediawiki switch
  • Services: Tuesday, April 18th 2017 14:30 UTC
  • Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
  • Traffic: Tuesday, April 18th 2017 19:00 UTC
  • MediaWiki: Wednesday, April 19th 2017 14:00 UTC (user visible, requires read-only mode)
  • Deployment server: Wednesday, April 19th 2017 16:00 UTC
Reports

Switching back:

Schedule
  • Traffic: Pre-switchback in two phases: Mon May 1 and Tue May 2 (to avoid cold-cache issues Weds)
  • MediaWiki: Wednesday, May 3rd 2017 14:00 UTC (user visible, requires read-only mode)
  • Elasticsearch: elasticsearch is automatically following mediawiki switch
  • Services: Thursday, May 4th 2017 14:30 UTC
  • Swift: Thursday, May 4th 2017 15:30 UTC
  • Deployment server: Thursday, May 4th 2017 16:00 UTC
Reports

2016 switches

Schedule
  • Deployment server: Wednesday, January 20th 2016
  • Traffic: Thursday, March 10th 2016
  • MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
  • Elasticsearch: Thursday, April 7th 2016, 12:00 UTC
  • Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
  • Services: Monday, April 18th 2016, 10:00 UTC
  • MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
Reports

Switching back:

  • MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
  • Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done

Monitoring Dashboards

Aggregated list of interesting dashboards