The guy watching South Park all day. Used to be a student in Computer Science (specialised in InfoSec, open source infrastructures, computer networking, and ethical security & network research). Formerly employed in InfoSec as well.
User Details
- User Since
- Oct 12 2014, 7:12 AM (525 w, 4 d)
- Availability
- Available
- IRC Nick
- Southparkfan
- LDAP User
- Southparkfan
- MediaWiki User
- Southparkfan [ Global Accounts ]
Sep 7 2024
As noted on IRC partially as well: the flapping has been going on for a while, there didn't seem to be any critical hosts in D4 (assuming the line card numbering matches the physical racks properly, in all VCs) and hence it was not Klaxon-worthy to me. Nevertheless, they're still production hosts running on a switch, with interface issues for sometimes for up to two hours. And unless the eqiad VC cabling is different from a perfect spine-leaf topology, this means the D4 asw only had one remaining uplink, which is an issue.
Sep 5 2024
Sep 4 2024
I can confirm the ldap group has been added to my account.
Sep 3 2024
For the record: there seem to be a few IdP-related issues in Netbox (T373702), but despite that, this LDAP access request is still valid.
Aug 28 2024
Email has been sent!
Aug 27 2024
Ta da: https://wikitech.wikimedia.org/w/index.php?title=Adding_and_removing_transit_providers&diff=2218856&oldid=2042295. Can you verify this is correct? There are lots of references to private Phabricator tasks, and of course I have never dealed with WMF's transit providers before.
Before working on automating the verification (that we didn't forget any step) or the actual implementation, we should look at using Netbox as source of truth.
As far as I understand, the actual circuits are already managed in Netbox, but the Homer template for the Transit group (which is what actually manages the BGP sessions on the CRs) is expanded based on yaml configuration in config/devices.yaml. On top of that, we have site-wide policies, and BFD + FlowSpec configuration for transit providers in config/common.yaml. Using Netbox to manage the AS-specific import/export policies does not really make sense (the policies are free-form text), and I'm not sure what Netbox data model is suitable for modelling BGP sessions. Something I can think of is some kind of CI/test that checks whether the Homer transit and transit_provider keys contain ASes that are not "transit" Providers in Netbox, to at least have some kind of Netbox-Homer verification, and said data may also be useful for cross-checking the IRR databases and ASPA objects. If Wikimedia has some custom data model for any of the use cases listed, let me know!
Yep, that would be ideal. Unfortunately, one`prefix-list` can only have one apply-path, so cannot take the union of BGP peer IPs, and interface/tunnel IPs. Furthermore, these apply-paths do not seem to support regexes(?), so we cannot craft a regex that only matches interfaces part of the external-links group (set policy-options prefix-list egress-ranges4 apply-groups xxxx" is valid, but the apply-groups is meant for "inheriting configuration data", not so much retrieving IP addresses from interfaces assigned to this...?).
Aug 12 2024
Follow-up from IRC: Wikimedia uses the Hosted RPKI, but we assume the ARIN portal just doesn't support anything else than ROAs. There is an ASPA record for AS11358, whose ASN is controlled by ARIN, and we think they either use Hybrid RPKI, where ARIN still hosts the RPKI objects through their Repository Publication Service. Technically, Krill could both act as the CA and create ASPA records, and it is possible that AS11358 and others are doing this.
Aug 9 2024
Jul 24 2024
sessionstorage04 is no longer.
Jul 23 2024
Couldn't upgrade Buster to 4.x, because there are no packages in buster-wikimedia. Installing Cassandra was a rather interesting process.
I didn't get a response in -sre, but Andrew has provided me with extra information.
Jul 22 2024
Puppet fails to install the Cassandra instance:
Error: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0] Error: /Stage[main]/Cassandra/Cassandra::Instance[default]/Exec[install-/var/lib/cassandra/data]/returns: change from 'notrun' to ['0'] failed: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0] (corrective) Error: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0] Error: /Stage[main]/Cassandra/Cassandra::Instance[default]/Exec[install-/var/lib/cassandra/data]/returns: change from 'notrun' to ['0'] failed: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0] (corrective)
Had to delete sessionstorage05 (bookworm) due to T357791, will replace with a bullseye instance for Cassandra
Jul 20 2024
Great, can't ssh into my new instance:
$ ssh deployment-sessionstore05.deployment-prep.eqiad1.wikimedia.cloud Connection closed by UNKNOWN port 65535
@Jgiannelos hey! Is deployment-restbase-bullseye (created by you last year) ready to take over the work from restbase04? Other than changing the references to restbase04 in Horizon hiera and LabsServices.php, and in the changeprop Chart (deployment--charts), it should be possible to switch, although the restbase service is not listening to port 7231 on -bullseye - any idea what's wrong?
@hnowlan I see you have created deployment-maps-master02. Other than possibly replacing the old master in https://github.com/wikimedia/maps-kartotherian-deploy/blob/master/scap/environments/beta/targets, is there anything needed before deleting master01?
As soon as the above changes have been merged, urldownloader03 can be deleted.
@BTullis I see you have created deployment-snapshot05 (Bullseye), although this new host was not part of https://gerrit.wikimedia.org/r/c/operations/dumps/scap/+/1008451, and neither is it part of the mediawiki-installation dsh group. Do we have to add snapshot05 to your 'scap' repository, as well or is it fine to just add it to the dsh group?
Given that Puppet does not have a flag to stop the periodic MediaWiki jobs, I had to disable Puppet on mwmaint03 and kill the jobs myself (just like the DC switchover cookbooks do). Can be re-enabled as soon as mwmaint02 is gone (deleted + removed from dsh groups).
It looks like the shellbox container is broken on the new host:
root@deployment-shellbox01:~# /usr/bin/docker run --rm=true --env-file /etc/shellbox/env -p 8081:8081 -v shellbox:/etc/shellbox -v /run/shared:/run/shared -v /srv/shellbox/config/:/srv/app/config -v /srv/shellbox/src:/srv/app/src --name spftest docker-registry.wikimedia.org/wikimedia/mediawiki-libs-shellbox:2024-06-13-133425-video --nodaemonize [20-Jul-2024 14:43:13] ERROR: unable to bind listening socket for address '/run/shared/fpm-www.sock': Permission denied (13) [20-Jul-2024 14:43:13] ERROR: unable to bind listening socket for address '/run/shared/fpm-www.sock': Permission denied (13) [20-Jul-2024 14:43:13] ERROR: FPM initialization failed [20-Jul-2024 14:43:13] ERROR: FPM initialization failed
deployment-parsoid14 has been installed with a Bullseye image.
Upgrade to bullseye/bookworm blocked due to T332015. @MoritzMuehlenhoff, can I help you to get poolcounter-prometheus-exporter imported to bullseye and/or bookworm (preferably both)?
^ after merging this change, deployment-push-notifications01 can be replaced with a Bookworm instance.
mediawiki11 and mediawiki12 are no longer in use, but still receive scap deployments. As soon as the two changes above have been merged, we can delete these instances.
@Jgiannelos I see you have created mobileapps02 with a Bullseye image. Is mobileapps01 ready for removal?
Jul 19 2024
Done (volume has been deleted as well)
Done :)
deployment-jobrunner04 has been shut down (had to reboot due to errors in Jenkins). As soon as https://gerrit.wikimedia.org/r/1055394 and https://gerrit.wikimedia.org/r/1055412 are merged, we can delete that instance.
Instance is offline, seems to be superseded by deployment-changeprop-1.deployment-prep.eqiad1.wikimedia.cloud per T357476#9540192. @Urbanecm_WMF, do you agree we can delete deployment-docker-cpjobqueue01?
Instance does not exist anymore?
After merging https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1055306, the irc.beta.wmflabs.org RRset can be removed, and the floating IP can be removed from deployment-ircd03 as well; if you want to test the IRC server, you can run irssi on a Cloud VPS instance.
Jul 18 2024
Loop fixed by setting profile::base::remove_python2_on_bullseye: false on prefix level, also done in production.
role::mw_rc_irc seems to work fine on a Bullseye box, except for a loop.
Jun 14 2024
Relevant: T127717#9671526 (i.e. can be deleted without issues, but preferably only before other instances are deleted)
May 10 2024
Thanks for your help, Riccardo! Given current time constraints, I'm afraid most of this work will take multiple months, but nevertheless, to see whether kea_python still works with the Kea packages provided by Debian, I felt it was time to bite the bullet and build the bindings manually.
Apr 17 2024
Thank you for your reply! My comments:
As I understand it, no server in production VLANs (that is: starting with {analytics,private,public} - excluding frack infrastructure?) should rely on DHCP for any purpose other than reimaging, because the IPv4 address will be set statically in d-i. For that reason, I can see why we would like to refuse DHCP requests if no syslinux path is provided by NetBox. I wouldn't classify it as a security measure against malevolent administrators, but rather as a failsafe to mitigate the impact of operator error.
Apr 9 2024
@ayounsi and I have discussed my first findings, and we thought it made sense to share them here.
Mar 28 2024
Mar 27 2024
Haven't made a lot of progress on this, unfortunately. Scheduled for April.
Nov 20 2023
I'll work on this.
Nov 15 2023
Production migration from the gnutls driver to the openssl driver can be tracked in T324623.
Nov 3 2023
Oct 13 2023
Alternative to consider: injecting REDIRECTs for traffic meant for a VIP. See the second section at http://www.linuxvirtualserver.org/docs/arp.html. I haven't tested it and it requires some sort of Netfilter implementation on the realservers, but it avoids MTU-related issues (when tunneling traffic). Nevermind, ARP problem is solved at Wikimedia by not annoucing ARP. MTU is a challenge when using any type of encapsulation (in this case IPIP), but that's a different issue :)
Oct 3 2023
Aug 5 2023
May 12 2023
Feb 1 2023
I have expanded https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Auth_logging. The 'known limitations' section shows there is enough work to do, but to avoid a never ending task, I am fine with resolving this task when T127717#8505600 has been applied to Cloud VPS. I find the lack of monitoring to be a blocker too, though.
Dec 14 2022
Standalone puppetmasters are also affected by this Git update:
$ git push -f project_puppetmaster HEAD:production Total 0 (delta 0), reused 0 (delta 0), pack-reused 0 remote: fatal: detected dubious ownership in repository at '/var/lib/git/operations/puppet' remote: To add an exception for this directory, call: remote: remote: git config --global --add safe.directory /var/lib/git/operations/puppet
Dec 7 2022
I have tested https://gerrit.wikimedia.org/r/c/operations/puppet/+/865731 by using rsyslog-openssl on one syslog client and one syslog server running buster + one syslog client and one syslog server running bullseye. All works as expected.
Status: we chose #3 (Let's Encrypt via acme-chief). We've gotten stuck on a bug in the gnutls driver for rsyslog: T324623