User Details
- User Since
- Dec 15 2021, 9:19 PM (158 w, 6 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- BKing (WMF) [ Global Accounts ]
Fri, Dec 20
Note that if we do want to run in Kubernetes, we need to build a wdqs image, similar to what WMDE does here
Thanks for the update @Andrew . Per IRC conversation, I've changed this request from wikidata-query project to search, as that project doesn't a have a hyphen. Let us know if you are able to fulfill this request.
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!
Thu, Dec 19
I checked the alert files on the Prometheus hosts using cumin:
Wed, Dec 18
We're now shipping the metrics correctly (thanks volans and dcausse ).
Tue, Dec 17
I ran a pcap on wdqs1011 and I can confirm that the prometheus hosts are scraping the correct port.
Per today's DPE SRE standup, I've grabbed this ticket and will try to move it forward.
@Zache I ran your example query against commons-query and it appears to be giving a response. As such, I'm going to resolve this ticket. That being said: if it does not work, respond here and the ticket will re-open.
Mon, Dec 16
Per IRC conversation with @elukey , I just wanted to let y'all know that I successfully reimaged cloudelastic1012 just now. No Puppet 5, no CSR or any other errors.
Since benchmarking is not my forte, I've created a benchmarking plan that sticks as closely as possible to Brendan Gregg's recommendations . Feel free to review/update the document as necessary.
Thu, Dec 12
@Jelto see above, Puppet laid down the ferm rules when the patch you reviewed was merged, but ferm didn't actually load them, even when I reloaded ferm.service.
Tue, Dec 10
Hello. This is Brian and I'm an SRE for the Search Platform team.
@RKemper I believe this one is finished, but not 100% sure. Moving to "Needs Review" so you can take a look. Feel free to close if everything is done.
Mon, Dec 9
@elukey I'm fine with focusing our efforts on UEFI, it seems like the best use of our time. Ping me in IRC if I can do anything to help test.
It took a few tries, but wdqs1025 is now running off UEFI. I left some notes here about my experience. Closing...
Wed, Dec 4
diff -U0 _srv_config-master_pybal_codfw_wdqs-internal-scholarly.tmpl _srv_config-master_pybal_codfw_wdqs-scholarly.tmpl -{{range $node := ls "/pools/codfw/wdqs-internal-scholarly/wdqs/"}}{{ $key := printf "/pools/codfw/wdqs-internal-scholarly/wdqs/%s" $node }}{{ $data := json (getv $key) }} +{{range $node := ls "/pools/codfw/wdqs-scholarly/wdqs-scholarly/"}}{{ $key := printf "/pools/codfw/wdqs-scholarly/wdqs-scholarly/%s" $node }}{{ $data := json (getv $key) }}
All dashboards have been migrated to Prometheus metrics. As such, I'm closing this ticket.
Tue, Dec 3
No problem...as you said, we can come back to this when the time is right.
Search platform shouldn't be getting these alerts anyone-Data Platform SRE should be the sole responder. I thought I fixed this in T379182 , but it appears I need to take another look.
Mon, Dec 2
Nov 27 2024
Per the above merge, Data Platform SRE is the new recipient of these alerts. Search Platform will no longer receive them. Closing...
Nov 26 2024
This is an interesting one...zram0 is a compressed RAMdisk, so it should not be in scope for any SMART (hard drive health) checks. I believe we are the first at WMF to use zRAM, so we'll probably need to find the SMART monitoring config and exclude zram devices.
Nov 25 2024
Per IRC conversation with @dcausse , we now have an alternate way of reindexing that does not involve cookbooks . As such, we can close out this ticket.
Closing as a duplicate of T379182
This alert has cleared, although the WDQS hosts appear frequently in the "slow but successful probes" on the linked dashboards. If this keeps up, we might try restarting the services, but for now I don't think we need to take any actions on that front.
I've created the cephs.home.meta and cephs.home.data pools as required by step 1 (ref https://wikimedia.slack.com/archives/C055QGPTC69/p1732203662967269 )
Nov 22 2024
Nov 21 2024
Nov 20 2024
I failed to re-open this in December, re-opening now for the same reasons.
As the tasks above have been completed, I'll go ahead and close out this ticket.
Nov 19 2024
Thank you very much @Jclark-ctr !