Here are some thoughts on alt text:
Question 1: If we have the wikitext of an article, how do we tell if it has images with missing alt text?
In principle this can be done with a regex. Supposing that we have the full wikitext of an article, we can search it for the following regex:
\[\[File:((?!\|\s*alt\s*=)[^]])*\]\]
If this regex produces matches, then each of those matches will be an instance of a [[File:...]] link that doesn’t have an alt= parameter.
Keep in mind, though, that the names of the File: namespace and the alt= parameter are specific to English wikipedia, and would need to be localized for other language wikis. (I'm pretty sure we hard-code a list of File namespace localizations, and the alt parameter is a magic-word that can be localized via API.)
Question 2: How do we insert alt text into an existing File link?
For each match of the regex above...
To insert alt text into it:
- Go to the location of the regex match.
- Parse until the end of the "File:..." name, i.e. until you reach the first pipe | character or the closing ] brackets.
- Insert |alt=<alt text> at that location.
Question 3: How do we get a list and/or queue of articles that have images with missing alt text?
This is a little more tricky. Theoretically it’s possible to feed the same regex directly into CirrusSearch, and search for insource:/\[\[File:((?!\|\s*alt\s*=)[^]])*\]\]/
This will actually search the contents of all articles using that regex, and return matches. The problem is that using regexes with CirrusSearch is very expensive, and will likely cause timeouts and other issues, therefore this approach would not be recommended.
This means we have to use a more efficient method to search articles (imperfectly), and then perform further searching ourselves within those results.
Idea 1:
This uses generator=pageswithprop which gives us pages that have a specific property, and the property we look for is page_image_free. This ensures that all the returned pages have at least one image in them. You can then fetch the wikitext of each of these articles, and perform the regex search on them (from above). The downside of this generator is that it's not randomized, and returns the names of articles in alphabetical order.
Idea 2:
This uses generator=random which gives us literally random articles (within the main namespace), and we'll look for articles that have a page_image_free property, which ensures that the article has at least one image. You can then fetch the wikitext of each of these articles, and perform the regex search on them (from above). The downside of this is that random will produce a lot of misses. However, if you take a large enough random sample (the query above gives 50), it's highly likely that a few of them will have images. And then, out of those articles that have images, it's ~90% likely that they're missing alt text.
Idea 3:
Set up a new backend service that pre-populates a list of articles (with a db query that runs periodically), and serve up that list to clients. (similar to GrowthExperiments or recommendation-api)