User:TweetsFactsAndQueries/Darwinís Fox

From Wikidata
Jump to navigation Jump to search

At WikiCon16, I met Tobias1984, who showed me how to use Pywikibot with the awesome PAWS interface (and who also wrote this great tutorial). I remembered that I had noticed an error with some taxon common name (P1843) statements a while ago: many of them contain the character “í” where they should contain an apostrophe – for example, “Darwinís Fox” for Darwin's fox (Q631167). At the time, I didn’t know how to fix it, but now I decided to try fixing this with Pywikibot.

We start off by opening PAWS. Log in, start your server, and create a new Python 3 notebook (New > Python 3). (Notebooks are public, so you can see the one I used here, and read along as this post unfolds. Note that the notebook includes a lot of output, so it will take some time to load.)

Insert some boring boilerplate code:

import pywikibot as pwb
site = pwb.Site("wikidata", "wikidata")
repo = site.data_repository()

(You send off each input unit – which can contain multiple lines – with Shift+Enter.)

We first want to fix one item “manually”, because automated editing is dangerous! So let’s start with Darwin’s Fox, Q631167. We obtain its item:

darwins_fox = pwb.ItemPage(repo, "Q631167")

We can see the broken taxon name with:

darwins_fox.get()["claims"]["P1843"][0].getTarget().text

Okay, that’s a lot of code out of nowhere. Where did it come from? I’m not going to write it all up, but essentially it’s a combination of

  • tab completion: enter darwins_fox., press Tab, and PAWS will tell you that the object has a get() method.
  • printing a snippet of the code, and looks what’s available in it: for example, if you print darwins_fox.get(), you can see that the result is a dictionary which contains the key "claims".
  • looking at the tutorial and copying snippets from there.
  • listing the object: you can show the methods and properties of any object with dir(), but that’s usually a last resort because it includes a lot of useless internal results (__foobar).

Now we want to edit the item. I’m still playing around with the pywikibot, so let’s just see what happens if I do:

darwins_fox.get()["claims"]["P1843"][0].getTarget().text = "Darwin’s fox"

Nothing happens so far. That’s good. I tried to see what methods darwins_fox has with tab completion. This one sounded promising:

darwins_fox.editEntity()

No error message – and the item is updated! Whoa. That’s really easy. And also, I should probably stress that this was pretty irresponsible of me (sorry!). Let’s use the sandbox item for playing around some more.

sandbox = pwb.ItemPage(repo, "Q4115189")
sandbox.get()["claims"]["P1843"][0].getTarget().text

I had added a faux taxon common name (P1843) to the sandbox item, so this still printed “Darwinís fox”. (When you read this, that statement will probably be gone.)

Let’s try to fix up the name without hard-coding the proper solution. That is, instead of setting it to the fixed string “Darwin’s fox”, replace the broken character “í” with an apostrophe.

target.text = target.text.replace("í", "’")
sandbox.editEntity(summary="test edit from pywikibot")

That looks good! Okay, let’s move onto SPARQL, since we’ll want to fix all items, not one. I had already written this SPARQL query back when I found this issue:

SELECT ?taxon ?commonName ?fixedCommonName
WHERE
{
  ?taxon wdt:P1843 ?commonName.
  FILTER(CONTAINS(?commonName, "í") && CONTAINS(?commonName, " ")).
  BIND(REPLACE(REPLACE(?commonName, "í", "'"), "‰", "ä") AS ?fixedCommonName).
}
Try it!

This query already shows the fixed taxon name by replacing the two broken characters. We don’t need that here, since we’ll do the replacing in Python. We also need to change the ?taxon variable to ?item because pywikibot’s SPARQL iterator looks for that particular variable name.

SELECT ?item
WHERE
{
  ?item wdt:P1843 ?commonName.
  FILTER(CONTAINS(?commonName, "í") && CONTAINS(?commonName, " ")).
}
Try it!

Let’s save that query.

query = """
SELECT ?item
WHERE
{
  ?item wdt:P1843 ?commonName.
  FILTER(CONTAINS(?commonName, "í") && CONTAINS(?commonName, " ")).
}
"""

We also now need the page generators package, so let’s import it:

from pywikibot import pagegenerators as pg;

And get a SPARQL browser for our query:

generator = pg.WikidataSPARQLPageGenerator(query, site)

And let’s first just browse the query results: loop over the iterator, print the taxon common name.

for item in generator:
    for claim in item.get()["claims"]["P1843"]:
        print(claim.getTarget().text)

The results look good, they all look like labels that we want to fix. So let’s try that. But first, we need to recreate the generator, because it’s now exhausted and we can’t iterate over it a second time

generator = pg.WikidataSPARQLPageGenerator(query, site)

And then, edit the items as before – but this time in the loop!

for item in generator:
    for claim in item.get()["claims"]["P1843"]:
        target = claim.getTarget()
        text = target.text
        if "í" in text:
            target.text = text.replace("í", "’").replace("‰", "ä")
            item.editEntity(summary="Fix broken taxon common name: see https://www.wikidata.org/wiki/User:TweetsFactsAndQueries/Darwin%C3%ADs_Fox")
            print("Edited {} to change {} to {}".format(item, text, target.text))
            break

I put that break in there because automated editing is still scary, so I wanted to edit only one item at first. But I forgot that this only breaks the inner loop (for claim in ...), not the outer loop (for item in generator), so this actually does edit all the items. Turns out that’s not too bad, because my account doesn’t have the bot flag, so pywikibot has to sleep for about ten seconds after each edit to prevent running into rate limits. If the edits had turned out to be bad, I could’ve just stopped the program (with the big button at the top of the notebook) and reverted the bad edits manually. But they look good, so I just let the thing run until it’s done. The SPARQL query returned 336 results earlier, so at ten seconds per edit, we can estimate that this will take about an hour.

Alright – an hour later, that’s done. We’re not done, though: Many of those items also have aliases that are broken in the same way.

We start again with Darwin’s fox, and look at its aliases:

aliases = darwins_fox.get()["aliases"]

And then we fix them in a loop. (Remember, this won’t do anything until we call some edit method.)

for lang in aliases:
    for index, alias in enumerate(aliases[lang]):
        if "í" in alias:
            aliases[lang][index] = alias.replace("í", "’").replace("‰", "ä")

We do the edit – this time, with the editAliases method.

darwins_fox.editEntity(summary="See https://www.wikidata.org/wiki/User:TweetsFactsAndQueries/Darwin%C3%ADs_Fox")

There is another complication. In some items, the fixed alias is already added – as in, they have both “somethíng” and “someth’ng” as alias. What happens in this case with the aliases? Let’s try it out with one sample item, Tanimbar Corella (Q757402).

goffins_cockatoo = pwb.ItemPage(repo, "Q757402")
aliases = goffins_cockatoo.get()["aliases"]
for lang in aliases:
    for index, alias in enumerate(aliases[lang]):
        if "í" in alias:
            aliases[lang][index] = alias.replace("í", "’").replace("‰", "ä")
aliases["en"][1] == aliases["en"][2] # prints True: those two aliases are now the same
goffins_cockatoo.editAliases(aliases, summary="See https://www.wikidata.org/wiki/User:TweetsFactsAndQueries/Darwin%C3%ADs_Fox")

A quick look at the item page confirms that the alias is now not present twice: it looks like either pywikibot or the Wikidata API quietly removed the duplicate alias. That’s exactly what we want, so no need to worry here!

Let’s move on to the mass edit, then! Here’s a SPARQL query that finds all items with an alias that contains broken characters and, when fixed, is identical to the taxon common name of the item.

SELECT ?item
WHERE
{
  ?item wdt:P1843 ?commonName;
        skos:altLabel ?alias.
  FILTER(CONTAINS(?alias, "í") && CONTAINS(?alias, " ")).
  FILTER(REPLACE(REPLACE(STR(?alias), "í", "’"), "‰", "ä") = STR(?commonName)).
}
Try it!

And here it is in Python:

query = """
SELECT ?item
WHERE
{
  ?item wdt:P1843 ?commonName;
        skos:altLabel ?alias.
  FILTER(CONTAINS(?alias, "í") && CONTAINS(?alias, " ")).
  FILTER(REPLACE(REPLACE(STR(?alias), "í", "’"), "‰", "ä") = STR(?commonName)).
}
"""

So let’s put the pieces together: the SPARQL generator just like for the common names, and the alias editing from just now.

for item in pg.WikidataSPARQLPageGenerator(query, site):
    aliases = item.get()["aliases"]
    for lang in aliases:
        for index, alias in enumerate(aliases[lang]):
            if "í" in alias:
                aliases[lang][index] = alias.replace("í", "’").replace("‰", "ä")
                print("Editing {} to change {} to {}".format(item, alias, aliases[lang][index]))
    item.editAliases(aliases, summary="Fix broken taxon common name alias: see https://www.wikidata.org/wiki/User:TweetsFactsAndQueries/Darwin%C3%ADs_Fox")
    raise Exception # break out of loop, because mass editing is still scary

This time, I used raise Exception instead of break to break out of the loop (even though in this case break would also work), to be on the safe side. The result looks good, so let’s remove that raise and run again:

for item in pg.WikidataSPARQLPageGenerator(query, site):
    aliases = item.get()["aliases"]
    for lang in aliases:
        for index, alias in enumerate(aliases[lang]):
            if "í" in alias:
                aliases[lang][index] = alias.replace("í", "’").replace("‰", "ä")
                print("Editing {} to change {} to {}".format(item, alias, aliases[lang][index]))
    item.editAliases(aliases, summary="Fix broken taxon common name alias: see https://www.wikidata.org/wiki/User:TweetsFactsAndQueries/Darwin%C3%ADs_Fox")

Again, pywikibot has to sleep for about ten seconds after each edit to avoid running into rate limits (my account doesn’t have the bot flag), so this will take another hour.

Some time later, that’s done as well. I looked over the results, and there was actually a fair amount of false positives :( mostly for two reasons:

  • ”Delfín” is spelled with an í in Spanish.
  • At least one language seems to use í for the genitive case.

Both of these could have been avoided if I’d limited the query and replacements to English labels. (Which is also why the problem didn’t occur in the taxon common name (P1843) fix above – there are less “taxon common name” statements than aliases, and mostly just English ones.) I could have stopped pywikibot and improved the query, but instead I decided to let it run and fix the broken cases manually: Commerson's dolphin (Q724354), Steller’s sea cow (Q187484), Fraser's dolphin (Q751965), Angraecum sesquipedale (Q133818), Vipera latastei (Q1247992), Kitti's hog-nosed bat (Q370965), and Etruscan shrew (Q369889). (I also went back and looked through the taxon common name changes, but didn’t find any edits that stood out as incorrect. Diospyros revaughanii (Q15341736) looked suspicious, but a quick Google search revealed that “Bois d’ebene feuilles” seems to be correct.)

There’s one final problem. In the alias results, I also noticed the item Indirana diplosticta (Q3150486), with taxon common name and alias “G¸nther’s frog”. Clearly, more characters than ’ and ä were broken – in this case, the ¸ should be an ü.

A quick SPARQL query –

SELECT ?taxon ?name WHERE {
  ?taxon wdt:P1843 ?name.
  FILTER(CONTAINS(?name, "¸")).
}
Try it!

– reveals that only eight items like that exists, so I opted to fix those manually instead of with pywikibot.

And now we’re done! There are probably some more broken taxon common names, or perhaps even other properties that are broken because of the same bug in some importer tool, but that’s all I know of.