Hi Simon - I finally finished going through your file Wikidata-ORCID mismatched names with papers 20240426.csv - only a little over 4 months of work :) I think I caught everything. Most common issue were cases where whoever added the author record mixed up two different authors on a paper - often first author and last author, or two neighboring authors. This may be because the publisher or Crossref or other secondary DB attached an ORCID to the wrong author. Anyway there were a lot of other special cases too. Sometimes the ORCID name was just a variant of the author name, and things were correct as they were (I added the variant name as an alias) - for example for females who had a name change. I also ran across quite a lot where the ORCID name bore no resemblance to any name on the (any) paper with that author, no idea what happened there.
I'm hoping these were all temporary glitches in data imports - I don't think I ran across any bad ORCIDs from 2021 or later.
Anyway, after going through this I thought there were a few improvements you could make to this dataset, if you run it again:
- Your file was missing papers with no DOI - I think it would be fine to leave the paper author name blank in these cases, or otherwise indicate the problem, but I think it's nicer to have all the papers listed in the one place even if the author names are unknown.
- Your file also was missing wikidata items where the author had no "series ordinal" number. Again just leave the name blank, and author number, that should be clear.
- It might be good if you compare not just to the main name from ORCID, but to their alternate or first/last name breakdowns. I'm not sure quite what you did there but some of the "correct" cases were clearly matches when looking at ORCID alternate names listed in the UI.
- There were a very small number of cases where the author sequence numbers you had were apparently different (often off by 1) from the actual author list in the paper, resulting in a mismatch on name when it was in fact ok. Not sure what happened here but one common issue was the handling of collective names - in Wikidata they seem to usually be in the regular author list just as they were on the paper, while they seemed to be skipped in the sequence you used.
Thanks for generating this file, I'm glad to have gone through it, I'm hoping we will have much less issues with mismatched ORCID's going forward!