Jump to content

2019:Quality/Data Quality in Wikidata

From Wikimania

Slides

[edit | edit source]

Slides can be found here:

Transcription of post-its gathered in the poster session and in this talk/workshop

[edit | edit source]

Poster

Question 1: What quality dimension should we better support or what task should be (more) automated?

  • Fit for purpose
  • How can we handle situations where Wikidata has actual better data than the “trusted” source?
  • How to easily find if a P date already exist so that I don’t create a duplicate?
  • Feature N importance?
  • Changing Wikidata values … when editing Wikipedia article infobox
  • Shape Expressions//Entity schemas
  • How can I communicate the trust of a source across Wikipedia languages?
  • Editathons for subject matter experts in field X who can translate labels for language Y (need user friendly tools)
  • More transparency and easy-to-understand explanations that allow people, students to use the data


Question 2: On which data quality dimension do you work on?

  • Identifiers interlinking
  • Input from Wikipedians very welcome
  • What’s complete depends on the context/ use case / community
  • Gender perspective in Wikidata. Classify by gender
  • Completeness because it is easy to tackle.
  • Data Quality is often defined outside WD
  • Ethical dimension of Google using the data
  • {similarity of items sic.} John Adams! Example -> additional data needed

Question 3: On which data quality dimension would you like to work in the future?

  • Diversity
  • Interlinking (value gained for source)
  • References needed for conflicting data
  • External references (need to replace the internal ones)
  • MORE TRUST
  • External Identifiers
  • Quality ranks for sources


Talk / Workshop Input

Q1: What critical data quality issues did you spot in Wikidata?

  • One topic can have multiple Wikidata entries in different languages.
  • Description in different languages translate differently.
  • There are still old manual interwiki translations left in Wikidata.
  • Better support for differing (and maybe even sourced) data values (like birthdate etc.).
  • First showcase item I checked to get help was wrong → control showcase items more
  • Not every entry has or even needs a reference.
  • Wikipedians cannot create new / missing properties.
  • Some questions about Wikidata are not answered in weeks.
  • Data changes in Wikidata are often not seen by Wikipedians.
  • Completeness and references.
  • How to know when an item is properly complete.
  • Duplication, lack of reference, data from Wikipedia.
  • Different topics can have the same Wikidata item. Need to be separated.
  • Sources.
  • Url reference without date visited qualification.
  • Wrong ontologies.
  • Wrong type of instances or subClassOf.
  • Duplicates.
  • The major quality concern or unused potential of Wikidata design, is that many claims are unsourced.
  • Confusion about properties pertaining to types of items (e.g., “location” or “located in the administrative entity” - usage is unclear to average Wikidata editor / contributor.
  • Make it easier to do multiple Q references for save source, instead of having to copy/paste repeatedly.
  • Gaps in subclass tree.
  • Inconsistent ontology / property usage.
  • BLP problems.
  • Incorrect or inaccurate subclasses being used.
  • Use of wrong properties / inconsistency.
  • Wikiproject standards hidden.
  • Lack of statements.
  • Bad language mappings / false friends.
  • Lack of references.
  • Results of search seem to be overtaken by growing body of article citations; make it easier to filter results appropriate to search.
  • Missing statements → automatically suggest a range of statements when type of item is chosen.
  • People sometimes don’t understand complex constraints.
  • Old imports from Wikipedia.
  • Too many items have still no statements; they are difficult to improve or merge.


Q2: How should we organize data quality management in the community?

  • Removing references if value has been modified.
  • DQ Management better procedure.
  • Tools to check duplicates.
  • Tag the quality issue in the UI.
  • Ask for more references.
  • Introduce quality levels for data:
  • Valid reference?
  • Who brought it in?
  • All quality dimensions respected?
  • Give good sight to relevant changes (for articles on my list).
  • Move automatic processes.
  • Mediators between the contents and substance community and the technical community.
  • Data quality night: more “canned” templates for common topics, with likely Q’s / properties
  • Have a way to add a check/mark or something like that for verification of data Q’s by editor who didn’t make original Q / reference (could be automated verification maybe?)
  • Measurement / queries tools to help spot weak quality e.g., bad quality image (low definition).
  • Suggest referencing unreferenced statements through a list.
  • “Gamify” by adding a tab when simple fixing task are listed (“todo” list).
  • Create “wiki love” campaigns specific for data.
  • Custom editing interfaces for special projects (art, wikicite, biographies).
  • Concentrate more topical / thematic discussions.
  • Standard queries and dashboard to spot bad data.
  • Wikiproject banners more prominent in item new.
  • More tools.
  • More specific metrics.
  • Addressing problems and point them out to work more collaborative.
  • Statement specific talk / spaces.
  • Coordinate a way for editors to show what you want comments on.
  • Anti-vandalism bots
  • Better coordination and discovery of project discussions.
  • Promote the development of data models (findable!).Discuss quality and data models in specific wikiprojects; they need more advertisement.
  • Improve multilingualism also in village punes. There are too many discussions in the project chat, which is difficult to follow.
  • (Not familiar enough with the Wikidata community to give feedback).


Q3: What did we learn from Wikipedia and/or other WM projects?

  • Keep people from different Wikiprojects involved and improve their understanding of the importance of Wikidata.
  • It’s better to source statements from the beginning and not after a lot of time
  • Problems of data modelling (e.g., a library should have 2 items: the institution and the building) should be solved quickly in order to avoid confusion or work and edit on.
  • It’s possible to have a multilanguage and multicultural project (see for example Wikimedia Commons).
  • Don’t rely on the information in other Wikipedias / projects.
  • Show Wikipedians how they can profit from Wikidata.
  • When I find a wrong fact, it is difficult to correct it. Result: I leave the wrong information at wiki.
  • Don’t close things down, protect things too quickly.
  • Be more open to institutions and “role accounts”
  • Keep the community fun and innovative, don’t ossify.
  • Inclusionism is a virtue.
  • Be user friendly (e.g., create an equivalent of VisualEditor on Wikidata -- suggest list of statements, etc.)
  • Talk pages don’t get used much, it seems.
  • Structured data is a good thing! (across wiki projects).
  • Better linking between related project across different wikimedia projects.
  • Outreach in Education → more Wikidata Use in University Courses + Data Science.
  • We have to work / learn a little bit more from each part of Wikipedia. E.g., developers, admin, user → feel every perspective.
  • Like Wikimedia is one common language, utopic feeling, many people, many languages, and *one* goal.
  • Global and local collaborative woking.
  • Write now, cite later.
  • The distance between discovering a mistake and correcting it, must be short and easy to use.
  • Maybe like Wikipedia, Wikidata will work in practice, but not in theory.
  • Be more clear on *where* discussion should take place.
  • Translating items into more languages improves quality.
  • That different camps (e.g., inclusionists vs. deletionists) are not opposites but complementary.
  • Bottom-up via communities of interest → so, give it space to organise itself, don’t “organise it”. Bottom-up is the Wiki USP, don’t lose it!
  • Referencing is [...] complicated.