Page MenuHomePhabricator

Rework how value and reference changes are handled
Closed, ResolvedPublic

Description

The current workflow of the updater requires loading the triples from rdf store prior to sending an update to it.
This makes the process very sensitive to update order.
This also prevents further refactoring to introduce a queue holding the triples to change (do not call Special:EntityData * number of nodes per update).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse triaged this task as Medium priority.

References and values are identified by a hash computed over their properties. It is not a unique ID as it is always generated on the fly when extracting the entity data.
The current RDF projection makes it a resource that is referenced from other triples. The problem is that when some of the properties of the value/reference are changed the hash is changed making this process prone to leave orphans in the rdf store.
Other complexity is that the hashes may be used by other entities making the cleanup even harder since we have to cleanup the values/refs only when they're never used elsewhere. This cleanup is currently done in realtime whenever a batch of entities is updated.
The query is as follow:

select ?s ?p ?o WHERE {
  VALUES ?s { list of values/references }
  # Since values are shared we can only clear the values on them when they are no longer used
  # anywhere else.
  
  FILTER NOT EXISTS {
    ?someEntity ?someStatementPred ?s .
    FILTER(?someStatementPred != wikibase:quantityNormalized)
  }
  ?s ?p ?o .
}

and is done once for values and once for references.
We should probably try to monitor how much time is spent trying to cleanup orphaned values & references. But also count how many values & references are duplicated in the dump.
This to answer the following questions:

  • is it worthwhile to continue to do orphan detection during realtime updates
  • is it worthwhile to investigate an offline method to prune orphaned values & references
  • is it worthwhile to dedup values&references at import time

References and values are identified by a hash computed over their properties. It is not a stable ID as it is always generated on the fly when extracting the entity data.

They’re fairly stable in practice, though – I think it’s been a while since the last time we broke the hashes. See also T167759: Reference hash is not stable for more discussion.

But also count how many values & references are duplicated in the dump.

A full ?reference (COUNT(*)) GROUP BY ?reference probably isn’t possible on the live query service, but “imported from English Wikipedia” seems like a good contender for one of the most common references, and is used on 14,021,522 statements. Imports from German (4,987,584) and Russian Wikipedia (3,035,493) are also common, though I suspect they’re beaten by some external database (“stated in”) that I can’t think of right now.

Some numbers extracted from a dump:

  • number of values: 20,659,551
  • number of unique values: 11,028,526
  • number of references: 60,078,314
  • number of unique references: 58,876,057

So to the question:

is it worthwhile to dedup values&references at import time

probably not, given that we have between 3 and 5 triples per value this would save between 27M and 45M dup triple inserts over ~8B triples
For references it's clearly not worthwhile

References and values are identified by a hash computed over their properties. It is not a stable ID as it is always generated on the fly when extracting the entity data.

They’re fairly stable in practice, though – I think it’s been a while since the last time we broke the hashes. See also T167759: Reference hash is not stable for more discussion.

Thanks for the link and the context. "stable ID" was probably misleading, I'll rephrase as "non unique ID".

Change 556032 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/rdf@master] Track existing values and refs outside of the munger

https://gerrit.wikimedia.org/r/556032

Change 556032 merged by jenkins-bot:
[wikidata/query/rdf@master] Track existing values and refs outside of the munger

https://gerrit.wikimedia.org/r/556032

The munger has been reworked so that it does not deal with this cleanup. The next gen updater will address this cleanup in a different way. For the current updater one thing to keep in mind is that the ref cleanup was disabled some time ago (investigating T194325: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/437362) and never re-enabled since then. We could imagine disabling values cleanup as well this could give us some room with the current updater.