Page MenuHomePhabricator

Implement Lua access to Lexemes, Senses and Forms
Closed, ResolvedPublic

Description

Task to collect some preliminary work on T212843: [EPIC] Access to Wikidata's lexicographical data from Wiktionaries and other WMF sites. This initial implementation will likely not feature fine-grained usage tracking yet, and parser functions are out of scope for now.

Event Timeline

Change 544205 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Add rudimentary mw.wikibase.lexeme Lua module

https://gerrit.wikimedia.org/r/544205

Change 544206 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Add rudimentary mw.wikibase.lexeme.entity.lexeme Lua module

https://gerrit.wikimedia.org/r/544206

Change 544207 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Make mw.wikibase.lexeme.entity.lexeme inherit mw.wikibase.entity

https://gerrit.wikimedia.org/r/544207

Change 544208 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Specify Lua module to be used for Lexeme entities

https://gerrit.wikimedia.org/r/544208

Change 544234 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Add documentation for rudimentary Lua modules

https://gerrit.wikimedia.org/r/544234

The patches linked above add support for code of the following sort:

mw.wikibase.lexeme.getLanguage( 'L1' )
mw.wikibase.getEntity( 'L2' ):getLexicalCategory()

Missing features:

  • Lua modules for Senses and Forms, likewise wired up with mw.wikibase.getEntity()
  • getSenses() and getForms() functions/methods in the Lexeme modules, returning “instances” of the corresponding modules

Also, lots of cleanup and testing is probably still needed.

Usage tracking is also going to be interesting. Currently, it’s strictly entity-based, as far as I can see (as opposed to page-based), both on the repo (wb_changes_subscription) and on the client (wbc_entity_usage). Does this mean that a Wiktionary page for one lexeme may end up with dozens, if not hundreds of wbc_entity_usage rows, one per form (and aspect)? Or should we say that entity usage stops at subentities, and any usage of a lexeme implies usage of all of its forms? Or do we somehow group usages together, similar as for other aspects, and turn form usages into one “all forms of this lexeme” usage once they exceed a certain threshold?

Change 545377 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Add all-usage for all subentities

https://gerrit.wikimedia.org/r/545377

Change 545378 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Add getLemmas function to Lua modules

https://gerrit.wikimedia.org/r/545378

Change 545379 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Add Lua module for Forms

https://gerrit.wikimedia.org/r/545379

Change 545537 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Add Lua module for Senses

https://gerrit.wikimedia.org/r/545537

Change 544205 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Add rudimentary mw.wikibase.lexeme Lua module

https://gerrit.wikimedia.org/r/544205

Change 544206 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Add rudimentary mw.wikibase.lexeme.entity.lexeme Lua module

https://gerrit.wikimedia.org/r/544206

Change 544207 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Make mw.wikibase.lexeme.entity.lexeme inherit mw.wikibase.entity

https://gerrit.wikimedia.org/r/544207

Change 544208 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Specify Lua module to be used for Lexeme entities

https://gerrit.wikimedia.org/r/544208

Change 544234 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Add documentation for rudimentary Lua modules

https://gerrit.wikimedia.org/r/544234

Change 545377 abandoned by Lucas Werkmeister (WMDE):
Add all-usage for all subentities

Reason:
not necessary after all

https://gerrit.wikimedia.org/r/545377

Change 545378 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Add getLemmas function to Lua modules

https://gerrit.wikimedia.org/r/545378

Change 550662 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Change function declarations to Lua style

https://gerrit.wikimedia.org/r/550662

Change 554116 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Capitalize Lexeme more consistently

https://gerrit.wikimedia.org/r/554116

Change 554117 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Add mw.wikibase.lexeme.splitLexemeId function

https://gerrit.wikimedia.org/r/554117

Change 554116 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Capitalize Lexeme more consistently

https://gerrit.wikimedia.org/r/554116

Change 554117 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Add mw.wikibase.lexeme.splitLexemeId function

https://gerrit.wikimedia.org/r/554117

One request: could we guard the code behind a per project feature flag? So we can deploy it but switch it on and off through a configuration.

It already is behind a feature flag, $wgLexemeEnableDataTransclusion (after all, the first changes were already merged).

Change 545379 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Add Lua module for Forms

https://gerrit.wikimedia.org/r/545379

Change 545537 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Add Lua module for Senses

https://gerrit.wikimedia.org/r/545537

Change 550662 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Change function declarations to Lua style

https://gerrit.wikimedia.org/r/550662

This is now merged and will ship (behind a feature-flag) in 1.38.0-wmf.6.

Usage tracking is also going to be interesting. Currently, it’s strictly entity-based, as far as I can see (as opposed to page-based), both on the repo (wb_changes_subscription) and on the client (wbc_entity_usage). Does this mean that a Wiktionary page for one lexeme may end up with dozens, if not hundreds of wbc_entity_usage rows, one per form (and aspect)? Or should we say that entity usage stops at subentities, and any usage of a lexeme implies usage of all of its forms? Or do we somehow group usages together, similar as for other aspects, and turn form usages into one “all forms of this lexeme” usage once they exceed a certain threshold?

The currently merged code tracks lots of ‘X’ (“all”) usages, but it still doesn’t track enough usage. Specifically, if you use mw.wikibase.getEntity( 'L1-S1' ), then the page will get a usage for L1-S1#X, but not for L1; and because we only look for pages using L1 when dispatching changes, the change won’t be notified when the lexeme is edited, and may continue to show untracked data.

I think fixing this is a hard requirement before we enable lexeme data transclusion in production. The easiest solution would be to make sure that mw.wikibase.getEntity( 'L1-S1' ) also tracks an L1#X usage, I’ll see if I can make that work.

Change 732998 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseLexeme@master] Track \u201Call\u201D usage for whole Lexeme instead of Sense/Form

https://gerrit.wikimedia.org/r/732998

Hm, there’s another thing that I forgot wasn’t done yet: the senses (and probably forms) of a returned lexeme entity aren’t entities themselves, they’re ordinary tables. Only the custom getForms() and getSenses() methods take care of properly creating entities.

mw.wikibase.getEntity('L1').senses[1]:getGlosses()
-- error: attempt to call method 'getGlosses' (a nil value).
mw.wikibase.getEntity('L1'):getSenses()[1]:getGlosses()
-- works

This isn’t as serious as the other issue – by the time getEntity('L1') returns, we’ve already tracked an “all” usage on L1, so being able to get the senses/forms without proper metatables doesn’t constitute a bypass of usage tracking or anything – but it’s still kind of strange, I guess…

Change 732998 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Track \u201Call\u201D usage for whole Lexeme instead of Sense/Form

https://gerrit.wikimedia.org/r/732998

The senses and forms of a returned lexeme entity aren’t entities themselves, they’re ordinary tables. Only the custom getForms() and getSenses() methods take care of properly creating entities.

I think we can leave this open for feedback after the initial Beta rollout. Should getForms() and getSenses() exist at all? Or should .forms and .senses contain entity objects already? And in either case, should they be indexed numerically (1, 2, …) or by ID (L1-F1, L1-F2, … – or just F1, F2, …?)? Maybe the initial testers have some feedback on this.

The senses and forms of a returned lexeme entity aren’t entities themselves, they’re ordinary tables. Only the custom getForms() and getSenses() methods take care of properly creating entities.

I think we can leave this open for feedback after the initial Beta rollout. Should getForms() and getSenses() exist at all? Or should .forms and .senses contain entity objects already? And in either case, should they be indexed numerically (1, 2, …) or by ID (L1-F1, L1-F2, … – or just F1, F2, …?)? Maybe the initial testers have some feedback on this.

I think we can leave .forms and .senses as they are at the moment – not documented as part of the stable interface, but not particularly hidden either. Similar to the .claims on all entities (I suppose they’re .statements on MediaInfo?), where we expect users to use :getAllStatements() and other functions instead.

Change 805771 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseLexeme@master] Declare Lexeme Lua interface stable

https://gerrit.wikimedia.org/r/805771

Change 805771 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Declare Lexeme Lua interface stable

https://gerrit.wikimedia.org/r/805771