Wikidata:Lexicographical data/Documentation

From Wikidata
Jump to navigation Jump to search

This is the main documentation page for lexicographical data on Wikidata. It is intended to describe general information about Wikidata lexemes: the way they are structured, how one may edit them, and what may be added to enrich them.

Note that while the information on this page may be broadly applicable across most languages, what works for modeling one language will not always work for modeling another language. For information about modeling lexemes for specific languages, visit the documentation pages for them.

More technical documentation may separately be found for the WikibaseLexeme extension for MediaWiki, which provides support for lexemes on Wikidata.

A Glossary of Wikidata Lexicographical terms is available.

Data model

[edit]
Visualization of the Lexeme data model

The data model of WikibaseLexeme describes the structure of the data that is handled as "lexemes" in Wikidata. The text below is merely a summary; for more detailed information, see the corresponding WikibaseLexeme documentation page.

A lexeme is a lexical element of a language, such as a word, a phrase, or a prefix. (More information about lexemes in general may be found on Wikipedia.)

Lexemes, like items and properties, are also Wikibase entities; they too have individual identifiers and can be separately accessed and queried.

There are seven components of a lexeme, described in each of the following subsections:

  1. its LID;
  2. its lemmata;
  3. its language;
  4. its lexical category;
  5. its (top-level) statements;
  6. its senses; and
  7. its forms.

Lexeme ID

[edit]

Lexemes have identifiers starting with an "L" followed by a number using the digits 0-9, such as L3746552. These IDs (often called "LIDs", for "lexeme identifiers") are unique within Wikidata and are assigned automatically when a lexeme is created.

The RDF URI for a lexeme is http://www.wikidata.org/entity/ followed by the lexeme ID.

Lexeme lemmata

[edit]

The lemmata (singular lemma) of a lexeme are primarily used as human-readable representations of the lexeme. Each lemma consists of a string accompanied by a valid IETF language tag. Usually lemmata are the written forms of a word, phrase, or affix that would be found in a dictionary describing them, whether they are considered the 'base' or 'stem' forms morphologically or not.

  • e.g. the English lexeme Lexeme:L3435 has the lemma 'umbrella' because most English dictionaries provide information about this lexeme under the heading 'umbrella' and not under something like 'umbrellas' or "umbrella's" or "umbrellas'".
  • e.g. the Italian lexeme Lexeme:L1196965 has the lemma 'volare' because most Italian dictionaries provide information about it under that heading and not under something like 'volo', 'volante', or 'volato'.
  • e.g. the Korean lexeme Lexeme:L17 has the lemma '먹다' because most Korean dictionaries provide information about it under that form, rather than something like '먹-', '먹어', or even '먹습니다'.

Lexemes can have several lemmata, particularly when there are differences in the writing system or other orthographic conventions within a given language. Different lemmata are indicated with different language tags, and a lexeme may only have one lemma for a given language tag.

  • e.g. the Hindustani lexeme Lexeme:L641622 has two lemmata, 'चाचा' with code hi and 'چاچا' with code ur, which are representations of the same dictionary form (pronounced /t͡ʃɑː.t͡ʃɑː/) in the Devanagari script (used for Hindi) and the Arabic script (used for Urdu).
  • e.g. the Hebrew lexeme Lexeme:L63672 has two lemmata, 'אדום' with code he and 'אָדֹם' with code he-x-Q21283070, which reflect differences in how the same word form is spelled depending on whether diacritics are present.
  • e.g. the Southern Min lexeme Lexeme:L308008 has three lemmata, '城市' with code nan-hani, 'siânn-tshī' with code nan-x-Q56929, and 'siâⁿ-chhī' with code nan-x-Q559173. These represent using either Chinese characters or one of two romanization systems, each corresponding to the same word form.

Note that some of the language codes above contain an -x- in them. There are two main reasons this would be present in a language code:

  1. For languages whose language codes are not yet supported, a last-resort option for a language code to use would involve adding a private-use subtag, containing the QID for the Wikidata item for the language, with the mis base code.
  2. If a language has a supported language code, but a variation whose language code is not supported, the private-use subtag may be attached directly to the existing supported code.
    • e.g. lexemes in the Varendri (Q48726757) of Bengali, such as Lexeme:L672268, have a lemma with the code bn-x-Q48726757 (where 'bn' is the existing supported code).
    • e.g. lemmata in Devanagari Sindhi (Q116688933) for lexemes in Sindhi use the language code sd-x-q116688933 (where 'sd' is the existing supported code).
    • e.g. lemmata in Adlam (Q19606346) for lexemes in Fula use the language code ff-x-q19606346 (where 'ff' is the existing supported code).

Lexeme lemmata are what are displayed when using the {{L}} template to link to a lexeme on Wikidata (including later on this page).

Lexeme language

[edit]

The language to which a lexeme belongs is a reference to a Wikidata item for a language.

For most languages, this is a straightforward determination: English (Q1860), Thai (Q9217), Manchu (Q33638), and Gun (Q3111668) are just four of the many possibilities, since they have supported language codes en, th, mnc, and guw.

Some languages, however, have begun to require for their lexemes that particular language items be used; see the documentation pages for those languages for more information.

Lexical category

[edit]

The lexical category to which a lexeme belongs is a reference to a Wikidata item for a particular group of words with specific syntactic behavior in a language. This usually corresponds with the "part of speech" of the lexeme: nouns, verbs, adjectives, adverbs, and so on.

The lexical category of a lexeme should be somewhat more general than any other more appropriate but more specific description thereof, as a broader reflection of how the lexeme behaves syntactically in its language. Other items like count noun (Q1520033), separable verb (Q3254028), and relative pronoun (Q1050744), where applicable, should be added as the values of instance of (P31) statements instead.

Different languages may use different lexical categories, but some are frequent enough across languages that a comparison may be made. See the full documentation page on lexical categories for a table comparing such categories across languages.

Lexeme statements

[edit]

Lexemes, like items or properties, have statements (claims) that provide information about the lexeme that is not specific to one of its forms or senses. Depending on how a particular language works, and depending on the lexical category of the lexeme, some statements will be more applicable to a given lexeme than others.

Many common properties applicable directly to lexemes are listed in Template:Lexicographical properties.

Lexeme senses

[edit]

Senses describe the different meanings of a lexeme.

A sense consists of three parts: 1) the sense ID, 2) glosses, and 3) statements.

  1. The sense ID starts with the ID of the lexeme it belongs to, followed by a hyphen ("-") and an "S", followed by a natural number in decimal notation: e.g. L3746552-S4. These IDs are unique within Wikidata; when a new sense is created within a lexeme, an entirely new sense ID is provided for it. Like an LID, a sense ID may be appended to http://www.wikidata.org/entity/ to form a unique URI for the sense.
  2. Glosses define the meaning of the sense using natural language. For a lexeme in a given language X, the gloss in language X should be a more detailed explanation of the meaning of the sense, while the glosses in other languages Y and Z may be less detailed, as long as they explain the meaning to speakers of Y and Z clearly enough.
  3. Like lexemes, items, and properties, senses can have statements further describing the sense and its relations to other senses and to Wikidata items.

Many common properties applicable to lexeme senses are listed in Template:Lexicographical properties.

Lexeme forms

[edit]

Forms describe the different realizations of a lexeme in speech or writing.

Depending on how a language behaves morphologically, there may be exactly one form of a lexeme or there may be multiple forms. In general, the more isolating or analytic or the more agglutinative or polysynthetic a language is, the more it may benefit from having one form per lexeme. Lexemes in many fusional languages typically have multiple forms for particular combinations of grammatical features.

A form consists of four parts: 1) the form ID, 2) form representations, 3) grammatical features, and 4) statements.

  1. The form ID starts with the ID of the lexeme it belongs to, followed by a hyphen ("-") and an "F", followed by a natural number in decimal notation: e.g. L3746552-F4. These IDs are unique within Wikidata; when a new form is created within a lexeme, an entirely new form ID is provided for it. Like an LID or a sense ID, a form ID may be appended to http://www.wikidata.org/entity/ to form a unique URI for the form.
  2. Form representations are strings with language tags that signify how a particular form is used. As with lemmata, there may be multiple representations on a single form to handle differences in writing system or orthographic variation within a language.
  3. Grammatical features are references to Wikidata items that define the syntactic circumstances in which a given form applies.
  4. Like lexemes, senses, items, and properties, forms can have statements further describing the form and its relations to other forms and to Wikidata items.

Many common properties applicable to lexeme forms are listed in Template:Lexicographical properties.

Lexeme inclusion criteria

[edit]

In some cases or languages, there may be multiple entities for related words, whereas in other languages, there may be just one. The below table provides an overview of how nouns in particular may be linked:

One or several lexemes for nouns?
difference in 1 lexeme 2+ lexemes
sense add several senses add applicable sense to lexeme link other(s) with homograph lexeme duplicate forms on each
etym. add etym. to each sense add etym. to lexeme base link other(s) with homograph lexeme duplicate forms on each
gender add gender to each sense add gender to lexeme base link other(s) with homograph lexeme duplicate forms on each
common/proper add several senses use lexical category "noun" add applicable sense to lexeme link other(s) with homograph lexeme duplicate forms on each
caps/lowercase add several forms qualify forms to applicable senses add applicable sense to lexeme link other(s) with homograph lexeme add only applicable forms
singular/plural add several forms qualify forms to applicable senses add applicable sense if possible link other(s) with homograph lexeme add only applicable forms
pronunciation add the same form twice qualify forms to applicable senses, add pronunciation add applicable sense if possible link other(s) with homograph lexeme add form and applicable pronunciation
forms/spelling add several forms or alternate forms qualify forms to applicable senses add applicable sense if possible link other(s) with homograph lexeme add only applicable forms

For a given language and criterion (first column), just one of the two might apply

Interface

[edit]

The following section details steps to take in Wikidata's user interface to perform common tasks involving editing lexemes.

Lexemes

[edit]
Screenshot of the lexeme creation page (as it appeared prior to November 2022)

Create a new lexeme

[edit]
  1. Go to Special:NewLexeme.
  2. Under "Lemma", enter a lemma (see Lexeme lemmata for more information).
  3. Under "Lexeme's language", enter the language of the lexeme, either by typing the name of the language or its QID (see Lexeme language for more information).
    1. If you are prompted to do so, under "Spelling variant of the Lemma", enter the language code of the lemma (see Lexeme lemmata for more information).
  4. Under "Lexical category", enter the lexical category of the lexeme, either by typing its name or its QID (see Lexical category for more information).
  5. Click "Create" to save your changes.

You have now created a lexeme with the most basic information. Because it is very empty, it cannot meaningfully be used until more information is added to it, such as statements, senses, and forms (for which see later in this page).

Edit a lexeme's lemmata, language, or lexical category

[edit]
Screenshot of the top of a Lexeme page
  1. Next to the lemmata, click the 'edit' button.
  2. Lemmata may be edited as follows:
    1. To add a lemma, first click the "+" that appears beside the lemmata.
    2. In the new lemma, under "Lemma", add the representation of the new lemma.
    3. Another thing to do in the new lemma is to add its language code under "Spelling variant".
    4. To remove a particular lemma, simply click the "x" appearing beside "Lemma" in that lemma.
  3. To change the language of the lexeme, use the search box appearing beside "Language" to pick an item for a language.
  4. To change the lexical category of the lexeme, use the search box appearing beside "Lexical category" to pick an item for a lexical category.
  5. Click "publish" to save your changes.

Add, edit or delete a lexeme's statements

[edit]
Screenshot of the interface to edit a statement

Adding a statement to a lexeme entails the following steps:

  1. Click "add statement".
  2. Enter a property, typing its name in the property field (such as derived from lexeme (P5191)) and selecting it in the suggester.
  3. Enter a value for the property.
    Note: A Wikidata property for lexicographic senses (Q54275340) such as translation (P5972) or synonym (P5973) does not currently support searching for senses, either by lexeme lemmata or sense glosses. This means that in order to enter a value for a statement, you need to enter the precise sense ID for the sense you want as a value.
    As seen here, Wikidata will not be able to find Lexemes and their senses when searching by their name.

    Searching by a precise Lexeme Sense ID however returns a publishable result.
  4. If you wish to add qualifiers and references to the statement, feel free to do so.
  5. Save the statement by clicking "publish".
  6. To edit a statement, click "edit".
  7. To delete a statement, click "edit", then click "remove".

Delete a lexeme

[edit]

To delete a lexeme, you may request its deletion at Wikidata:Requests for deletions, just as is done with items. If you have the Merge gadget enabled, you may submit deletion requests for lexemes using it.

Search for a Lexeme

[edit]

To look for lexemes via Special:Search or the search box on any page, you may use its LID, one of its lemmata, or a representation of one of its forms.

The simplest way to do this is to prefix "L:" to one of these, and you will automatically see results in the Lexeme namespace for your search. For example, lexeme L301993 has the lemma "হৃদয়" and one of its forms has the representation "হৃদয়েতে". Searching for "L:L301993", "L:হৃদয়", or "L:হৃদয়েতে" will return the same lexeme in the results.

You may alternatively search without the "L:" prefix (e.g. using "L301993", "হৃদয়", or "হৃদয়েতে"), then select the "Lexeme" namespace in the "Search in:" and rerun the search to get the same lexeme returned.

Note that the selector (the drop-down menu that pops up to suggest results) does not support the Lexeme namespace yet. Pressing Enter or clicking the search icon after typing your keyword, however, will show you the results.

Senses

[edit]

Create a new sense

[edit]
  1. In the "Senses" section of a lexeme, click "add Sense".
  2. Under "Language", enter a language code for the gloss.
  3. Under "Gloss", enter the gloss.
  4. To add new glosses, click "add" and repeat steps 2 and 3.
  5. Click "publish" to save your changes.

Edit a sense's glosses

[edit]
  1. Next to the sense glosses, click "edit".
  2. To add a new gloss, do the following:
    1. Underneath the existing sense glosses, click the smaller "add" link. (Be careful that you do not accidentally click on the "add statement" or "add Sense" links used to add a new statement or sense instead!)
    2. Under "Language", enter a language code for the new gloss.
    3. Under "Gloss", enter the new gloss.
    4. Repeat these steps for each new gloss you wish to add.
  3. To remove a gloss, click "remove" next to the gloss.
  4. Click "publish" to save your changes.

Remove a sense

[edit]
  1. Next to the sense glosses, click "edit".
  2. Click "remove".

Forms

[edit]
Adding a Form

Create a new form

[edit]
  1. In the "Forms" section of a lexeme, click "add Form".
  2. Under "Representation", fill in a representation for the new form.
  3. Under "Spelling variant", fill in the language code for that representation.
  4. To add more representations, click the "+" next to the existing representations and repeat steps 2 and 3 for the new representation.
  5. Next to "Grammatical features", enter one or several grammatical features, by typing their name and selecting them in the list of items that appears.
  6. Click "publish" to save your changes.

Edit a form's representations or grammatical features

[edit]
  1. Next to the form's representations, click "edit".
  2. Representations may be edited as follows:
    1. To add a representation, first click the "+" that appears beside the representations.
    2. In the new representation, under "Representation", add the new representation for the form.
    3. Another thing to do in the new form representation is to add its language code under "Spelling variant".
    4. To remove a particular representation, simply select the "x" appearing beside "Representation" in that representation.
  3. To add a grammatical feature, type its name at the end of the text box and select the appropriate item in the list of items that appears.
  4. To remove a grammatical feature, click the "x" that appears next to it.
  5. Click "publish" to save your changes.

Delete a form

[edit]
  1. Next to the form's representations, click "edit".
  2. Click "remove".

Features

[edit]

See also: Wikidata:Lexicographical data/Development

What is included in the first version

[edit]
  • New datatypes: Lexeme, Form
  • Add, edit, and delete Lexemes
  • Add, edit, and delete Forms
  • Add, edit, and delete statements
  • Add, edit, and delete qualifiers
  • Add, edit, and delete references
  • Linking from a Lexeme or a Form to an Item
  • Linking from a Lexeme, a Form, or an Item to another Lexeme
  • Search and suggestions when entering a value
  • Basic internal APIs (used for UI; you should not use them)

Updates done in subsequent releases

[edit]
  • Search for content with Special:Search
  • Display the lemma in the history pages, recent changes, and watchlist
  • Add, edit, and delete Senses
  • RDF support and ability to query the data on query.wikidata.org
  • Data access on clients (other Wikimedia projects)

What will be added in the future

[edit]

Ordered from near to long-term plans:

  • Better API support
  • Automatic generation of Forms
  • Editing data directly from Wiktionary

See also

[edit]