User:SM5POR/Languages

Issues

edit
Done Area Listed Issue Question or proposal Posted Resolution Resolved
Symbols 2020-06-06 TeX string (P1993) has a few "unique value" constraint violations, possibly related to the property descriptions in several languages referring to the "concept" rather than "symbol" expressed using the TeX string. Either remove the constraint, or make sure the affected concepts are provided with notation property pointers to the corresponding symbols to make the conflicting properties redundant.
Grammar 2022-01-28 Declaring each grammatical category (Q980357), such as case (Q128234), to be a subclass of (P279) grammatical category (Q980357) implies that the former item inherits a number of properties from the latter, including what it is an instance of (P31) (either defined explicitly by a claim for that item, or in turn inherited from its parent class). In effect, case (Q128234) itself becomes a class (a subclass) of grammatical categories, which it in reality isn't (it's a class of grammemes). The appropriate property to use with grammatical category (Q980357) is instance of (P31), as it breaks the chain of inheritance.
Ontology 2022-09-25 When words or phrases from one language or another end up as items in Wikidata Main namespace (due to Wikipedia articles being written about them, or for other reasons), they should not be confused with the concepts those words refer to. As an example, a curriculum (Q207137) is not a Latin phrase (Q3062294), but the English word "curriculum" is. Now, is Q90219924 a preposition (Q4833830) in the English language or a relation (Q930933) that may be written in different ways in different languages? Develop queries and methods to identify this kind of conflation, and write guidelines on how to avoid introducing such errors.
Semantics 2023-01-08 Senses require a large number of semantic items for interpretation. Employ qualifiers with item for this sense (P5137) to generate a more diverse effective set of target values.

Word/subject conflation

edit

Identify anomalies

edit

These items are likely to confuse properties of a subject with the properties of the word for this subject in one or more languages:

As I plan to demonstrate below, adpositions (prepositions, postpositions or circumpositions) without context aren't easily translated between different languages, as there is no one-to-one-mapping between the set of adpositions in a language and the semantic relations they denote.

The following analysis focuses on the English preposition Q90219924:

The item Q90219924 was created in April of 2020 and claimed to be an exact match (P2888) of the English and Russian lexemes in (L2987) and в/въ (L2109), respectively, but those (mutual) claims were soon removed (exact match (P2888) are probably not meant to be used with lexemes) and unidirectional item for this sense (P5137) links were left om the lexemes in their place. Later other properties were added, as well as more lexemes.

However, as almost any preposition typically has numerous different uses within its language, it won't easily map to a single item or translate to a corresponding word in another language. in (L2987) currently lists only two senses, described as "within" and "into" respectively, and they both link to Q90219924, turning that item into (!) a union of two senses (in contrast, the Russian lexeme в/въ (L2109) lists as many as 22 different senses). This is hardly how item for this sense (P5137) is supposed to be used, and in a dictionary a preposition may in reality have dozens of senses.

To test this, I composed a few sentences in English involving the preposition "in" and added translations for the languages to which the linked lexemes belong. The translations from English have been made by Google Translate, but I have verified (and corrected) the German and Swedish translations only. The Russian translations are verified by User:Infovarius. The Punjabi translations remain unverified.

English German Swedish Russian Punjabi Bengali Hindi
in (L2987) in (L6748) i (L35761) в/въ (L2109) ਵਿਚ/وِچ (L679728) মধ্যে (L595057) in (L2987)
I don't think we are in Kansas anymore. Ich glaube nicht, dass wir mehr in Kansas sind. Jag tror inte att vi är i Kansas längre. Я не думаю, что мы ещё в Канзасе. ਮੈਨੂੰ ਨਹੀਂ ਲੱਗਦਾ ਕਿ ਅਸੀਂ ਹੁਣ ਕੰਸਾਸ ਵਿੱਚ ਹਾਂ। আমি মনে করি না আমরা আর ক্যান্সাসে আছি।
The train will leave Princeton in half an hour. Der Zug verlässt Princeton in einer halben Stunde. Tåget kommer att lämna Princeton om en halvtimme. Поезд отходит из Принстона через полчаса. ਟ੍ਰੇਨ ਅੱਧੇ ਘੰਟੇ ਵਿੱਚ ਪ੍ਰਿੰਸਟਨ ਤੋਂ ਰਵਾਨਾ ਹੋਵੇਗੀ। ট্রেনটি আধ ঘন্টার মধ্যে প্রিন্সটন ছেড়ে যাবে।
War and Peace was originally written in Russian. Krieg und Frieden wurde ursprünglich auf Russisch geschrieben. Krig och fred skrevs ursprungligen på ryska. Война и мир изначально была написана на русском языке. ਜੰਗ ਅਤੇ ਸ਼ਾਂਤੀ ਮੂਲ ਰੂਪ ਵਿੱਚ ਰੂਸੀ ਵਿੱਚ ਲਿਖੀ ਗਈ ਸੀ। যুদ্ধ ও শান্তি মূলত রুশ ভাষায় লেখা হয়েছিল।
Yuri Gagarin became the first human in space in 1961. Juri Gagarin flog 1961 als erster Mensch ins All. Jurij Gagarin blev den första människan i rymden 1961. Юрий Гагарин стал первым человеком в космосе в 1961 году. ਯੂਰੀ ਗਾਗਰਿਨ 1961 ਵਿੱਚ ਪੁਲਾੜ ਵਿੱਚ ਜਾਣ ਵਾਲਾ ਪਹਿਲਾ ਮਨੁੱਖ ਬਣਿਆ। ইউরি গ্যাগারিন সর্বপ্রথম ব্যক্তি যিনি ১৯৬১ সালে মহাকাশ ভ্রমণ করেন।
There are 366 days in a leap year. Ein Schaltjahr hat 366 Tage. Det går 366 dagar på ett skottår. В високосном году 366 дней. There are 366 days in a leap year. 'অধিবর্ষে ৩৬৬ দিন'। 'लीप वर्ष में अधिक दिन'.

As should be illustrated by the table above, the English preposition "in" seems to correspond fairly well to the Punjabi postposition "ਵਿੱਚ" in its usage in these six different contexts (or senses), but gradually less so to the Russian, German, and Swedish prepositions ("в", "in", and "i" respectively). In Swedish, only the spatial "in" becomes "i", while the other senses are indicated by "på", "om" or simply no word at all.

Class trees

edit

For this reason, I believe lexeme senses should be mapped (using the item for this sense (P5137) property) to different items depending on the exact semantics of those senses in their source language. These items may in turn be linked to each other using the subclass of (P279) property, thereby forming one or more class trees under relation (Q930933) and possibly other concepts. Here is an example:


Example of grammatical relation class tree

Given that we have the lexeme database, I doubt that we really need a Wikibase item for each lexeme that is specific to one language or another also in the Main Wikidata namespace, unless there are entries in other Wikimedia projects requiring such items. In those cases where an item currently serves a double purpose as a word and a sense, and it has never had any Wikimedia links, I would suggest removing the language-specific properties and attributes, resulting in a refined language-independent item describing a single sense only. As one of the aliases for map–territory relation (Q1963130) reads, the word is not the thing!

Grammar

edit
grammatical category (Q980357) grammeme (Q2374489) Number of items Item examples
part of speech (Q82042)

Find grammars

Grammatical categories

edit

The class of grammatical category (Q980357) may well be divided into sub-classes as the need arises, for instance to describe different kinds of grammar, such as those found in the Tamil language.

Grammar Grammatical categories Area of grammar
letter (Q9788)
word (Q8171)
Q20559207
Tamil prosody (Q19576072)
stylistic device (Q182545)

Lexemes

edit

Word classes

edit

Also known as parts of speech.

Reference used below: CODCE9 The Concise Oxford Dictionary of Current English, ninth edition (1995), part of Concise Oxford English Dictionary (Q2992058) series

Adpositions

edit

Including prepositions, postpositions, and circumpositions.

English adpositions

edit

These are mostly prepositions.

against
edit
ago (postposition)
edit
from
edit

CODCE9 identifies 23 different senses (plus 14 as an adverb and 3 as an adjective). See#in discussion below.

into
edit

CODCE9 identifies 5 different senses.

CODCE9 identifies 10 different senses.

CODCE9 identifies 15 different senses (plus 2 as an adverb).

under
edit
upon
edit

German adpositions

edit

These are mostly prepositions.

See#in discussion below.

innerhalb
edit
nach
edit

Spanish adpositions

edit

These are mostly prepositions.

ante
edit
bajo
edit
hacia
edit
hasta
edit

Swedish adpositions

edit

These are mostly prepositions.

för
edit
för ... sedan (circumposition)
edit

See#in discussion below.

See#in discussion below.

See#in discussion below.

till
edit

Lexeme properties

edit

Find properties for lexemes

edit

Find properties for lexemes

Find properties actually used with lexemes

Find lexemes with a rich set of properties

Find types of properties for which examples of using them on lexemes exist

Find redundant statements on items and their corresponding senses

edit

Difference between namespaces

edit
Language-independent queries
edit
SELECT DISTINCT ?subject ?subjectLabel ?category ?categoryLabel ?languages ?image ?video WHERE {
  {
    SELECT DISTINCT ?subject ?category (COUNT(DISTINCT ?language) AS ?languages) ?image ?video WHERE {
      #VALUES ?subject {wd:Q2}
      ?sense wdt:P5137 ?subject.
      ?lexeme ontolex:sense ?sense.
      ?lexeme wikibase:lexicalCategory ?category.
      ?lexeme dct:language ?language.
      #OPTIONAL {?subject wdt:P18 ?image.}
      #OPTIONAL {?subject wdt:P10 ?video.}
    }
    GROUP BY ?subject ?category ?image ?video
  }
  SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
}
Try it!
Language-dependent queries
edit
SELECT DISTINCT ?subject ?language ?speech ?ipa ?writing ?image ?video WHERE {
  VALUES ?subject {wd:Q2}
  ?sense wdt:P5137 ?subject.
  ?lexeme ontolex:sense ?sense.
  ?lexeme dct:language ?language.
  ?lexeme ontolex:lexicalForm ?form.
  OPTIONAL {?form wdt:P443 ?speech}
  OPTIONAL {?form wdt:P898 ?ipa}
  OPTIONAL {?form ?wdtp ?writing}
  OPTIONAL {?sense wdt:P18 ?image.}
  OPTIONAL {?sense wdt:P10 ?video.}
}
Try it!

Model property proposals

edit

While Wikidata property example for lexemes (P5192) offers suggestions for how to use a specific property in the lexeme domain, demonstrating how to combine multiple properties and other attributes when documenting a word may require a model lexeme, similar to the model item used to show how to design items in the Main entity namespace.

These proposals may be out of date, as there is now at least ̣̣̣̣̻a model lexeme (P11464) propertyˌ

  • Model lexeme
  • Model sense
  • Model form

Statements

edit
Statement Model lexeme Model sense Model form
instance of (P31) Wikidata property (Q18616576) Wikidata property (Q18616576) Wikidata property (Q18616576)
described at URL (P973)
Wikidata item of this property (P1629) Wikidata model lexeme Wikidata model sense Wikidata model form
Wikidata usage instructions (P2559)
Wikidata property example (P1855) noun (Q1084) noun (Q1084) noun (Q1084)
inverse label item (P7087)
expected completeness (P2429) always incomplete (Q21873886) always incomplete (Q21873886) always incomplete (Q21873886)
related property (P1659)
property proposal discussion (P3254)

Constraints

edit
Constraint Model lexeme Model sense Model form
subject type constraint (Q21503250) class (P2308)

relation (P2309)

class (P2308)

relation (P2309)

class (P2308)

relation (P2309)

allowed qualifiers constraint (Q21510851) property (P2306) property (P2306) property (P2306)
allowed-entity-types constraint (Q52004125) item of property constraint (P2305) item of property constraint (P2305) item of property constraint (P2305)
property scope constraint (Q53869507) property scope (P5314) property scope (P5314) property scope (P5314)

Lexeme statistics

edit

Note: These statistics seem mostly redundant, as they are less extensive than the statistics gathered by the Wikidata Lexicographical project. I'm retaining this section anyway as a toolbox to be able to compare my numbers with those of the project and verify that I understand the lexeme structural relationships correctly, as well as to conduct some in-depth analysis of specific statistical quantities not described elsewhere.

Number of languages

edit

Find languages with currently at least 10,000 lexemes

Number of lexemes, senses and forms

edit

Updated 2022-09-04

Language Lexemes Senses Forms
Aragonese (Q8765) 10127 4 29290
Basque (Q8752) 22931 30737 1256971
Bokmål (Q25167) 17525 23346 118708
Czech (Q9056) 14196 5237 715522
Danish (Q9035) 14947 7526 66185
English (Q1860) 71660 28688 130461
Estonian (Q9072) 83208 55 2916037
French (Q150) 13784 8852 86541
German (Q188) 27498 9209 230588
Hebrew (Q9288) 29912 6029 451625
Indonesian (Q9240) 19685 71 412071
Latin (Q397) 32183 556 1198579
Malayalam (Q36236) 63316 11333 749411
Russian (Q7737) 101432 10697 1237781
Slovak (Q9058) 16475 959 235263
Spanish (Q1321) 21056 7042 281386
Swedish (Q9027) 36858 8708 254157
Ukrainian (Q8798) 15967 128 507567
All 909 languages 684223 218317 11171815

Update statistics for previously identified top languages

Update cross-language totals

Number of lexemes per lexical category

edit

Find all lexical categories

Word classes (parts of speech)

edit

Find main categories

Updated 2022-09-18

Language Categories Words Nouns Verbs Adjectives Numerals Interjections Adverbs Function words
Aragonese (Q8765) 6712 9 3405 0 0 0 0
Basque (Q8752) 14495 3968 277 0 41 21 10
Bokmål (Q25167) 11013 3406 2725 0 93 310 194
Czech (Q9056) 4992 290 4871 96 13 3276 194
Danish (Q9035) 8638 3546 1385 69 56 306 216
English (Q1860) 28431 7435 12506 42 264 20216 306
Estonian (Q9072) 60137 7932 9146 176 627 4436 754
French (Q150) 8444 1523 1765 251 17 573 103
German (Q188) 16225 3550 2710 243 319 2353 392
Hebrew (Q9288) 19748 4706 4269 26 29 107 131
Indonesian (Q9240) 6700 12782 173 1 1 2 15
Latin (Q397) 15885 6544 7307 124 99 1922 212
Malayalam (Q36236) 53387 3979 197 134 7 88 109
Russian (Q7737) 101096 56 60 26 7 20 40
Slovak (Q9058) 7037 3378 4001 145 56 816 406
Spanish (Q1321) 12253 3815 4178 0 8 223 89
Swedish (Q9027) 25979 4500 4007 60 28 908 150
Ukrainian (Q8798) 87 4 15830 2 0 0 3
All 921 languages 205 688820 436832 79222 85540 2817 1823 38195 4996

Update statistics for previously identified top languages; update cross-language totals

Function words
edit

Find function word categories

Language Categories Function words Conjunctions Adpositions Particles Determiners Pro-forms Interrogative words
Aragonese (Q8765)
All 911 languages

Update statistics for previously identified top languages; update cross-language totals

Morphemes

edit

Find morpheme categories

Language Categories Morphemes Affixes Roots Clitics Confixes
Aragonese (Q8765)
All 911 languages

Find speech recordings for lexemes

edit

Find speech recordings for lexemes

Find rrelated language

edit

Find languages belonging to a particular family

Map senses to items

edit

Find items linked to senses

Find items covering potentially multiple senses

Expand the effective number of semantic target objects

edit

The property item for this sense (P5137) exists to map each lexeme sense in any language to a single language-independent item identifying the semantic contents of the sense. The number of actual items is however unlikely to ever match the combined diversity of vocabularies from every language, for the following reasons, among others:

  • Due to the way dictionaries and encyclopedias (including Wikipedia) are written, most items correspond to and describe nouns, leaving few options for adjectives or verbs.
  • Even within the same part of speech and item corresponding to a sense, individual languages may have distinct lexemes for multiple aspects of the item not recognized in most languages, and therefore not represented in the item.
  • Some variation in vocabulary may be due to varying language style or level of education of the speaker or the intended audience.

Even when multiple items exist to match variation in a source language when doing a translation, the target language may lack the same nuances with respect to the item, rendering some words untranslatable.

One approach towards solving this problem involves adding qualifiers to the item for this sense (P5137) statement, resulting in an effective number of distinct statement values that is the product of the number of items and the total number of qualifier value combinations. Since item for this sense (P5137) typically links numerous languages and senses to the same item, the variation can be expected to appear on the lexeme/sense subject side of the statement, suggesting subject has role (P2868) as a suitable qualifier. Multiple aspects may be represented using different sets of qualifier value items:

  • Level of understanding
    • child level
    • general level (also default)
    • academic level
  • Socio-linguistiic context
    • slang
    • popular
    • professional
    • spiritual
  • Language style
    • casual
    • factual
    • formal
    • poetic
  • Grammatical context
    • possessive action
    • production
    • consumption
    • bringing
    • removing
    • sounding
    • has quality like

Instance of term considered harmful

edit

Find instances of term that are probably conflations

Find homographs

edit

Find declared homographs within each language

Find languages telling different kinds of events apart

edit

Finf words referring to "events"

Find senses in English, German and Swedish

edit

Find senses in English, German and Swedish

Find lexemes

edit

Find lexemes of particular languages

Find lexemes of particular lexical categories

Not working yet

Broken query

Find subclasses of a given set of classes, listing additional clues for those without English labels

edit

Not working yet

Broken query

Labels

edit

Wikidata label statistics

edit

Property labels in most languages per language family

Proper names

edit

Compare number of unique names used across multiple languages

edit

Compare number of unique names used across multiple languages

Finding classes of items other items are named after

edit

Find classes of items other items are named after

Translation

edit

Phonetics

edit

Synthetic speech

edit

Visual language

edit

Symbols

edit

Finding concepts with corresponding symbols sharing the same notational property

edit

Find concepts with corresponding symbols sharing the same notational property

Typography

edit

List usage of typeface used typeface/font used (P2739)

Writing systems

edit

Finding ontological relations between writing systems, scripts, alphabets, and letters

edit

Find ontological relations between writing systems, scripts, alphabets, and letters

Mongolian script

edit