Keyword (linguistics): Difference between revisions
Tag: Reverted |
minor ce |
||
(29 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
{{ |
{{Short description|Word which occurs in a text more often than we would expect to occur by chance alone}} |
||
In [[corpus linguistics]] a '''key word''' is a word which occurs in a text more often than we would expect to occur by chance alone. |
In [[corpus linguistics]] a '''key word''' is a word which occurs in a text more often than we would expect to occur by chance alone.{{sfn|Scott and Tribble|2006|page=55}} Key words are calculated by carrying out a [[statistical test]] (e.g., [[Log-linear analysis|loglinear]] or [[chi-squared test|chi-squared]]) which compares the word frequencies in a text against their expected frequencies derived in a much larger corpus, which acts as a reference for general language use. '''Keyness''' is then the quality a word or phrase has of being "key" in its context. Combinations of nouns with [[parts of speech]] that human readers would not likely notice, such as prepositions, time adverbs, and pronouns can be a relevant part of keyness. Even separate pronouns can constitute keywords.{{sfn|Scott and Tribble|2006|page=72}} |
||
⚫ | Compare this with [[collocation]], the quality linking two words or phrases usually assumed to be within a given span of each other. Keyness is a ''textual'' feature, not a language feature (so a word has keyness in a certain textual context but may well not have keyness in other contexts, whereas a node and collocate are often found together in texts of the same genre so collocation is to a considerable extent a ''language'' phenomenon). The set of keywords found in a given text share keyness, they are co-key. Words typically found in the same texts as a key word are called ''associates''. |
||
==Keyword as a language feature== |
|||
==Sociological aspects== |
|||
Probabilistic methods for keyword extraction are widely used in corpus linguistics, literary and linguistic computing, and digital humanities. These methods, originating from information retrieval and computer science, were initially developed for text indexing and term search. One widely used approach is "[[Tf–idf | Term frequency–inverse document frequency ]]" (tf–idf), which weighs the importance of terms in information retrieval systems. Despite its popularity, tf–idf is considered an empirical method with various possible variations. Studies in this area often focus on quantifying term significance and relevance in retrieval processes, utilizing measures such as frequency, signal-to-noise ratio, and relevance weighting methods. Additionally, fields like computational terminology and machine learning employ statistical measures like chi-squared statistics, pairwise mutual information, Dice coefficient, log-likelihood ratio, and Jaccard similarity for automatic term extraction and feature subset selection. |
|||
⚫ | In politics, sociology and [[critical discourse analysis]], the key reference for keywords was [[Raymond Williams]] (1976),{{sfn|Williams|1976}} but Williams was resolutely [[Marxist]], and Critical Discourse Analysis has tended to perpetuate this political meaning of the term: keywords are part of ideologies and studying them is part of [[social criticism]]. [[Cultural studies]] has tended to develop along similar lines. This stands in stark contrast to present day linguistics which is wary of political analysis, and has tended to aspire to non-political objectivity. The development of technology, new techniques and methodology relating to massive corpora have all consolidated this trend. |
||
===Translatability=== |
|||
Keyword extraction in natural language processing and linguistic applications, including text mining, involves extracting valuable insights from large volumes of textual data using machine-driven, human-assisted, or hybrid methods. A key challenge is extracting keywords from texts without prior information. [[ Hans Peter Luhn | Luhn ]] pioneered unsupervised keyword extraction by leveraging Zipf's frequency analysis, which orders words by occurrence frequency. Zipf's law observes that word frequency is inversely proportional to its rank. Luhn's method involves discarding words at the extremes of the frequency list and considering the rest as keywords. |
|||
⚫ | There are, however, numerous political dimensions that come into play when keywords are studied in relation to cultures, societies and their histories. The Lublin Ethnolinguistics School studies Polish and European keywords in this fashion. Anna Wierzbicka (1997),{{sfn|Wierzbicka|1997}} probably the best known cultural linguist writing in English today, studies languages as parts of cultures evolving in society and history. And it becomes impossible to ignore politics when keywords migrate from one culture to another. Underhill and Gianninoto {{sfn|Underhill & Gianninoto|2019}} demonstrate the way political terms like, "citizen" and "individual" are integrated into the Chinese worldview over the course of the 19th and 20th century. They argue that this is part of a complex readjustment of conceptual clusters related to "the people". Keywords like "citizen" generate various translations in Chinese, and are part of an ongoing adaptation to global concepts of individual rights and responsibilities. Understanding keywords in this light becomes crucial for understanding how the politics of China evolves as Communism emerges and as the free market and citizens' rights develop. Underhill and Gianninoto argue that this is part of the complex ways ideological worldviews interact with the language as an ongoing means of perceiving and understanding the world. |
||
⚫ | Barbara Cassin studies keywords in a more traditional manner, striving to define the words specific to individual cultures, in order to demonstrate that many of our keywords are partially "untranslatable" into their "equivalents. The Greeks may need four words to cover all the meanings English-speakers have in mind when speaking of "love". Similarly, the French find that "liberté" suffices, while English-speakers attribute different associations to "liberty" and "freedom": "freedom of speech" or "freedom of movement", but "the Statue of Liberty".{{sfn|Cassin|2014}} |
||
Another unsupervised probabilistic approach to keyword extraction involves using Shannon's [[ Entropy (information theory) | entropy ]] to measure the content of information for each word. Shannon's entropy, widely applied in physics literature, finds relevance in linguistics and natural language studies. Applications include DNA sequence analysis, measuring long-range correlations, language acquisition studies, resolving authorship disputes, communication modeling, and statistical analysis of word roles in corpora. |
|||
Statistical keyword extraction methods typically rely on modeling the distribution of term frequencies (i.e. word counts) within a corpus, utilizing [[ document-term matrix | document-term matrices]]. In computational linguistics, the [[ Negative binomial distribution ]] (NBD) was proposed as a candidate for describing natural language data, mirroring its utilization in ecology and biostatistics. This choice is primarily attributed to NBD's ability to account for [[ overdispersion ]], a common phenomenon observed in word counts. Overdispersion arises from the tendency of content words to aggregate, leading to a skewed distribution of term frequencies within a text. Thus, parameters of NBD, when estimated, may capture this variability. It makes them a useful measure in extracting keywords that effectively represent such salient features of the text as named entities. This approach, adopted from analysis of ecological systems, offers a robust framework for keyword extraction and also sheds light on the underlying statistical properties of linguistic data, contributing to a better understanding of word use dynamics. |
|||
==Keyword as a textual phenomenon== |
|||
⚫ | Compare this with [[collocation]], the quality linking two words or phrases usually assumed to be within a given span of each other. Keyness is a ''textual'' feature, not a language feature (so a word has keyness in a certain textual context but may well not have keyness in other contexts, whereas a node and collocate are often found together in texts of the same genre so collocation is to a considerable extent a ''language'' phenomenon). The set of keywords found in a given text share keyness, they are co-key. Words typically found in the same texts as a key word are called ''associates''. |
||
==Software-assisted identification== |
|||
⚫ | In politics, sociology and critical discourse analysis, the key reference for keywords was Raymond Williams (1976), but Williams was resolutely Marxist, and Critical Discourse Analysis has tended to perpetuate this political meaning of the term: keywords are part of ideologies and studying them is part of social criticism. Cultural |
||
Keywords are identified by software that compares a word-list of the text with a word-list based on a larger reference corpus. Software such as e.g. [[WordSmith (software)|WordSmith]], lists keywords and phrases and allows plotting their occurrence as they appear in texts. |
|||
⚫ | There are, however, numerous political dimensions that come into play when keywords are studied in relation to cultures, societies and their histories. The Lublin Ethnolinguistics School studies Polish and European keywords in this fashion. Anna Wierzbicka (1997), probably the best known cultural linguist writing in English today, studies languages as parts of cultures evolving in society and history. And it becomes impossible to ignore politics when keywords migrate from one culture to another. Gianninoto |
||
==See also== |
|||
⚫ | Barbara Cassin studies keywords in a more traditional manner, striving to define the words specific to individual cultures, in order to demonstrate that many of our keywords are partially "untranslatable" into their "equivalents. The Greeks may need four words to cover all the meanings English-speakers have in mind when speaking of "love". Similarly, the French find that "liberté" suffices, while English-speakers attribute different associations to "liberty" and "freedom": "freedom of speech" or "freedom of movement", but "the Statue of Liberty". |
||
*[[Transition (linguistics)]] |
|||
==References== |
==References== |
||
Line 26: | Line 23: | ||
==Bibliography== |
==Bibliography== |
||
* |
*{{cite book |last1=Cassin|first1=Barbara|year=2014|title=Dictionary of Untranslatables|location=Oxford|publisher=Princeton University Press|isbn=9780691138701|ref={{sfnref|Cassin|2014}}}} |
||
* Scott |
*{{cite book |last1=Scott |first1=Mike |last2=Tribble |first2=Chris |title=Textual patterns: key words and corpus analysis in language education |date=2006 |publisher=John Benjamins Pub |location=Amsterdam ; Philadelphia |url=https://coehuman.uodiyala.edu.iq/uploads/Coehuman%20library%20pdf/English%20library%D9%83%D8%AA%D8%A8%20%D8%A7%D9%84%D8%A7%D9%86%D9%83%D9%84%D9%8A%D8%B2%D9%8A/linguistics/registers.pdf|isbn=9789027222930|ref={{sfnref|Scott and Tribble|2006}}}} especially chapters 4 & 5. |
||
* |
* {{cite book |last1=Underhill|first1=James|last2=Gianninoto|first2=Rosamaria|year=2019|title=Migrating Meanings: Sharing keywords in a global world|location=Edinburgh|publisher=Edinburgh University Press|doi=10.3366/edinburgh/9780748696949.001.0001|isbn=9780748696949|ref={{sfnref|Underhill & Gianninoto|2019}}}} |
||
* |
* {{cite book |last1=Wierzbicka|first1=Anna|year=1997|title=Understanding Cultures through their Key Words|location=Oxford|publisher=Oxford University Press|url=https://dl1.cuni.cz/pluginfile.php/415674/mod_resource/content/1/Wierzbicka_Libertas.pdf|isbn=9780195088366|ref={{sfnref|Wierzbicka|1997}}}} |
||
* |
* {{cite book |last1=Williams|first1=Raymond|year=1976|title=Keywords: A Vocabulary of culture and society|location=New York|publisher=Oxford University Press|ref={{sfnref|Williams|1976}}}} |
||
==External links== |
==External links== |
Latest revision as of 21:11, 23 April 2024
In corpus linguistics a key word is a word which occurs in a text more often than we would expect to occur by chance alone.[1] Key words are calculated by carrying out a statistical test (e.g., loglinear or chi-squared) which compares the word frequencies in a text against their expected frequencies derived in a much larger corpus, which acts as a reference for general language use. Keyness is then the quality a word or phrase has of being "key" in its context. Combinations of nouns with parts of speech that human readers would not likely notice, such as prepositions, time adverbs, and pronouns can be a relevant part of keyness. Even separate pronouns can constitute keywords.[2]
Compare this with collocation, the quality linking two words or phrases usually assumed to be within a given span of each other. Keyness is a textual feature, not a language feature (so a word has keyness in a certain textual context but may well not have keyness in other contexts, whereas a node and collocate are often found together in texts of the same genre so collocation is to a considerable extent a language phenomenon). The set of keywords found in a given text share keyness, they are co-key. Words typically found in the same texts as a key word are called associates.
Sociological aspects
[edit]In politics, sociology and critical discourse analysis, the key reference for keywords was Raymond Williams (1976),[3] but Williams was resolutely Marxist, and Critical Discourse Analysis has tended to perpetuate this political meaning of the term: keywords are part of ideologies and studying them is part of social criticism. Cultural studies has tended to develop along similar lines. This stands in stark contrast to present day linguistics which is wary of political analysis, and has tended to aspire to non-political objectivity. The development of technology, new techniques and methodology relating to massive corpora have all consolidated this trend.
Translatability
[edit]There are, however, numerous political dimensions that come into play when keywords are studied in relation to cultures, societies and their histories. The Lublin Ethnolinguistics School studies Polish and European keywords in this fashion. Anna Wierzbicka (1997),[4] probably the best known cultural linguist writing in English today, studies languages as parts of cultures evolving in society and history. And it becomes impossible to ignore politics when keywords migrate from one culture to another. Underhill and Gianninoto [5] demonstrate the way political terms like, "citizen" and "individual" are integrated into the Chinese worldview over the course of the 19th and 20th century. They argue that this is part of a complex readjustment of conceptual clusters related to "the people". Keywords like "citizen" generate various translations in Chinese, and are part of an ongoing adaptation to global concepts of individual rights and responsibilities. Understanding keywords in this light becomes crucial for understanding how the politics of China evolves as Communism emerges and as the free market and citizens' rights develop. Underhill and Gianninoto argue that this is part of the complex ways ideological worldviews interact with the language as an ongoing means of perceiving and understanding the world.
Barbara Cassin studies keywords in a more traditional manner, striving to define the words specific to individual cultures, in order to demonstrate that many of our keywords are partially "untranslatable" into their "equivalents. The Greeks may need four words to cover all the meanings English-speakers have in mind when speaking of "love". Similarly, the French find that "liberté" suffices, while English-speakers attribute different associations to "liberty" and "freedom": "freedom of speech" or "freedom of movement", but "the Statue of Liberty".[6]
Software-assisted identification
[edit]Keywords are identified by software that compares a word-list of the text with a word-list based on a larger reference corpus. Software such as e.g. WordSmith, lists keywords and phrases and allows plotting their occurrence as they appear in texts.
See also
[edit]References
[edit]- ^ Scott and Tribble 2006, p. 55.
- ^ Scott and Tribble 2006, p. 72.
- ^ Williams 1976.
- ^ Wierzbicka 1997.
- ^ Underhill & Gianninoto 2019.
- ^ Cassin 2014.
Bibliography
[edit]- Cassin, Barbara (2014). Dictionary of Untranslatables. Oxford: Princeton University Press. ISBN 9780691138701.
{{cite book}}
: CS1 maint: ref duplicates default (link) - Scott, Mike; Tribble, Chris (2006). Textual patterns: key words and corpus analysis in language education (PDF). Amsterdam ; Philadelphia: John Benjamins Pub. ISBN 9789027222930. especially chapters 4 & 5.
- Underhill, James; Gianninoto, Rosamaria (2019). Migrating Meanings: Sharing keywords in a global world. Edinburgh: Edinburgh University Press. doi:10.3366/edinburgh/9780748696949.001.0001. ISBN 9780748696949.
- Wierzbicka, Anna (1997). Understanding Cultures through their Key Words (PDF). Oxford: Oxford University Press. ISBN 9780195088366.
{{cite book}}
: CS1 maint: ref duplicates default (link) - Williams, Raymond (1976). Keywords: A Vocabulary of culture and society. New York: Oxford University Press.
{{cite book}}
: CS1 maint: ref duplicates default (link)
External links
[edit]- Understanding the role of text length, sample size and vocabulary size in determining text coverage, by Kiyomi Chujo and Masao Utiyama
- Frequency Level Checker Archived 2010-08-06 at the Wayback Machine