Lyndon word

Last updated

In mathematics, in the areas of combinatorics and computer science, a Lyndon word is a nonempty string that is strictly smaller in lexicographic order than all of its rotations. Lyndon words are named after mathematician Roger Lyndon, who investigated them in 1954, calling them standard lexicographic sequences. [1] Anatoly Shirshov introduced Lyndon words in 1953 calling them regular words. [2] Lyndon words are a special case of Hall words; almost all properties of Lyndon words are shared by Hall words.

Contents

Definitions

Several equivalent definitions exist.

A -ary Lyndon word of length is an -character string over an alphabet of size , and which is the unique minimum element in the lexicographical ordering in the multiset of all its rotations. Being the singularly smallest rotation implies that a Lyndon word differs from any of its non-trivial rotations, and is therefore aperiodic. [3]

Alternately, a word is a Lyndon word if and only if it is nonempty and lexicographically strictly smaller than any of its proper suffixes, that is for all nonempty words such that and is nonempty.

Another characterisation is the following: A Lyndon word has the property that it is nonempty and, whenever it is split into two nonempty substrings, the left substring is always lexicographically less than the right substring. That is, if is a Lyndon word, and is any factorization into two substrings, with and understood to be non-empty, then . This definition implies that a string of length is a Lyndon word if and only if there exist Lyndon words and such that and . [4] Although there may be more than one choice of and with this property, there is a particular choice, called the standard factorization, in which is as long as possible. [5]

Enumeration

The Lyndon words over the two-symbol binary alphabet {0,1}, sorted by length and then lexicographically within each length class, form an infinite sequence that begins

0, 1, 01, 001, 011, 0001, 0011, 0111, 00001, 00011, 00101, 00111, 01011, 01111, ...

The first string that does not belong to this sequence, "00", is omitted because it is periodic (it consists of two repetitions of the substring "0"); the second omitted string, "10", is aperiodic but is not minimal in its permutation class as it can be cyclically permuted to the smaller string "01".

The empty string also meets the definition of a Lyndon word of length zero. The numbers of binary Lyndon words of each length, starting with length zero, form the integer sequence

1, 2, 1, 2, 3, 6, 9, 18, 30, 56, 99, 186, 335, ... (sequence A001037 in the OEIS )

Lyndon words correspond to aperiodic necklace class representatives and can thus be counted with Moreau's necklace-counting function. [3] [6]

Generation

Duval (1988) provides an efficient algorithm for listing the Lyndon words of length at most with a given alphabet size in lexicographic order. If is one of the words in the sequence, then the next word after can be found by the following steps:

  1. Repeat and truncate it to a new word of length exactly .
  2. As long as the final symbol of is the last symbol in the sorted ordering of the alphabet, remove it, producing a shorter word.
  3. Replace the final remaining symbol of by its successor in the sorted ordering of the alphabet.

For example, suppose we have generated the binary Lyndon words of length up to 7, and we have generated up to , then the steps are:

  1. Repeat and truncate to get
  2. Since the last symbol is 0, it is not the final symbol.
  3. Increment the last symbol to obtain .

The worst-case time to generate the successor of a word by this procedure is . However, if the words being generated are stored in an array of length , and the construction of from is performed by adding symbols onto the end of rather than by making a new copy of , then the average time to generate the successor of (assuming each word is equally likely) is constant. Therefore, the sequence of all Lyndon words of length at most can be generated in time proportional to the length of the sequence. [7] A constant fraction of the words in this sequence have length exactly , so the same procedure can be used to efficiently generate words of length exactly or words whose length divides , by filtering out the generated words that do not fit these criteria.

Standard factorization

According to the Chen–Fox–Lyndon theorem, every string may be formed in a unique way by concatenating a sequence of Lyndon words, in such a way that the words in the sequence are nonincreasing lexicographically. [8] The final Lyndon word in this sequence is the lexicographically smallest suffix of the given string. [9] A factorization into a nonincreasing sequence of Lyndon words (the so-called Lyndon factorization) can be constructed in linear time. [9] Lyndon factorizations may be used as part of a bijective variant of the Burrows–Wheeler transform for data compression, [10] and in algorithms for digital geometry. [11]

Such factorizations can be written (uniquely) as finite binary trees, with the leaves labelled by the alphabet, with each rightward branch given by the final Lyndon word in the sequence. [12] Such trees are sometimes called standard bracketings and can be taken as factorization of the elements of a free group or as the basis elements for a free Lie algebra. These trees are a special case of Hall trees (as Lyndon words are a special case of Hall words), and so likewise, the Hall words provide a standard order, called the commutator collecting process for groups, and basis for Lie algebras. [13] Indeed, they provide an explicit construction for the commutators appearing in the Poincaré–Birkhoff–Witt theorem needed for the construction of universal enveloping algebras.

Every Lyndon word can be understood as a permutation, the suffix standard permutation.

Duval algorithm

Duval (1983) developed an algorithm for finding the standard factorization that runs in linear time and constant space. It iterates over a string trying to find as long a Lyndon word as possible. When it finds one, it adds it to the result list and proceeds to search the remaining part of the string. The resulting list of strings is the standard factorization of the given string. More formal description of the algorithm follows.

Given a string S of length N, one should proceed with the following steps:

  1. Let m be the index of the symbol-candidate to be appended to the already collected symbols. Initially, m = 1 (indices of symbols in a string start from zero).
  2. Let k be the index of the symbol we would compare others to. Initially, k = 0.
  3. While k and m are less than N, compare S[k] (the k-th symbol of the string S) to S[m]. There are three possible outcomes:
    1. S[k] is equal to S[m]: append S[m] to the current collected symbols. Increment k and m.
    2. S[k] is less than S[m]: if we append S[m] to the current collected symbols, we'll get a Lyndon word. But we can't add it to the result list yet because it may be just a part of a larger Lyndon word. Thus, just increment m and set k to 0 so the next symbol would be compared to the first one in the string.
    3. S[k] is greater than S[m]: if we append S[m] to the current collected symbols, it will be neither a Lyndon word nor a possible beginning of one. Thus, add the first m-k collected symbols to the result list, remove them from the string, set m to 1 and k to 0 so that they point to the second and the first symbol of the string respectively.
  4. When m > N, it is essentially the same as encountering minus infinity, thus, add the first m-k collected symbols to the result list after removing them from the string, set m to 1 and k to 0, and return to the previous step.
  5. Add S to the result list.

Connection to de Bruijn sequences

If one concatenates together, in lexicographic order, all the Lyndon words that have length dividing a given number n, the result is a de Bruijn sequence, a circular sequence of symbols such that each possible length-n sequence appears exactly once as one of its contiguous subsequences. For example, the concatenation of the binary Lyndon words whose length divides four is

0 0001 0011 01 0111 1

This construction, together with the efficient generation of Lyndon words, provides an efficient method for constructing a particular de Bruijn sequence in linear time and logarithmic space. [14]

Additional properties and applications

Lyndon words have an application to the description of free Lie algebras, in constructing a basis for the homogeneous part of a given degree; this was Lyndon's original motivation for introducing these words. [4] Lyndon words may be understood as a special case of Hall sets. [4]

For prime p, the number of irreducible monic polynomials of degree d over the field is the same as the number of Lyndon words of length d in an alphabet of p symbols; they can be placed into explicit correspondence. [15]

A theorem of Radford states that a shuffle algebra over a field of characteristic 0 can be viewed as a polynomial algebra over the Lyndon words. More precisely, let A be an alphabet, let k be a field of characteristic 0 (or, more general, a commutative -algebra), and let R be the free noncommutative k-algebra kxa | aA. The words over A can then be identified with the "noncommutative monomials" (i.e., products of the xa) in R; namely, we identify a word (a1,a2,...,an) with the monomial xa1xa2...xan. Thus, the words over A form a k-vector space basis of R. Then, a shuffle product is defined on R; this is a k-bilinear, associative and commutative product, which is denoted by ш, and which on the words can be recursively defined by

1 ш v = v for any word v;
u ш 1 = u for any word u;
ua ш vb = (u ш vb)a + (ua ш v)b for any a,b ∈ A and any words u and v.

The shuffle algebra on the alphabet A is defined to be the additive group R endowed with ш as multiplication. Radford's theorem [16] now states that the Lyndon words are algebraically independent elements of this shuffle algebra, and generate it; thus, the shuffle algebra is isomorphic to a polynomial ring over k, with the indeterminates corresponding to the Lyndon words. [16]

See also

Notes

  1. Lyndon (1954).
  2. Shirshov (1953).
  3. 1 2 Berstel & Perrin (2007); Melançon (2001).
  4. 1 2 3 Melançon (2001).
  5. Berstel & Perrin (2007).
  6. Ruskey (2003) provides details of these counts for Lyndon words and several related concepts.
  7. Berstel & Pocchiola (1994).
  8. Melançon (2001). Berstel & Perrin (2007) write that although this is commonly credited to Chen, Fox & Lyndon (1958), and follows from results in that paper, it was not stated explicitly until Schützenberger (1965). It was formulated explicitly by Shirshov (1958), see Schützenberger & Sherman (1963).
  9. 1 2 Duval (1983).
  10. Gil & Scott (2009); Kufleitner (2009).
  11. Brlek et al. (2009).
  12. Glen (2012).
  13. Melançon (1992); Melançon & Reutenauer (1989); Hohlweg & Reutenauer (2003)
  14. According to Berstel & Perrin (2007), the sequence generated in this way was first described (with a different generation method) by Martin (1934), and the connection between it and Lyndon words was observed by Fredricksen & Maiorana (1978).
  15. Golomb (1969).
  16. 1 2 Radford (1979)

Related Research Articles

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero elements, often called the empty string and denoted by ε or λ, as the identity element. The free monoid on a set A is usually denoted A. The free semigroup on A is the subsemigroup of A containing all elements except the empty string. It is usually denoted A+.

In mathematics, the lexicographic or lexicographical order is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a totally ordered set.

de Bruijn sequence Cycle through all length-k sequences

In combinatorial mathematics, a de Bruijn sequence of order n on a size-k alphabet A is a cyclic sequence in which every possible length-n string on A occurs exactly once as a substring (i.e., as a contiguous subsequence). Such a sequence is denoted by B(k, n) and has length kn, which is also the number of distinct strings of length n on A. Each of these distinct strings, when taken as a substring of B(k, n), must start at a different position, because substrings starting at the same position are not distinct. Therefore, B(k, n) must have at leastkn symbols. And since B(k, n) has exactlykn symbols, de Bruijn sequences are optimally short with respect to the property of containing every string of length n at least once.

<span class="mw-page-title-main">Sturmian word</span> Kind of infinitely long sequence of characters

In mathematics, a Sturmian word, named after Jacques Charles François Sturm, is a certain kind of infinitely long sequence of characters. Such a sequence can be generated by considering a game of English billiards on a square table. The struck ball will successively hit the vertical and horizontal edges labelled 0 and 1 generating a sequence of letters. This sequence is a Sturmian word.

In mathematics, a free Lie algebra over a field K is a Lie algebra generated by a set X, without any imposed relations other than the defining relations of alternating K-bilinearity and the Jacobi identity.

In combinatorics, a squarefree word is a word that does not contain any squares. A square is a word of the form XX, where X is not empty. Thus, a squarefree word can also be defined as a word that avoids the pattern XX.

In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and, depending on its purpose, may be finite, countable, or even uncountable.

In mathematics and theoretical computer science, an automatic sequence (also called a k-automatic sequence or a k-recognizable sequence when one wants to indicate that the base of the numerals used is k) is an infinite sequence of terms characterized by a finite automaton. The n-th term of an automatic sequence a(n) is a mapping of the final state reached in a finite automaton accepting the digits of the number n in some fixed base k.

Combinatorics on words is a fairly new field of mathematics, branching from combinatorics, which focuses on the study of words and formal languages. The subject looks at letters or symbols, and the sequences they form. Combinatorics on words affects various areas of mathematical study, including algebra and computer science. There have been a wide range of contributions to the field. Some of the first work was on square-free words by Axel Thue in the early 1900s. He and colleagues observed patterns within words and tried to explain them. As time went on, combinatorics on words became useful in the study of algorithms and coding. It led to developments in abstract algebra and answering open questions.

In computer science, the complexity function of a word or string is the function that counts the number of distinct factors of that string. More generally, the complexity function of a formal language counts the number of distinct words of given length.

In mathematics, in the areas of group theory and combinatorics, Hall words provide a unique monoid factorisation of the free monoid. They are also totally ordered, and thus provide a total order on the monoid. This is analogous to the better-known case of Lyndon words; in fact, the Lyndon words are a special case, and almost all properties possessed by Lyndon words carry over to Hall words. Hall words are in one-to-one correspondence with Hall trees. These are binary trees; taken together, they form the Hall set. This set is a particular totally ordered subset of a free non-associative algebra, that is, a free magma. In this form, the Hall trees provide a basis for free Lie algebras, and can be used to perform the commutations required by the Poincaré–Birkhoff–Witt theorem used in the construction of a universal enveloping algebra. As such, this generalizes the same process when done with the Lyndon words. Hall trees can also be used to give a total order to the elements of a group, via the commutator collecting process, which is a special case of the general construction given below. It can be shown that Lazard sets coincide with Hall sets.

In mathematics, a shuffle algebra is a Hopf algebra with a basis corresponding to words on some set, whose product is given by the shuffle productXY of two words X, Y: the sum of all ways of interlacing them. The interlacing is given by the riffle shuffle permutation.

In computer science, the lexicographically minimal string rotation or lexicographically least circular substring is the problem of finding the rotation of a string possessing the lowest lexicographical order of all such rotations. For example, the lexicographically minimal rotation of "bbaaccaadd" would be "aaccaaddbb". It is possible for a string to have multiple lexicographically minimal rotations, but for most applications this does not matter as the rotations must be equivalent. Finding the lexicographically minimal rotation is useful as a way of normalizing strings. If the strings represent potentially isomorphic structures such as graphs, normalizing in this way allows for simple equality checking. A common implementation trick when dealing with circular strings is to concatenate the string to itself instead of having to perform modular arithmetic on the string indices.

In mathematics, a factorisation of a free monoid is a sequence of subsets of words with the property that every word in the free monoid can be written as a concatenation of elements drawn from the subsets. The Chen–Fox–Lyndon theorem states that the Lyndon words furnish a factorisation. The Schützenberger theorem relates the definition in terms of a multiplicative property to an additive property.

In mathematics and computer science, a morphic word or substitutive word is an infinite sequence of symbols which is constructed from a particular class of endomorphism of a free monoid.

In mathematics, a sesquipower or Zimin word is a string over an alphabet with identical prefix and suffix. Sesquipowers are unavoidable patterns, in the sense that all sufficiently long strings contain one.

In mathematics and theoretical computer science, a pattern is an unavoidable pattern if it is unavoidable on any finite alphabet.

In mathematics, a recurrent word or sequence is an infinite word over a finite alphabet in which every factor occurs infinitely many times. An infinite word is recurrent if and only if it is a sesquipower.

In mathematics and computer science, the critical exponent of a finite or infinite sequence of symbols over a finite alphabet describes the largest number of times a contiguous subsequence can be repeated. For example, the critical exponent of "Mississippi" is 7/3, as it contains the string "ississi", which is of length 7 and period 3.

References