skip to main content
10.1145/1096601.1096637acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
Article

Injecting information into atomic units of text

Published: 02 November 2005 Publication History

Abstract

This paper presents a new approach to text processing, based on textemes. These are atomic text units generalising the concepts of character and glyph by merging them in a common data structure, together with an arbitrary number of user-defined properties. In the first part, we give a survey of the notions of character and glyph and their relation with Natural Language Processing models, some visual text representation issues and strategies adopted by file formats (SVG, PDF, DVI) and software (Uniscribe, Pango). In the second part we show applications of textemes in various text processing issues: ligatures, variant glyphs and other OpenType-related properties, hyphenation, color and other presentation attributes, Arabic form and morphology, CJK spacing, metadata, etc. Finally we describe how the Omega typesetting system implements texteme processing as an example of a generalised approach to input character stream parsing, internal representation of text, and modular typographic transformations. In the data flow from input to output, whether in memory or through serializations in auxiliary data files, textemes progressively accumulate information that is used by Omega's paragraph builder engine and included in the output DVI file. We show how this additional information increases efficiency of conversions to other file formats such as PDF or SVG. We conclude this paper by presenting interesting potential applications of texteme methods in document engineering.

References

[1]
Raph Levien, Is the SVG working group about to choose shame and get war?, http://www.levien.com/svg/shame.html.
[2]
Martin Dürst et al., Character Model for the World Wide Web 1.0: Fundamentals. W3C Recommendation 15 February 2005, http://www.w3.org/TR/2005/REC-charmod-20050215/.
[3]
Yannis Haralambous, Tiqwah, a Typesetting System for Biblical Hebrew Based on \TeX, Actes du Quatrième Colloque International Bible et informatique, matériel et matière, Amsterdam, 1994, pp. 445--470.
[4]
Yannis Haralambous, Unicode, XML, TEI, Omega, International Unicode Conference 16, Amsterdam, 2000, pp. b.7.1--b.7.23.
[5]
Yannis Haralambous, Fontes & codages, O'Reilly France, 2004.
[6]
Mounged de poche, Dar el-Machreq, Beyrouth, 1986.
[7]
John Plaice, Yannis Haralambous and Chris Rowley, An Extensible Approach to High-Quality Multilingual Typesetting, IEEE Research Issues in Data Engineering: Multi-lingual Information Management, 2003. RIDE-MLIM 2003, Hyderabad.
[8]
John Plaice, Yannis Haralambous, Paul Swoboda and Gábor Bella, Moving Omega to an Object-Oriented Platform, Springer Lecture Notes in Computer Science 3130, 2004, pp. 17--26.
[9]
Text Encoding Initiative, http://www.tei-c.org.uk/Drafts/P4/CO.xml.
[10]
The Adobe PDF Reference, Fifth Edition, Version 1.6, http://partners.adobe.com/public/developer/en/pdf/PDFReference16.pdf.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '05: Proceedings of the 2005 ACM symposium on Document engineering
November 2005
252 pages
ISBN:1595932402
DOI:10.1145/1096601
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. OpenType
  2. PDF
  3. SVG
  4. Unicode
  5. character
  6. glyph
  7. multilingual typesetting
  8. omega
  9. texteme

Qualifiers

  • Article

Conference

DocEng05
Sponsor:
DocEng05: ACM Symposium on Document Engineering
November 2 - 4, 2005
Bristol, United Kingdom

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media