No abstract available.
Proceeding Downloads
Handheld Video Document Scanning: A Robust On-Device Model for Multi-Page Document Scanning
Document capture applications on smartphones have emerged as popular tools for digitizing documents. For many individuals, capturing documents with their smartphones is more convenient than using dedicated photocopiers or scanners, even if the quality of ...
Which is the most suitable scanner resolution for documents? Detailing the answer given to the question raised by Professor George Nagy
Defining the correct image resolution is a fundamental issue to preserve all the information in a document, keeping the minimum image acquisition and processing times, as well as the storage space and computer bandwidth for network transmission, allowing ...
ZigZag: A Robust Adaptive Approach to Non-Uniformly Illuminated Document Image Binarization
In the era of mobile imaging, the quality of document photos captured by smartphones often suffers due to adverse lighting conditions. Traditional document analysis and optical character recognition systems encounter difficulties with images that have ...
Texture-based Document Binarization
Image binarization, the conversion of a color image into its monochromatic version, plays a key role in many document processing pipelines. The technical literature presents over a hundred different algorithms for document image binarization yielding ...
A Heuristic Algorithm for Mathematical Markup Encoding Based on the Relative Positions of Characters
Mathematical expressions (MEs) are the most crucial technical content in scientific documents, yet their presentations are not easy to describe. However, LaTeX, one of the primary markup languages in mathematics, enables MEs to be easily understood by ...
Graph Detective: A User Interface for Intuitive Graph Exploration Through Visualized Queries
Graph databases are used across several domains due to the intuitive structure of graphs. They are well-suited for storing document collections together with their interlinkages through metadata and annotations. Yet, querying such graphs requires ...
CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi-Automatic Annotation Tool (DocumentLabeler) for Engineering System Design
In the realm of document engineering and Natural Language Processing (NLP), the integration of digitally born catalogs into product design processes presents a novel avenue for enhancing information extraction and interoperability. This paper introduces ...
TopicTag: Automatic Annotation of NMF Topic Models Using Chain of Thought and Prompt Tuning with LLMs
- Selma Wanna,
- Nicholas Solovyev,
- Ryan Barron,
- Maksim E. Eren,
- Manish Bhattarai,
- Kim Ø. Rasmussen,
- Boian S. Alexandrov
Topic modeling is a technique for organizing and extracting themes from large collections of unstructured text. Non-negative matrix factorization (NMF) is a common unsupervised approach that decomposes a term frequency-inverse document frequency (TF-IDF) ...
Post-OCR Correction with OpenAI's GPT Models on Challenging English Prosody Texts
The digitization of historical documents faces challenges with the accuracy of Optical Character Recognition (OCR). Noting the success of large language models (LLMs) on many text-based tasks, this paper explores the potential of OpenAI's GPT models (3.5-...
Detecting AI-Generated Texts in Cross-Domains
Existing tools to detect text generated by a large language model (LLM) have met with certain success, but their performance can drop when dealing with texts in new domains. To tackle this issue, we train a ranking classifier called RoBERTa-Ranker, a ...
Competition on Binarizing Photographed Document Images 2024 Quality, Time and Space Report
- Rafael Dueire Lins,
- Gustavo P. Chaves,
- Gabriel de F. P e Silva,
- Thaylor Vieira,
- Ricardo da Silva Barboza,
- Steven J. Simske
Many document processing platforms have image binarization as a key step. The performance of binarization algorithms depends on several factors that span from the quality of the digitalization devices to the intrinsic features of the document itself and ...
Assessing Abstractive and Extractive Methods for Automatic News Summarization
Automatic Text Summarization (ATS) is a research area that originated in the late 1950s and has gained increasing importance with the surge of text data available today. ATS approaches are generally classified into extractive and abstractive methods. ...
Assessing the Reliability and Validity of the Measures for Automatic Text Summarization
Automatic Text Summarization (ATS) is a research area that originated in the late 1950s and has gained increasing importance with the surging amount of text data available today. One of the key challenges in this area is how to quantitatively assess the ...
An Efficient PDF Malware Detection Method Using Highly Compact Features
The growing use of PDFs has made them a prime target for malware attacks. Machine learning-based approaches for detecting PDF malware are increasingly popular due to their high accuracy and efficiency. However, the effectiveness of these systems largely ...
Automatically producing accessible and reusable PDFs with LATEX
In this application note we outline the goals of the "LATEX Tagged PDF" project, describe its current status, show how it can already now been used to create accessible and reusable PDFs, and outline our future plans for a successful completion. Further ...
LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors
Sparse retrieval methods like BM25 are based on lexical overlap, focusing on the surface form of the terms that appear in the query and the document. The use of inverted indices in these methods leads to high retrieval efficiency. On the other hand, ...
Similarity Problems in Paragraph Justification: An Extension to the Knuth-Plass Algorithm
In high quality typography, consecutive lines beginning or ending with the same word or sequence of characters is considered a defect. We have implemented an extension to TEX'S paragraph justification algorithm which handles this problem. Experimentation ...
Index Terms
- Proceedings of the ACM Symposium on Document Engineering 2024
Recommendations
Acceptance Rates
Year | Submitted | Accepted | Rate |
---|---|---|---|
DocEng '24 | 27 | 16 | 59% |
DocEng '23 | 27 | 9 | 33% |
DocEng '19 | 77 | 30 | 39% |
DocEng '17 | 71 | 13 | 18% |
DocEng '16 | 35 | 11 | 31% |
DocEng '15 | 31 | 11 | 35% |
DocEng '14 | 41 | 15 | 37% |
DocEng '13 | 50 | 16 | 32% |
DocEng '10 | 42 | 13 | 31% |
DocEng '08 | 62 | 21 | 34% |
DocEng '02 | 46 | 21 | 46% |
DocEng '01 | 55 | 18 | 33% |
Overall | 564 | 194 | 34% |