skip to main content
10.1145/3685650acmconferencesBook PagePublication PagesdocengConference Proceedingsconference-collections
DocEng '24: Proceedings of the ACM Symposium on Document Engineering 2024
ACM2024 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
DocEng '24: ACM Symposium on Document Engineering 2024 San Jose CA USA August 20 - 23, 2024
ISBN:
979-8-4007-1169-5
Published:
18 September 2024
Sponsors:
Recommend ACM DL
ALREADY A SUBSCRIBER?SIGN IN

Reflects downloads up to 22 Jan 2025Bibliometrics
Abstract

No abstract available.

Skip Table Of Content Section
research-article
Handheld Video Document Scanning: A Robust On-Device Model for Multi-Page Document Scanning

Document capture applications on smartphones have emerged as popular tools for digitizing documents. For many individuals, capturing documents with their smartphones is more convenient than using dedicated photocopiers or scanners, even if the quality of ...

short-paper
Which is the most suitable scanner resolution for documents? Detailing the answer given to the question raised by Professor George Nagy

Defining the correct image resolution is a fundamental issue to preserve all the information in a document, keeping the minimum image acquisition and processing times, as well as the storage space and computer bandwidth for network transmission, allowing ...

research-article
Open Access
Best Paper
Best Paper
ZigZag: A Robust Adaptive Approach to Non-Uniformly Illuminated Document Image Binarization

In the era of mobile imaging, the quality of document photos captured by smartphones often suffers due to adverse lighting conditions. Traditional document analysis and optical character recognition systems encounter difficulties with images that have ...

research-article
Texture-based Document Binarization

Image binarization, the conversion of a color image into its monochromatic version, plays a key role in many document processing pipelines. The technical literature presents over a hundred different algorithms for document image binarization yielding ...

research-article
Open Access
A Heuristic Algorithm for Mathematical Markup Encoding Based on the Relative Positions of Characters

Mathematical expressions (MEs) are the most crucial technical content in scientific documents, yet their presentations are not easy to describe. However, LaTeX, one of the primary markup languages in mathematics, enables MEs to be easily understood by ...

research-article
Open Access
Best Student Paper
Best Student Paper
Graph Detective: A User Interface for Intuitive Graph Exploration Through Visualized Queries

Graph databases are used across several domains due to the intuitive structure of graphs. They are well-suited for storing document collections together with their interlinkages through metadata and annotations. Yet, querying such graphs requires ...

research-article
CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi-Automatic Annotation Tool (DocumentLabeler) for Engineering System Design

In the realm of document engineering and Natural Language Processing (NLP), the integration of digitally born catalogs into product design processes presents a novel avenue for enhancing information extraction and interoperability. This paper introduces ...

short-paper
Open Access
TopicTag: Automatic Annotation of NMF Topic Models Using Chain of Thought and Prompt Tuning with LLMs

Topic modeling is a technique for organizing and extracting themes from large collections of unstructured text. Non-negative matrix factorization (NMF) is a common unsupervised approach that decomposes a term frequency-inverse document frequency (TF-IDF) ...

short-paper
Open Access
Post-OCR Correction with OpenAI's GPT Models on Challenging English Prosody Texts

The digitization of historical documents faces challenges with the accuracy of Optical Character Recognition (OCR). Noting the success of large language models (LLMs) on many text-based tasks, this paper explores the potential of OpenAI's GPT models (3.5-...

short-paper
Open Access
Detecting AI-Generated Texts in Cross-Domains

Existing tools to detect text generated by a large language model (LLM) have met with certain success, but their performance can drop when dealing with texts in new domains. To tackle this issue, we train a ranking classifier called RoBERTa-Ranker, a ...

panel
Competition on Binarizing Photographed Document Images 2024 Quality, Time and Space Report

Many document processing platforms have image binarization as a key step. The performance of binarization algorithms depends on several factors that span from the quality of the digitalization devices to the intrinsic features of the document itself and ...

research-article
Assessing Abstractive and Extractive Methods for Automatic News Summarization

Automatic Text Summarization (ATS) is a research area that originated in the late 1950s and has gained increasing importance with the surge of text data available today. ATS approaches are generally classified into extractive and abstractive methods. ...

short-paper
Assessing the Reliability and Validity of the Measures for Automatic Text Summarization

Automatic Text Summarization (ATS) is a research area that originated in the late 1950s and has gained increasing importance with the surging amount of text data available today. One of the key challenges in this area is how to quantitatively assess the ...

short-paper
Open Access
An Efficient PDF Malware Detection Method Using Highly Compact Features

The growing use of PDFs has made them a prime target for malware attacks. Machine learning-based approaches for detecting PDF malware are increasingly popular due to their high accuracy and efficiency. However, the effectiveness of these systems largely ...

short-paper
Automatically producing accessible and reusable PDFs with LATEX

In this application note we outline the goals of the "LATEX Tagged PDF" project, describe its current status, show how it can already now been used to create accessible and reusable PDFs, and outline our future plans for a successful completion. Further ...

research-article
LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

Sparse retrieval methods like BM25 are based on lexical overlap, focusing on the surface form of the terms that appear in the query and the document. The use of inverted indices in these methods leads to high retrieval efficiency. On the other hand, ...

short-paper
Similarity Problems in Paragraph Justification: An Extension to the Knuth-Plass Algorithm

In high quality typography, consecutive lines beginning or ending with the same word or sequence of characters is considered a defect. We have implemented an extension to TEX'S paragraph justification algorithm which handles this problem. Experimentation ...

Index terms have been assigned to the content through auto-classification.

Recommendations

Acceptance Rates

DocEng '24 Paper Acceptance Rate 16 of 27 submissions, 59%;
Overall Acceptance Rate 194 of 564 submissions, 34%
YearSubmittedAcceptedRate
DocEng '24271659%
DocEng '2327933%
DocEng '19773039%
DocEng '17711318%
DocEng '16351131%
DocEng '15311135%
DocEng '14411537%
DocEng '13501632%
DocEng '10421331%
DocEng '08622134%
DocEng '02462146%
DocEng '01551833%
Overall56419434%