2015 13th International Conference on Document Analysis and Recognition (ICDAR)
Viral Transcript Alignment
Gil Sadeh, Lior Wolf
Tal Hassner
Nachum Dershowitz
Daniel Stökl Ben-Ezra
Tel Aviv University
Tel Aviv, Israel
The Open University
Raanana, Israel
Tel Aviv University
Tel Aviv, Israel
École Pratique des Hautes Études
Paris, France
Abstract—We present an end-to-end system for aligning transcript letters to their coordinates in a manuscript image. An
intuitive GUI and an automatic line detection method enable
the user to perform an exact alignment of parts of document
pages. In order to bridge large regions in between annotation,
and augment the manual effort, the system employs an opticalflow engine for directly matching at the pixel level the image of
a line of a historical text with a synthetic image created from the
transcript’s matching line. Meanwhile, by accumulating aligned
letters, and performing letter spotting, the system is able to
bootstrap a rapid semi-automatic transcription of the remaining
text. Thus, the amount of manual work is greatly diminished and
the transcript alignment task becomes practical regardless of the
corpus size.
I.
I NTRODUCTION
While high-quality imaging is currently the most effective
way to create digital copies of historical manuscripts, having
a searchable and processable text is often equally important to
scholars. Unfortunately, optical character recognition (OCR)
of historical documents is notoriously difficult. On the other
hand, it is a very common scenario for important texts to have
been transcribed manually. This is the case for many of the
most valuable collections recently digitized and made available
online. Examples include the Dead Sea Scrolls, Greek papyri,
Codex Sinaiticus, some of the Cairo Genizah documents, much
of the Tibetan Buddhist Canon, Taiwanese deeds and court
papers in the Taiwan History Digital Library, medieval Latin
and English manuscripts, the Early English Laws collection,
the George Washington Papers1 , and many others.
We have designed an end-to-end interactive system which
learns to align each letter of a transcript with the corresponding pixels of the letter in the digitized images of scanned
manuscripts. After identifying the lines of text in the image,
the system begins with a first approximate alignment using
an optical-flow engine, as in [1]. The interface we propose
allows the user to naturally correct any alignment errors.
Thereafter, the system learns to match individual letters based
on corrections introduced by the user. As the system learns to
better recognize letters in the images, the amount of manual
effort required decreases dramatically, from one line to the
next. As a consequence, transcript alignment becomes easier
and faster as the user progresses through the document.
1 Available
online,
respectively:
http://www.deadseascrolls.org.il,
http://www.papyrology.ox.ac.uk/Ancient Lives,
http://codexsinaiticus.org,
http://www.genizah.org,
http://www.tbrc.org,
http://idp.bl.uk,
http://thdl.ntu.edu.tw,http://scriptorium.english.cam.ac.uk/manuscripts,
http://www.earlyenglishlaws.ac.uk,
http://rotunda.upress.virginia.edu/
founders/GEWN.html.
978-1-4799-1805-8/15/$31.00 ©2015 IEEE
II.
P REVIOUS W ORK
The problem of matching text with images of the text was
raised in [2], but only a limited amount of research has been
devoted to the issue. A straightforward approach to alignment
is to perform OCR on the image and then find the best stringmatch between the OCR text and the transcription. A wordlevel recognizer is another possibility (see, e.g., [3], [4], [5]);
it can also be used in training semiautomatic transcription [6],
[7], [8]. But OCR for handwritten text, with which we are
dealing here, is difficult. In [9], [10], [11] and others, the
sequence of word images and the transcript are viewed as time
series, and dynamic time warping (DTW) is used to align them.
Hidden Markov models (HMM) have been used in [12], [13],
[14], [15], [7], for example. Geometric models of characters
and punctuation (including such features as character size
and inter-character gaps) have recently been used to reduce
segmentation errors (e.g., for Japanese in [16] and for Chinese
in [17]). In [18] six visually significant elements that combine
to signatures for the different characters of pre-Gothic Latin
written by quills were selected for detection through gradient
and connected components. The work in [5] uses a minimized
cost function based on the relative length of words and the
best combination of spaces. Alignment on the word level with
a classification technique constrained by the number of the
words is performed in [19]. The method in [7] also involves an
application of the Viterbi algorithm. In contrast to the abovementioned methods, we employ a rendered image of the text
as the main representation of the transcript, and then use direct
image-to-image matching techniques [1].
Our system incorporates a letter-spotting engine within it,
and is based on techniques previously used for word spotting.
Example of word spotting efforts include [20], [21], [22],
[23]. Two approaches are possible in searching for occurrences
of words in documents: one can first segment the text into
words [24] and then compare each target word with the query,
or one can search for a match to the query using a sliding
window of some sort. An example of word spotting among
segmented images is [25].
In our scenario, since letter segmentation is even harder
than word segmentation, we cannot assume prior segmentation.
Among the works that do not require segmentation are [26],
[27], [28]. An in-between approach is to work with multiple
overlapping target regions, as in [29]. Our letter-spotting
engine also employs multiple overlapping regions and is based
on a previous word spotting work [30]. The architecture is a
hierarchical one and is inspired by the work of Liao et al. [31]
in the domain of face recognition.
711
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
A screen shot of our system. The left pane is the image pane; the right pane contains the transcribed text. The middle pane is
used to zoom in and interactively align a single line at a time. In this middle pane, the first two lines are used to edit anchor points.
Hovering over one of the anchor lines highlights the matching anchor in the other line. Anchors can be added, dragged, or removed. The other
three lines in the middle pane depict the interpolated alignment: a line from the image, a line of warped synthesized text, and the line of
synthesized text. Manuscript: Thomasin Circlaere, Welscher Gast, Cod. Pal. germ. 389, fol. 1r, courtesy of the Universitätsbibliothek Heidelberg
at http://digi.ub.uni-heidelberg.de/diglit/cpg389/0013 transcription by Jakub Šimek, Heidelberg, SFB 933, Teilprojekt B06.
Fig. 1.
III.
BASELINE INTERACTIVE SYSTEM
The goal of the transcript alignment system is to produce
accurate results with minimal user intervention. The user
interface, depicted in Figure 1, comprises a manuscript image
pane, a document text pane, and an alignment pane, which is
used to align one line of text at a time (though not addressed in
this paper, the extension to vertical scripts is straightforward.)
The process starts by automatically detecting the lines of
the manuscript using a line detection algorithm, here the linedetection method described in [32]. This method is based on
binarization followed by a horizontal projection and a peakfinding algorithm. Once lines are found, they are individually
trimmed at their left and right boundaries, as detected through
a vertical projection. Our GUI allows the user to then add
lines which were missed, remove spurious lines, and adjust
the location of the separators between the lines.
Then, individual manuscript lines are aligned with the
corresponding text line by line. This alignment is performed in
a semi-automatic manner. Anchor points along the horizontal
lines are used to describe the location of text letters in the
image. Each anchor point divides a segment in the manuscript
to a left part and to a right part; it also divides a matching text
segment to two parts. Such anchor points can be inserted by
the user or set automatically by the system (see Section IV).
The insertion of such points, or the adjustment and possibly
removal of misplaced automatic points, constitutes the main
forms of interaction for the proposed alignment process.
In order to interpolate between the anchor points, we
employ a solution previously proposed by the authors [1].
Since this is a general solution, which does not rely on the
existence of OCR capabilities, it can be used to align text and
images in a wide variety of scripts with minimal effort required
for adaptation. Instead of detecting individual graphemes, it
robustly employs an optical-flow based technique to directly
match the historical image with a synthetic image created from
the text at the pixel level. The ability to synthesize such images
assumes that, given a manuscript, a similarly looking font is
available. The method, however, was shown to perform well
even when the manuscript font is vastly different from the one
used in the original text.
The method of [1] is independently applied to every
segment—every stretch of pixels lying between two adjacent
anchor points. This local text-to-image alignment problem thus
reduces to that of aligning a cropped image out of a line of text
and its matching, synthetic transcript image. For completeness,
we briefly describe the alignment method below.
Text to image alignment
The process starts by synthesizing a reference image of
the transcript segment using a suitable font. This synthesis is
performed in a manner in which the provenance of every pixel
of the resulting image is kept; i.e., we know for every pixel
which is its corresponding letter.
Then, the cropped image line and the matching generated
image are both represented by the same image encoding
pipeline. First, each image I is converted to a Four Patch
LBP (FPLBP) code image C, where each pixel is assigned an
integer value in the range [0..15] [33]. Next, local histograms
of FPLBP codes are pulled at each image location. Since most
of the ambiguity is in the horizontal direction for horizontal
712
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
scripts, these histograms are gathered from elliptical domains;
a 2D Gaussian filter, with sigmas 2.5 (horizontally) and 1 (vertically), is used for this purpose. Lastly, to compute optical flow
between two images, the SIFT-flow method of [34] is applied
by replacing its original Dense SIFT representation [35] with
the values of these histograms per each pixel.
Once the method is applied, each pixel in the manuscript
image segment is mapped to a pixel of the synthetic image,
which is, in turn, mapped to the original text letters. Each
manuscript image pixel is therefore aligned to the text. This
very efficient method can be applied interactively to each
image segment once the anchor points are placed or modified.
However, it is often the case that in order to obtain very
accurate results, the anchor points cannot be spread far apart;
that is, larger image sections raise the likelihood that local
matching errors will occur.
In order to reduce the manual effort of adding many anchor
points by hand, we attempt to learn from parts of the document
which have already been accurately aligned. One avenue for
improvement would be to synthesize more accurate images
using letter samples that were already aligned. However, the
method of [1] is robust to specific letter forms and the resulting
improvement would be limited. Instead, we choose to use the
letter samples in order to automatically place anchor points,
as is described next.
IV.
G OING V IRAL
The main motivation for transcript alignment is that it provides letter samples that are accessible for both human analysis
and machine learning. In our system the automatic analysis
occurs already as part of the transcript alignment system in
order to reduce the required effort. This is done through the
use of letter spotting, using a method that is adapted from the
word spotting method we previously proposed [30] for efficient
word spotting.
We scale each manuscript line to have a height of 21 pixels
and consider letters of a width of 9 pixels. Previously aligned
text provides letter samples of varying widths. Letter samples
that are too narrow are discarded. For letter samples that are
wider, only the central part of the specified width is used.
The letter samples are stored in a dataset from which they are
retrieved during search.
For the insertion of automatic anchor points to a line of
text, the manuscript segment of the entire line is divided into
overlapping windows (stride of 1) of the specified width. These
are used as queries in the letter spotting search. Each dataset or
query image patch is represented as a vector as follows: First
the patch is divided to 3×7 non-overlapping cells of size 3×3
that are each encoded by a HOG descriptor [36] of length 31.
All descriptors are concatenated and L2 normalized. The HOG
descriptor r is therefore of dimensionality 31 × 3 × 7 = 651.
A matrix M ∈ R5250×651 , which consists of the vector
representations (same as r) for 5250 random 9 × 21 bounding
boxes from all manuscript lines, is then considered. The vector
r is transformed to a vector s ∈ R5250 by means of a
linear projection: s = M r. In other words, the normalized
descriptor vector is represented by its cosine similarities to a
predetermined set of exemplars.
Then, a max-pooling step takes place. The set of indices
[1..5250] is randomly split into fixed groups Ii of size 15.
Given a vector s, this max-pooling is performed simply by
considering one measurement per such triplet Ii that is the
maximal value among the three indices of Ii in the vector v.
Put differently, let t be the vector of length 350 that results
from the max-pooling process as applied to the vector s. Then
ti = maxj∈Ii sj .
Queries are performed based on the Euclidean distance
between the vectors t associated with the patches extracted
from the image line and the vectors t associated with the
already stored letter samples. Only the top-1 retrieval is
considered. Among overlapping image regions non-maximal
suppression is employed, i.e., from the group of overlapping
scan windows in the new line to transcribe, only the scan
window with the lowest Euclidean distance to the top-1 search
result is considered as a possible match.
Eliminating matching conflicts: The process above is
based on visual similarity alone and contains many false
matches. Next, we select an optimal set of anchor points
by adding the information arising from the alphabet letters
associated with the letter samples and the actual text to
be aligned. This is done using binary linear programming,
which is computed very efficiently using standard solvers for
problems of the scale at hand.
This process takes into account the visual distance di for
a potential letter sample pi to the retrieving query window
qi based on the corresponding vectors t. It also considers the
expected distance dij , along the horizontal axis, of each text
letter lj from the position of qi . In order to compute this
distance, lj is first associated with an image location based
on the optical flow method described in Section III.
We formulate the problem as a bipartite matching problem.
Each matched dataset letter sample pi is a potential candidate
to a text letter lj in the new line’s transcript if they represent the
same alphabet letter. In this case, we define Wij = exp(−di −
dij /σ), for some parameter σ. If the letter associated with pi
and lj is not the same, Wij = 0. The value of σ is set in all
our experiments to 50 to allow a very relaxed consideration of
the predicted location along the horizontal axis.
The following optimization problem is solved in order to
recover the selected matches, which are represented by the
binary variables xij .
X
max
xij ∈{0,1}
Wij xij
i,j
subject to
X
xij ≤ 1, ∀j
i
X
xij ≤ 1, ∀i
j
xi1 ,j1 + xi2 ,j2 ≤ 1, ∀i1 > i2 , j1 < j2
The first two constraints state that no query window qi
can be associated with multiple text letters li and vice versa.
Note that it is possible that the same database sample appears
twice, i.e., pi = pi′ for i 6= i′ . The third constraint enforces
the order of matches. It cannot happen that two image patches
713
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
qi1 and qi2 appear in a different order in the line than that of
the matched lj1 and lj2 .
Finally, we note that some letters tend to create more false
matches than others. For each alphabet letter we hold usage
counters. Once an automatic anchor created for a letter is
rejected more than 30% of the time, no more suggestions are
made based on this letter.
V.
E XPERIMENTS
The complete system was implemented and evaluated on
a variety of historical documents in various scripts. In each
experiment, the system was run line by line. In each line,
the transcript alignment was performed to the highest possible
accuracy with as many manual interventions as needed. The
amount of manual intervention in each line was recorded. In
order to eliminate ambiguities in the evaluation, dragging of
anchor points was not permitted. Only insertion and deletion
were used. If dragging was allowed, it could have been used
to created anchors at locations that are unrelated to the letter
spotting results, hence masking the success and failures of the
automatic anchor placement.
Figures 2 and 3 depicts the alignment effort on sample
documents. The number of detected anchor points, which is
twice the number of spotted letters, is shown, as well as the
number of anchor points removed or added. Overall, as the
alignment process progresses, the number of spotted letters
increases and the manual effort decreases. In addition, it should
be noted that removing a false anchor point is much easier
than adding a new one. The latter requires twice the number
of mouse clicks and a much higher accuracy.
Figure 4 is a typical example of the amount of manual labor
required. The automatic alignment, obtained by the anchors
arising from the spotted letters combined with the optical flow
solution, is very descent. However, in order to obtain very
accurate results, some adjustments are necessary.
VI.
C ONCLUSION
The combination of letter spotting and a robust interpolation mechanism allows for accurate and rapid transcript
alignment. While the idea that document analysis should be
bootstrapped with some manual labor, becoming increasingly
automatic, is intuitive, very few document analysis systems
to date have demonstrated this property. We believe that
the success of the system described here comes from the
realization that instead of using the obtained knowledge in
order to improve the underlying alignment engine through
machine learning, an optimization mechanism can select from
an over-complete set of hints that are derived from the growing
set of annotated results.
The entire code of the system will be made publicly
available as an open source project.
(a)
(b)
Sample historical documents and plots depicting the amount
of user intervention required at every line. (a) Samples in various
languages and script types. (b) For each line (x-axis), the y-axis
counts the number of spotted anchors (dashed purple line), the number
of manual added anchor points (solid brown line), and the number
of automatically placed anchor points which were removed (dotted
black line). These have been normalized to twice the number of
letters per line, since one spotted letter provides two anchors. As
can be seen, the number of automatically detected letters increases
as more aligned lines become available, reducing the amount of
manual work required. From top to bottom: (i) ms Kaufmann MS
A50, fol. 128r col 2. By courtesy of the Oriental Collection of
the Library and Information Centre of the Hungarian Academy of
Sciences transcription by Daniel Stökl Ben Ezra and Michael Krupp.
(ii) The Commentary on Habakkuk Scroll (1QpHab); Qumran, Cave
1; Accession number: 95.57/28. Courtesy Israel Museum and Shrine
of the Book. (iii) See Fig. 1. (iv)See Fig. 1 (fol. 7r).
Fig. 2.
ACKNOWLEDGMENTS
This tool is the result of a cooperation between the EPHE
(École Pratique des Hautes Études), Paris, Tel Aviv University,
and the Open University of Israel. The TAU authors would
like to thank the Ministry of Science, Technology and Space
(Israel-Taiwan grant number 3-10341) and the German-Israeli
Foundation (grant I-145-101.3-2013) for their support. This
research was also supported in part by the Israel Science
Foundation (grant 1330/14). This research was initiated at
the Dagstuhl Perspectives Workshop 12382, “Computation and
Palaeography: Potentials and Limits”.
714
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
[11]
[12]
[13]
(a)
(b)
Fig. 3. A cursive document from the IAM database [37]. Please refer
[14]
to the caption of Fig. 2 for details.
[15]
[16]
[17]
[18]
(a)
[19]
[20]
[21]
[22]
(b)
A single aligned line from the middle of the document
(1QpHab col. VI line 5). Each plot includes an image line and a
transcript line with the anchor points marked; an image line with
marked letter locations; a warped synthesized transcript line; the
synthesized transcript line with matching letters marked by matching
colors. (a) The automatic alignment, which includes 22 anchor points
added automatically by spotting 11 letters. (b) An improved alignment
after adding five anchor points.
[23]
Fig. 4.
R EFERENCES
[24]
[25]
[26]
[27]
[28]
[1]
T. Hassner, L. Wolf, and N. Dershowitz, “OCR-free transcript alignment,” in ICDAR, 2013, pp. 1310–1314.
[2]
J. D. Hobby, “Matching document images with ground truth,” IJDAR,
vol. 1, no. 1, pp. 52–61, 1998.
[29]
[3]
V. Lavrenko, T. M. Rath, and R. Manmatha, “Holistic word recognition
for handwritten historical documents,” in DIAL, 2004.
[30]
[4]
C. Huang and S. N. Srihari, “Mapping transcripts to handwritten text,”
in ICFHR, 2006.
[31]
[5]
S. Zinger, J. Nerbonne, and L. Schomaker, “Text-image alignment
for historical handwritten documents,” in Document Recognition and
Retrieval, 2009, pp. 703–724.
[32]
J. Puigcerver, A. H. Toselli, and E. Vidal, “Word-graph and characterlattice combination for KWS in handwritten documents,” in ICFHR,
2014, pp. 181–186.
[33]
[6]
[7]
A. Toselli, V. Romero, and E. Vidal, “Alignment between text images
and their transcripts for handwritten documents,” in Language Technology for Cultural Heritage, 2011, pp. 23–37.
[8]
J. A. Sánchez, V. Bosch, V. Romero, K. Depuydt, and J. de Does,
“Handwritten text recognition for historical documents in the transcriptorium project,” in Proc. Int. Conf. on Digital Access to Textual Cultural
Heritage, 2014.
[9]
E. M. Kornfield, R. Manmatha, and J. Allan, “Text alignment with
handwritten documents,” in DIAL, 2004.
[10]
D. Jose, A. Bhardwaj, and V. Govindaraju, “Transcript mapping for
[34]
[35]
[36]
[37]
handwritten English documents,” in DRR, ser. SPIE Proceedings, B. A.
Yanikoglu and K. Berkner, Eds., vol. 6815. SPIE, 2008, p. 68150.
L. Lorigo and V. Govindaraju, “Transcript mapping for handwritten
arabic documents,” in Electronic Imaging, 2007.
M. Zimmermann and H. Bunke, “Automatic segmentation of the IAM
off-line database for handwritten English text,” in ICPR, 2002.
J. L. Rothfeder, R. Manmatha, and T. M. Rath, “Aligning transcripts
to automatically segmented handwritten manuscripts,” in Document
Analysis Systems, ser. Lecture Notes in Computer Science, H. Bunke
and A. L. Spitz, Eds., vol. 3872. Springer, 2006, pp. 84–95.
A. Fischer, V. Frinken, A. Fornés, and H. Bunke, “Transcription
alignment of Latin manuscripts using HMMs,” in HIP, 2011.
A. Toselli and E. Vidal, “Fast HMM-filler approach for key word
spotting in handwritten documents,” in ICDAR, 2013.
B. Zhu and M. Nakagawa, “Online handwritten Japanese text recognition by improving segmentation quality,” in ICFHR, 2008.
F. Yin, Q.-F. Wang, and C.-L. Liu, “Integrating geometric context for
text alignment of handwritten chinese documents.”
Y. Leydier, V. Eglin, S. Bres, and D. Stutzmann, “Learning-free textimage alignment for medieval manuscripts ,” in ICFHR, 2014.
N. Stamatopoulos, G. Louloudis, and B. Gatos, “Efficient transcript
mapping to ease the creation of document image segmentation ground
truth with text-image alignment,” in ICFHR, 2010.
V. Frinken, A. Fischer, R. Manmatha, and H. Bunke, “A novel word
spotting method based on recurrent neural networks,” PAMI, 2012.
B. Gatos and I. Pratikakis, “Segmentation-free word spotting in historical printed documents,” in ICDAR, 2009.
J. A. Rodrı́guez-Serrano and F. Perronnin, “Local gradient histogram
features for word spotting in unconstrained handwritten documents,” in
ICFHR, 2008.
A. Fischer, A. Keller, V. Frinken, and H. Bunke, “HMM-based word
spotting in handwritten documents using subword models,” in ICPR,
2010.
R. Manmatha and N. Srimal, “Scale space technique for word segmentation in handwritten documents,” in Scale-Space Theories in Computer
Vision, 1999.
J. Almazan, A. Gordo, A. Fornés, and E. Valveny, “Handwritten word
spotting with corrected attributes,” in ICCV, 2013.
L. Rothacker, M. Rusiñol, and G. Fink, “Bag-of-features HMMs for
segmentation-free word spotting in handwritten documents,” in ICDAR,
2013.
J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Efficient exemplar
word spotting,” in BMVC, 2012.
M. Rusiñol, D. Aldavert, R. Toledo, and J. Lladós, “Browsing heterogeneous document collections by a segmentation-free word spotting
method,” in ICDAR, 2011.
A. J. Newell and L. D. Griffin, “Multiscale histogram of oriented
gradient descriptors for robust character recognition,” in ICDAR, 2011.
A. Kovalchuk, L. Wolf, and N. Dershowitz, “A simple and fast word
spotting method,” in ICFHR, 2014, pp. 3–8.
Q. Liao, J. Z. Leibo, Y. Mroueh, and T. Poggio, “Can a biologicallyplausible hierarchy effectively replace face detection, alignment, and
recognition pipelines?” CoRR, vol. abs/1311.4082, 2013.
L. Wolf, R. Littman, N. Mayer, T. German, N. Dershowitz, R. Shweka,
and Y. Choueka, “Identifying join candidates in the Cairo Genizah,”
IJCV, 2011.
L. Wolf, T. Hassner, and Y. Taigman, “Descriptor based methods in the
wild,” in Post-ECCV Faces in Real-Life Images Workshop, 2008.
C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondence
across scenes and its applications,” PAMI, 2011.
A. Vedaldi and B. Fulkerson, “Vlfeat: An open and portable library of
computer vision algorithms,” in Proc. int. conf. on Multimedia, 2010.
N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in CVPR, 2005.
U.-V. Marti and H. Bunke, “The iam-database: an english sentence
database for offline handwriting recognition,” IJDAR, 2002.
715