Academia.eduAcademia.edu

Beyond the Ground Truth: Alternative Quality Measures of Document Binarizations

2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)

2016 15th International Conference on Frontiers in Handwriting Recognition Beyond the Ground Truth: Alternative Quality Measures of Document Binarizations Arie Shaus, Barak Sober, Eli Turkel Eli Piasetzky The Department of Applied Mathematics Tel Aviv University Tel Aviv, Israel. [email protected], [email protected], [email protected] The Sackler School of Physics and Astronomy Tel Aviv University Tel Aviv, Israel. [email protected] Abstract— This article discusses the quality assessment of binary images. The customary, ground truth based methodology, used in the literature is shown to be problematic due to its subjective nature. Several previously suggested alternatives are surveyed and are also found to be inadequate in certain scenarios. A new approach, quantifying the adherence of a binarization to its document image is proposed and tested using six different measures of accuracy. The measures are evaluated experimentally based on datasets from DIBCO and H-DIBCO competitions, with respect to different kinds of binarization degradations. Ground Truth Document Image Binarizations Binarizations Binarizations Figure 1. Standard binarization quality evaluation process. The document image is gray-scale, while the binarization and the ground truth are black and white images. The quality metric measures the adherence of the binarization to the ground truth. Keywords— Ground truth, binarization, evaluation, quality measure. I. INTRODUCTION Document Image The established methodology of document binarization assessment relies upon ground truth (GT) images (see competitions [1-6]). This is motivated by the need for binarization quality criteria. A manually created GT image is presumed to be a close approximation to the binarization ideal. Consequently, the different binarized images are scored according to their adherence to the GT image. The entire evaluation process, depicted in Fig. 1, consists of the following stages: Preliminary step: A black and white GT is created manually, based upon a gray-scale document image. This process is driven by human-operated tools (e.g. [7-10]). Algorithms application: The same document image serves as an input for the various binarization algorithms, resulting in binary images (herein: binarizations). Algorithms evaluation: These binarizations are judged against the GT, using quality assessment metrics (such as Fmeasure, pseudo F-measure, PSNR, Negative Rate Metric, Distance Reciprocal Distortion Metric and Misclassification Penalty Metric; see [1-6] for details). Due to certain drawbacks in this methodology (detailed below), we present two alternative solutions. The first suggestion is an evaluation of the binarizations directly versus the document image, avoiding the use of GT altogether. The second option is strengthening the existing methodology by assessing the GT quality prior to its usage. Both solutions rely on an identical mechanism and we therefore consider them together. 2167-6445/16 $31.00 © 2016 IEEE DOI 10.1109/ICFHR.2016.90 Quality measure Quality Binarizations Binarizations Binarizations or GT measure Figure 2. Proposed binarization quality evaluation process. The quality of binarization or ground truth is assessed by measuring their adherence to the document image. The main contribution of the article is the suggestion of several new measures, enabling the assessment of the accuracy of black and white images (binarizations or GT) directly vs. the document image (see Fig. 2). The rest of the paper is organized as follows: Section II deals with the pitfalls of the existing methodology. Section III gives a brief survey of previous attempts to solve or avoid these drawbacks. Section IV specifies our solution, while Sections V deals with the experimental setting and results. Finally, Section VI summarizes the outcomes and proposes future research directions. II. METHODOLOGICAL PITFALLS Several papers deal with the deficiencies of the existing methodology. All of them emphasize the subjectivity and the inherent inconsistency of the GT creation process. In [11], the variability of five binarization algorithms was compared to that of different manual GTs. Significant irregularities in the GTs of the same document was found. Surprisingly, the results revealed that the variance between the binarizations was smaller than the variance between the different GTs. 494 495 Article [12] deals with GTs of First Temple period Hebrew inscriptions, created by several experts. Their GTs were shown to be of markedly different quality. Paper [13] performed a binarization classifier training, based on three variants of GT. The performance of the classifiers varied significantly with respect to the underlying GT. We therefore conclude that [11-13] demonstrate that the GT is inherently subjective, with large deviations between different human operators and creation techniques, influencing the performance of the algorithms “downstream”. This problem was noted already in [14], where automatic systems were found to be more reliable than the human “ground truther”. III. synthetic document image and are judged against the perfect GT. This elegant technique avoids the need for the creation of GT images. On the other hand, it cannot evaluate binarizations of already existing degraded documents. In addition, if no clean version of a given type of handwriting or typeface exists, or if the noise model cannot be adequately deduced, the method is also inapplicable. Yet another, “goal-directed” approach [25], tries to avoid ground-truthing altogether. The results of different binarization techniques are used as inputs for other algorithms (e.g. OCR systems), whose outputs are the ones being evaluated. However, with any sufficiently complicated goal, the tuning of the parameters “downstream” may have a major influence on the outcomes. In certain cases (e.g. historical documents), the binarization may also be the desired end product, with no further processing required. EXISTING SOLUTIONS The aforementioned methodological pitfalls were addressed by some articles in the past. This section gives a brief survey of these proposed solutions which are found to be inadequate in certain scenarios. Article [10] aims at presenting an objective evaluation methodology for document image binarization, performed in the following fashion: Preliminary steps: A skeleton of GT is created via the algorithms [15-16], and corrected manually. The document image edges are extracted by the Canny method [17]. Algorithms evaluation: The GT skeleton is dilated within each binarization, until 50% of the edges inside each connected component are covered. This results in a new, “evaluated GT”. This approach has several shortcomings. First, it includes a manual stage. According to our tests, the impact of this stage is not negligible. Second, the method constructs a different “evaluated GT” for each binarization. Therefore, every binarization is judged against its own GT, with no common ground for comparison. Finally, no justification is given for preferring the proposed intricate scheme to the existing methodology. The similarity of the outcomes in [10] (as well as Occam’s razor principle) suggests that the existing methodology should be favored. A later article [18] made attempts to improve upon [10], yet hasn't avoided the manually performed stages (e.g. “The user shall verify that at least one dilation marker exists within the borders of each ground truth component”; “the user shall close any edge disconnections”, etc.). Another approach presented in [19] is an elaboration on the same theme. The main changes are dropping the manual correction phase, and dilating with respect to binarizations [20-22]. This avoids a creation of different GT for each binarization and the potential for human error. However, this approach merely creates another, albeit sophisticated, binarization procedure. Though this is certainly an “objective” way to handle the binarization evaluation, in fact it pre-supposes that the presented procedure creates the perfect binarization, which is not proved by the authors. A different approach [23-24] is to avoid the GT creation step altogether. A clean, binary image of a document is marked as GT. This image is combined with any desired type of noise, in order to create a synthetic document image. The evaluated binarization algorithms are activated on the IV. SUGGESTED SOLUTION The main contribution of this article is the proposal of several new metrics assessing either the binarization or the GT. A first step in that direction was undertaken in [12], where different GTs of the same historical inscription were compared. The technique superimposed the GTs over the document image. The quality of the fit was used in order to rank the different GTs. A similar methodology can be used in order to evaluate the quality of either the binarization (bypassing the GT), or the GT itself (therefore, adding a verification step to the existing scheme). A. Preliminary Definitions We assume: 1. A black and white image BW ( x, y ) ( BW :[1, M ] × [1, N ] → {0, 255} ) which can be either a binarization or a GT, is superimposed over a gray-scale document image D ( x, y ) of the same dimensions (if needed, a preliminary registration is performed, e.g. [12]). 2. A measure m, taking into account certain correspondences between BW and D , is used in order to evaluate the quality of BW . In the considered situation, the correspondence between the BW and D images defines the foreground and background sets of pixels: F = {( x, y ) | BW ( x, y ) = 0} and B = {( x, y ) | BW ( x, y ) = 255} , respectively (with # F + # B = MN ). The measure m may take into account the properties of these two populations within D . We use the following notations: µ F and µ B are the foreground and background averages respectively, where: ¦ µS = D ( x, y ) ( x , y )∈S #S for S = F , B are the mean values of the foreground and the background; 496 495 σ F and σ B are their respective standard deviations, defined in a similar fashion. nF = assessment of multispectral images. The rationale behind this measure is an optimization of mCMI under all possible graylevel transformations of the document image. It can be shown that this is achieved by: (5) mPC = 255 ⋅ ( bi − fi ) . #F #B and nB = are respectively #F +#B #F +#B ¦ the proportions of the foreground and the background pixels. fi = bi = # {( x, y ) ∈ F | D ( x, y ) = i} #F # {( x, y ) ∈ B | D( x, y ) = i} #B i: f i ≤ bi As in the case of mCMI , it is assumed that the better BW and is indicated by larger mPC . Remark 1: Some of the above mentioned measures are adaptations of global binarization techniques. Indeed, assessing a binarization “looking back” at the document image can be considered as a dual problem to the task of arriving at the binarization itself. , i = 0...255 , are the empirical distributions (histograms) of F and B . B. Proposed Measures We consider the following measures: Adapted Otsu: Article [26] used a thresholding criterion minimizing the intra-class variance for backgroundforeground separation. A similar measure can be used in order to assess the intra-class variance, dropping the requirement of hard-thresholding. Thus: (1) mOtsu = n F ⋅ σ F 2 +n B ⋅ σ B 2 . Remark 2: As seen above, different approaches prefer either small or large measure values. For the sake of consistency, in the experimental section (below) we negate the Otsu and the KI measures. Thus, it is assumed that the better BW always corresponds to a higher value of a given measure. Additional “classical” measures for image (or matrix) comparison can be also utilized for our purpose, in particular L1, L2 and PSNR measures. L1: Defined by: (6) mL = D ( x, y ) − BW ( x, y ) . It is assumed that smaller values of mOtsu reflect better quality of BW . Adapted Kapur: Paper [27] used an entropy-based thresholding criterion for binarization, maximizing the sum of entropies of background and foreground populations. Again, dropping the requirement for a threshold, we get: 255 i =0 ( x, y ) L2: Defined by: (2) . (7) Again, consistency-wise, these two measures ought to be negated. PSNR (used in [1-6] vs. GT): Defined by: with x log( x ) considered zero at x = 0 . Our expectation is that larger values of mKapur indicate a better BW . § § mL2 2 · · 2 (8) ¸. mPSNR = 10 ⋅ log10 ¨ 255 / ¨ ¨ MN ¸¸ ¸ ¨ © ¹ © ¹ Definition: Two given measures m1 and m2 are denoted as equivalent, m1 ~ m2 , if for a constant D and different Adapted Kittler-Illingworth (KI): The authors of [28] presumed a normally distributed foreground and background pixel populations. The derived criterion function tries to reduce the classification error rate under this supposition. Again, we shall use a similar measure, with no hardthresholding: . 2 ( x, y ) j =0 mKI = 1 + 2 ⋅ [ n B log(σ B ) + n F log(σ F ) ] − ¦ ( D ( x, y ) − BW ( x, y ) ) mL2 = 255 mKapur = ¦ f i log( f i ) + ¦ b j log(b j ) , ¦ 1 BW and BW * the monotonicity is maintained jointly, e.g.: m1 ( BW , D ) > m1 ( BW * , D ) ⇔ (3) −2 ⋅ [ n B log(n B ) + n F log(n F ) ] Our expectation is that smaller mKI values reflect better BW . (9) m2 ( BW , D ) > m2 ( BW * , D ) Proposition 1: The PSNR measure is equivalent to the negated L2, i.e. mPSNR ~ − mL2 . CMI: Paper [12] deals with the quality assessment of GTs of historical inscriptions. As such, this is not an adapted method, but a measure developed directly in order to handle similar tasks (also see [29-30] for additional usages): (4) mCMI = µ B − µ F . Proof: Indeed, due to monotonicity of C⋅x 2 ( 0 ≠ C ∈ \ ), log10 , 1/x and x (for x ≥ 0 ): § 2552 MN mPSNR = 10log10 ¨ ¨ mL 2 2 © Larger values of mCMI should point to a better BW . Potential Contrast (PC): The concept of “Potential Contrast” was presented in [31], for the purpose of · 2552 MN ~ − mL2 . (10) ¸~ ¸ mL2 2 ¹  497 496 Methodology: We tested the measures on purposely engineered binary images with gradually diminishing quality. For each document image, its corresponding highquality binarization was used in order to obtain a sequence of progressively inferior black and white images. Three different types of deteriorations were pursued: 1. An addition of increasing levels of random salt and pepper (S&P) noise (1%, 2%, etc., stopping at 10%), imitating isolated artifacts of the binarization process (e.g. stains, see [29], [32] for examples and methods for their handling). In order to ensure the significance of the results, each noise level was added independently 25 times (thus 25 different binary images were created with 1% noise, 25 more with 2% noise, etc.). 2. A continuing morphological dilation of the foreground (4-connectivity; dilations of 1 up to 10 pixels), emulating a binarization algorithm prone to False Positive errors near the edge (e.g. due to miscalculated threshold), or an operator with a preference for wide strokes creating the GT. 3. A continuing morphological erosion of the foreground (4-connectivity; erosions of 1 up to 3 pixels), mimicking a binarization algorithm prone to False Negative errors near the edge (e.g. due to miscalculated threshold), or an operator with a preference for narrow strokes creating the GT. As already stated, our expectation was a constantly declining score, with the continuing deterioration of the engineered binarizations. Dataset: Heterogeneous and openly available data from several past binarization competitions were used, in particular DIBCO 2009 [1] (5 handwritten and 5 printed documents), H-DIBCO 2010 [2] (10 handwritten documents), DIBCO 2011 [3] (8 handwritten and 8 printed documents), H-DIBCO 2012 [4] (14 handwritten documents), DIBCO 2013 [5] (8 handwritten and 8 printed documents), and H-DIBCO 2014 [6] (10 handwritten documents); a total of 76 documents. As the measures require a grayscale document image, in case RGB document images were provided, they were converted to gray-scale by channel averaging. Within the datasets, each document image was accompanied by its corresponding GT. The GTs were taken as a high-quality basis for our deterioration procedures, resulting in 2064 different binarizations tested. Success criterion (for each image, each type of type of deterioration and each measure): Monotonic decrease of the scores sequence (e.g., maximal score for the original binary image, the next for 1% S&P noise, etc.). A non-observance of correct monotonic behavior between two consecutive deteriorated binarizations (e.g. the score increasing between 3% and 4% of S&P noise) was counted as a “break of monotonicity”. Note: The abovementioned setting ensures the significance and the reproducibility of our results. If BW ( x, y ) ∈ {0, 255} (like in our Proposition 2: setting), then mL1 ~ mL2 . Proof: The norms are influenced by the foreground and the background populations, induced by BW . Indeed, on the one side: mL1 = ¦ D ( x , y ) − BW ( x, y ) = ( x, y ) ¦ = ¦ ( 255 − D ( x, y ) ) D ( x, y ) + ( x , y )∈F (11) ( x , y )∈B Subtracting a constant (sum over the unvarying D( x, y ) ) would result in equivalent measure, therefore: ~ ¦ D ( x, y ) + ¦ ( 255 − D ( x, y ) ) − F B ¦ D ( x, y ) ( x, y ) (12) = ¦ ( 255 − 2 D ( x, y ) ) B On the other side: ¦ ( D ( x, y ) − BW ( x, y ) ) mL2 = 2 ( x, y ) ~ ¦ ( D ( x, y ) − BW ( x, y ) ) (13) 2 ( x, y ) And moreover: = ¦ D ( x, y ) 2 + ( x , y )∈F = ¦ ¦ ( 255 − D ( x, y ) ) 2 = ( x , y )∈B D ( x, y ) 2 + 255¦ ( 255 − 2 D ( x, y ) ) ( x , y )∈F ∪ B (14) B Since the first term is constant, and as a multiplicative non-zero constant results in equivalent measure, we get: ~ ¦ ( 255 − 2 D ( x, y ) ) . (15) B  From Propositions 1 and 2 it follows that despite the seeming dissimilarity of the last three measures, they would in fact yield the same binarizations’ ranking. Therefore, in what follows, we would only use the mPSNR measure. V. EXPERIMENTAL SETTING AND RESULTS This section compares the performance of the six quality measures described above. We begin with the experimental settings, continuing with the results. A. Experimental Setting Goal: The goal of this experiment is to compare the performance of the measures under controlled deterioration of high-quality binarizations of various documents. We require the measures to maintain a monotonic decrease with respect to the increasing worsening of the binarizations. This may be seen as an “axiomatic” (and certainly reasonable) requirement for the measures. We stress that in this experiment, the elements under examination are the different measures, and not the binarizations. B. Experimental Results A summary of the results for different types of deterioration are presented in Tables I, II and III. 498 497 TABLE I. RESULTS FOR SALT AND PEPPER DETERIORATION Dataseta #Files DIBCO2009 H DIBCO2009 P H-DIBCO2010 H DIBCO2011 H DIBCO2011 P H-DIBCO2012 H DIBCO2013 H DIBCO2013 P H-DIBCO2014 H 5 5 10 8 8 14 8 8 10 Mean TABLE III. % of Breaks of Monotonicity Otsu 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Kapur 26% 82% 22% 41% 71% 30% 26% 80% 37% 43.4% KI 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% CMI 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% PC 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% PSNR 0% 0% 0% 13% 13% 0% 0% 0% 0% 2.6% RESULTS FOR EROSION OF THE FOREGROUND Dataset #Files DIBCO2009 H DIBCO2009 P H-DIBCO2010 H DIBCO2011 H DIBCO2011 P H-DIBCO2012 H DIBCO2013 H DIBCO2013 P H-DIBCO2014 H 5 5 10 8 8 14 8 8 10 Mean % of Breaks of Monotonicity Otsu 0% 0% 0% 0% 0% 0% 4% 0% 0% 0.4% Kapur 7% 7% 37% 13% 4% 31% 25% 17% 20% 20% KI 20% 0% 0% 21% 0% 7% 0% 21% 0% 7% CMI 100% 73% 80% 88% 75% 71% 75% 75% 70% 77% PC 60% 20% 47% 71% 46% 50% 46% 46% 37% 47% PSNR 7% 0% 47% 4% 13% 24% 21% 25% 37% 22% a. H=Handwritten, P=Printed. Table III documents a relatively small-scale morphological erosion experiment, limited to 3 erosions (as 4 erosion would result in a complete elimination of the foreground in some binary images). The almost perfectly performing Otsu measure is followed by KI, with 7% mistakes. Most of KI’s mistakes were made on 1-pixel erosion stage, surely within the limits of the original GTs reliability. Kapur, PSNR, and particularly PC and CMI measures were confused by this setting. It is noticeable that the CMI and the PC measures do not take into account the information regarding the size of F and B . Subsequently, a preference for “thinning” the characters (limiting the foreground to only the most certain “skeleton” pixels, with only minor penalty to the background statistics) might be observed in these measures. Table I presents the results of the S&P noising experiment. It can be seen that Otsu, KI, CMI and PC measures perform perfectly in this setting, with 0% ordering mistakes in all the sequences. The PSNR measure also behaves nicely in most cases. Unfortunately, it shows 2.6% of monotonicity break. On indepth inspection, these cases correlate with the existence of bright stripes across the document. In such cases, the PSNR (and consequently the equivalent L1 and L2 measures) might “prefer” a presence of foreground pixels mistaken for background, which may indeed happen in this type of noise. Finally, the Kapur measure (with 43.4% mistakes) is unreliable in this experiment. Moreover, we do not consider this measure as well-founded, as it ignores the gray-level values altogether (a permutation of the histogram results in the same score). VI. TABLE II. RESULTS FOR DILATION OF THE FOREGROUND Dataset #Files DIBCO2009 H DIBCO2009 P H-DIBCO2010 H DIBCO2011 H DIBCO2011 P H-DIBCO2012 H DIBCO2013 H DIBCO2013 P H-DIBCO2014 H 5 5 10 8 8 14 8 8 10 Mean % of Breaks of Monotonicity Otsu 24% 0% 0% 0% 0% 0% 0% 0% 0% 1.6% Kapur 26% 20% 12% 20% 29% 19% 20% 25% 11% 19.5% KI 4% 2% 6% 1% 0% 6% 3% 0% 4% 3.2% CMI 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% PC 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% SUMMARY AND FUTURE DIRECTIONS We presented several measures, which quantify the adherence of a binary image to its grayscale document image. The binary document can either be a GT, or a product of a binarization algorithm. Both cases are treated in the same fashion. In order to check the adequacy of the proposed measures, an experimental framework was constructed utilizing a clean binary document with specifically engineered increasing deterioration of the binarization. The results indicate that the adapted Otsu and KI measures present the best overall performance for binarizations evaluation purposes. The PSNR, PC and CMI measures can probably be useful in scenarios with adequate stroke width. The adapted Kapur measure is not a viable option for a quality measure. The measures used in this article are of a global nature. Other such measures can be adapted from surveys such as [33]. Additionally, various measures operating on a local (i.e. “moving window”) level, can also be considered. Another research direction is the elimination of the reliance not only on the GT, but also on the document image itself. This may be possible utilizing the intrinsic properties of the binarization. Such a proposal is hinted by [34] (where it is performed manually) and [35]. PSNR 0% 0% 0% 13% 15% 0% 0% 0% 0% 2.9% Table II shows the results of morphological dilation experiment. The CMI and PC measures still perform perfectly, with 0% mistakes. Otsu (1.6% breaks of monotonicity, all in a single dataset), PSNR (2.9% mistakes) and KI (3.2% mistakes) also exhibit good performance. A close examination shows that all the Otsu mistakes are attributed to the presence of dark stains, covering a large part of the document. In such a case, the Otsu metric may “prefer” a relocation of some B pixels 2 to F , in order to reduce the variance σ B . As before, the Kapur metric does not show a reliable behavior. ACKNOWLEDGMENT The research reported here received initial funding from the Israel Science Foundation – F.I.R.S.T. (Bikura) Individual Grant no. 644/08, as well as the Israel Science 499 498 [15] M. Kamel, and A. Zhao, “Extraction of binary character/graphics images from grayscale document images”, CVGIP: Computer Vision Graphics and Image Processing 55, No. 3, pp. 203-217, 1993. [16] H. J. Lee, and B. Chen, “Recognition of handwritten Chinese characters via short line segments”, Pattern Recognition 25, No. 5, pp. 543-552, 1992. [17] J. Canny, “A Computational Approach to Edge Detection”, IEEE Transactions on Pattern Analysis and Machine Intelligence 8, No. 6, pp. 679-698, 1986. [18] K. Ntirogiannis, B. Gatos and I. Pratikakis, "Performance evaluation methodology for historical document image binarization", IEEE Transactions on Image Processing 22, No. 2, pp. 595-609, 2012. [19] I. Ben Messaoud, H. El Abed, H. Amiri and V. Märgner, “A design of a preprocessing framework for large database of historical documents,” in Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp. 177-183, 2011. [20] J. Sauvola and M. Pietikainen, “Adaptive document image binarization,” Pattern Recognition 33, No. 2, pp. 225–236, 2000. [21] B. Gatos, I. Pratikakis, and S. Perantonis, “Adaptive degraded document image binarization,” Pattern Recognition 39, pp. 317–327, 2006. [22] S. Lu and B. Su and C. L. Tan, “Document image binarization using background estimation and stroke edge,” Inter. Journal on Document Analysis and Recognition 13, No. 4, pp. 303–314, 2010. [23] P. Stathis, E. Kavallieratou and N. Papamarkos, “An evaluation technique for binarization algorithms”, Journal of Universal Computer Science 14, No. 18, pp. 3011-3030, 2009. [24] R. Paredes and E. Kavallieratou, “ICFHR 2010 contest: Quantitative evaluation of binarization algorithms,” in Proceedings of the 12th International Conference on Frontiers in Handwriting Recognition (ICFHR 2010), pp. 733-736, 2010. [25] Ø. D. Trier and A. K. Jain, “Goal-directed evaluation of binarization methods”, IEEE Transactions On Pattern Analysis And Machine Intelligence 17, No. 12, pp. 1191-1201, 1995. [26] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Transactions on Systems, Man and Cybernetics 9, No. 1, pp. 62-66, 1979. [27] J. N. Kapur, P. K. Sahoo and A. K. C. Wong, “A new method for gray-level picture thresholding using the entropy of the histogram,” Computer Vision, Graphics, and Image Processing 29, No. 3, pp. 273285, 1985. [28] J. Kittler and J. Illingworth, “Minimum error thresholding,” Pattern Recognition 19, No. 1, pp. 41-47, 1986. [29] A. Shaus, E. Turkel, E. Piasetzky, “Binarization of First Temple period inscriptions - Performance of existing algorithms and a new registration based scheme,” in Proceedings of the 13th International Conference on Frontiers in Handwriting Recognition (ICFHR 2012), pp. 645-650, 2012. [30] A. Shaus, I. Finkelstein and E. Piasetzky, “Bypassing the eye of the beholder: Automated ostraca facsimile evaluation,” Maarav 17.1, pp. 7-20, 2010. [31] S. Faigenbaum, B. Sober, A. Shaus, M. Moinester, E. Piasetzky, G. Bearman, M. Cordonsky and I. Finkelstein, “Multispectral Images of Ostraca: Acquisition and Analysis,” Journal of Archaeological Science, Vol. 39, Issue 12, pp. 3581–3590, 2012. [32] A. Shaus, B. Sober, E. Turkel and E. Piasetzky, “Improving binarization via sparse methods”, in Proceedings of the 16th International Graphonomics Society Conference (IGS 2013), pp. 163166, 2013. [33] M. Athimethphat, “A review on global binarization algorithms for degraded document images”, Assumption University Journal of Technology 14, No. 3, pp. 188-195, 2011. [34] Ø. D. Trier and T. Taxt, “Evaluation of binarization methods for document images”, IEEE Transactions on Pattern Analysis and Machine Intelligence 17, No. 3, 1995. [35] S. Faigenbaum, A. Shaus, B. Sober, E. Turkel and E. Piasetzky, “Evaluating glyph binarizations based on their properties”, in Proceedings of the 13th ACM Symposium on Document Engineering (DocEng2013), pp. 127-130, 2013. Foundation Grant no. 1457/13. The research was also funded the European Research Council under the European Community's Seventh Framework Programme (FP7/20072013)/ERC grant agreement no. 229418, and by an Early Israel grant (New Horizons project), Tel Aviv University. This study was also supported by a generous donation of Mr. Jacques Chahine, made through the French Friends of Tel Aviv University. Arie Shaus is grateful to the Azrieli Foundation for the award of an Azrieli Fellowship. We would also like to acknowledge the valuable help of Shirly Ben-Dor Evian, Sivan Einhorn, Shira Faigenbaum-Golovin, Israel Finkelstein and David Levin. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] B. Gatos, K. Ntirogiannis and I. Pratikakis, “ICDAR 2009 document image binarization contest (DIBCO 2009),” in Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR 2009), pp. 1375-1382, 2009. I. Pratikakis, B. Gatos and K. Ntirogiannis, “H-DIBCO 2010 – Handwritten document image binarization competition,” in Proceedings of the 12th International Conference on Frontiers in Handwriting Recognition (ICFHR 2010), pp. 727-732, 2010. I. Pratikakis, B. Gatos and K. Ntirogiannis, “ICDAR 2011 document image binarization contest (DIBCO 2011),” in Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR 2011), pp. 1506-1510, 2011. I. Pratikakis, B. Gatos and K. Ntirogiannis, “H-DIBCO 2012 – Handwritten document image binarization competition,” in Proceedings of the 13th International Conference on Frontiers in Handwriting Recognition (ICFHR 2012), pp. 817-822, 2012. I. Pratikakis, B. Gatos and K. Ntirogiannis, “ICDAR 2013 document image binarization contest (DIBCO 2013),” in Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR 2013), pp. 1471-1476, 2013. K. Ntirogiannis, B. Gatos and I. Pratikakis, “H-DIBCO 2014 – Handwritten document image binarization competition,” in Proceedings of the 14th International Conference on Frontiers in Handwriting Recognition (ICFHR 2014), pp. 809-813, 2014. E. Saund, J. Lind, and P. Sarkar. “PixLabeler: User interface for pixel-level labeling of elements in document images,” in Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR 2009), pp. 646–650, 2009. A. Fischer, E. Indermühle, H. Bunke, G. Viehhauser and M. Stolz, “Ground truth creation for handwriting recognition in historical documents,” in Proceedings of the 9th IAPR Workshop on Document Analysis Systems (DAS 2010), pp. 3-10, 2010. C. Clausner, S. Pletschacher and A. Antonacopoulos, “Aletheia - An advanced document layout and text ground-truthing system for production environments,” in Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR 2011), pp. 48-52, 2011. K. Ntirogiannis, B. Gatos and I. Pratikakis, “An objective evaluation methodology for document image binarization techniques,” in Proceedings of the 8th IAPR Workshop on Document Analysis Systems (DAS 2008), pp. 217-224, 2008. E. H. Barney Smith, “An analysis of binarization ground truthing,” in Proceedings of the 9th IAPR Workshop on Document Analysis Systems (DAS 2010), pp. 27-33, 2010. A. Shaus, E. Turkel and E. Piasetzky, “Quality evaluation of facsimiles of Hebrew First Temple period inscriptions”, in Proceedings of the 10th IAPR Workshop on Document Analysis Systems (DAS 2012), pp. 170-174, 2012. E. H. Barney Smith and C. An, “Effect of “Ground Truth” on image binarization,” in Proceedings of the 10th IAPR Workshop on Document Analysis Systems (DAS 2012), pp. 250-254, 2012. R. M. Brown, T. H. Fay and C. U. Walker, “Handprinted symbol recognition system”, Pattern Recognition 21, No. 2, pp. 91-118, 1988. 500 499