Automatic Handwritten Text Detection and Classification
InformationFörfattare: Olle Dahlstedt
Beräknat färdigt: 2021-06
Handledare: Anders Hast
Handledares företag/institution: Uppsala universitet
Ämnesgranskare: Anders Brun
PresentationPresentatör: Olle Dahlstedt
Presentationstid: 2021-09-01 18:00
Opponent: Jonas Wikström
As more and more organizations digitize their records, the need for automatic document processing software increases. In particular, the rise of ‘digital humanities’ precede a new set of problems on how to digitize historical archival material in an efficient and accurate manner. The transcription of archival material to formats fit for research purposes, such as handwritten spreadsheets, is still expensive and plagued by tedious manual labor. Over the decades, research in handwritten text recognition has focused on text line extraction and recognition. In this thesis, we examine document images that contain complex details, contain more categories of text than handwriting, and handwritten text that is not separated easily to lines.
This thesis examines the sub-problem of handwritten text segmentation in detail. We propose a broad definition of text segmentation that requires both text detection and text classification, since this enables us to detect multiple kinds of text within the same image. The aim is to design a system which can detect and identify both handwriting and machine-text within the same image. Working with photographs of spreadsheet documents from the years 1871-1951, a topdown layout-agnostic image processing pipeline is developed. Different kinds of preprocessing are examined, to correct illumination and enhance contrast before binarization, and to detect and clear line contours. To achieve text region detection, we evaluate connected components labeling and MSER as region detectors, extracting textual and non-textual sub-images. On detected sub-images, we perform a Bag-of-Visual-Words quantization of k-means clustered feature descriptor vectors and perform categorical classification by training a Naïve Bayesclassifier on the feature distances to the cluster centroids.
Results include a novel two-stage illumination correction and contrast enhancement algorithm that improves document quality as a precursor to binarization, increasing the mean grayscale values of an image while retaining low grayscale variance. Region detectors are evaluated on images with different types of preprocessing and the results show that clearing document outlines influences text region detection. Training on a small sample of sub-images, the categorical classification model proves viable for discrimination between machine-text and handwriting, enabling the use of this model for further recognition purposes.