A very important step in the handwriting recognition process is that of text line extraction: it aims at extracting individual text lines from the text regions of the manuscript page. In this work, we propose a novel text line extraction algorithm for color manuscript pages without prior binarization. Our algorithm is based on seam carving to compute separating seams between text lines. However, unconstrained seam carving has the tendency to produce seams that move through gaps between multiple text lines, if these are the lowest energy regions of the neighboring image space. We constrain the seam carving computation between two consecutive text lines and therefore, we are able to generate separating seams that correcly separate text lines.
Datasets and Results
We evaluate our algorithm on the original manuscript pages of the work Aline by the Swiss-French writer Charles-Ferdinand Ramuz. We obtained it from the Bibliothèque Cantonale et Universitaire de Lausanne (BCU). Due to copyright reasons, the manuscript pages are not available online.
In order to show the applicability of our method to diverse manuscripts, we also apply our algorithm to the dataset of , which is organized in four collections and contains 215 manuscript pages in English, Spanish and Arabic. Furthermore, we apply our algorithm to a smaller dataset that was provided to us by the authors of . In all cases, we outperform the state-of-the-art algorithm of . See our paper [pdf] for more details.
The code of our algorithm to reproduce the results in our paper together with generated seams on the dataset of  are available below. For any questions regarding the code or the results, please contact Nikolaos Arvanitopoulos (firstname.lastname@example.org)
Paper (Infoscience link) [pdf]
MATLAB Code: text_line_extraction_code.zip
Seam results: seam_results.zip
Below we show a result of our algorithm. In red we show the seams that separate two consecutive text lines from each other. In blue, we show the seams that approximate the medial axis of each text line.