Text detection and recognition in natural images

Abstract

Text Recognition in Natural Images Using Multiclass Hough Forests

Yildirim, GökhanAchanta, RadhakrishnaSüsstrunk, Sabine

8th International Conference on Computer Vision Theory and Applications (VISAPP), Barcelona, Spain, February 21-24, 2013

Text detection and recognition in natural images are popular yet unsolved problems in computer vision. Several feature selection and classification techniques have been used to model and solve these two related problems. In this paper, we propose a technique that attempts to solve these two problems in a unified manner. We modify an object detection framework called Hough Forests by introducing “Cross-Scale Binary Features” that compares the information between the same image patch at different scales. We use this modified technique to produce likelihood maps for every text character. These maps and a word-formation cost function are used to detect and recognize the text in natural images. We test our technique with Street View House Numbers and ICDAR 2003 datasets. For the SVHN dataset, our algorithm outperforms recent methods and has comparable performance using less number of training samples. We also exceed the state-of-the-art word recognition performance for ICDAR 2003 dataset by 4%.

Detailed recordFulltext

Hough Forests

Generalized Hough transform is a technique for finding the position and parameters of arbitrary shaped objects using local information in a voting scheme. In this paper, we use multiclass Hough forests (Gall et al., 2011), which is an efficient method to compute the generalized Hough transform for multiple types of objects by adopting random forests.

/* Style Definitions */ table.MsoNormalTable {mso-style-name:”Table Normal”; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:””; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; mso-pagination:widow-orphan; font-size:12.0pt; font-family:”Times New Roman”; mso-ascii-font-family:Cambria; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:”Times New Roman”; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Cambria; mso-hansi-theme-font:minor-latin;}

Figure 1: Generalized Hough transform of character small “a” by using Hough forests (image is taken from ICDAR 2003 dataset)

 

Cross-Scale Binary Features

In our work, we modify the binary features in (Gall et al., 2011) by allowing the comparison between two scales. Every feature effectively compares the mean values of two randomly positioned rectangles with random dimensions.

 

/* Style Definitions */ table.MsoNormalTable {mso-style-name:”Table Normal”; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:””; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; mso-pagination:widow-orphan; font-size:12.0pt; font-family:”Times New Roman”; mso-ascii-font-family:Cambria; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:”Times New Roman”; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Cambria; mso-hansi-theme-font:minor-latin;}

Figure 2: A representation of cross-scale binary feature, the feature value is the output of average intensity comparison between two gray rectangles

Word Formation

Letters can resemble each other either locally or globally. For example, the upper part of the letter “P” could be detected as a letter “D” within a smaller scale. Depending on the font style, the letter “W” can be confused with two successive “V”s or “U”s and vice versa. Therefore, instead of recognizing characters individually, we use a pictorial structure model and produce lexicon priors with the help of an English word list.

Figure 3: From left to right: word image, likelihood maps for “W”, “A”, and “Y”. The first row shows the beginning of the word search and the second row shows the minimized path between likelihood maps.

Training Dataset

We create a dataset by using a subset of 200 computer fonts under random affine transformations. In addition, we put other characters around the main character using the probabilities of occurrence in an English word. Finally, we blend these images with random non-text background to simulate a character in a natural scene. The MATLAB code and other necessary files to generate a dataset can be downloaded here.

 

Results

We also test our algorithm on cropped words in the ICDAR 2003 database. As in (Wang et al., 2011) and (Mishra et al., 2012), we ignore words with nonalphanumeric characters and words that are shorter than three letters, giving us a total of 827 words. Note that the proper nouns and brand names that appear in the dataset are also in our search space.

Method ICDAR2003 Time
Hough Forest 85.7 3 minutes
Mishra et al. 81.78
Wang et al. 76 15 seconds

 Table 1: Cropped word recognition results (in %) from the ICDAR 2003 database.

References

Gall, J., Yao, A., Razavi, N., Gool, L. V., and Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11):2188–2202.

Mishra, A., Alahari, K., and Jawahar, C. V. (2012). Top-down and bottom-up cues for scene text recognition. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2687–2694.

Wang, K., Babenko, B., and Belongie, S. (2011). End-to-end scene text recognition. In Proc. of the International Conference on Computer Vision, pages 1457–1464.