Offiline Arabic Handwritten Character Recognizer Based on Feature Extraction and Support Vector Machine
Computers and Technology
Submitted By thahirsh
Offline handwritten Arabic character recognizer based on Feature extraction and Support vector machine
Assistant professor in MCA department
Sankara College of Science and comerce,
ABSTRACT: Since the problem of Arabic text recognition is a large and complex one, it makes sense to try a simple method to see what performance can be achieved. The characters are written by many people using a great variety of sizes, writing styles, instruments, and with a widely varying amount of care. Some of the characters or words are poorly formed and are hard to classify, even for a human. Of the 280 sample characters used for training, 280 have been used for test purposes. The captured image of a character is normalized and set to eight feature values as parameter values of a vector. Training has given for a character by SVM (Support Vector machine) algorithm. It attempts to work with a subset of the features in a character that a human would typically see for the identification of Arabic characters.
1. Introduction One of the most classical applications of the Artificial Neural Network is the Character Recognition System. Cost effective and less time consuming, businesses, post offices, banks, security systems, and even the field of robotics employ this system as the base of their operations. Handwriting recognition can be defined as the task of transforming text represented in the spatial form of graphical marks into its symbolic representation. A recognition system can be either “on-line” or off-line”. It is “on-line” if the temporal sequence of points traced out by the pen is available, such as with electronic personal data assistants that require the user to “write” on the screen using a stylus. It is “off-line” if it is applied to previously written text, such as any images scanned in by a scanner. The on-line problem is usually easier than the off-line problem since more information is available. This article is restricted to off-line Arabic character recognition system only.
1. Arabic writing: The Arabic alphabet contains 28 letters. Each has between two and four shapes, and the choice of which shape to use depends on the position of the letter within its word or sub word. The shapes correspond to the four positions: beginning of a (sub-) word, middle of a (sub) word and of a (sub) word and in isolation. Some variations are shown. [pic]
Figure : four handwritten examples of laam-alif suggest allowable variation
1.2. Proposed approach:
This method is recognizes the isolated Arabic handwritten alphabets only. First an image is cleaned with image processing technique. It may be converted to a more concise representation like slope, height of the character, etc, and then features are detected from the image. With the features as input, a recognizer returns the identified character. This system is to develop a system capable of recognizing handwritten characters inputted by the image of a character (off-line). It provides means for training the input characters first, then classification option for characters. The input image is compared with the stored model of a character image values using SVM algorithm.
2. Support Vector Machines SVM
Support Vector Machines (SVM) is a method for creating functions from a set of labeled training data. The function can be a classification function (the output is binary: is the input in a category) or the function can be a general regression function. A Support Vector Machine (SVM) performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories. In the parlance of SVM literature, a predictor variable is called an attribute, and a transformed attribute that is used to define the hyperplane is called a feature. The task of choosing the most suitable representation is known as feature selection. A set of features that describes one case (i.e., a row of predictor values) is called a vector. So the goal of SVM modeling is to find the optimal hyperplane that separates clusters of vector in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors. The figure below presents an overview of the SVM process. Figure : SVM algorithm
3.1 METHODOLOGY: The major steps in building the classifier are pre-processing, feature selection and construction, feature extraction, SVM, multiple SVM classifiers. The block diagram illustrates the methodology
Figure 3.1: Outline of the system, decision is based on the maximum output of the SVMs Each character images is considered from four different views, and from each views, two features are extracted and combined to obtain 8 features. Using these features, multiple SVM classifiers are trained to separate different classes of characters. ‘ALLGS’ database is used for training and testing of SVM classifiers. It provides 280 samples for training and 100 samples for testing from the real life samples.
Figure 3.2: Some samples of the characters in the database
3.1.1 Pre-processing: The images in the database have different sizes (as shown in figure). The preprocessing normalizes the size of all the images by crop the exact image to eliminate white spaces. Feature extraction makes the features invariant to size, and translation, hence the way of feature extraction for all samples will be the same.
Figure3.3: a) original image b) cropped image [pic]
3.1.2 Feature Extraction: The goal of feature extractors is to characterize an object by measurements whose values are very similar to the other objects in the same category, but very different for the objects in different categories. Here the different features and their combinations were obtained, and we selected a set of 8 features. It involves:
1) The cropped image is divided into four equal parts (namely ll, lh, hl, hh) as shown in figure for feature construction
2) For feature selection: The mean and variance values of each part of the image are calculated. These real values are collected in a one-dimensional matrix. The dataset d consists eight parameters for an image. The parameters are selected as features or vectors and it is passed to SVM recognizer. For above image, the extracted feature values are d = 1.0e+004 * [0.0463 0.0000 -0.0000 0.0000 1.8962 0.0600 0.0849 0.0301]. Generally the features are normalized to be the range between +1 and -1. The main idea is to transfer the features of each character in to one-dimensional signals (functions of one variable for example distance or time), and to process these parameters or weights for obtaining another features and further recognition. For each image, one-dimensional derivative dataset d is computed. According to our normalization method, each character image has 8 parameter values are obtained and they are used as features for the classification step. These features are easy to interpret, compute and they have good information about the structure of the character. These parameters are passed to LIBSVM algorithm to train the samples, and test with new dataset. The algorithm is explained in the section 3.2
3.1.4 Multiple SVM classifier Since we have 28 classes of characters, therefore we need 28 SVM classifiers or 28 hyperplanes to separate the characters from each other. For example, one classifier for character ‘alif ا’ ( so-called SVM0), will separate all the samples of ‘alif’ from other characters and so on. This method of designing multiple SVM classifiers is called one against the others is illustrated in figure
Figure 3.6: One against the other methods for a three class problem
From the 28 outputs of the 28 SVM classifiers, we take the maximum. The SVM classifiers that give the maximum output determine the class label for the input digit.
4. Implementation Our implementation employs the LIBSVM library to train the classifiers and predict new test samples. Feature extractors generate the feature vector in a format usable by LIBSVM and features are scaled between 0 and 1. Radial basis functions (RBF) are used as the kernel for both of the classifiers. . Best values of parameters ( and C are obtained by performing a grid search using cross-validation on the data by taking (= 2−15, 2−13... 23 and C = 2−5, 2−3... 215. The characters were read one by one from each user’s sample file and separated uniformly into 5 files. One of these files was used for training and the rest for testing. Thus, a training size of 20% was taken and 5 experiments were carried out by training the classifier on one file and testing it on the balance. Parameters and C were determined by performing a grid search using 5-fold cross validation on the training set with the help of the grid search tool. The effect of alpha on overall combined accuracy in 5 experiments is drawn in the graph 1. The accuracy rate are analyzed from the graph 2.
Figure 4.2: Accuracy rate
[pic]Results: Experiments were performed and the success rate is listed in the table. Due to time constraints, the firsts 100 letters in the database were used for the results, rather than full database. On 280 characters written by 5 writers, SVM gives the accuracy rate recognition rate 90.4%. Moreover hybrid features from various other methods of Character recognition will give higher recognition rate.
5 Conclusion and future enhancement From the training results, the system has more trouble identifying ‘ر ‘ and ’د .Because, the image is cropped into the height and width of the character. Here the baseline is not followed, so it would not consider the ascending and descending strokes. In future, to avoid this type of trouble, it requires another feature which is mainly to identify the baseline. Future work also includes the development of algorithms for use with larger lexicons and more variability in the appearance of words. This research work has described on the automatic recognition of Arabic handwriting. It has been tested on small dataset.
http://www.csie.ntu.edu.tw/~cjlin/libsvm/ --- Algorithm: SVM: DataSet and the libsvm homepage:
--- Support vector machines
--- Off-line Arabic Handwriting Recognition – A survey
---Character Recognition by Feature Point Extraction by Eric W. Brown
SVM 1 - -- - -
- - SVMn