Matching Strings of Renaissance Letters

Using Finite Numerical Sequences

Nikolay Metodiev Sirakov1 and Fernando Humberto Muge2

1,2 C.V.R.M., Instituto Superior Tecnico

Av.Rovisco Pais, 1049-001 Lisbon, Portugal

Tel. +351-1-84 17 396, Fax: +351-1-84 17 442.

E-mail: 1 nmsir@alfa.ist.utl.pt  and 2 pcmuge@alfa.ist.utl.pt

 

 

Abstract: This article presents an approach for shape matching of Renaissance letters and words, which are derived from Renaissance’s books from 16-en century. The method is based on notions as regularities, consistencies and essential points of letter’s border. The word’s model is developed as aspects graph tree, where each consistency falls apart to four finite numerical sequences. Since the semantic meaning of the words is not taken into account, they are considered as strings. Thus the string’s model may be presented as a set of finite numerical sequences. This way the problem for matching words, which contain Renaissance letters, is reduced to the problem for matching finite numerical sequences. In order to solve the latest one a set of rules is developed. It allows for decreasing of the algorithm’s complexity. A software tool is created on the basis of the theoretical concepts. Employing this tool a set of experiments was performed. As a case study set of words is used from the 7th page of the Renaissance book Lusiadas - Biblioteca Geral da Universidade de Coimbra. The results are shown in the article.

 

Keywords: shape description, aspect graph tree, and matching Renaissance letters and words.

 

1.  Introduction

This study presents an approach for shape matching of Renaissance letters and strings, which are derived from Renaissance books. The research is a part from the EU project “Digital Access to Books of the Renaissance”, Acronym- DEBORA, DG XIII/Telematic Program/LB-5608/A.

The indexing is an important problem related to digitising and analysis of Renaissance books. The calculation of “words frequency” in a given page, chapter of book or in a whole book may provide better opportunity for study of Renaissance language.

Let’s suppose that a digital Renaissance book is given and the pages are saved as images. In order to perform correct, efficient and fast indexing and calculation of the word’s frequency the following problems have to be solved before:

$             pre-processing the pages (images) in order to clean the noise;

$              segmentation of each page to text and pictures;

$              line detection;

$              words segmentation.

Let suppose that upper mentioned steps are accomplished and key word is given. The words segmentation algorithm derives a list of words, whose enveloping rectangle has almost the same size as the rectangle enveloping the key one. It provides for significant decreasing the number of the words, which have to be matched with the key one.

Since the semantic meaning of the words is not taken in to account in case of shape matching, the notion string will be used hereafter.

A number of different methods may be employed for solving the string matching problem [1], [2], [3], [4], Some of them are structural, statistical or combined [3], other are based on Hidden Markov Methods and Neural Networks  [1], [4]. These methods are oriented mainly to hand written characters recognition.

Due to the ancient technology used for printing of Renaissance books the letter’s shape varies from page to page and number of different distortions of the letter’s shape may be observed. Some letters are broken to two ore more parts, some letters are damaged (a part is missing), and other letters are fused together. But from other hand, the Renaissance letters shape does not vary so much as in hand written case. Since all these reasons it may be considered that the Renaissance letters have “difficulty” of matching being somewhere between the difficulty of type written and hand written letters. It makes us to conclude that shape-matching approaches should be more appropriate in the case of Renaissance letters.

The shape representing notions such as convexity and connectivity are suggested in [5] as useful to be applied for shape description of letters, but methodology for matching is not provided there. Formfactor, roudnes, aspect ration, elongation and curl [9] may be also applied for letters shape measurement, but they will need additional information to provide a correct matching in case of digital Renaissance books.

An approach for shape matching of Renaissance letters and strings is developed in this article. The method is based on notions such as regularities, consistencies, essential border points [6], [7]. The string’s model is developed as an aspect graph tree. The shape of each letter is presented as consistency (vector of consistencies, when several objects present the letter’s shape (see Fig.2)). Further each consistency falls apart to four finite numerical sequences (FNS). This way the problem for matching of Renaissance letters and strings is reduced to the problem for matching of FNS. It allows for significant reduction of the calculation complexity. A software tool for matching strings of Renaissance letters is developed on the basis of the theoretical concepts. An experiment was performed using a list of words, whose enveloping rectangle has almost the same size as enveloping rectangle of the word “Occeano”. The list is derived from the 7-th page of the book “Lusiadas”, which belongs to Biblioteca Geral da Universidade de Coimbra. The results of matching are provided in the article.

2. RENAISSANCE LETTER’S MODEL

In this paragraph the model of a string (word) containing Renaissance letters is developed as an aspect graph tree, using the introduced notions "regularities", "consistencies", "essential points of the letter’s border " and FNS [6], [7], [8]. The 2D object border is approximated with a given accuracy by a closed polygon.

"Regularity" is considered to be the direction of motion of a point over a closed polygon (see Fig.1A). The angles satisfied by regularities are defined below:

                                (1).

Regularity represents just one edge of a closed polygon, which approximate the letter’s border by a given accuracy, while consistency is defined as consecutively connected regularities.  As an example, the border of the Renaissance letter “E” is approximated with accuracy 0.5 by the closed polygon shown in Fig. 1B. The polygon is described by the consistency presented in (2). Cl closes the consistency.

                  

  Cl                                            (2).

Vertices that link different regularities are called "essential points of the letter’s border "- [7], [8]. Vertices that link regularities of one and the same type are called "non-essential".

A vertex is used as a starting point of a polygon description that is accomplished by employing consistencies. In our case this is the polygon vertex with co-ordinates (xmin, ymin).

                     A)

                      B)

 

Figure 1. A) The plane points satisfy 8 regularities. B) Letter border, the regularities of its edges satisfy.

Further on, each consistency falls apart to four finite numerical sequences: Rg - sequence of regularities that build the consistency; Rp - a sequence containing repetition indicators and showing how many times each regularity from Rg consecutively participates in the consistency; An - the angles, that the regularities of Rg conclude with the axis Ox+; Le - the lengths of the regularities belonging to Rg. As a result, the consistency shown in (2) falls into four numerical sequences:

Rg = 5,1,3,2,3,2,8,7,5,7,6,8,4,3,8,6,4,3,2,6,7,6;    Rp = 1,1,3,1,1,1,1,1,1,1,1,1,1,2,1,4,2,1,1,2,1,1;        An = A1,A2,........,A10;         Le = l1,....,l30    (3).

It follows from the definition of the regularities S1, S2, S7, S8 that they have Rp = 1.

Thus, the geometrical model of each closed polygon (contour) approximating a letter border (when the letter is presented by only one object, see Fig.2) may be presented as four finite numerical sequences.

Employing the idea of finite numerical sequences, the model of a string containing Renaissance letters is developed as an aspect graph tree with six levels (see Fig.3). The following denotes are used there:

Ok  – string of Renaissance letters;

         Cki- the letter i that belong to the string k. If the letter contains only one object (for example “i” and “j” contain 2 objects), then Cki denote consistency;

         regularity that belongs to the i-th consistency of the k-th string (word);

, , - are the repetition indicator of , angle and length,  respectively.

 

Figure 2. The letters “h” and “n” are broken and two objects present each letter.

Note that in the case of Renaissance books some of the letters are broken to two or more objects (see Fig. 2). In this case, Cji denotes the following vector:

Cji = (, …,)                                                               (4),

where  ,for v=1,…,u, is a consistency and u is the number of the objects that present the corresponding Renaissance letter. The consistencies in the upper vector are ordered depend on the position of the object in the letter and using the following description rule: objects description is performed from the left corner of the letter to the right one and from bottom to top.

 

Figure 3. The string’s model as an aspect graph tree.

Using the presented model, the problem for matching of strings, which contain Renaissance letters, is reduced to matching aspect graph trees. It allows for a significant decreasing of calculation complexity of the algorithms.

3. MATCHING OF STRING’S MODELS

An approach for matching models of strings containing Renaissance letters is developed in this paragraph. Only strings (words) with “almost” the same size of the enveloping rectangle are considered as subject of matching. The other words can be discriminated by word segmentation algorithm, as it was explained in the Introduction.

Having a look at the graph structure it may be observed that the second level is a sequence of vectors and four nodes are consecutively connected with each vector element. A number may be considered with each graph node by using the following mappings:

Cki à i, à v,  à mj, à the number of the consecutive repetitions of , à the angle of  ,  à the length of . This way the string model shown in Fig. 3 may be presented by set of finite numerical sequences (FNS).

Follows that the problem for matching of graph trees may be restricted to matching of finite numerical sequences. A set of rules is given hereafter in order to solve this problem.

By definition Si and Sj are called different if  ij. They are called almost similar if  i = j, but  , for d10 and d20. When d1 = 0, the regularities are similar, and when d1 = 0, d2 = 0 they are equal to one another. The terms are denoted as follows: almost similarity by  "", similarity by "" and equality by “”.

Consider FNS – a1, a2, and two elements: e1i a1, e2j  a2. The numbers i and j give the position of the element in FNS.

We consider the elements e1i and e2j almost similar if: , where d30 and d40. When  but 0 the elements are similar, and when  they are equal to one another.

If FNS a1 and a2 satisfy the inequality  they are called almost similar when m is the number of the almost similar elements; similar - when m is the number of the similar elements; equal - when m is the number of equal elements between the sequences. d5 denotes a threshold.

It follows from the above definition that two consistencies ci and cj  are:

almost similar (ci  cj), if                                                        Rg Rgj ,   Rp  Rpj ,   An  Anj ,  Ln  Lnj            (5);

are similar ( ci  cj), if                                                             Rgi    Rgj ,  Rp  Rpj ,  An   Anj ,  Ln  Lnj            (6);

equal (ci  cj), if                                                                      Rgi    Rgj ,  Rp  Rpj ,  Ani    Anj , Lni    Lnj            (7).

and the following inequalities are satisfied

      ,            .

In the last definition the following notations are used: d – threshold; Li for i=1,2 - the total length of the corresponding consistency; LN - the total length of the equal parts of the FNS -Rg. We use the sequence Le (see (3)) to calculate L. Moreover, to calculate LN we use the sequences Le and the result of the comparison between sequences Rg.

     Note that the matching result depends on the thresholds d1,...,d5. The user may choose the values, depending on his knowledge about the considered Renaissance letters.

In the case of broken letter, more than one closed polygon (objects) describing it. The notion "morphological similarity" and the approach presented in [8] are applied here, in order to define similar objects that belong to one and the same letter.

Remember that the problem for matching Renaissance words is reduced to matching aspect graph tree, which we restricted to processing and matching of finite numerical sequences. It lead to significant decreasing of calculation complexity of the matching algorithms, which is very important when large number of Renaissance words have to be matched in quasi real time.

4. RESULTS

A software tool capable for matching of strings (words) containing Renaissance letters is developed on the basis of the theoretical concepts. The tool is running under DOS in order to increase the speed. As input it use a file, containing list of strings (words), which have to be compared with a given keyword. The matching begins form the top of the list.

Two types of outputs from the tool are provided: images and a text file. Each image contains the result of matching between the keyword and the corresponding one from the list. The similar letters are painted in one and the same colour and the corresponding similar parts of the border are connected with straight elements. The text file contains the FNS describing the border of each letter, the essential border points and the result of matching.

                         A)

              B)

            C)

Figure 4.  A) A list of words, whose enveloping rectangle has the same size as “Occeano”. B) Matching the first word from the list with the key word. The last three letters only are found as similar. C) Matching the second word with the key one. The words are equal since they have the same number of letters, which are similar.

 

           A)

           B)

           C)

Figure 5.  Matching the key word with the 3-th, 4-th and the 5-th words. All the words are different. A disadvantage of our algorithm may be observed in C). The letters “O” and “G” are found as similar, since the first one is broken and has consistency similar to the consistency of “G”.

            A)

             B)

             C)

Figure 6.  Matching the key word with the 6-th, 7-th and the 8-th words. The first couple of words are different since all pairs of corresponding letters are different. C) The words are equal since they have the same number of letters, which are similar.

              

 

Figure 7. Matching the key word with the last one from the list. A fusion of letters in the lower word may be observed.

      

Figure 8. Matching a letter (second one from the first and second strings) with a broken one.

Two strings (words) are considered as equal if and only if they have the same number of letters and the corresponding letters are similar. The notion “corresponding” means that only letters situated on one and the same place in the strings are subject of matching (first with first, second with second etc).

The developed software tool was employed to process the list presented on Fig. 4A. The words are derived from the 7-th page of the book Lusiadas, which belongs to  Biblioteca Geral da Universidade de Coimbra . The sizes of the enveloping rectangle of each word are almost equal to the size of the enveloping rectangle of “Oceano”. The results of matching are provided in Figs. 4. 5. 6. and 7. Two words are found equal to the word “Oceano”.

In Fig. 8 is shown the matching of a letter “n” with a broken one. The different particles of the lower letter are found as similar to different parts of the upper one.

The running time of the software tool for performing each matching, given in the upper Figures, is around 100 ms using PC 300 MHz.

About 90 words were matched with corresponding key words. The words and the key words were derived from the Renaissance book “Lusiadas”, which belongs to Biblioteca Geral da Universidade de Coimbra. The rate for definition of letters similarity is about 90%.

Applying the presented method and software tool the following advantages may be obtaiuned:

:   description of the letter’s shape by means of finite numerical sequences. It allows for significant decreasing the software running time, which is essential in case of on-line matching of large set of words containing Renaissance letters;

:   definition of shape similarity between Renaissance letters;

:   definition of shape similarity between fussed letters, between fused and normal;

:   definition of shape similarity between broken letters, between broken and normal;

:   a tool for analysis of Renaissance letters is available.

Comparing our approach, with other existing in the field of matching strings the following differences may be observed. Although the Neural Networks provide high recognition rate (in some cases it is 99.35% [4]), there is necessary a training to be performed for recognition of each alphabet letter depend on the printing technique used for the corresponding Renaissance book. Unfortunately it is not convenient for use by library experts, since they do not always have the knowledge necessary for it.

For the statistical methods there have a correlation between the recognition rate, recognition time and the number of “classes” involved in the recognition process [3].

Both types of methods are more appropriate for matching hand-written letters, where infinite different manners of writing letter usually exist.

5. CONCLUSIONS

The research will continue with investigation of set digital Renaissance books in order to define the most common and important distortions in the Renaissance letters shape. Note that there is not available Data Base capable for providing of digital information about the shape of the Renaissance letters from 16-en century. On the basis of the accumulated knowledge and colected Data Base a set of additional letter’s discriminating features will be defined. Further, it will allows to develop an approach and software tool which own higher rate for definition of similarity between Renaissance words. This tool will lie in the root of a software system for automatically definition of words frequencies in digital Renaissance books in quasi real time. It will allows for library and linguistic experts to study by better way the Renaissance books and language.

6. ACKNOWLEDGEMENTS

Thanks to CVRM-IST and EU project Digital Access to Books of the Renaissance, LB-5608/A-DEBORA, for providing the experimental data. Thanks to INVOTAN NATO Fellowships CP(BU)8/99/po and BNSF projects 5025/98, I 706/97 for supporting this research.

References

[1] Bouchaffra, D., Govindaraju, V., Srihari, S.N.,1999. Recognition of Strings using Non-Stationary Markovian Models: An Application to Zip Code Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 23-25, 1999, Fort Collins, Colorado, USA.

[2] Cha, S. H., Shihand, Y. C., Srihari, S.N. 1999. Approximate Stroke Sequence String Mathing Algorithm for Character Recognition and Analysis. Proc. of the 5-th International Conference on Document Analysis and Recognition, Bangalore, India, Sept 1999.

[3] Heute, L., J.V. Moreau, T paquet, Y Lecourtier, and C. Olivier, 1996. Combining of Structural and Statistikal Features for the Recognition of Hand-written Characters. (13th ICPR, TU Vienna, Austria, August 25-29. 1996) Pattern Recognition and Signal Analysis, IEEE Computer Society, Vol. II, 1996, pp.210-215.

[4] Rigoll, G., A., Kosmada, J., Rottland, Neukirchen, C., 1996. A Comparison Between Continuous and Discrete Density Hidden Markov Models for Cursive Hand-written Recognition, (13th ICPR, TU Vienna, Austria, August 25-29. 1996) Pattern Recognition and Signal Analysis, IEEE Computer Society, Vol. II, 1996, pp. 205-209.

[5] Russ, J.C., 1998. The image processing- handbook. Springer Verlag, Heidelberg, Germany, 771 pp.

[6] Sirakov, N.M., “Recognition of shape from a finite series of  plane  figures”, Shape in Pictures; Edited by: O Lie and A. Toet, H.J.A.M. Heijmans, D. Foster, P. Meer; Printed by Springer Verlag, the Netherland, 1994, pp.453-462.

[7] Sirakov, N.M., F.H. Muge, "Comparing 2D Borders Using Regular Structures", Proc. of RecPad'94, Lisbon, Portugal, 1994, 169 -176.

[8] Sirakov, N.M., 1996.Automatic Reconstruction of 3D Branching Objects. (13th ICPR, TU Vienna, Austria, August 25-29. 1996) Pattern Recognition and Signal Analysis, IEEE Computer Society, Vol. II, 1996, pp. 620-624.

[9] Soille, P., 1999. Morphological Image Analysis – Principles and Applications,. Springer Verlag, Berlin, Heidelberg, New York, Barcelona, 310 p.