Minimum edit distance on a probabilistic string
Description
Undergraduate honors thesis / Open access
Abstract
Optical character recognition, or OCR, is a method used to convert images of typed or
handwritten text into machine-encoded text. Oftentimes, text can be illegible or worn out, and
therefore ambiguous. In these situations, OCR models can output a probabilistic string, or
sequence of characters, with a ranking of several less likely options as well. In order to quantify
how dissimilar the output string is from another string, Levenshtein distance, or other edit
distance algorithms are used. These algorithms count the number of operations required to
convert one string into another. The possible operations that can be performed in most edit
distance algorithms consist of inserting a character, deleting a character, and replacing a
character. The smaller the Levenshtein distance between two strings, the more similar the strings
are to one another. There are various edit distance algorithms, each with their own run-time,
efficiency, and readability, however, most of these algorithms do not take probabilistic strings
into account. This paper’s contribution is to survey prominent methods of calculating the
minimum edit distance of two strings and to evaluate how each method takes run-time,
efficiency, and the ability to work with probabilistic strings into account. This will help automate
the process to find missing and extra letters in scrolls that were copied from the Masoretic Text,
which can help Biblical researchers.
Permanent Link(s)
https://hdl.handle.net/20.500.12202/9035Citation
Schwartz, D. (2023, May 23). Minimum edit distance on a probabilistic string [Unpublished undergraduate honors thesis]. Yeshiva University.
*This is constructed from limited available data and may be imprecise.
Collections
Item Preview
The following license files are associated with this item: