Minimum edit distance on a probabilistic string
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
YU Faculty Profile
Abstract
Optical character recognition, or OCR, is a method used to convert images of typed or handwritten text into machine-encoded text. Oftentimes, text can be illegible or worn out, and therefore ambiguous. In these situations, OCR models can output a probabilistic string, or sequence of characters, with a ranking of several less likely options as well. In order to quantify how dissimilar the output string is from another string, Levenshtein distance, or other edit distance algorithms are used. These algorithms count the number of operations required to convert one string into another. The possible operations that can be performed in most edit distance algorithms consist of inserting a character, deleting a character, and replacing a character. The smaller the Levenshtein distance between two strings, the more similar the strings are to one another. There are various edit distance algorithms, each with their own run-time, efficiency, and readability, however, most of these algorithms do not take probabilistic strings into account. This paper’s contribution is to survey prominent methods of calculating the minimum edit distance of two strings and to evaluate how each method takes run-time, efficiency, and the ability to work with probabilistic strings into account. This will help automate the process to find missing and extra letters in scrolls that were copied from the Masoretic Text, which can help Biblical researchers.