Minimum edit distance on a probabilistic string

Date

2023-05-23

Journal Title

Journal ISSN

Volume Title

Publisher

Yeshiva University

YU Faculty Profile

Abstract

Optical character recognition, or OCR, is a method used to convert images of typed or handwritten text into machine-encoded text. Oftentimes, text can be illegible or worn out, and therefore ambiguous. In these situations, OCR models can output a probabilistic string, or sequence of characters, with a ranking of several less likely options as well. In order to quantify how dissimilar the output string is from another string, Levenshtein distance, or other edit distance algorithms are used. These algorithms count the number of operations required to convert one string into another. The possible operations that can be performed in most edit distance algorithms consist of inserting a character, deleting a character, and replacing a character. The smaller the Levenshtein distance between two strings, the more similar the strings are to one another. There are various edit distance algorithms, each with their own run-time, efficiency, and readability, however, most of these algorithms do not take probabilistic strings into account. This paper’s contribution is to survey prominent methods of calculating the minimum edit distance of two strings and to evaluate how each method takes run-time, efficiency, and the ability to work with probabilistic strings into account. This will help automate the process to find missing and extra letters in scrolls that were copied from the Masoretic Text, which can help Biblical researchers.

Description

Undergraduate honors thesis / Open access

Keywords

Optical Character Recognition (OCR), probabilistic string, Levenshtein distance, distance algorithms

Citation

Schwartz, D. (2023, May 23). Minimum edit distance on a probabilistic string [Unpublished undergraduate honors thesis]. Yeshiva University.