Super User is a question and answer site for computer enthusiasts and power users. It only takes a minute to sign up.
Sign up to join this community
Anybody can ask a question
Anybody can answer
The best answers are voted up and rise to the top
I’m looking for a utility which takes input files, or subtitle-format files, and checks them for common errors and artifacts of OCR’ing raster-image text. Examples:
- Occurrences of lowercase-l instead of uppercase I and vice-versa, when the other character is more likely to fit (e.g. “l’m going home” starting with a lowercase-l).
- Ditto, but the-number-1 and lowercase-l, or uppercase-I and the-number-1.
- Gratuitous space: “I’ m going home”
- Sequences which seem like they should be a multi-digit number, but have a non-digit, e.g. “19o3”.
what it does with these cases can vary:
- Just heuristically fix them.
- Bring up a dialog (GUI/TUI) asking whether to fix the single occurrence (perhaps with option to “fix all”)
- Print the location of the suspected error, for offline handling by a human or another tool
- Small size
- Supports multiple modes of operation
- Allows for additional heuristics to be programmed in