OCR stands for Optical character recognition.
This is useful if you have DVD/BD subtitles or hardcoded subtitles, but want to “convert” them to the softsub .srt or .ass format. PGS/SUP/VOB are in fact “softsubs”, but they are picture based. This is for perfomance reasons on DVD and Blu-ray players. SRT/ASS are render based, so it takes much more CPU usage for mathematical calculations. The higher CPU usage is the disadvatage of them. The obvious advatage is that you can render to every resolution you want. Even being crisp sharp on a 8K display. The same advatage and disadvantage goes for hardsubs. However, that means you can’t both just convert them into another format. You need OCR for that.
What do I need?
OCR with DVD & BD subtitles:
The fact that both PGS aka SUP and VOB are already softsubbed makes it much easier than dealing with hardsubtitles. Just fire up Subtitle Edit and use on of the following “Import” options:
Later you have to configure your OCR settings. I personally use this settings, but it’s totally up to you. In my tests they seem to be the most accurate:
Depending on your content you might want to “add names/noise list”, “add to user dictionary” or “add pair to OCR replace list”. The more you OCR and fill both three lists the smarter and faster the OCR gets. After you are done you hit “OK”.
“Add to names/noise list (case sensitive)”
Like the name says, you can add “Unknown words” to a name list instead of the putting them in your dictionary list.
“Add to user dictionary”
Also like the name says. The global dictionary doesn’t know everything, so you can make it smarter. Like already said, the smarter you make the OCR the faster and better it gets.
“Add pair to OCR replace list”
This list is useful for certain special characters that doesn’t exist in your language or stuff that often got wrong detected. Within anime it can happen that sometimes translators use characters like āōū. You can just put them on a replace list (keep in mind that is probably detects āōū as different characters):
If everything seems okay for you, you can save your result.
I personally recommend “srt”, because there’s no typeset or complex styling anyway.
If you want to fix the subtitles with one of my Subtitle-Pack scripts, srt input probably leads in better results.
OCR with hardcoded subtitles:
This is just a quick guide, because so far I always avoided it (and recommend the same if possible).
VideoSubFinder let’s you find frames of a video with hardcoded subtitles. You probaly have to tweak it a bit for less false positives. After done go though every taken picture and delete everything without subtitles on it.
You can either use SubtitleEdit or ABBYY FineReader for OCR. SubtitleEdit is painfully slow with Tesseract and other methods aren’t really worth using. You can get some speed boost (at the cost of more accurate results) if you use Tesseract 3.02 or newer versions with “Original Tesseract only” instead of LSTM or both. The other option left is ABBYY FineReader – which is much faster with pretty decent results. However, it’s paid software. It’s up to you what you rather want to use.
Edit: You can also try this HERE.