
Old and valuable text in many languages can now be digitized and shared over the internet using platforms like Wikisource.Įditor's note: Article has been updated based on community feedback. Overall, this is quite a large leap for languages that have old texts that have not yet been digitized. However, for a few scripts like Gurmukhi (used to write Punjabi), the output after OCR is quite poor and results in gibberish text in different scripts.Ī tutorial to convert text in Odia (Indian language) from a scanned image using Google's OCR. Tamil-language Wikimedian and Wikimedia India's program director Ravishankar Ayyakkannu said on Facebook this after testing: "For some of the languages like Malayalam and Tamil, the OCR works with almost 100% accuracy, along with support in formatting like auto cropping, separating text by discarding images, and ignoring colored backgrounds." Native speakers of the following Indian lanaguages-Bangla, Malayalam, Kannada, Odia, Tamil, and Telugu-also commented on a Facebook post with feedback after testing the OCR. Other text formatting and structuring elements such as bulleted and numbered lists, tables, text columns, and footnotes or endnotes are likely to get lost. However, detecting these elements is difficult and we may not always succeed. When processing your document, we attempt to preserve basic text formatting such as bold and italic text, font size and type, and line breaks. The OCR project support page offers additional details on preserving character formatting for things like bold and italics after OCR in the output text: Developed as a community project during 1995-2006 and later taken over by Google, Tesseract is considered one of the most accurate OCR engines and works for over 60 languages.
Image text extractor software free#
Google's OCR is probably using dependencies of Tesseract, an OCR engine released as free software, or OCRopus, a free document analysis and optical character recognition (OCR) system that is primarily used in Google Books.

The technology extracts text from images, scans of printed text, and even handwriting, which means text can be extracted from pretty much any old books, manuscripts, or images. It's quite simple and easy to use, and can detect most languages with over 90% accuracy.
Image text extractor software software#
Google's Optical Character Recognition (OCR) software now works for over 248 world languages (including all the major South Asian languages).
