Digital Humanities at the Sourasky Central Library
The field of Digital Humanities has evolved, and in some cases, even redefining, traditional research of various subjects, resources, and methodologies using scientific techniques and tools. This research field combines traditional humanities and social sciences queries and research methods with digital tools
Our Mission
At the Sourasky Central Library we offer support and resources for Tel Aviv University researchers and students who wish to use computer-based technologies to answer research questions related to the humanities and arts.
Guiding Rules
- Providing guidance, tutorials, and seminars to develop DH skills for conducting research and teaching.
- Preferring Open Source software tools, making research products easily accessible, and preserving the data after the research has been completed (or financial resources have run out).
- Providing support and advice in writing research proposals that include the use of digital tools, from the stage of writing the proposal to the implementation stage.
- Focusing on four major tools: Converting images to text (OCR), Spatial analysis (GIS), Distant Reading (DR), Content Management Systems (CMS).
What is Optical Character Recognition (OCR)?
When scanning a page from a book, a newspaper or any other textual source, the output is an image of the page – quite similar to an image photographed by our mobile phone. The computer does not identify the textual characters, thus, searching for words or phrases is not possible.
Optical Character Recognition (also known as OCR) is a process that enables the computer to identify printed or handwritten text fonts in the scanned image by using designated software. This software can identify the fonts in the scanned text and convert each one of them into a single character.
Optical Character Recognition and DH
Today, Optical Character Recognition is the starting point of computational or quantitative analysis of textual sources. The process, in which many scanned sources can be converted to machine-readable texts, is a mandatory stage in analyzing a large quantity of textual research objects in computational methods. Simplified text images can be generated from sources that were OCRed, so can textual-strings (with one meaning or the other) be tagged and research objects statistically analyzed.
Optical Character Recognition tools
- Adobe Acrobat Pro: The commercial version of the popular PDF file editor easily and efficiently converts PDF files into searchable files. The program supports 42 languages. Once completing the OCR process, the original document can be edited and saved in other formats. In our DH Lab we have fully licensed Adobe Acrobat Pro 2017 installed on two working stations.
- Tesseract: Google’s open-source OCR engine. This engine supports 165 languages, including Hebrew and Arabic (see the full list of languages here). Tesseract does not have a graphic user interface and a coherent use requires some technical expertise. Students and researchers who need guidance and assistance should contact the Reference and Guidance Department. In our DH lab there are two workstations with a full installation of Tesseract version 5.
- ABBYY FineReader: Leading commercial software for optical character recognition. The software supports 201 languages, including Hebrew and Arabic. This software has advanced image processing capabilities, and it even includes an option to train character recognition and create new language patterns by the user. In the digital humanities laboratory there is one position with a full license for ABBYY FineReader 16.
- OCR on Demand: Send us a PDF file - we will send you back a searchable PDF > OCR Service Request.
For more information: Main Entrance Hall | cenlib@tauex.tau.ac.il | 03-6404823