Bringing AI to Life: A Real-World Example of Using OCR with RAG Models for LLM Integration

5 min readAug 25, 2024

In the ever-evolving world of artificial intelligence, the synergy between Optical Character Recognition (OCR) and Retrieval-Augmented Generation (RAG) models opens up a vast array of possibilities. This guide is designed to walk you through how to get started with OCR using Tesseract and then integrate it with a RAG model for LLM use cases, specifically with OpenAI GPT models.

Real World Example: Extracting and Using Information from a Gym Schedule Photo
When it comes to practical AI applications, nothing beats a real-life example. Let’s say you’ve just snapped a photo of the group workout schedule at your local gym. It’s packed with valuable information, but how do you quickly extract that data and make it usable? This is where the combination of Optical Character Recognition (OCR) and Retrieval-Augmented Generation (RAG) models comes into play.
For instance, using Tesseract OCR, you can extract all the text from the gym schedule, turning an image into searchable, editable content. Once the text is extracted, you can feed it into a RAG model. This allows you to query the schedule in natural language — like asking, “What classes are available on Tuesday at 9:00 AM?” — and get an instant, accurate response, such as “Boot Camp at 9:00 AM in Studio 1.”
Want to skip to the best part? Scroll to the bottom on how to get set up and try this yourself!

Why Combine OCR with RAG?

The combination of OCR and RAG models is a powerful approach for extracting and leveraging textual information from diverse sources. Here’s why:

Text Extraction from Images: OCR enables the conversion of various documents — such as scanned papers, PDFs, or photos — into editable and searchable text.
Enhanced Information Retrieval: RAG models bring together the strengths of retrieval-based models and generation-based models, allowing them to fetch relevant information from a knowledge base and generate coherent responses. When paired with OCR, RAG models can utilize information embedded in images or non-digital text.
Real-World Applications: This approach is versatile, with applications ranging from automating document processing to enhancing chatbot interactions with hard-to-find data.

Step 1: Setting Up Tesseract for OCR

Tesseract is an open-source OCR engine that supports multiple languages and can be easily integrated into Python workflows.

Installation

You can install Tesseract on different platforms as follows:

For Windows:

Download the installer from the official GitHub repository.
Run the installer and follow the instructions.

For macOS:

brew install tesserect

For Linux:

sudo apt-get install tesseract-ocr

Basic Usage with Python

Once installed, you can use Tesseract directly from the command line or integrate it with Python using the pytesseract library. Here’s a basic example using Python:

import pytesseract
from PIL import Image
# Load the image
image = Image.open('path_to_image.png')
# Extract text from image
text = pytesseract.image_to_string(image)
print(text)

This code loads an image, processes it with Tesseract, and outputs the extracted text.

Step 2: Integrating OCR with a RAG Model

After extracting text using OCR, the next step is to integrate this output with a RAG model. This approach is particularly useful when you need an LLM to generate responses based on external knowledge sources.

Overview of RAG Models

A RAG model consists of two main components:

Retriever: Fetches relevant documents or text snippets from a large corpus based on a query.
Generator: Utilizes an LLM to generate a response based on the retrieved information.

This architecture is especially useful for tasks like question answering, where the model needs to consult external knowledge before generating a response.

Practical Example Using a Real Document

Let’s walk through a practical example using a real-world document — a group exercise schedule (thanks XSport Fitness in South Barrington, IL). Suppose you want to automate the process of answering queries about available classes based on the schedule.

Group Schedule from XSport at South Barrington, IL

Step 1: Extracting Text from the Schedule

Here’s how to extract text from an image of the exercise schedule using Tesseract:

import pytesseract
from PIL import Image
# Load the image
image = Image.open('/mnt/data/B0587E70-2B61-4629-932A-8231137FACCF.jpeg')
# Extract text using Tesseract OCR
text = pytesseract.image_to_string(image)
print(text)

The extracted text might look something like this:

BARRINGTON
GROUP EXERCISE SCHEDULE
MONDAY     TUESDAY     WEDNESDAY     THURSDAY     FRIDAY     SATURDAY     SUNDAY
9:00 AM    9:00 AM     9:00 AM       9:00 AM      9:00 AM    9:30 AM      [No Class]
Boot Camp  Boot Camp   Pure Strength Ride and Rip [No Class] Boot Camp   [No Class]
(50)       (50)        (50)          (50)         (50)       (50)        [No Class]
Studio 1   Studio 1    Studio 1      Studio 1     Studio 1   Studio 1    [No Class]
..

Step 2: Storing the OCR Text in a Knowledge Base

You could store this text in a searchable format (e.g., Elasticsearch, FAISS) to enable efficient retrieval:

from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import DensePassageRetriever
# Initialize the document store
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")
# Add OCR extracted text to the document store
document_store.write_documents(
  [
    {
      "content": text, 
      "meta": {
        "name": "group_exercise_schedule"
      }
    }
  ]
)

Step 3: Retrieving Relevant Information with RAG

To answer a query such as “What classes are available on Tuesday at 9:00 AM?”, you can implement a retriever to search the knowledge base, and then use a GPT model to generate a response:

# Implement retriever
retriever = DensePassageRetriever(document_store=document_store)
# Query the retriever
retrieved_docs = retriever.retrieve(query="Tuesday 9:00 AM classes")
# Generate a response with GPT
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What classes are available on Tuesday at 9:00 AM?"},
        {"role": "assistant", "content": retrieved_docs[0].content}
    ]
)
print(response.choices[0].message['content'])

Expected Output:

The class available on Tuesday at 9:00 AM is Boot Camp (50) in Studio 1.

Conclusion

This guide has walked you through the process of setting up OCR with Tesseract, integrating the extracted text into a RAG model, and using it with OpenAI GPT models to answer queries based on the content of an image. This approach can be adapted for a variety of use cases, such as automating responses to common questions based on printed schedules or documents.

By combining the strengths of OCR and RAG models, you can create sophisticated AI systems that handle real-world data more effectively, enabling a more seamless interaction between digital and non-digital information sources.

Happy coding!