Gemma 3n’s multimodal capabilities extend beyond audio to a sophisticated understanding of visual information. You can use it to describe what’s in an image, answer specific questions about it (Visual Question Answering or VQA), and even read text from images (Optical Character Recognition or OCR).
This tutorial will guide you through using Gemma 3n’s vision features with Python, again leveraging the mlx-vlm
library on Apple Silicon for a streamlined experience.
The Power of On-Device Vision
Having a powerful vision model running locally is a significant step forward for AI applications. It enables use cases like:
- Accessibility tools that describe the world to visually impaired users.
- Smart photo organization that automatically tags and categorizes your images based on their content.
- Interactive learning apps that can “see” and comment on a user’s drawings or diagrams.
- Privacy-first analysis of sensitive images without ever sending them to the cloud.
Step 1: Environment Setup
The setup is the same as our audio transcription tutorial. If you already have it, you can skip this step. Otherwise, in a new project directory, create and activate a virtual environment:
# Create a new directory for your project
mkdir gemma-vision-test && cd gemma-vision-test
# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install the required library
pip install mlx-vlm
Step 2: Prepare Your Image
Find an image you want to analyze. It could be a photograph, a diagram, or a screenshot. For this example, save it as my-image.jpg
in your project directory.
Step 3: Writing the Image Analysis Script
Let’s write a Python script, analyze_image.py
, to perform different visual tasks.
import mlx.core as mx
from mlx_vlm import load
from PIL import Image
import requests
# Load the Gemma 3n model. This will be cached after the first run.
model, processor = load("gg-hf-gm/gemma-3n-E4B-it")
# Define the image path
image_file = "my-image.jpg"
# --- Task 1: General Image Description ---
prompt_describe = "Describe this image in detail."
print("--- Generating Image Description ---")
response_describe = model.generate(
image=image_file,
prompt=prompt_describe,
max_tokens=512,
temp=0.7
)
print(response_describe)
# --- Task 2: Visual Question Answering (VQA) ---
prompt_vqa = "What color is the main object in the image?"
print("\n--- Answering Question About Image ---")
response_vqa = model.generate(
image=image_file,
prompt=prompt_vqa,
max_tokens=128,
temp=0.0 # Lower temperature for factual answers
)
print(response_vqa)
# --- Task 3: Optical Character Recognition (OCR) ---
# Use an image that contains clear text for this task
image_with_text = "my-text-image.png"
prompt_ocr = "Read the text in this image."
print("\n--- Performing OCR on Image ---")
response_ocr = model.generate(
image=image_with_text,
prompt=prompt_ocr,
max_tokens=256,
temp=0.0
)
print(response_ocr)
Step 4: Run and Observe
Place your my-image.jpg
and an image with some text named my-text-image.png
in the directory. Now, run the script:
python analyze_image.py
You will see the model output three distinct responses: a detailed description of the first image, a specific answer to your question about it, and a transcription of the text in the second image.
Video Demonstrations
To see Gemma 3n’s vision capabilities in action, including handling multiple images at once, check out these insightful videos:
Unlocking Multi-Image Magic:
Using Vision for OCR and Understanding:
Conclusion
Gemma 3n’s ability to seamlessly process and reason about visual information makes it an incredibly versatile tool for developers. By combining its text, audio, and vision capabilities, you can build truly intelligent, multimodal applications that were previously only possible with large, cloud-based models. Experiment with your own images and prompts to discover the full potential of on-device visual AI.