A Python script that uses a local Ollama multimodal model to generate captions for your images in bulk. You can use the prompt to guide the vision model to include certain keywords, to describe a certain person by their name. It features a rich, interactive terminal user interface (TUI) for easy operation, configuration, and live progress tracking. This is mostly a helper tool for preparing image datasets for training with FLUX. They are captions, as unlike Stable Diffusion, FLUX relies on natural language processing over keyword processing.
rich and gum. No need to edit the script to change settings!config.json file for your next session.Before you begin, ensure you have the following installed and running:
A Multimodal Ollama Model: You need a model capable of processing images, such as moondream.
ollama pull moondream
Rich: A Python library for rich text and beautiful formatting in the terminal.
pip install rich
Gum: A tool for glamorous shell scripts, used for the interactive menus.
brew install gumInstall Dependencies: Make sure you have installed Python, Rich, and Gum as listed in the requirements section.
Start Ollama: Ensure the Ollama application is running and the server is active.
Run the Script:
Save the code as ollama_captionizer.py and run it from your terminal:
python3 ollama_captionizer.py
Use the Menu: You will be greeted by the main menu, where you can:
Captions will be saved as .txt files with the same name as the original image (e.g., my_photo.jpg -> my_photo.txt).
This script is written in Python and is designed to be cross-platform. It should work on macOS, Linux, and Windows provided the dependencies are met.
A key feature is that it communicates with the Ollama server over its network API (e.g., http://localhost:11434). This means you do not need to modify the script to handle different executable names like ollama.exe on Windows.
The primary consideration for cross-platform use is ensuring that the gum command-line tool is properly installed and accessible in your system's PATH.