Back to articles
How I Built Video Token Optimization for Vision LLMs: Cutting Costs 13-45% with Frame Dedup + Scene Detection

How I Built Video Token Optimization for Vision LLMs: Cutting Costs 13-45% with Frame Dedup + Scene Detection

via Dev.to PythonPritom Mazumdar

A few weeks ago I launched Token0 -- an open-source proxy that optimizes images before they hit vision LLMs like GPT-4o, Claude, and Ollama models. The reception was good, so I kept building. The most requested feature was video. If images are expensive, video is brutal -- every second at 30fps is 30 images. This post covers how I built the video optimization pipeline, what I learned benchmarking it across 5 models, and the model-aware edge case that nearly broke everything. The Problem with Naive Video Most apps that analyze video do one of two things: Extract frames at 1fps and send every one of them Send a handful of manually selected keyframes Both approaches waste tokens in predictable ways. At 1fps on a 60-second product demo video: You get 60 frames Frames 1-29 of the same talking head are near-identical (Hamming distance < 10 between perceptual hashes) The only frames with unique information are at scene transitions You're paying for 60 images when 8-12 contain all the informat

Continue reading on Dev.to Python

Opens in a new tab

Read Full Article
2 views

Related Articles