Video Processing Pipelines¶
This document describes how a football broadcast video is processed end-to-end. It focuses on the sequence of stages, their inputs/outputs, and how they run in parallel. Implementation details are intentionally omitted.
Goals¶
- Identify frames/scenes that correspond to the “overhead (tactical) shot”
- Extract the in-game clock via OCR
- Detect players and ball
- Estimate camera geometry by matching field lines and computing a field-aligned transform
- Produce time-aligned outputs that downstream stages can consume (tracking, analytics, overlays)
High-level flow¶
- Ingest video and standardize it (fps, resolution, codec)
- Sample frames and compute visual embeddings with a vision backbone
- Cluster frames by embeddings to identify overhead views
- In parallel, run OCR to decode the in-game clock
- In parallel, run fine-tuned object detection for players and ball
- In parallel, estimate camera pose by matching field lines and computing homography/geometry
- Synchronize all branches by timestamps and hand off to subsequent stages (e.g., tracking, analytics)
flowchart TD
A[Ingest Video] --> B[Frame Sampling]
B --> C[Vision Backbone Embeddings]
C --> D[Clustering & Overhead View Classification]
B --> E[OCR: In-game Clock]
B --> F[Object Detection: Players & Ball]
D --> G[Camera Geometry: Field Line Matching & Pose]
F --> G
D --> H[Select Overhead Segments]
E --> I[Time Alignment]
F --> I
G --> I
I --> J[Downstream: Tracking, Analytics, Overlays]
Stage 0 — Ingestion and pre-processing¶
Input - Single video file or directory of pre-split segments
Process - Normalize fps and resolution (optional but recommended) - Generate basic metadata (duration, frame count, fps) - If very long, split into segments (e.g., n-minute chunks) for parallelism
Output - Standardized video (or segments) - Metadata JSON (fps, duration, segment boundaries)
Stage 1 — Frame sampling and embeddings (Backbone)¶
Input - Standardized video or segments
Process - Sample frames at a configured rate (e.g., 1–2 fps for global scene clustering) - Compute feature embeddings with a vision backbone (e.g., ResNet/CLIP/Vit). Store per-frame embedding vectors with timestamps
Output - Frame index with: {timestamp, frame_id, embedding}
Stage 2 — Clustering and overhead-shot classification¶
Input - Frame embeddings
Process - Cluster embeddings (e.g., k-means/DBSCAN/HDBSCAN) - Identify cluster(s) corresponding to the overhead view (e.g., via similarity to learned prototypes or centroid-distance heuristics) - Optionally refine with a lightweight classifier trained to distinguish overhead vs non-overhead
Output - Per-frame labels: {is_overhead: bool, confidence} - Overhead segments: contiguous time ranges when is_overhead is true
Stage 3 — OCR for in-game clock (parallel)¶
Input - Sampled frames (or low-rate crops of expected clock region)
Process - Detect and/or crop the scoreboard/clock region - Run OCR to decode the clock string (e.g., mm:ss or mm:ss + half) - Apply temporal smoothing, error correction, and monotonicity constraints
Output - Time series: {timestamp -> game_clock} - Confidence scores and missing/uncertain markers
Stage 4 — Object detection (players and ball) (parallel)¶
Input - Frames (full-res or downscaled); ideally restricted to overhead segments for efficiency
Process - Run a fine-tuned detection model (e.g., YOLO) for classes: player, ball, referee, goalkeeper - Produce bounding boxes with confidence and class - Optional: instance appearance descriptors for downstream tracking
Output - Detections per frame: [{bbox, class, conf, timestamp}]
Stage 5 — Camera geometry (parallel)¶
Input - Frames (preferably those labeled as overhead)
Process - Detect field lines, circles, arcs (e.g., Hough + learned line segments) - Match against a canonical pitch model to estimate homography and camera pose - Use RANSAC/robust fitting to handle outliers - Optionally compute per-frame transforms to normalized field coordinates
Output - Per-frame geometry: {H (image->field), inliers, quality metrics} - Camera pose over time, when available
Stage 6 — Synchronization and handoff¶
Input - Overhead labels, OCR time series, detections, camera geometry
Process - Align by timestamps and segment boundaries - Resolve conflicts and propagate uncertainties (e.g., prefer overhead-only outputs for geometry-dependent tasks) - Produce unified, time-indexed records for downstream steps
Output - Consolidated timeline with: {game_clock, is_overhead, detections, geometry} - Ready for tracking, identity stitching, team assignment, event detection, analytics
Data model (proposed)¶
- video.json
- fps, duration, segments
- frames.parquet
- frame_id, timestamp, is_overhead, overhead_conf
- embeddings.parquet
- frame_id, embedding (vector)
- ocr_clock.parquet
- timestamp, game_clock, conf
- detections.parquet
- frame_id, timestamp, class, bbox(xywh), conf
- geometry.parquet
- frame_id, timestamp, H(3x3 flattened), quality, inliers
Processing of a new video (example)¶
1) Ingest video; optionally split into N-minute segments for parallelism 2) Sample frames at 1–2 fps; compute embeddings and cluster to label overhead frames 3) In parallel - OCR: read the in-game clock from each sampled frame and smooth over time - Object detection: run the player/ball detector; store per-frame boxes - Camera geometry: on overhead frames, fit homography to pitch model and store transforms 4) Synchronize by timestamps; emit a consolidated timeline for downstream tasks (tracking, analytics, overlays)
Parallelism and orchestration¶
- The pipeline branches from sampled frames into three parallel paths: OCR, detection, and geometry
- Overhead classification can gate geometry estimation and optionally gate detection to save compute
- Branch results are joined by timestamp; missing data is handled gracefully with confidence/quality fields
- Orchestration can be implemented with a workflow engine (e.g., Metaflow) to schedule parallel steps and joins
Quality, monitoring, and fallback¶
- Persist per-stage quality metrics (OCR confidence, clustering silhouette score, geometry inlier ratios)
- Define thresholds and fallbacks (e.g., if geometry fails, skip field-projected analytics for that window)
- Log timing and throughput for each stage to guide optimization
Outputs for downstream stages¶
- Tracking: use detections + embeddings to produce identities and trajectories
- Team assignment: color/cluster-based team ID + goalie detection
- Event detection: possession changes, shots, goals, set-pieces
- Visual overlays: project detections onto a canonical field view using H
This document will evolve as models and contracts stabilize. The focus here is the data flow and stage responsibilities to enable independent development and testing of each component.