Data Formats¶

This document describes the expected directory layouts for raw match data and Roboflow training datasets, and the environment variables required by the training scripts.

Raw match layout¶

Match data lives under data/ and is organised by match name:

data/
└── <match_name>/               # e.g. arsenal_mancity
    ├── original_video/         # source video file(s) for the match
    │   └── *.mp4               # one file per half / recording chunk
    └── full_video_frames/      # frames extracted from the source video
        └── *.jpg               # one image per extracted frame (default JPEG)

`original_video/`¶

Contains the raw .mp4 recording(s) for a single match. Long recordings should be split into chunks first using split_video.py (see src/footy_track/scripts/README.md); the resulting parts are stored here.

`full_video_frames/`¶

Contains the per-frame images produced by extract_frames.py. Default format is JPEG at 1 FPS; PNG and WEBP are also supported.

# Extract frames from a source video into the expected directory
uv run footy-track-extract-frames \
    data/arsenal_mancity/original_video/match.mp4 \
    --outdir data/arsenal_mancity/full_video_frames

Frames can optionally carry EXIF metadata (--embed-metadata):

EXIF tag	Content
`video_id`	Match identifier string
`frame_index`	Zero-based frame number
`timestamp_seconds`	Wall-clock position in the source video

Matches used in practice (referenced in notebooks):

Match	Directory name
Arsenal vs Manchester City	`arsenal_mancity`
Arsenal vs Norwich	`arsenal_norwich`
Arsenal vs Aston Villa	`arsenal_astonvilla`
Arsenal vs Bournemouth (1st half)	`arsenal_bournmouth_1st_half`
Arsenal vs Bournemouth (2nd half)	`arsenal_bournmouth_2nd_half`

Roboflow dataset layout¶

Training datasets are downloaded from Roboflow into data/ when the training scripts run. The layout differs between the two model types.

Detection dataset¶

Roboflow workspace: egroeg121
Project: footy-track-detection
Download path: data/detection_dataset/roboflow_dataset_<version>/

data/detection_dataset/
└── roboflow_dataset_<version>/     # e.g. roboflow_dataset_3
    ├── data.yaml                   # YOLO dataset config (paths + class names)
    ├── train/
    │   ├── images/                 # training images
    │   └── labels/                 # YOLO .txt annotations (normalised xywh)
    └── val/
        ├── images/                 # validation images
        └── labels/

Classes (7): player, player_sub, coach, referee, keeper, in_play_ball, person

Current default version: 3 (162 train / 49 val images).

Classifier dataset¶

Roboflow workspace: egroeg121
Project: footy-track-broadcast-frame
Download path: data/classifier_dataset/roboflow_dataset_<version>/

data/classifier_dataset/
└── roboflow_dataset_<version>/     # e.g. roboflow_dataset_10
    ├── train/
    │   ├── yes/                    # broadcast / in-play frames
    │   └── no/                     # non-broadcast frames
    ├── val/
    │   ├── yes/
    │   └── no/
    └── test/
        ├── yes/
        └── no/

Classes (2): yes (broadcast frame), no (non-broadcast frame)

Current default version: 10.

Environment variables¶

Variable	Required	Description
`ROBOFLOW_API_KEY`	Yes	Roboflow API key. Used by all training scripts and upload utilities to authenticate against the Roboflow API. Obtain it from your Roboflow workspace settings.

The Roboflow SDK also reads the key from ~/.config/roboflow/config.json if you have previously run roboflow login.

Training metrics are logged to Weights & Biases if the wandb CLI is authenticated (wandb login). No explicit env var is required beyond the standard WANDB_API_KEY that the W&B SDK reads automatically.

W&B projects¶

Script	W&B project
`train_object_detector.py`	`footy_scan_detection`
`train_classifier.py`	`footy_scan_classifier`