Command-line tool for extracting DINO, CLIP and SigLIP features from images and videos

These details have not been verified by PyPI

Project links

Project description

PyPI License

🦕 DINOtool

DINOtool is a command-line tool for extracting visual features from images and videos using modern vision models like DINOv2, CLIP, SigLIP2, and OpenCLIP/timm compatible models. It supports both global (frame-level) and local (patch-level) features, and can optionally visualize feature maps using PCA.

uvx dinotool test.jpg -o out.jpg

✨ Features

Works with:
- 📷 Single images
- 🎞️ Video files
- 📁 Folders of images
🧠 Supports multiple model backends:
- DINOv2 (default)
- SigLIP2, CLIP, and any timm/OpenCLIP model
💾 Outputs standard formats:
- .parquet (flat/global features)
- .zarr / .nc (spatial patch features)
- .jpg / .mp4 with visualizations
🌈 Optional PCA-based side-by-side visualizations
⚡ Simple CLI with no coding required

👤 Who is DINOtool for?

DINOtool is designed for:

Researchers exploring vision models or needing feature extraction for experiments
Data scientists working with image/video datasets for tasks like clustering, retrieval, or classification
Developers who want to use DINO, CLIP, or SigLIP2 features without writing model code
Students and educators looking to visualize and understand patch-based ViT features
Anyone who wants to preprocess media into standardized visual features for downstream ML tasks — without building a custom pipeline

✨Examples:

dinotool input.mp4 -o output.mp4

produces output:

Video example

DINOv2 accepts inputs of any size. The OpenCLIP/timm models resize the input. Here is an example of a 896x896 image:

dinotool test/data/bird1.jpg -o dinov2.jpg --model-name vit-b # Shortcut to dinov2_vitb14_reg
dinotool test/data/bird1.jpg -o siglip2.jpg --model-name siglip2 # Shortcut to hf-hub:timm/ViT-B-16-SigLIP2-512

produces outputs (DINOv2 / SigLIP2):

DINO_SigLIP2

Global features for image folders:

Processing image directories and extracting global or local features for each image is easy with DINOtool:

dinotool image_folder/ -o global_features --save-features 'frame'

produces a global_features.parquet file with global features:

filename	feature_0	feature_1	feature_2	...	feature_383
`cat_001.jpg`	0.123	-0.045	0.211	...	0.009
`dog_002.jpg`	0.097	0.033	0.187	...	-0.012
`tree_003.jpg`	-0.056	0.140	0.092	...	0.034
`car_004.jpg`	0.301	-0.202	0.144	...	-0.019

Similar files can be also produced for local patch features, for videos etc.

More examples:

More example commands can be found in test/test_cases.md

Example of reading output file formats is in docs/reading_outputs.ipynb

Example of PCA feature visualization by first masking objects using the first PCA features, similar to DINOv2 demos is in docs/masked_pca_demo.ipynb:

Masked_PCA

📦 Installation

Basic install (Linux/WSL2)

If you do not have ffmpeg installed:

sudo apt install ffmpeg

Install via pip:

pip install dinotool

You can check that dinotool is properly installed by testing it on an image:

dinotool test.jpg -o out.jpg

`uv`

If you have uv installed, you can simply run DINOtool with

uvx dinotool test.jpg -o out.jpg

You still have to have ffmpeg installed.

🐍 Conda Environment (Recommended)

If you want an isolated setup, especially useful for managing ffmpeg and dependencies:

Install Miniforge.

conda create -n dinotool python=3.12
conda activate dinotool
conda install -c conda-forge ffmpeg
pip install dinotool

Windows notes:

Windows is supported only for CPU usage. If you want GPU support on Windows, we recommend using WSL2 + Ubuntu.
The conda method above is recommended for Windows CPU setups.

🚀 Basic usage

📷 Single images

Extract and visualize DINO features from an image:

dinotool input.jpg -o output.jpg

This produces a .jpg similar to the examples above.

For a easy-to-process Parquet file of the local features without visualization, run

dinotool input.jpg -o out_features --save-features 'flat' --no-vis

🎞️ Video:

Extract global features from a video using SigLIP2:

dinotool input.mp4 -o features --model-name siglip2 --save-features frame

This produces a features.parquet file with a row for each frame of the video.

📁 Folder of Images (or folders of video frames)

Process a folder of images with patch-level output:

dinotool images/ -o results --save-features full

This produces a folder results with visualization .jpg and a NetCDF file for each image separately.

If the images in the folder can be resized to a fixed size, you can use batch processing by setting a fixed resize size (--input-size W H) and --no-vis:

dinotool images/ -o results2 --save-features 'frame' --input-size 512 512 --batch-size 4 --no-vis

This produces a parquet file with global features for each image.

💾 Feature extraction options

Use --save-features to export features for downstream tasks.

Mode	Format	Output shape	Best for
`full`	`.nc` (image) / `.zarr` (video, batched image folders)	`(frames, height, width, feature)`	Keeps spatial structure of patches.
`flat`	partitioned `.parquet`	`(frames * height * width, feature)`	Reliable long video processing. Faster patch-level analysis
`frame`	`.parquet`	`(frames, feature)`	One global feature vector per frame

`full` - Spatial local features

Saves full patch feature maps from the ViT (one vector per image patch).
Useful for reconstructing spatial attention maps or for downstream tasks like segmentation.
Stored as netCDF for single images, .zarr for video sequences.
zarr saving can be memory-intensive and might still fail for large videos.

dinotool input.mp4 -o output.mp4 --save-features full

`flat` - Flattened local features

Saves same vectors as above, but discards 2D spatial layout and saves output in parquet format.
More reliable for longer videos.
Useful for faster computations for statistics, patch-level similarity and clustering.
For single image input saves a .parquet file with one row per patch.
For video inputs saves a partitioned .parquet directory, with indices for frames and patches.

dinotool input.mp4 -o output.mp4 --save-features flat

`frame` - Global features

Saves one global feature vector per frame/image.
Useful for temporal tasks, and creating vector databases.
For single image input saves a .txt file with a single vector
For image folder and video input saves a .parquet file with one row per frame/image.

# For a video
dinotool input.mp4 -o output.mp4 --save-features frame

# For an image
dinotool input.jpg -o output.jpg --save-features frame

The output is a side-by-side visualization with PCA of the patch-level features.

🧪 Additional Options

`--model-name`

By default, the value passed to this argument is loaded from facebookresearch/dinov2, meaning that the possible DINOv2 models are:

dinov2_vits14
dinov2_vitb14
dinov2_vitl14
dinov2_vitg14

and their reg variants (recommended): i.e. dinov2_vits14_reg.

See the DINOv2 github repo for more information.

OpenCLIP models:

DINOtool now supports also ViT models that follow the OpenCLIP/timm model API for feature extraction. These models are for example the SigLIP2 models in Huggingface hub. Additionally, other models in the Hub should also work, but have not been fully tested. These include SigLIP and CLIP models.

The OpenCLIP/timm model name has to be passed in the format hf-hub:timm/<model name>.

Shortcuts: There are some predefined shortcuts for popular models. These can be passed to --model-name

# DINOv2
"vit-s": "dinov2_vits14_reg"
"vit-b": "dinov2_vitb14_reg"
"vit-l": "dinov2_vitl14_reg"
"vit-g": "dinov2_vitg14_reg"

# SigLIP2
"siglip2": "hf-hub:timm/ViT-B-16-SigLIP2-512"
"siglip2-so400m-384": "hf-hub:timm/ViT-SO400M-16-SigLIP2-384"
"siglip2-so400m-512": "hf-hub:timm/ViT-SO400M-16-SigLIP2-512"
"siglip2-b16-256": "hf-hub:timm/ViT-B-16-SigLIP2-256"
"siglip2-b16-512": "hf-hub:timm/ViT-B-16-SigLIP2-512"
"siglip2-b32-256": "hf-hub:timm/ViT-B-32-SigLIP2-256"
"siglip2-b32-512": "hf-hub:timm/ViT-B-32-SigLIP2-512"

# CLIP
"clip": "hf-hub:timm/vit_base_patch16_clip_224.openai"

`--input-size`

Setting input size fixes the resolution for all inputs. This is useful for processing HD videos, and mandatory for batch processing of image folders.

# Processing a HD video faster:
dinotool input.mp4 -o output.mp4 --input-size 920 540 --batch-size 16

`--batch-size`

For faster processing, set batch size as large as your GPU memory allows. Batch processing is possible for video files and directories of video frames (following naming where each imagename can be converted to an integer, like 00001.jpg), where all inputs are assumed to be the same size.

dinotool input.mp4 -o output.mp4 --batch-size 16

For batch processing image folders, --input-size must be set. Visualization is also not possible.

🧑‍💻 Usage reference

🦕 DINOtool: Extract and visualize ViT features from images and videos.

Usage:
  dinotool input_path -o output_path [options]

Arguments:
  input                   Path to image, video file, or folder of frames.
  -o, --output            Path for the output (required).

Options:
  -s, --save-features MODE    Save extracted features: full, flat, or frame
  -m, --model-name MODEL      Model to use (default: dinov2_vits14_reg)
  --input-size W H        Resize input before processing. Must be set for batch
                          processing of image folders
  -b, --batch-size N          Batch size for faster processing
  --only-pca              Only visualize PCA features.
  --no-vis                Only output features with no visualization.
                          --save features must be set.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Oct 1, 2025

0.2.2

Jun 23, 2025

This version

0.2.1

Jun 18, 2025

0.2.0

Jun 18, 2025

0.1.1

Apr 7, 2025

0.1.0

Apr 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dinotool-0.2.1.tar.gz (27.6 MB view details)

Uploaded Jun 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dinotool-0.2.1-py3-none-any.whl (24.9 kB view details)

Uploaded Jun 18, 2025 Python 3

File details

Details for the file dinotool-0.2.1.tar.gz.

File metadata

Download URL: dinotool-0.2.1.tar.gz
Upload date: Jun 18, 2025
Size: 27.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for dinotool-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`9f9d4d30d0fc5c8d0b30db9df0b065cc9e5fd9815824f52b35128e42df5c460b`
MD5	`84c08335a936cfe87b87a084a9c4ea23`
BLAKE2b-256	`1371b01ee3b8551d9ac4a6e4fdcb61051c33603a0afbc477e49f0666194f0c36`

See more details on using hashes here.

File details

Details for the file dinotool-0.2.1-py3-none-any.whl.

File metadata

Download URL: dinotool-0.2.1-py3-none-any.whl
Upload date: Jun 18, 2025
Size: 24.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for dinotool-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fe889da1e4f7c3032f7e7ce8f02503b292d12cc53c88729bbfc5819c834dbdd5`
MD5	`003f13075c14a27e792fdf5bcb8adbbb`
BLAKE2b-256	`dde8c9bcdef94f30485feba177e4e6eff83414885791b0c2b49410ef2520b503`

See more details on using hashes here.

dinotool 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🦕 DINOtool

✨ Features

👤 Who is DINOtool for?

✨Examples:

Global features for image folders:

More examples:

📦 Installation

Basic install (Linux/WSL2)

uv

🐍 Conda Environment (Recommended)

Windows notes:

🚀 Basic usage

📷 Single images

🎞️ Video:

📁 Folder of Images (or folders of video frames)

💾 Feature extraction options

full - Spatial local features

flat - Flattened local features

frame - Global features

🧪 Additional Options

--model-name

--input-size

--batch-size

🧑‍💻 Usage reference

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`uv`

`full` - Spatial local features

`flat` - Flattened local features

`frame` - Global features

`--model-name`

`--input-size`

`--batch-size`