Patch-ioner: A Unified Zero-Shot Captioning Framework

Project description

Patch-ioner

"One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework" 💍

Official repository containing the code for the paper "One Patch to Caption Them All: A Unified Zero-Shot Captioning Franework".

🧩 Installation

You can install Patch-ioner directly from GitHub using pip:

pip install git+https://github.com/Ruggero1912/Patch-ioner

🚀 Loading a Pretrained Model

You can easily load a pretrained model from Hugging Face using the following API:

from patchioner import Patchioner

MODEL_ID = "Ruggero1912/Patch-ioner_talk2dino_decap_COCO_Captions"

model = Patchioner.from_config(MODEL_ID)

Patchioner also supports AutoModel.from_pretrained of the transformers library.

from transformers import AutoModel

MODEL_ID = "Ruggero1912/Patch-ioner_talk2dino_decap_COCO_Captions"

model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)

You can browse all models in the Patch-ioner collection:
Patch-ioner Models Collection

Model Name	Description / Variant	Hugging Face Link
`Ruggero1912/Patch-ioner_talk2dino_decap_COCO_Captions`	Talk2DINO + DeCap variant trained on COCO	🔗
`Ruggero1912/Patch-ioner_talk2dino_capdec_COCO_Captions`	Talk2DINO + CapDec variant trained on COCO	🔗
`Ruggero1912/Patch-ioner_talk2dino_Viecap_COCO_Captions`	Talk2DINO + ViECap variant trained on COCO	🔗
`Ruggero1912/Patch-ioner_talk2dino_Meacap_COCO_Captions`	Talk2DINO + MeaCap variant trained on COCO	🔗

Trace Captioning Dataset Test Splits

The trace captioning dataset test splits are based on the Localized Narratives datasets for COCO and Flickr30k. These datasets are available inside this repository in the eval-trace-captioning folder.

Available Datasets

trace_capt_coco_test.json: This dataset contains trace captioning test splits based on the COCO dataset.
trace_capt_flickr30k_test.json: This dataset contains trace captioning test splits based on the Flickr30k dataset.

Dataset Description

The trace captioning datasets are derived from the Localized Narratives annotations, which provide detailed descriptions of images along with the corresponding mouse traces. These traces indicate the sequence in which different parts of the image are described, providing a rich source of information for training and evaluating captioning models.

Quantitative Experiments

The repository includes code to run quantitative experiments on various captioning tasks. You can find the relevant code in the following folders:

eval-trace-captioning: For evaluating trace captioning tasks.
eval-dense-captioning: For evaluating dense captioning tasks.
eval-region-set-captioning: For evaluating region set captioning tasks.
eval-image-captioning: For evaluating image captioning tasks.

Evaluating the Model on Trace Captioning

To evaluate the model on the trace captioning task, use the eval_trace_captioning.py script. This script runs the model on a specified dataset and computes relevant evaluation metrics.

Running the Evaluation

To perform the evaluation, use the following command:

python eval_trace_captioning.py --model_name <MODEL_NAME> \
                                --evaluation_dataset <DATASET_PATH> \
                                --batch_size 16 \
                                --device cuda

Replace <MODEL_NAME> with the name of the model and <DATASET_PATH> with the path to the dataset.

Available Options

--model_name (str, required): The name of the model to evaluate.
--evaluation_dataset (str, required): Path to the dataset used for evaluation.
--batch_size (int, default=16): Number of samples per batch during evaluation.
--device (str, default='cuda' if available): The device to run the evaluation on (cuda or cpu).
--use_gaussian_weighting (flag): If set, applies Gaussian weighting to the captions.
--gaussian_variance (float, default=1.0): Sets the variance for Gaussian weighting.
--keep_img_ratio (flag): Maintains the image aspect ratio when resizing.
--caption_bboxes_type (str, default=None): Specifies the type of bounding boxes for captions.
--use_attention_weighting (flag): If set, weights patches using the attention map.
--keep_n_best_sims (int, default=None): Stores the top-N similarities for visualization purposes.
--caption_from (str, default='patches'): Specifies whether to generate captions from patches or cls tokens.
--configs_dir (str, default='../configs'): Path to the configuration files directory.
--use_attn_map_for_bboxes (flag): Uses the attention map to define bounding boxes.
--csv_scores_output (str, default='evaluation_results.csv'): Path to save the evaluation results.

Example Usage

Running Evaluation with Gaussian Weighting

python eval_trace_captioning.py --model_name mlp.k \
                                --evaluation_dataset data/trace_captioning.json \
                                --batch_size 32 \
                                --use_gaussian_weighting \
                                --gaussian_variance 0.8 \
                                --device cuda

This command evaluates the mlp.k model on data/trace_captioning.json, applying Gaussian weighting with a variance of 0.8 and running on a GPU (cuda).

Evaluating the Model on Dense Captioning

To evaluate the model on the dense captioning task, use the eval_densecap.py script. This script runs the model on a specified dataset and computes relevant evaluation metrics.

Running the Evaluation

To perform the evaluation, use the following command:

python eval_densecap.py --model_name <MODEL_NAME> \
                        --evaluation_dataset <DATASET_PATH> \
                        --batch_size 16 \
                        --device cuda

Replace <MODEL_NAME> with the name of the model and <DATASET_PATH> with the path to the dataset.

Available Options

--model_name (str, required): The name of the model to evaluate.
--evaluation_dataset (str, required): Path to the dataset used for evaluation.
--batch_size (int, default=16): Number of samples per batch during evaluation.
--device (str, default='cuda' if available): The device to run the evaluation on (cuda or cpu).
--use_gaussian_weighting (flag): If set, applies Gaussian weighting to the captions.
--gaussian_variance (float, default=1.0): Sets the variance for Gaussian weighting.
--keep_img_ratio (flag): Maintains the image aspect ratio when resizing.
--caption_bboxes_type (str, default=None): Specifies the type of bounding boxes for captions.
--configs_dir (str, default='../configs'): Path to the configuration files directory.
--compute_scores (bool, default=True): Computes the dense captioning MAP score.
--compute_scores_verbose (bool, default=False): Verbose output for score computation.
--overwrite (bool, default=True): Overwrites existing results.
--overwrite_inference (str, default=None): Overwrites inference results.
--compute_predictions_scores (bool, default=True): Computes prediction scores.
--caption_from (str, default='patches'): Specifies whether to generate captions from patches or cls tokens.
--use_attn_map_for_bboxes (bool, default=False): Uses the attention map to define bounding boxes.

Example Usage

Running Evaluation with Gaussian Weighting

python eval_densecap.py --model_name mlp.k \
                        --evaluation_dataset data/vg_test_dense_captioning.json \
                        --batch_size 32 \
                        --use_gaussian_weighting \
                        --gaussian_variance 0.8 \
                        --device cuda

This command evaluates the mlp.k model on data/vg_test_dense_captioning.json, applying Gaussian weighting with a variance of 0.8 and running on a GPU (cuda).

Evaluating the Model on Region-Set Captioning

To evaluate the model on the region-set captioning task, use the eval_region_set_captioning.py script. This script runs the model on a specified dataset and computes relevant evaluation metrics.

Running the Evaluation

To perform the evaluation, use the following command:

python eval_region_set_captioning.py --model_name <MODEL_NAME> \
                                     --evaluation_dataset <DATASET_PATH> \
                                     --batch_size 16 \
                                     --device cuda

Replace <MODEL_NAME> with the name of the model and <DATASET_PATH> with the path to the dataset.

Available Options

--model_name (str, required): The name of the model to evaluate.
--evaluation_dataset (str, required): Path to the dataset used for evaluation.
--batch_size (int, default=16): Number of samples per batch during evaluation.
--device (str, default='cuda' if available): The device to run the evaluation on (cuda or cpu).
--use_gaussian_weighting (flag): If set, applies Gaussian weighting to the captions.
--gaussian_variance (float, default=1.0): Sets the variance for Gaussian weighting.
--keep_img_ratio (flag): Maintains the image aspect ratio when resizing.
--caption_bboxes_type (str, default=None): Specifies the type of bounding boxes for captions.
--caption_from (str, default='patches'): Specifies whether to generate captions from patches or cls tokens.
--configs_dir (str, default='../configs'): Path to the configuration files directory.
--use_attn_map_for_bboxes (flag): Uses the attention map to define bounding boxes.
--csv_scores_output (str, default='evaluation_results.csv'): Path to save the evaluation results.

Example Usage

Running Evaluation with Gaussian Weighting

python eval_region_set_captioning.py --model_name mlp.k \
                                     --evaluation_dataset data/region_set_captioning.json \
                                     --batch_size 32 \
                                     --use_gaussian_weighting \
                                     --gaussian_variance 1.0 \
                                     --device cuda

This command evaluates the mlp.k model on data/region_set_captioning.json, applying Gaussian weighting with a variance of 1.0 and running on a GPU (cuda).

Evaluating the Model on Image Captioning

To evaluate the model on the image captioning task, use the eval_image_captioning.py script. This script runs the model on a specified dataset and computes relevant evaluation metrics.

Running the Evaluation

To perform the evaluation, use the following command:

python eval_image_captioning.py --model_name <MODEL_NAME> \
                                --evaluation_dataset <DATASET_PATH> \
                                --batch_size 16 \
                                --device cuda

Replace <MODEL_NAME> with the name of the model and <DATASET_PATH> with the path to the dataset.

Available Options

--model_name (str, required): The name of the model to evaluate.
--evaluation_dataset (str, required): Path to the dataset used for evaluation.
--batch_size (int, default=16): Number of samples per batch during evaluation.
--use_gaussian_weighting (flag): If set, applies Gaussian weighting to the captions.
--gaussian_variance (float, default=1.0): Sets the variance for Gaussian weighting.
--keep_img_ratio (flag): Maintains the image aspect ratio when resizing.
--keep_n_best_sims (int, default=None): Stores the top-N similarities for visualization purposes.
--caption_from (str, default='cls'): Specifies whether to generate captions from cls tokens, average self-attention, or patches.
--configs_dir (str, default='../configs'): Path to the configuration files directory.
--device (str, default='cuda' if available): The device to run the evaluation on (cuda or cpu).
--no_scores (flag): If set, does not compute the scores for the captions.

Example Usage

Running Evaluation with Gaussian Weighting

python eval_image_captioning.py --model_name mlp.k \
                                --evaluation_dataset data/coco-test.json \
                                --batch_size 32 \
                                --use_gaussian_weighting \
                                --gaussian_variance 1.0 \
                                --device cuda

This command evaluates the mlp.k model on data/coco-test.json, applying Gaussian weighting with a variance of 1.0 and running on a GPU (cuda).

Available datasets:

coco-test.json
flickr30_test.json

Setup Requirements

To set up the requirements for this repository, follow the steps below:

Prerequisites

Ensure you have the following installed:

Python 3.8 or higher
Git
Conda
CUDA Toolkit (if using GPU)

Installation

Clone the repository:
```
git clone [REDACTED]
cd Patch-ioner
```
Create a Conda Environment

    conda env create -f environment.yml

Training the Decoder

You can train the decoder using the following commands:

Talk2DINO, Memory (~DeCap)

python decoderTraining.py --out_dir weights_dino_b14_karpathy --not-distributed 1 --local-rank 1 --dataset coco_train_karpathy.json --prefix coco_karpathy --talk2dino_weights weights_talk2dino/vitb_mlp_infonce.pth --talk2dino_config configs_talk2dino/vitb_mlp_infonce.yaml --use_dino_feats --pre_extract_features

Talk2DINO, Noise (~CapDec)

python decoderTraining.py --out_dir weights_dino_b14_noise_karpathy --not-distributed 1 --local-rank 1 --dataset coco_train_karpathy.json --prefix coco_karpathy --talk2dino_weights weights_talk2dino/vitb_mlp_infonce.pth --talk2dino_config configs_talk2dino/vitb_mlp_infonce.yaml --use_dino_feats --pre_extract_features --gaussian_noise 0.08

CLIP B16, Memory (DeCap) Karpathy Train Split

python decoderTraining.py --out_dir weights_clip_b16_karpathy --not-distributed 1 --local-rank 0 --dataset coco_train_karpathy.json --prefix coco_karpathy

CLIP B32, Memory (DeCap) Karpathy Train Split

python decoderTraining.py --out_dir weights_clip_b32_karpathy --not-distributed 1 --local-rank 0 --dataset coco_train_karpathy.json --prefix coco_karpathy --clip_model ViT-B/32

Model Configuration

You can define a configuration for the model in the configs folder as a YAML file. The allowed options include:

decap_weights: Path to the textual decoder weights file.
prefix_size: Size of the textual embedding prefix.
linear_talk2dino: Boolean flag to use the linear version talk2dino.
support_memory_size: Size of the memory bank.
dino_model: Model type for DINO.
normalize: Boolean flag for normalization of the embeddings in input to the decoder.
kkv_attention: Boolean flag for KKV attention.
projection_type: Path to the projection type file.

Example configuration:

decap_weights: '/raid/datasets/models_weights/decap_weights/talkingdino-ksplits/coco_karpathy-009.pt'
prefix_size: 768
linear_talk2dino: False
support_memory_size: 591753
dino_model: 'dinov2_vitb14_reg'
normalize: True
kkv_attention: False
projection_type: '/raid/datasets/im2txtmemories/coco_train_karpathy.json'

To setup a ViECap baseline, populate the nested fields at the key viecap. The available options are:

project_length: Length of the learnable prefix projected from vision features.
top_k: The number of detected objects to use as hard prompt.
name_of_entities_text: The name of the collection of entities to use for the hard-prompt.
files_path: Path to directory containing ViECap-related checkpoints and auxiliary data.
weight_path: Path to the learned weights used for prefix projection.
using_hard_prompt: True in the default configuration.
soft_prompt_first: True in the default configuration.
using_greedy_search: True for greedy search, False for beam search.
language_model: GPT-2 in the default configuration.

In the case of MeaCap baselines, set the nested fields at viecap -> meacap. The available options are:

memory_caption_num: standard value for MeaCap is 5
vl_model: the clip version
wte_model_path: the default value is "sentence-transformers/all-MiniLM-L6-v2"
parser_checkpoint: Checkpoint for a scene graph parser, default is "lizhuang144/flan-t5-base-VG-factual-sg".
memory_id: the id of the memory pool.
memory_base_path: Path to directory containing MeaCap-related checkpoints and auxiliary data.

Credits

This repository contains code from several other repositories, including:

We acknowledge and thank the authors of these repositories for their contributions.

Reference

If you found this code useful, please cite the following paper:

@misc{bianchi2025patchcaptionallunified,
      title={One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework}, 
      author={Lorenzo Bianchi and Giacomo Pacini and Fabio Carrara and Nicola Messina and Giuseppe Amato and Fabrizio Falchi},
      year={2025},
      eprint={2510.02898},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.02898}, 
}

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Oct 14, 2025

0.1.0

Oct 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

patchioner-0.1.1.tar.gz (50.1 MB view details)

Uploaded Oct 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

patchioner-0.1.1-py3-none-any.whl (50.5 MB view details)

Uploaded Oct 14, 2025 Python 3

File details

Details for the file patchioner-0.1.1.tar.gz.

File metadata

Download URL: patchioner-0.1.1.tar.gz
Upload date: Oct 14, 2025
Size: 50.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.20

File hashes

Hashes for patchioner-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`acac2ce48c18a6b8cf09b2a209f7f88c2ffb0965f16bfc3c18206b8a143cae66`
MD5	`b43b914fe9c218a1aa53affc9cccac07`
BLAKE2b-256	`462261cef540bee8cecc304dc5919d9ada4163256806bfed5fd6a07666e7ae11`

See more details on using hashes here.

File details

Details for the file patchioner-0.1.1-py3-none-any.whl.

File metadata

Download URL: patchioner-0.1.1-py3-none-any.whl
Upload date: Oct 14, 2025
Size: 50.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.20

File hashes

Hashes for patchioner-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8f8de583839b7114cacc06cc8a582c3fc3e2062b248f1239c667a5faf018b6df`
MD5	`8990c2e0fcc0fff5700239037cffe47f`
BLAKE2b-256	`fdb6a30f004ef499e43ef5f63bd1dcee580f4ef1373c625e3ec8e30bab4011bb`

See more details on using hashes here.

patchioner 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Patch-ioner

"One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework" 💍

🧩 Installation

🚀 Loading a Pretrained Model

Trace Captioning Dataset Test Splits

Available Datasets

Dataset Description

Quantitative Experiments

Evaluating the Model on Trace Captioning

Running the Evaluation

Example Usage

Evaluating the Model on Dense Captioning

Running the Evaluation

Example Usage

Evaluating the Model on Region-Set Captioning

Running the Evaluation

Example Usage

Evaluating the Model on Image Captioning

Running the Evaluation

Example Usage

Setup Requirements

Prerequisites

Installation

Training the Decoder

Talk2DINO, Memory (~DeCap)

Talk2DINO, Noise (~CapDec)

CLIP B16, Memory (DeCap) Karpathy Train Split

CLIP B32, Memory (DeCap) Karpathy Train Split

Model Configuration

Credits

Reference

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes