Skip to main content

LLaVAction: Evaluating and Training Multi-Modal Large Language Models for Action Recognition

Project description

LLaVAction: Evaluating and Training Multi-Modal Large Language Models for Action Recognition

Static Badge Demo Website llavaction-checkpoints

Downloads Downloads PyPI version License: Apache 2.0

Abstract

Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 Challenge, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as VideoMME, PerceptionTest and MVBench.

Code

  • This repository contains the implementation for our preprint on evaluating and training multi-modal large language models for action recognition.
  • Our code is built on LLaVA-NeXT, and files in the directory llavaction/action are related to our work. We thank the authors of LLaVA-NeXT for making their code publicly available.
  • The files in the /eval, /model, /serve and /train are directly from LLaVA-NeXT, unless modified and noted below.
    • /model/llava_arch.py
    • /model/language_model/llava_qwen.py
    • /train/train.py
    • /train/llava_trainer.py
    • /utils.py

Demo

Open In Colab We provide code to run video inference in a Jupyter Notebook (which can be run on Google Colaboratory).

Installation guide for video inference:

conda create -n llavaction python=3.10 -y
conda activate llavaction
pip install --upgrade pip  # Enable PEP 660 support.
pip install --pre llavaction
  • Please see the /example directory for a demo notebook.

EPIC-KITCHENS-100-MQA

In our work, we introduce a new way to evaluate MLMMs for action recognition by casting EPIC-KITCHENS-100 into a multi-question-answer benchmark. This has not yet been released [as of 3/2025], but please check the issues or open an issue if you are interested in accessing this resource before the paper is published. We also plan to integrate this the package lmms-eval.

Acknowledgments

We thank the Swiss AI Initiative Project ID a03 from the Swiss National Supercomputing Centre (CSCS); Boehringer Ingelheim Fonds PhD stipend (H.Q.); M.W.M. thanks the Vallee Foundation; M.W.M. and A.M. thank the SNSF by grant No. 320030-227871.

group-logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llavaction-0.0.1.tar.gz (208.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llavaction-0.0.1-py3-none-any.whl (256.0 kB view details)

Uploaded Python 3

File details

Details for the file llavaction-0.0.1.tar.gz.

File metadata

  • Download URL: llavaction-0.0.1.tar.gz
  • Upload date:
  • Size: 208.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llavaction-0.0.1.tar.gz
Algorithm Hash digest
SHA256 f59bebaf1675034dc042d2e99a4fcf2ce64fbc6e98a6d68243a6c6494d0b2218
MD5 8c37dd9ec18d8af6ec763451d8a9014f
BLAKE2b-256 b2a70c9a8f05d30f3dcf654b2d1ccabdcd79fbb1b8231b4a6a86e47bd03ac264

See more details on using hashes here.

File details

Details for the file llavaction-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: llavaction-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 256.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llavaction-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a0237d9680552fe61c3bfa64ed56d90fb59acef30a1bdf711a0fcd37b7d1060e
MD5 d9e242c9b23e409501838578a13b8445
BLAKE2b-256 142371e21078ed7fd405e46c027c1c88ff0835e30b855c626d859b2628c12b18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page