YOLO-World: Real-time Open Vocabulary Object Detection
Project description
Tianheng Cheng2,3,*, Lin Song1,📧,*, Yixiao Ge1,🌟,2, Wenyu Liu3, Xinggang Wang3,📧, Ying Shan1,2
* Equal contribution 🌟 Project lead 📧 Corresponding author
1 Tencent AI Lab, 2 ARC Lab, Tencent PCG
3 Huazhong University of Science and Technology
Notice
We recommend that everyone use English to communicate on issues, as this helps developers from around the world discuss, share experiences, and answer questions together.
🔥 Updates
[2024-4-28]:
Long time no see! This update contains bugfixs and improvements: (1) ONNX demo; (2) image demo (support tensor input); (2) new pre-trained models; (3) image prompts; (4)simple version for fine-tuning / deployment; (5) guide for installation (include a requirements.txt
).
[2024-3-28]:
We provide: (1) more high-resolution pre-trained models (e.g., S, M, X) (#142); (2) pre-trained models with CLIP-Large text encoders. Most importantly, we preliminarily fix the fine-tuning without mask-refine
and explore a new fine-tuning setting (#160,#76). In addition, fine-tuning YOLO-World with mask-refine
also obtains significant improvements, check more details in configs/finetune_coco.
[2024-3-16]:
We fix the bugs about the demo (#110,#94,#129, #125) with visualizations of segmentation masks, and release YOLO-World with Embeddings, which supports prompt tuning, text prompts and image prompts.
[2024-3-3]:
We add the high-resolution YOLO-World, which supports 1280x1280
resolution with higher accuracy and better performance for small objects!
[2024-2-29]:
We release the newest version of YOLO-World-v2 with higher accuracy and faster speed! We hope the community can join us to improve YOLO-World!
[2024-2-28]:
Excited to announce that YOLO-World has been accepted by CVPR 2024! We're continuing to make YOLO-World faster and stronger, as well as making it better to use for all.
[2024-2-22]:
We sincerely thank RoboFlow and @Skalskip92 for the Video Guide about YOLO-World, nice work!
[2024-2-18]:
We thank @Skalskip92 for developing the wonderful segmentation demo via connecting YOLO-World and EfficientSAM. You can try it now at the 🤗 HuggingFace Spaces.
[2024-2-17]:
The largest model X of YOLO-World is released, which achieves better zero-shot performance!
[2024-2-17]:
We release the code & models for YOLO-World-Seg now! YOLO-World now supports open-vocabulary / zero-shot object segmentation!
[2024-2-15]:
The pre-traind YOLO-World-L with CC3M-Lite is released!
[2024-2-14]:
We provide the image_demo
for inference on images or directories.
[2024-2-10]:
We provide the fine-tuning and data details for fine-tuning YOLO-World on the COCO dataset or the custom datasets!
[2024-2-3]:
We support the Gradio
demo now in the repo and you can build the YOLO-World demo on your own device!
[2024-2-1]:
We've released the code and weights of YOLO-World now!
[2024-2-1]:
We deploy the YOLO-World demo on HuggingFace 🤗, you can try it now!
[2024-1-31]:
We are excited to launch YOLO-World, a cutting-edge real-time open-vocabulary object detector.
TODO
YOLO-World is under active development and please stay tuned ☕️! If you have suggestions📃 or ideas💡,we would love for you to bring them up in the Roadmap ❤️!
YOLO-World 目前正在积极开发中📃,如果你有建议或者想法💡,我们非常希望您在 Roadmap 中提出来 ❤️!
FAQ (Frequently Asked Questions)
We have set up an FAQ about YOLO-World in the discussion on GitHub. We hope everyone can raise issues or solutions during use here, and we also hope that everyone can quickly find solutions from it.
我们在GitHub的discussion中建立了关于YOLO-World的常见问答,这里将收集一些常见问题,同时大家可以在此提出使用中的问题或者解决方案,也希望大家能够从中快速寻找到解决方案
Highlights & Introduction
This repo contains the PyTorch implementation, pre-trained weights, and pre-training/fine-tuning code for YOLO-World.
-
YOLO-World is pre-trained on large-scale datasets, including detection, grounding, and image-text datasets.
-
YOLO-World is the next-generation YOLO detector, with a strong open-vocabulary detection capability and grounding ability.
-
YOLO-World presents a prompt-then-detect paradigm for efficient user-vocabulary inference, which re-parameterizes vocabulary embeddings as parameters into the model and achieve superior inference speed. You can try to export your own detection model without extra training or fine-tuning in our online demo!
Model Zoo
We've pre-trained YOLO-World-S/M/L from scratch and evaluate on the LVIS val-1.0
and LVIS minival
. We provide the pre-trained model weights and training logs for applications/research or re-producing the results.
Zero-shot Inference on LVIS dataset
model | Pre-train Data | Size | APmini | APr | APc | APf | APval | APr | APc | APf | weights |
---|---|---|---|---|---|---|---|---|---|---|---|
YOLO-Worldv2-S | O365+GoldG | 640 | 22.7 | 16.3 | 20.8 | 25.5 | 17.3 | 11.3 | 14.9 | 22.7 | HF Checkpoints 🤗 |
YOLO-Worldv2-S | O365+GoldG | 1280🔸 | 24.1 | 18.7 | 22.0 | 26.9 | 18.8 | 14.1 | 16.3 | 23.8 | HF Checkpoints 🤗 |
YOLO-Worldv2-M | O365+GoldG | 640 | 30.0 | 25.0 | 27.2 | 33.4 | 23.5 | 17.1 | 20.0 | 30.1 | HF Checkpoints 🤗 |
YOLO-Worldv2-M | O365+GoldG | 1280🔸 | 31.6 | 24.5 | 29.0 | 35.1 | 25.3 | 19.3 | 22.0 | 31.7 | HF Checkpoints 🤗 |
YOLO-Worldv2-L | O365+GoldG | 640 | 33.0 | 22.6 | 32.0 | 35.8 | 26.0 | 18.6 | 23.0 | 32.6 | HF Checkpoints 🤗 |
YOLO-Worldv2-L | O365+GoldG | 1280🔸 | 34.6 | 29.2 | 32.8 | 37.2 | 27.6 | 21.9 | 24.2 | 34.0 | HF Checkpoints 🤗 |
YOLO-Worldv2-L (CLIP-Large) 🔥 | O365+GoldG | 640 | 34.0 | 22.0 | 32.6 | 37.4 | 27.1 | 19.9 | 23.9 | 33.9 | HF Checkpoints 🤗 |
YOLO-Worldv2-L (CLIP-Large) 🔥 | O365+GoldG | 800🔸 | 35.5 | 28.3 | 33.2 | 38.8 | 28.6 | 22.0 | 25.1 | 35.4 | HF Checkpoints 🤗 |
YOLO-Worldv2-L | O365+GoldG+CC3M-Lite | 640 | 32.9 | 25.3 | 31.1 | 35.8 | 26.1 | 20.6 | 22.6 | 32.3 | HF Checkpoints 🤗 |
YOLO-Worldv2-X | O365+GoldG+CC3M-Lite | 640 | 35.4 | 28.7 | 32.9 | 38.7 | 28.4 | 20.6 | 25.6 | 35.0 | HF Checkpoints 🤗 |
🔥 YOLO-Worldv2-X | O365+GoldG+CC3M-Lite | 1280🔸 | 37.4 | 30.5 | 35.2 | 40.7 | 29.8 | 21.1 | 26.8 | 37.0 | HF Checkpoints 🤗 |
YOLO-Worldv2-XL | O365+GoldG+CC3M-Lite | 640 | 36.0 | 25.8 | 34.1 | 39.5 | 29.1 | 21.1 | 26.3 | 35.8 | HF Checkpoints 🤗 |
NOTE:
- APmini: evaluated on LVIS
minival
. - APval: evaluated on LVIS
val 1.0
. - HuggingFace Mirror provides the mirror of HuggingFace, which is a choice for users who are unable to reach.
- 🔸: fine-tuning models with the pre-trained data.
Pre-training Logs:
We provide the pre-training logs of YOLO-World-v2
. Due to the unexpected errors of the local machines, the training might be interrupted several times.
Model | YOLO-World-v2-S | YOLO-World-v2-M | YOLO-World-v2-L | YOLO-World-v2-X |
---|---|---|---|---|
Pre-training Log | Part-1, Part-2 | Part-1, Part-2 | Part-1, Part-2 | Final part |
Getting started
1. Installation
YOLO-World is developed based on torch==1.11.0
mmyolo==0.6.0
and mmdetection==3.0.0
. Check more details about requirements
and mmcv
in docs/installation.
Clone Project
git clone --recursive https://github.com/AILab-CVC/YOLO-World.git
Install
pip install torch wheel -q
pip install -e .
2. Preparing Data
We provide the details about the pre-training data in docs/data.
Training & Evaluation
We adopt the default training or evaluation scripts of mmyolo.
We provide the configs for pre-training and fine-tuning in configs/pretrain
and configs/finetune_coco
.
Training YOLO-World is easy:
chmod +x tools/dist_train.sh
# sample command for pre-training, use AMP for mixed-precision training
./tools/dist_train.sh configs/pretrain/yolo_world_l_t2i_bn_2e-4_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py 8 --amp
NOTE: YOLO-World is pre-trained on 4 nodes with 8 GPUs per node (32 GPUs in total). For pre-training, the node_rank
and nnodes
for multi-node training should be specified.
Evaluating YOLO-World is also easy:
chmod +x tools/dist_test.sh
./tools/dist_test.sh path/to/config path/to/weights 8
NOTE: We mainly evaluate the performance on LVIS-minival for pre-training.
Fine-tuning YOLO-World
We provide the details about fine-tuning YOLO-World in docs/fine-tuning.
Deployment
We provide the details about deployment for downstream applications in docs/deployment. You can directly download the ONNX model through the online demo in Huggingface Spaces 🤗.
Demo
See demo
for more details
-
gradio_demo.py
: Gradio demo, ONNX export -
image_demo.py
: inference with images or a directory of images -
simple_demo.py
: a simple demo of YOLO-World, usingarray
(instead of path as input). -
video_demo.py
: inference YOLO-World on videos. -
inference.ipynb
: jupyter notebook for YOLO-World. - Google Colab Notebook: We sincerely thank Onuralp for sharing the Colab Demo, you can have a try 😊!
Acknowledgement
We sincerely thank mmyolo, mmdetection, GLIP, and transformers for providing their wonderful code to the community!
Citations
If you find YOLO-World is useful in your research or applications, please consider giving us a star 🌟 and citing it.
@inproceedings{Cheng2024YOLOWorld,
title={YOLO-World: Real-Time Open-Vocabulary Object Detection},
author={Cheng, Tianheng and Song, Lin and Ge, Yixiao and Liu, Wenyu and Wang, Xinggang and Shan, Ying},
booktitle={Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}
Licence
YOLO-World is under the GPL-v3 Licence and is supported for comercial usage.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for yolo_world_open-0.8.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b800148758340e95b5eb5a3bd527d3f9cc1674807a6c5c81244af24606e78805 |
|
MD5 | daf62d4215897517397e2c8b391c33aa |
|
BLAKE2b-256 | b6834d3f8f64cdcf33fe7100d9b737ec19e8fdfe5cdbcac813d8b1c8d53c9425 |