Skip to main content

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Project description

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

If you like our project, please give us a star โญ on GitHub for the latest update.

hf_space Open In Colab hf_paper arXiv Home Page Dataset zhihu zhihu zhihu License github

This repository is the official implementation of ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human-identity consistent in the generated video. The approach draws inspiration from previous studies on frequency analysis of vision/diffusion transformers.

๐Ÿ’ก We also have other video generation projects that may interest you โœจ.

Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin, Yunyang Ge and Xinhua Cheng etc.
github github arXiv

MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Shenghai Yuan, Jinfa Huang and Yujun Shi etc.
github github arXiv

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
Shenghai Yuan, Jinfa Huang and Yongqi Xu etc.
github github arXiv

๐Ÿ“ฃ News

  • โณโณโณ Release the full codes & datasets & weights.
  • โณโณโณ Integrate into Diffusers.
  • [2024.12.09] ๐Ÿ”ฅWe release the test set and metric calculation code used in the paper, now your can measure the metrics on your own machine. Please refer to this guide for more details.
  • [2024.12.08] ๐Ÿ”ฅThe code for data preprocessing is out, which is used to obtain the training data required by ConsisID. Please refer to this guide for more details.
  • [2024.12.04] Thanks @shizi for providing ๐Ÿค—Windows-ConsisID and ๐ŸŸฃWindows-ConsisID, which make it easy to run ConsisID on Windows.
  • [2024.12.01] ๐Ÿ”ฅ We provide full text prompts corresponding to all the videos on project page. Click here to get and try the demo.
  • [2024.11.30] ๐Ÿ”ฅ We have fixed the huggingface demo, welcome to try it.
  • [2024.11.29] ๐Ÿ”ฅ The current codes and weights are our early versions, and the differences with the latest version in arxiv can be viewed here. And we will release the full codes in the next few days.
  • [2024.11.28] Thanks @camenduru for providing Jupyter Notebook and @Kijai for providing ComfyUI Extension ComfyUI-ConsisIDWrapper. If you find related work, please let us know.
  • [2024.11.27] ๐Ÿ”ฅ Due to policy restrictions, we only open-source part of the dataset. You can download it by clicking here. And we will release the data processing codes in the next few days.
  • [2024.11.26] ๐Ÿ”ฅ We release the arXiv paper for ConsisID, and you can click here to see more details.
  • [2024.11.22] ๐Ÿ”ฅ All codes & datasets are coming soon! Stay tuned ๐Ÿ‘€!

๐Ÿ˜ Gallery

Identity-Preserving Text-to-Video Generation.

Demo Video of ConsisID or you can click here to watch the video.

๐Ÿค— Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by ConsisID. We also provide online demo in Hugging Face Spaces.

python app.py

CLI Inference

python infer.py --model_path BestWishYsh/ConsisID-preview

warning: It is worth noting that even if we use the same seed and prompt but we change a machine, the results will be different.

GPU Memory Optimization

# turn on if you don't have multiple GPUs or enough GPU memory(such as H100)
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

warning: it will cost more time in inference and may also reduce the quality.

Prompt Refiner

ConsisID has high requirements for prompt quality. You can use GPT-4o to refine the input text prompt, an example is as follows (original prompt: "a man is playing guitar.")

a man is playing guitar.

Change the sentence above to something like this (add some facial changes, even if they are minor. Don't make the sentence too long): 

The video features a man standing next to an airplane, engaged in a conversation on his cell phone. he is wearing sunglasses and a black top, and he appears to be talking seriously. The airplane has a green stripe running along its side, and there is a large engine visible behind his. The man seems to be standing near the entrance of the airplane, possibly preparing to board or just having disembarked. The setting suggests that he might be at an airport or a private airfield. The overall atmosphere of the video is professional and focused, with the man's attire and the presence of the airplane indicating a business or travel context.

Some sample prompts are available here.

โš™๏ธ Requirements and Installation

We recommend the requirements as follows.

Environment

git clone --depth=1 https://github.com/PKU-YuanGroup/ConsisID.git
cd ConsisID
conda create -n consisid python=3.11.0
conda activate consisid
pip install -r requirements.txt

Download ConsisID

The weights are available at ๐Ÿค—HuggingFace and ๐ŸŸฃWiseModel, and will be automatically downloaded when runing app.py and infer.py, or you can download it with the following commands.

# way 1
# if you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
cd util
python download_weights.py

# way 2
# if you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --repo-type model \
BestWishYsh/ConsisID-preview \
--local-dir ckpts

# way 3
git lfs install
git clone https://www.wisemodel.cn/SHYuanBest/ConsisID-Preview.git

Once ready, the weights will be organized in this format:

๐Ÿ“ฆ ckpts/
โ”œโ”€โ”€ ๐Ÿ“‚ data_process/
โ”œโ”€โ”€ ๐Ÿ“‚ face_encoder/
โ”œโ”€โ”€ ๐Ÿ“‚ scheduler/
โ”œโ”€โ”€ ๐Ÿ“‚ text_encoder/
โ”œโ”€โ”€ ๐Ÿ“‚ tokenizer/
โ”œโ”€โ”€ ๐Ÿ“‚ transformer/
โ”œโ”€โ”€ ๐Ÿ“‚ vae/
โ”œโ”€โ”€ ๐Ÿ“„ configuration.json
โ”œโ”€โ”€ ๐Ÿ“„ model_index.json

๐Ÿ—๏ธ Training

Data preprocessing

Please refer to this guide for how to obtain the training data required by ConsisID. If you want to train a text to image and video generation model. You need to arrange all the dataset in this format:

๐Ÿ“ฆ datasets/
โ”œโ”€โ”€ ๐Ÿ“‚ captions/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ dataname_1.json
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ dataname_2.json
โ”œโ”€โ”€ ๐Ÿ“‚ dataname_1/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ refine_bbox_jsons/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ track_masks_data/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ videos/
โ”œโ”€โ”€ ๐Ÿ“‚ dataname_2/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ refine_bbox_jsons/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ track_masks_data/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ videos/
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ ๐Ÿ“„ total_train_data.txt

Video DiT training

First, setting hyperparameters:

Then, we run the following bash to start training:

# For single rank
bash train_single_rank.sh
# For multi rank
bash train_multi_rank.sh

๐Ÿ™Œ Friendly Links

We found some plugins created by community developers. Thanks for their efforts:

If you find related work, please let us know.

๐Ÿณ Dataset

We release the subset of the data used to train ConsisID. The dataset is available at HuggingFace, or you can download it with the following command. Some samples can be found on our Project Page.

huggingface-cli download --repo-type dataset \
BestWishYsh/ConsisID-preview-Data \
--local-dir BestWishYsh/ConsisID-preview-Data

๐Ÿ› ๏ธ Evaluation

We release the data used for evaluation in ConsisID, which is available at HuggingFace. Please refer to this guide for how to evaluate customized model.

๐Ÿ‘ Acknowledgement

๐Ÿ”’ License

  • The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
  • The CogVideoX-5B model (Transformers module) is released under the CogVideoX LICENSE.
  • The service is a research preview. Please contact us if you find any potential violations. (shyuan-cs@hotmail.com)

โœ๏ธ Citation

If you find our paper and codes useful in your research, please consider giving a star :star: and citation :pencil:.

@article{yuan2024identity,
  title={Identity-Preserving Text-to-Video Generation by Frequency Decomposition},
  author={Yuan, Shenghai and Huang, Jinfa and He, Xianyi and Ge, Yunyuan and Shi, Yujun and Chen, Liuhan and Luo, Jiebo and Yuan, Li},
  journal={arXiv preprint arXiv:2411.17440},
  year={2024}
}

๐Ÿค Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

consisid_eva_clip-1.0.2.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

consisid_eva_clip-1.0.2-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file consisid_eva_clip-1.0.2.tar.gz.

File metadata

  • Download URL: consisid_eva_clip-1.0.2.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.0

File hashes

Hashes for consisid_eva_clip-1.0.2.tar.gz
Algorithm Hash digest
SHA256 24dfd508d7c66ce690ef155ef75c5f119b640794a9aa1a19b8d40534c642b27f
MD5 0ff0b561b8578964d21202a2f36bb51a
BLAKE2b-256 9b214ecb3d80ff386eb81a31fbd8f7f0854258218f4148946e80c17819a0cc53

See more details on using hashes here.

File details

Details for the file consisid_eva_clip-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for consisid_eva_clip-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4dfd3fe613252efa562adfb1a93f13759b7983b3d7eefebe42f9f7507cfdc614
MD5 fa09c65eddbabb1e38c5de5dc50662c7
BLAKE2b-256 db1c69fe5b4f6c36e9b303f46d3fbccc7b3a1d59ffcca9355fd0c6277a074bf3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page