Skip to main content

EraX-VL-7B-V1 - A multimodal vision-language model based on Qwen2-VL-7B architecture.

Project description

EraX-VL-7B-V1

Logo

🤗 Hugging Face  

Introduction

After a month's relentless efforts, today we are thrilled to release EraX-VL-7B-V1!

NOTA BENE: EraX-VL-7B-V1 is NOT a typical OCR-only tool likes Tesseract but is a Multimodal LLM-based model. To use it effectively, you may have to twist your prompt carefully depending on your tasks.

EraX-VL-7B-V1 is the latest version of the vision language models in the EraX model families.

Benchmark

Below is the evaluation benchmark of global open-source and proprietary Multimodal Models on the MTVQA Vietnamese test set conducted by VinBigdata. We plan to conduct more detailed and diverse evaluations in the near future.

(Source: VinBigData) (20:00 23 September 2024)

Quickstart

Below, we provide simple examples to show how to use EraX-VL-7B-V1 🤗 Transformers.

The code of EraX-VL-7B-V1 has been in the latest Hugging face transformers and we advise you to build from source with command:

Install the necessary packages:

python -m pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 accelerate
python -m pip install qwen-vl-utils
pip install flash-attn --no-build-isolation

Using Google Colaboratory

  • Google Colaboratory run right away: Open In Colab
  • Google Colaboratory API (key required): Open In Colab

Using 🤗 Transformers

import os
import base64
import json

import cv2
import numpy as np
import matplotlib.pyplot as  plt

import torch
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "erax/EraX-VL-7B-V1"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager", # replace with "flash_attention_2" if your GPU is Ampere architecture
    device_map="auto"
)

tokenizer =  AutoTokenizer.from_pretrained(model_path)
# processor = AutoProcessor.from_pretrained(model_path)

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
     model_path,
     min_pixels=min_pixels,
     max_pixels=max_pixels,
 )

image_path = "image.jpg"

with open(image_path, "rb") as f:
    encoded_image = base64.b64encode(f.read())
decoded_image_text = encoded_image.decode('utf-8')
base64_data = f"data:image;base64,{decoded_image_text}"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": base64_data,
            },
            {
                "type": "text",
                "text": "Diễn tả nội dung bức ảnh này bằng định dạng json."
            },
        ],
    }
]

# Prepare prompt
tokenized_text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[ tokenized_text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Generation configs
generation_config                    = model.generation_config
generation_config.do_sample          = True
generation_config.temperature        = 0.2
generation_config.top_k              = 1
generation_config.top_p              = 0.001
generation_config.max_new_tokens     = 2048
generation_config.repetition_penalty = 1.1

# Inference
generated_ids = model.generate(**inputs, generation_config=generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])

Using API

Install erax-vl-7b-v1 package:

pip install erax-vl-7b-v1==0.1.0

Then you can use this library for image extraction task like this:

import os

from erax_vl_7b_v1.utils import (
    process_lr,
    get_json,
    openBase64_Image,
    add_img_content,
    add_pdf_content,
    add_pdf_content_json
)
from erax_vl_7b_v1.erax_api_lib import (
    API_Image_OCR_EraX_VL_7B_vLLM,
    API_PDF_OCR_EraX_VL_7B_vLLM,
    API_Chat_OCR_EraX_VL_7B_vLLM,
    API_Multiple_Images_OCR_EraX_VL_7B_vLLM,
    API_PDF_Full_OCR_EraX_VL_7B_vLLM
)

ERAX_URL_ID = "EraX's URL ID"
API_KEY = "EraX's API Key"

image_path = "image.jpg"
prompt = """Hãy trích xuất toàn bộ chi tiết của các bức ảnh này theo đúng thứ tự của nội dung bằng định dạng json và không bình luận gì thêm."""

result, history =  API_Image_OCR_EraX_VL_7B_vLLM(
        image_paths=image_path, 
        is_base64=False,
        prompt=prompt, 
        erax_url_id=ERAX_URL_ID, 
        API_key=API_KEY,
    )

# Convert string json to json. It is result.
json_result = get_json(result) 

print(json_result)

For API inquiry

  • For correspondence regarding this work or inquiry for API trial, please contact Nguyễn Anh Nguyên at nguyen@erax.ai.

Citation

If you find our project useful, we would appreciate it if you could star our repository and cite our work as follows:

@article{EraX-VL-7B-V1,
  title={EraX-VL-7B-V1: A Highly Efficient Multimodal LLM for Vietnamese, especially for medical forms and bills},
  author={Nguyễn Anh Nguyên and Nguyễn Hồ Nam (BCG) and Hoàng Tiến Dũng and Phạm Đình Thục and Phạm Huỳnh Nhật},
  organization={EraX},
  year={2024},
  url={https://huggingface.co/erax-ai/EraX-VL-7B-V1}
}

Acknowledgement

EraX-VL-7B-V1 is built with reference to the code of the following projects: Qwen2-VL, InternVL and Khang Đoàn (5CD-AI). Thanks for their awesome work!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

erax_vl_7b_v1-0.1.1.tar.gz (46.1 kB view details)

Uploaded Source

Built Distribution

erax_vl_7b_v1-0.1.1-py3-none-any.whl (43.6 kB view details)

Uploaded Python 3

File details

Details for the file erax_vl_7b_v1-0.1.1.tar.gz.

File metadata

  • Download URL: erax_vl_7b_v1-0.1.1.tar.gz
  • Upload date:
  • Size: 46.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.0

File hashes

Hashes for erax_vl_7b_v1-0.1.1.tar.gz
Algorithm Hash digest
SHA256 099dc895e8e70ff44648e09bd13fed0295489b5e09d7cae8e1e1d29d1f495b8e
MD5 1a3ed4c88158afeadf0bfd41eeb9b53b
BLAKE2b-256 136e13e86cc249b2ea1de75805abf712d82de60e0d0cdf8bb72560746678c6e8

See more details on using hashes here.

File details

Details for the file erax_vl_7b_v1-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for erax_vl_7b_v1-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 da88a6d4d2cb736b76af3e69863e7281fd6ebf3bd7251d8a1ccbd2f0312f3f5c
MD5 5cef22085fb4a79b6bfb1073fe8255a8
BLAKE2b-256 4ad46ff4fa4ba47d84ca80fa853cff570d080bc8e1cf1697c08298f13a250f61

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page