Skip to main content

No project description provided

Project description

VILA🌴
Incorporating VIsual LAyout Structures for Scientific Text Classification

Motivation

Scientific papers typically organize contents in visual groups like text blocks or lines, and text within each group usually have the same semantics. We explore different approaches for injecting the group structure into the text classifiers, and build models that improves the accuracy or efficiency of the scientific text classification task.

tease

Installation

After cloning the github repo, you can either install the vila library or just install the dependencies:

git clone git@github.com:allenai/VILA.git
cd VILA 
conda create -n vila python=3.6
pip install -e . # Install the `vila` library 
pip install -r requirements.txt # Only install the dependencies 

We tested the code and trained the models using Python≥3.6, PyTorch==1.7.1, and transformers==4.4.2.

Usage

Directory Structure

VILA
├─ checkpoints  # For all trained weights 
│  └─ grotoap2  # For each dataset                                 
│     ├─ baseline  # For the experiment type, e.g., baseline, ivila, hvila, ...
│     │  └─ bert-base-uncased  # For the used base model, e.g., bert-base-uncased. 
│     │     ├─ checkpoint-199999                                
│     │     ├─ checkpoint-299999                                 
│     │     ├─ all_results.json                                       
│     │     └─ pytorch_model.bin                         
│     └─ ivila-BLK-row                           
│        └─ microsoft-layoutlm-base-uncased 
└─ data                                       
   ├─ docbank
   ├─ ...
   └─ grotoap2                                 

Note:

  • We will provide the download links to the datasets very soon.

Training

All training scripts are in the ./scripts folder.

  1. Train the baseline models

    cd scripts
    # bash train_baseline.sh [dataset-name] [base-model-name]
    bash train_baseline.sh grotoap2 bert-base-uncased
    bash train_baseline.sh docbank microsoft/layoutlm-base-uncased
    
  2. Train the I-VILA models

    cd scripts
    # bash train_ivila.sh [dataset-name] [how-to-obtain-layout-indicators] [used-special-token] [base-model-name]
    bash train_ivila.sh grotoap2 row BLK microsoft/layoutlm-base-uncased 
      # Row is an alias for textline 
    bash train_ivila.sh docbank block SEP bert-base-uncased
      # We can also use the default special tokens like SEP 
    bash train_ivila.sh s2-vl sentence BLK roberta-base 
      # We can also extract the sentence breaks using spacy and use them as indicators.
    
  3. Train the H-VILA models

    cd tools
    python create_hvila_model_base_weights.py 
    
    cd ../scripts
    # bash train_hvila.sh \
    #  [dataset-name] \
    #  [H-VILA-names] \
    #  [Group-Encoder-Output-Aggregation-Function] \
    #  [How-to-Obtain-Bounding-Box] \
    #  [Use-textline-or-block-as-the-group]
    
    bash train_hvila.sh \
      grotoap2 \
      weak-strong-layoutlm \
      average \
      first \
      row 
    

    In the above example, we use the:

    1. average of the group encoder outputs for all tokens as the group representation
    2. the bounding box of the first token as the group's bounding box
    3. textline (or row) to construct the groups

Evaluation Toolkit

The evaluation toolkit can generate a detailed report for the prediction accuracy (marco F1 scores) and Visual Layout consistency (group entropy) for the prediction files test_predictions.csv produced by the training scripts.

  1. Generate reports for a group of experiments for a specific dataset
  cd tools
  python generate-eval.py --dataset_name grotoap2 --experiment_name baseline
  # It will create a _reports folder in ../checkpoints/grotoap2/baseline and store the 
  # scores in the report.csv file. 
  1. Generate reports for all experiments for a specific dataset
  cd tools
  python generate-eval.py --dataset_name grotoap2
  # It will create reports for all experiments in the ../checkpoints/grotoap2/ folder
  # Also it will aggregate all the results and save them in ../checkpoints/grotoap2/_reports 
  1. Generate reports for per-class accuracy
  cd tools
  python generate-eval.py --dataset_name grotoap2 --experiment_name baseline --store_per_class
  # In additiona to the report.csv file, it will also generate a report_per_class.csv
  # table in the corresponding folder. 

Note: this evaluation toolkits might take a long time to run as calculing the group entropy may take long.

Model Inference/Prediction

Please refer to the example code below

import layoutparser as lp # For visualization 

from vila.pdftools.pdf_extractor import PDFExtractor
from vila.predictors import SimplePDFPredictor
# Choose from SimplePDFPredictor,
# LayoutIndicatorPDFPredictor, 
# and HierarchicalPDFDataPreprocessor

pdf_extractor = PDFExtractor("pdfplumber")
page_tokens, page_images = pdf_extractor.load_tokens_and_image(f"path-to-your.pdf")

pdf_predictor = SimplePDFPredictor.from_pretrained("path-to-the-trained-weights")

for idx, page_token in enumerate(page_tokens):
    pdf_data = page_token.to_pagedata().to_dict()
    predicted_tokens = pdf_predictor.predict(pdf_data)
    lp.draw_box(page_images[idx], predicted_tokens)

Citation

@article{Shen2021IncorporatingVL,
  title={Incorporating Visual Layout Structures for Scientific Text Classification},
  author={Zejiang Shen and Kyle Lo and Lucy Lu Wang and Bailey Kuehl and Daniel S. Weld and Doug Downey},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.00676},
  url={https://arxiv.org/abs/2106.00676}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vila-0.1.1.tar.gz (25.8 kB view details)

Uploaded Source

Built Distribution

vila-0.1.1-py3-none-any.whl (29.1 kB view details)

Uploaded Python 3

File details

Details for the file vila-0.1.1.tar.gz.

File metadata

  • Download URL: vila-0.1.1.tar.gz
  • Upload date:
  • Size: 25.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.7.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for vila-0.1.1.tar.gz
Algorithm Hash digest
SHA256 989e276244abca5633413f3e7fb3486a4c48e45c93c6a305c3a147659eecf5a7
MD5 40a494d7e486c4489f13b408cb69d364
BLAKE2b-256 0a7705aa6c0d01b32b6426597972a89b565bcc1841b1e488f938bcb76638ea66

See more details on using hashes here.

File details

Details for the file vila-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: vila-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 29.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.7.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for vila-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dc07823142f648f0fe167202f0e5e94d7b6f781a019fb4bb0ce72953152873f3
MD5 c2b2a09ae85f8129e80aed75837b2712
BLAKE2b-256 8b07438f79065892d9fb4a0d7b4d8114ad3c4359b7b28f3abf28309811608ad7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page