Skip to main content

An academic paper PDF to JSON conversion toolkit.

Project description

appjsonify: An Academic Paper PDF-to-JSON Conversion Toolkit

appjsonify[^1] is a handy PDF-to-JSON conversion tool for academic papers implemented in Python. appjsonify allows you to obtain a structured JSON file that can be easily used for various downstream tasks such as paper recommendation, information extraction, and information retrieval from papers.
[^1]: Academic Paper PDF jsonify

appjsonify overview

Requirements

  • Linux or macOS (Not tested on Windows)
  • Python 3.10 or later
  • pdfplumber
  • registrable
  • tqdm
  • pillow
  • pdf2image
  • torch
  • detectron2

    Please manually install it based on the instructions.

Installation

Prerequisites

If your environment does not have poppler, please install it. This is necessary to obtain PDF images using pdf2image. For more details, refer to Prerequisites.

Released version

pip install appjsonify
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

Editable version (Beta)

git clone https://github.com/hitachi-nlp/appjsonify.git
python -m pip install --editable .
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

Usage

appjsonify offers two options to structure your paper PDF file into a JSON file.

  1. Use the existing templates
    Suitable if a paper adopts the AAAI, ACL, ICML, ICLR, NeurIPS, IEEE, ACM, or Springer styles. See Templates for more details.
  2. Configure pipelines and parameters by yourself
    If a paper does not adopts the above formats, you need to specify the processing pipeline and its parameters. Please refer to Build your own pipeline for further information.

Templates

appjsonify provides two types of the templates for each of the following paper types: AAAI, ACL, ICML, ICLR, NeurIPS, IEEE, ACM, and Springer. One is more accurate but slower due to the use of machine learning based models, the other is less accurate but faster due to its rule based approach.

AAAI papers

Better performance but slower

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type AAAI

If your environment has a GPU(s), it is better to also specify --detectron_device_mode cuda to speed up the process.

Faster but a bit noisy

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type AAAI2

ACL papers

Better performance but slower

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ACL

Faster but a bit noisy

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ACL2

ICML papers

Better performance but slower

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ICML

Faster but a bit noisy

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ICML2

ICLR papers

Better performance but slower

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ICLR

Faster but a bit noisy

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ICLR2

NeurIPS papers

Better performance but slower

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type NeurIPS

Faster but a bit noisy

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type NeurIPS2

IEEE papers

Currently only tested with IEEE BigData papers.

Better performance but slower

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type IEEE

Faster but a bit noisy

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type IEEE2

ACM papers

Currently only tested with TALLIP papers.

Better performance but slower

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ACM

Faster but a bit noisy

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ACM2

Springer papers

Better performance but slower

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type Springer

Faster but a bit noisy

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type Springer2

Useful parameters

  • --verbose: If you want to check the intermediate processing results, please set this flag. The log files will be saved under output_dir. Optionally, you can use the following four flags to add the corresponding information.
    • --show_pos: Bounding box information.
    • --show_font: Font name and size information.
    • --show_style: Style information (e.g., section, body, abstract, etc.)
    • --show_meta: Supplementary information (e.g., information on objects and footnotes.)
    • --insert_page_break: Insert breaks between pages.
  • --save_image: If you are using a more accurate but slower version of templates or load_objects_with_ml, appjsonify can save detected table and figure images if this flag is set. In addition to this, please also specify the output directory path as --output_image_dir.

Build your own pipeline

appjsonify also allows users to build their own academic paper PDF-to-JSON processing pipeline. For more details, please refer to Available Modules and Document Handling in appjsonify.

How to add your own module

Users can add their own modules to appjsonify for more flexible document processing. To add modules, appjsonify must be installed in editable mode. See Customize appjsonify for more details and feel free to make a PR if you wish to add your module to this repository and package!

Contributing and Future Work

Contributions are more than welcome! Feel free to raise an issue and/or make a PR. Possible future work is as follows:

  • Better documentation
  • More paper templates
  • More robust references extraction
  • Powerful mathematical equation support
  • Robust algorithm description detection
  • Multilingual support
  • Add more test scripts

Citation

If you use appjsonify in your work, please cite the following.

@article{Yamaguchi2023appjsonify,
  title={appjsonify: An Academic Paper PDF-to-JSON Conversion Toolkit},
  author={Atsuki Yamaguchi and Terufumi Morishita},
  year={2023}
}

License

© 2023 Atsuki Yamaguchi and Terufumi Morishita (Hitachi, Ltd.)

This work is licensed under the MIT license unless specified.

appjsonify uses the follwoing publicly available works.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

appjsonify-0.1.1.tar.gz (45.6 kB view details)

Uploaded Source

Built Distribution

appjsonify-0.1.1-py3-none-any.whl (64.8 kB view details)

Uploaded Python 3

File details

Details for the file appjsonify-0.1.1.tar.gz.

File metadata

  • Download URL: appjsonify-0.1.1.tar.gz
  • Upload date:
  • Size: 45.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.10 Darwin/22.5.0

File hashes

Hashes for appjsonify-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0c05a1a10f71a5cc30a6450949bd86e6266a667968c4f155c75fec225e2d2f90
MD5 625e33fcec1ec09290eed2f74dadf306
BLAKE2b-256 6e37452911648f71d133868469241a4fde837c235babdc2f1a06499d5c2c4a09

See more details on using hashes here.

File details

Details for the file appjsonify-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: appjsonify-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 64.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.10 Darwin/22.5.0

File hashes

Hashes for appjsonify-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9dd9c1f6ef50764be7f49d7150d931bc80ef39e26ca57b517c7703915f836011
MD5 919d65a1afb9dabfa230674da4d04a33
BLAKE2b-256 29daa219b1b1fbf6c1648d9f25f703a2cb58d31bb22268f0e1b1258d5c3a04e2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page