An academic paper PDF to JSON conversion toolkit.
Project description
appjsonify
: An Academic Paper PDF-to-JSON Conversion Toolkit
appjsonify
[^1] is a handy PDF-to-JSON conversion tool for academic papers implemented in Python.
appjsonify
allows you to obtain a structured JSON file that can be easily used for various downstream tasks such as paper recommendation, information extraction, and information retrieval from papers.
[^1]: Academic Paper PDF jsonify
Requirements
- Linux or macOS (Not tested on Windows)
- Python 3.10 or later
- pdfplumber
- registrable
- tqdm
- pillow
- pdf2image
- torch
- detectron2
Please manually install it based on the instructions.
Installation
Prerequisites
If your environment does not have poppler
, please install it.
This is necessary to obtain PDF images using pdf2image
.
For more details, refer to Prerequisites.
Released version
pip install appjsonify
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
Editable version (Beta)
git clone https://github.com/hitachi-nlp/appjsonify.git
python -m pip install --editable .
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
Usage
appjsonify
offers two options to structure your paper PDF file into a JSON file.
- Use the existing templates
Suitable if a paper adopts theAAAI
,ACL
,ICML
,ICLR
,NeurIPS
,IEEE
,ACM
, orSpringer
styles. See Templates for more details. - Configure pipelines and parameters by yourself
If a paper does not adopts the above formats, you need to specify the processing pipeline and its parameters. Please refer to Build your own pipeline for further information.
Templates
appjsonify
provides two types of the templates for each of the following paper types: AAAI
, ACL
, ICML
, ICLR
, NeurIPS
, IEEE
, ACM
, and Springer
.
One is more accurate but slower due to the use of machine learning based models, the other is less accurate but faster due to its rule based approach.
AAAI papers
Better performance but slower
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type AAAI
If your environment has a GPU(s), it is better to also specify
--detectron_device_mode cuda
to speed up the process.
Faster but a bit noisy
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type AAAI2
ACL papers
Better performance but slower
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ACL
Faster but a bit noisy
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ACL2
ICML papers
Better performance but slower
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ICML
Faster but a bit noisy
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ICML2
ICLR papers
Better performance but slower
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ICLR
Faster but a bit noisy
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ICLR2
NeurIPS papers
Better performance but slower
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type NeurIPS
Faster but a bit noisy
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type NeurIPS2
IEEE papers
Currently only tested with IEEE BigData papers.
Better performance but slower
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type IEEE
Faster but a bit noisy
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type IEEE2
ACM papers
Currently only tested with TALLIP papers.
Better performance but slower
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ACM
Faster but a bit noisy
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type ACM2
Springer papers
Better performance but slower
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type Springer
Faster but a bit noisy
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir --paper_type Springer2
Useful parameters
--verbose
: If you want to check the intermediate processing results, please set this flag. The log files will be saved underoutput_dir
. Optionally, you can use the following four flags to add the corresponding information.--show_pos
: Bounding box information.--show_font
: Font name and size information.--show_style
: Style information (e.g.,section
,body
,abstract
, etc.)--show_meta
: Supplementary information (e.g., information on objects and footnotes.)--insert_page_break
: Insert breaks between pages.
--save_image
: If you are using a more accurate but slower version of templates orload_objects_with_ml
,appjsonify
can save detected table and figure images if this flag is set. In addition to this, please also specify the output directory path as--output_image_dir
.
Build your own pipeline
appjsonify
also allows users to build their own academic paper PDF-to-JSON processing pipeline.
For more details, please refer to Available Modules and Document Handling in appjsonify
.
How to add your own module
Users can add their own modules to appjsonify
for more flexible document processing.
To add modules, appjsonify
must be installed in editable mode.
See Customize appjsonify
for more details and feel free to make a PR if you wish to add your module to this repository and package!
Contributing and Future Work
Contributions are more than welcome! Feel free to raise an issue and/or make a PR. Possible future work is as follows:
- Better documentation
- More paper templates
- More robust references extraction
- Powerful mathematical equation support
- Robust algorithm description detection
- Multilingual support
- Add more test scripts
Citation
If you use appjsonify
in your work, please cite the following.
@article{Yamaguchi2023appjsonify,
title={appjsonify: An Academic Paper PDF-to-JSON Conversion Toolkit},
author={Atsuki Yamaguchi and Terufumi Morishita},
year={2023}
}
License
© 2023 Atsuki Yamaguchi and Terufumi Morishita (Hitachi, Ltd.)
This work is licensed under the MIT license unless specified.
appjsonify
uses the follwoing publicly available works.
- pdfplumber by Jeremy Singer-Vine (MIT License).
- registrable by epwalsh (Apache License 2.0).
- tqdm (MIT License, Mozilla Public License 2.0 (MPL 2.0)).
- pillow by Jeffrey A. Clark (Historical Permission Notice and Disclaimer License).
- pdf2image by Edouard Belval (MIT License).
- torch (BSD-style license).
- Detectron2 by Facebook AI Research (Apache License 2.0) in detectron2_demo.
- DocBank pretrained model by Microsoft Research Asia (Apache License 2.0) in docbank.py.
- TableBank pretrained model by Microsoft Research Asia (Apache License 2.0) in tablebank.py.
- PubLayNet pretrained model by hpanwar08 (Apache License 2.0) in publaynet.py.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file appjsonify-0.1.1.tar.gz
.
File metadata
- Download URL: appjsonify-0.1.1.tar.gz
- Upload date:
- Size: 45.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.10 Darwin/22.5.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c05a1a10f71a5cc30a6450949bd86e6266a667968c4f155c75fec225e2d2f90 |
|
MD5 | 625e33fcec1ec09290eed2f74dadf306 |
|
BLAKE2b-256 | 6e37452911648f71d133868469241a4fde837c235babdc2f1a06499d5c2c4a09 |
File details
Details for the file appjsonify-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: appjsonify-0.1.1-py3-none-any.whl
- Upload date:
- Size: 64.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.10 Darwin/22.5.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9dd9c1f6ef50764be7f49d7150d931bc80ef39e26ca57b517c7703915f836011 |
|
MD5 | 919d65a1afb9dabfa230674da4d04a33 |
|
BLAKE2b-256 | 29daa219b1b1fbf6c1648d9f25f703a2cb58d31bb22268f0e1b1258d5c3a04e2 |