Logical structure analysis of visually structured documents.
Project description
pdf-struct: Logical structure analysis for visually structured documents
This is a tool for extracting fine-grained logical structures (such as boundaries and their hierarchies) from visually structured documents (VSDs) such as PDFs. pdf-struct is easily customizable to different types of VSDs and it significantly outperformed baselines in identifying different structures in VSDs. For example, our system obtained a paragraph boundary detection F1 score of 0.953 which is significantly better than a popular PDF-to-text tool with an F1 score of 0.739. Please note that current pdf-struct has several limitations:
- It is intended for single-column documents. It does not suport multi-column documents.
- Published models are trained on contracts. It may work on general documents, but it has not been tested. Nevertheless, you can train your own model using a corpus of your choice.
Details of pdf-struct can be found in our paper that was published in "Natural Legal Language Processing Workshop 2021". You can find the dataset for reproducing the paper here.
Basic Usage
This program runs on Python 3 (tested on 3.8.5). Install pdf-struct:
pip install pdf-struct
pdf-struct predict --model PDFContractEnFeatureExtractor ${PATH_TO_PDF_FILE}
You may choose a pretrained model from https://github.com/stanfordnlp/pdf-struct-models
Advanced Usage
This section explains the way to create your own dataset and to train your own models.
Prerequisite
To install dependencies, run:
pip install -r requirements.txt
Getting data ready
First, place your raw documents in a directory of your choice. They must have following extensions:
*.pdf
: PDF files with embedded text. PDF without embedded text (i.e. those that require OCR) or two columns PDF is not supported.*.txt
: Plain text files that are visually structured with spaces/line breaks.*.hocr
: HOCR files.
You may handle HTML files by turning them into PDF files:
find my_input_directory/ -type f | \
grep -P 'html$|htm$|HTML$|HTM$' | \
while read f; do \
chrome --headless --disable-gpu --print-to-pdf-no-header --print-to-pdf="data/raw/`basename $f`.pdf" "$f"; \
done
Creating TSV files for annotation
Create TSV file for annotation.
pdf-struct init-dataset ${FILE_TYPE} ${RAW_DOCUMENTS_DIR} ${OUTPUT_DIR}
where ${FILE_TYPE}
should be one of pdf
, txt
or hocr
.
This will output tsv files to ${OUTPUT_DIR}
.
Annotating TSV files
Annotate TSV files that were geenerated with init-dataset
command.
Each line of TSV file is organized as following:
text<tab>pointer<tab>label
text
is extracted text from the input document. It should roughly correspond to a line in the document.
label
(default empty) denotes the transition relationship between that line and the next line.
It should be one of following:
- c(continuous): Next line is part of a same paragraph
- a(ddress): Next line is part of a same paragraph BUT the line break is meaningful. This is intended to be used for things like addresses.
- b(lock): Next line is a start of a new paragraph BUT in within the same "block"
- s(ame level): Next line is a start of a new block (thus a new paragraph)
- d(rop): Next line is a start of a new block (thus a new paragraph) that is a child of the current block
- e(limine): The current line should be deleted
- x (excluded): The current line should be excluded both from training and evaluation
- In our experiments, we removed things like temporal two column lines, signatures, titles etc.
In the annotation, we introduced a concept block
. This is intended for a case where we want to distinguish listings and paragraphs.
e.g.,
Each party must:
1. Blah blah blah ....
blah blah blah....
Blah blah blah....
blah blah blah....
2. Blah blah blah...
Here, a new paragraph within 1.
at the fifth line is definately meaningful and it should not be treated in the same way as the start of 2.
at the eighth line.
We say that relationship between the forth and fith lines (i.e. label for the forth line) is b
.
That being said, we currently treat b and s label in the same way. In fact some other labels are merged in the training/evaluation:
c
anda
will be mergedb
ands
will be mergedc
,b
,d
ands
with a pointer is merged
pointer
(default 0) is introduced when the hierarchy goes up.
It should be used along with c
, b
, d
or s
.
We use pointer along with different labels, because we have some oocasions where we see rise in hierarchy AND the line being a continous paragraph or a different paragraph.
e.g.,
Blah blah blah...:<tab>0<tab>d
a. Blah blah blah...<tab>0<tab>s
b. Blah blah blah...<tab>-1<tab>s
Blah blah blah...:<tab>0<tab>d
1. Blah blah blah...<tab>0<tab>d
a) Blah blah blah...<tab>0<tab>c
blah blah blah...<tab>0<tab>s
b) Blah blah blah...<tab>5<tab>c
but this does not include ...<tab>5<tab>s
PAGE 1/2<tab>0<tab>e
2. Blah blah blah...<tab>0<tab>d
As you can see, eighth line use a pointer along with c
because the nineth line is actually a continous paragraph from the fifth line.
Pointers are 1-indexed (starts from 1) and 0 denotes no pointer.
A pointer can be set to -1
to return to the most upper hierarchy.
The last line should be annotated with pointer -1
and label s
(though it is ignored internally).
Evaluating models
You can run experiments with following command:
pdf-struct evaluate ${FILE_TYPE} ${FEATURE_EXTRACTOR_NAME} ${RAW_DOCUMENTS_DIR} ${ANNOTATED_DOCUMENTS_DIR}
Refer pdf-struct evaluate --help
for the list of the feature extractors.
This will run k-folds cross validation over the data.
Training models
You can train a new model on your dataset.
pdf-struct train ${FILE_TYPE} ${FEATURE_EXTRACTOR_NAME} ${RAW_DOCUMENTS_DIR} ${ANNOTATED_DOCUMENTS_DIR} ${MODEL_OUTPUT_PATH}
You can then feed ${MODEL_OUTPUT_PATH}
to --path
option of pdf-struct predict
.
Customizing feature extractor
Coming soon!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdf-struct-0.3.0.tar.gz
.
File metadata
- Download URL: pdf-struct-0.3.0.tar.gz
- Upload date:
- Size: 42.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85e12d0a6f043df90ec087278d93b874758fe2922b72ac8b7c71f0864ecd6de9 |
|
MD5 | 1ae7ca6ef4a474de634c53d4af7eb388 |
|
BLAKE2b-256 | 32c361fd0167aebe5ba83793942731a5550dc70bcce360afab6c38316953393c |
File details
Details for the file pdf_struct-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: pdf_struct-0.3.0-py3-none-any.whl
- Upload date:
- Size: 61.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d144f994795ccd72b2d3b2be9167e6365b66044128fb7ab7c0e0d91ad2a59427 |
|
MD5 | c354a16f510ff7e2630b9ad83a02bdf0 |
|
BLAKE2b-256 | af53b82e5271643e1b1002eca72d26dec998b3b7794d281cd0ff48172f5d8516 |