Use spaCy with PDFs, Word docs and other documents
Project description
spaCy Layout: Process PDFs, Word documents and more with spaCy
This plugin integrates with Docling to bring structured processing of PDFs, Word documents and other input formats to your spaCy pipeline. It outputs clean, structured data in a text-based format and outputs spaCy's familiar Doc
objects that let you access labelled text spans like sections, headings, or footnotes.
This workflow makes it easy to apply powerful NLP techniques to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing chunking for RAG pipelines.
📝 Usage
⚠️ This package requires Python 3.10 or above.
pip install spacy-layout
After initializing the spaCyLayout
preprocessor with an nlp
object for tokenization, you can call it on a document path to convert it to structured data. The resulting Doc
object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.
import spacy
from spacy_layout import spaCyLayout
nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
# Process a document and create a spaCy Doc object
doc = layout("./starcraft.pdf")
# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)
# Layout spans for different sections
for span in doc.spans["layout"]:
# Document section and token and character offsets into the text
print(span.text, span.start, span.end, span.start_char, span.end_char)
# Section type, e.g. "text", "title", "section_header" etc.
print(span.label_)
# Layout features of the section, including bounding box
print(span._.layout)
# Closest heading to the span (accuracy depends on document structure)
print(span._.heading)
If you need to process larger volumes of documents at scale, you can use the spaCyLayout.pipe
method, which takes an iterable of paths or bytes instead and yields Doc
objects:
paths = ["one.pdf", "two.pdf", "three.pdf", ...]
for doc in layout.pipe(paths):
print(doc._.layout)
After you've processed the documents, you can serialize the structured Doc
objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion.
spaCy also allows you to call the nlp
object on an already created Doc
, so you can easily apply a pipeline of components for linguistic analysis or named entity recognition, use rule-based matching or anything else you can do with spaCy.
# Load the transformer-based English pipeline
# Installation: python -m spacy download en_core_web_trf
nlp = spacy.load("en_core_web_trf")
layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
# Apply the pipeline to access POS tags, dependencies, entities etc.
doc = nlp(doc)
🎛️ API
Data and extension attributes
layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
print(doc._.layout)
for span in doc.spans["layout"]:
print(span.label_, span._.layout)
Attribute | Type | Description |
---|---|---|
Doc._.layout |
DocLayout |
Layout features of the document. |
Doc._.pages |
list[tuple[PageLayout, list[Span]]] |
Pages in the document and the spans they contain. |
Doc.spans["layout"] |
spacy.tokens.SpanGroup |
The layout spans in the document. |
Span.label_ |
str |
The type of the extracted layout span, e.g. "text" or "section_header" . See here for options. |
Span.label |
int |
The integer ID of the span label. |
Span.id |
int |
Running index of layout span. |
Span._.layout |
SpanLayout |
Layout features of a layout span. |
Span._.heading |
Span | None |
Closest heading to a span, if available. |
dataclass PageLayout
Attribute | Type | Description |
---|---|---|
page_no |
int |
The page number (1-indexed). |
width |
float |
Page with in pixels. |
height |
float |
Page height in pixels. |
dataclass DocLayout
Attribute | Type | Description |
---|---|---|
pages |
list[PageLayout] |
The pages in the document. |
dataclass SpanLayout
Attribute | Type | Description |
---|---|---|
x |
float |
Horizontal offset of the bounding box in pixels. |
y |
float |
Vertical offset of the bounding box in pixels. |
width |
float |
Width of the bounding box in pixels. |
height |
float |
Height of the bounding box in pixels. |
page_no |
int |
Number of page the span is on. |
class spaCyLayout
method spaCyLayout.__init__
Initialize the document processor.
nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
Argument | Type | Description |
---|---|---|
nlp |
spacy.language.Language |
The initialized nlp object to use for tokenization. |
separator |
str |
Token used to separate sections in the created Doc object. The separator won't be part of the layout span. If None , no separator will be added. Defaults to "\n\n" . |
attrs |
dict[str, str] |
Override the custom spaCy attributes. Can include "doc_layout" , "doc_pages" , "span_layout" , "span_heading" and "span_group" . |
headings |
list[str] |
Labels of headings to consider for Span._.heading detection. Defaults to ["section_header", "page_header", "title"] . |
docling_options |
dict[InputFormat, FormatOption] |
Format options passed to Docling's DocumentConverter . |
RETURNS | spaCyLayout |
The initialized object. |
method spaCyLayout.__call__
Process a document and create a spaCy Doc
object containing the text content and layout spans, available via Doc.spans["layout"]
by default.
layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
Argument | Type | Description |
---|---|---|
source |
str | Path | bytes |
Path of document to process or bytes. |
RETURNS | Doc |
The processed spaCy Doc object. |
method spaCyLayout.pipe
Process multiple documents and create spaCy Doc
objects. You should use this method if you're processing larger volumes of documents at scale.
layout = spaCyLayout(nlp)
paths = ["one.pdf", "two.pdf", "three.pdf", ...]
docs = layout.pipe(paths)
Argument | Type | Description |
---|---|---|
sources |
Iterable[str | Path | bytes] |
Paths of documents to process or bytes. |
YIELDS | Doc |
The processed spaCy Doc object. |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file spacy_layout-0.0.4.tar.gz
.
File metadata
- Download URL: spacy_layout-0.0.4.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 609550de9ffc89d37f0e363b70f5daa207d6dd8cc4f50f09b37bdf781a0d102b |
|
MD5 | ee54c11efe8494639bc5c18bbb54253b |
|
BLAKE2b-256 | 19e2a12f5c32fa027496620d34d8706cb31f8b50d67fa8a19217b42fd80aab00 |
Provenance
The following attestation bundles were made for spacy_layout-0.0.4.tar.gz
:
Publisher:
publish.yml
on explosion/spacy-layout
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
spacy_layout-0.0.4.tar.gz
- Subject digest:
609550de9ffc89d37f0e363b70f5daa207d6dd8cc4f50f09b37bdf781a0d102b
- Sigstore transparency entry: 150138488
- Sigstore integration time:
- Predicate type:
File details
Details for the file spacy_layout-0.0.4-py2.py3-none-any.whl
.
File metadata
- Download URL: spacy_layout-0.0.4-py2.py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 49133f07af609ab1ba12f5b6ef7d31efda3f9ea9d92f7ffa2d369342b2e2a368 |
|
MD5 | 2284f4770f9c09252e4288cc3c8111f1 |
|
BLAKE2b-256 | 6fc57df488ba0388a9eaec76ac3062b74b66e666f15661159dc7b8f139b051f9 |
Provenance
The following attestation bundles were made for spacy_layout-0.0.4-py2.py3-none-any.whl
:
Publisher:
publish.yml
on explosion/spacy-layout
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
spacy_layout-0.0.4-py2.py3-none-any.whl
- Subject digest:
49133f07af609ab1ba12f5b6ef7d31efda3f9ea9d92f7ffa2d369342b2e2a368
- Sigstore transparency entry: 150138489
- Sigstore integration time:
- Predicate type: