Skip to main content

Use spaCy with PDFs, Word docs and other documents

Project description

spaCy Layout: Process PDFs, Word documents and more with spaCy

This plugin integrates with Docling to bring structured processing of PDFs, Word documents and other input formats to your spaCy pipeline. It outputs clean, structured data in a text-based format and outputs spaCy's familiar Doc objects that let you access labelled text spans like sections, headings, or footnotes.

This workflow makes it easy to apply powerful NLP techniques to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing chunking for RAG pipelines.

Test Current Release Version pypi Version Built with spaCy

📝 Usage

⚠️ This package requires Python 3.10 or above.

pip install spacy-layout

After initializing the spaCyLayout preprocessor with an nlp object for tokenization, you can call it on a document path to convert it to structured data. The resulting Doc object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.

import spacy
from spacy_layout import spaCyLayout

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)

# Process a document and create a spaCy Doc object
doc = layout("./starcraft.pdf")

# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)

# Layout spans for different sections
for span in doc.spans["layout"]:
    # Document section and token and character offsets into the text
    print(span.text, span.start, span.end, span.start_char, span.end_char)
    # Section type, e.g. "text", "title", "section_header" etc.
    print(span.label_)
    # Layout features of the section, including bounding box
    print(span._.layout)
    # Closest heading to the span (accuracy depends on document structure)
    print(span._.heading)

If you need to process larger volumes of documents at scale, you can use the spaCyLayout.pipe method, which takes an iterator of paths instead and yields Doc objects:

paths = ["one.pdf", "two.pdf", "three.pdf", ...]
for doc in layout.pipe(paths):
    print(doc._.layout)

After you've processed the documents, you can serialize the structured Doc objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion.

spaCy also allows you to call the nlp object on an already created Doc, so you can easily apply a pipeline of components for linguistic analysis or named entity recognition, use rule-based matching or anything else you can do with spaCy.

# Load the transformer-based English pipeline
# Installation: python -m spacy download en_core_web_trf
nlp = spacy.load("en_core_web_trf")
layout = spaCyLayout(nlp)

doc = layout("./starcraft.pdf")
# Apply the pipeline to access POS tags, dependencies, entities etc.
doc = nlp(doc)

🎛️ API

Data and extension attributes

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
print(doc._.layout)
for span in doc.spans["layout"]:
    print(span.label_, span._.layout)
Attribute Type Description
Doc._.layout DocLayout Layout features of the document.
Doc._.pages list[tuple[PageLayout, list[Span]]] Pages in the document and the spans they contain.
Doc.spans["layout"] spacy.tokens.SpanGroup The layout spans in the document.
Span.label_ str The type of the extracted layout span, e.g. "text" or "section_header". See here for options.
Span.label int The integer ID of the span label.
Span.id int Running index of layout span.
Span._.layout SpanLayout Layout features of a layout span.
Span._.heading Span | None Closest heading to a span, if available.

dataclass PageLayout

Attribute Type Description
page_no int The page number (1-indexed).
width float Page with in pixels.
height float Page height in pixels.

dataclass DocLayout

Attribute Type Description
pages list[PageLayout] The pages in the document.

dataclass SpanLayout

Attribute Type Description
x float Horizontal offset of the bounding box in pixels.
y float Vertical offset of the bounding box in pixels.
width float Width of the bounding box in pixels.
height float Height of the bounding box in pixels.
page_no int Number of page the span is on.

class spaCyLayout

method spaCyLayout.__init__

Initialize the document processor.

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
Argument Type Description
nlp spacy.language.Language The initialized nlp object to use for tokenization.
separator str Token used to separate sections in the created Doc object. The separator won't be part of the layout span. If None, no separator will be added. Defaults to "\n\n".
attrs dict[str, str] Override the custom spaCy attributes. Can include "doc_layout", "doc_pages", "span_layout", "span_heading" and "span_group".
headings list[str] Labels of headings to consider for Span._.heading detection. Defaults to ["section_header", "page_header", "title"].
docling_options dict[InputFormat, FormatOption] Format options passed to Docling's DocumentConverter.
RETURNS spaCyLayout The initialized object.

method spaCyLayout.__call__

Process a document and create a spaCy Doc object containing the text content and layout spans, available via Doc.spans["layout"] by default.

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
Argument Type Description
source str | Path | bytes Path of document to process or bytes.
RETURNS Doc The processed spaCy Doc object.

method spaCyLayout.pipe

Process multiple documents and create spaCy Doc objects. You should use this method if you're processing larger volumes of documents at scale.

layout = spaCyLayout(nlp)
paths = ["one.pdf", "two.pdf", "three.pdf", ...]
docs = layout.pipe(paths)
Argument Type Description
paths Iterable[str | Path | bytes] Paths of documents to process or bytes.
YIELDS Doc The processed spaCy Doc object.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_layout-0.0.3.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

spacy_layout-0.0.3-py2.py3-none-any.whl (7.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file spacy_layout-0.0.3.tar.gz.

File metadata

  • Download URL: spacy_layout-0.0.3.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for spacy_layout-0.0.3.tar.gz
Algorithm Hash digest
SHA256 77a7be11787c7eacd7d60a8a614cbdc7551eda18f2a6fb489d0719b5db88b0e8
MD5 8fed5a6d0f1766a227aa473894f7f579
BLAKE2b-256 bb61445c3375a55fde4a6102ba38c2eaa5ca3a65e75ea68493a916ecb1e07c5b

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_layout-0.0.3.tar.gz:

Publisher: publish.yml on explosion/spacy-layout

Attestations:

File details

Details for the file spacy_layout-0.0.3-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_layout-0.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 beaa1cd2833b00fdbbb8e5b5b37b7abb0329024eee75158ce5fe56cf23b80d7c
MD5 fecab629f1e88190ea0d5cf0e137c7f9
BLAKE2b-256 12d36fbd337bc6b1672b6182634ac9a7de3357d744ba2c99625ede57a24877f3

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_layout-0.0.3-py2.py3-none-any.whl:

Publisher: publish.yml on explosion/spacy-layout

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page