Skip to main content

Use spaCy with PDFs, Word docs and other documents

Project description

spaCy Layout: Process PDFs, Word documents and more with spaCy

This plugin integrates with Docling to bring structured processing of PDFs, Word documents and other input formats to your spaCy pipeline. It outputs clean, structured data in a text-based format and outputs spaCy's familiar Doc objects that let you access labelled text spans like sections, headings, or footnotes.

This workflow makes it easy to apply powerful NLP techniques to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing chunking for RAG pipelines.

Test Current Release Version pypi Version Built with spaCy

📝 Usage

⚠️ This package requires Python 3.10 or above.

pip install spacy-layout

After initializing the spaCyLayout preprocessor with an nlp object for tokenization, you can call it on a document path to convert it to structured data. The resulting Doc object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.

import spacy
from spacy_layout import spaCyLayout

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)

# Process a document and create a spaCy Doc object
doc = layout("./starcraft.pdf")

# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)

# Layout spans for different sections
for span in doc.spans["layout"]:
    # Document section and token and character offsets into the text
    print(span.text, span.start, span.end, span.start_char, span.end_char)
    # Section type, e.g. "text", "title", "section_header" etc.
    print(span.label_)
    # Layout features of the section, including bounding box
    print(span._.layout)
    # Closest heading to the span (accuracy depends on document structure)
    print(span._.heading)

If you need to process larger volumes of documents at scale, you can use the spaCyLayout.pipe method, which takes an iterable of paths or bytes instead and yields Doc objects:

paths = ["one.pdf", "two.pdf", "three.pdf", ...]
for doc in layout.pipe(paths):
    print(doc._.layout)

After you've processed the documents, you can serialize the structured Doc objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion.

spaCy also allows you to call the nlp object on an already created Doc, so you can easily apply a pipeline of components for linguistic analysis or named entity recognition, use rule-based matching or anything else you can do with spaCy.

# Load the transformer-based English pipeline
# Installation: python -m spacy download en_core_web_trf
nlp = spacy.load("en_core_web_trf")
layout = spaCyLayout(nlp)

doc = layout("./starcraft.pdf")
# Apply the pipeline to access POS tags, dependencies, entities etc.
doc = nlp(doc)

🎛️ API

Data and extension attributes

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
print(doc._.layout)
for span in doc.spans["layout"]:
    print(span.label_, span._.layout)
Attribute Type Description
Doc._.layout DocLayout Layout features of the document.
Doc._.pages list[tuple[PageLayout, list[Span]]] Pages in the document and the spans they contain.
Doc.spans["layout"] spacy.tokens.SpanGroup The layout spans in the document.
Span.label_ str The type of the extracted layout span, e.g. "text" or "section_header". See here for options.
Span.label int The integer ID of the span label.
Span.id int Running index of layout span.
Span._.layout SpanLayout Layout features of a layout span.
Span._.heading Span | None Closest heading to a span, if available.

dataclass PageLayout

Attribute Type Description
page_no int The page number (1-indexed).
width float Page with in pixels.
height float Page height in pixels.

dataclass DocLayout

Attribute Type Description
pages list[PageLayout] The pages in the document.

dataclass SpanLayout

Attribute Type Description
x float Horizontal offset of the bounding box in pixels.
y float Vertical offset of the bounding box in pixels.
width float Width of the bounding box in pixels.
height float Height of the bounding box in pixels.
page_no int Number of page the span is on.

class spaCyLayout

method spaCyLayout.__init__

Initialize the document processor.

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
Argument Type Description
nlp spacy.language.Language The initialized nlp object to use for tokenization.
separator str Token used to separate sections in the created Doc object. The separator won't be part of the layout span. If None, no separator will be added. Defaults to "\n\n".
attrs dict[str, str] Override the custom spaCy attributes. Can include "doc_layout", "doc_pages", "span_layout", "span_heading" and "span_group".
headings list[str] Labels of headings to consider for Span._.heading detection. Defaults to ["section_header", "page_header", "title"].
docling_options dict[InputFormat, FormatOption] Format options passed to Docling's DocumentConverter.
RETURNS spaCyLayout The initialized object.

method spaCyLayout.__call__

Process a document and create a spaCy Doc object containing the text content and layout spans, available via Doc.spans["layout"] by default.

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
Argument Type Description
source str | Path | bytes Path of document to process or bytes.
RETURNS Doc The processed spaCy Doc object.

method spaCyLayout.pipe

Process multiple documents and create spaCy Doc objects. You should use this method if you're processing larger volumes of documents at scale.

layout = spaCyLayout(nlp)
paths = ["one.pdf", "two.pdf", "three.pdf", ...]
docs = layout.pipe(paths)
Argument Type Description
sources Iterable[str | Path | bytes] Paths of documents to process or bytes.
YIELDS Doc The processed spaCy Doc object.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_layout-0.0.5.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

spacy_layout-0.0.5-py2.py3-none-any.whl (8.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file spacy_layout-0.0.5.tar.gz.

File metadata

  • Download URL: spacy_layout-0.0.5.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for spacy_layout-0.0.5.tar.gz
Algorithm Hash digest
SHA256 8c785c70dd74f70b915109cf466b525c5b53292da1733d330fd705e96b3e5ace
MD5 cc172dce600f571a8bb2533fd6d6a51b
BLAKE2b-256 290041f12302794ae7e1f2e879aa17927ccd86e8e3a7963db78d6d9d2f2e9933

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_layout-0.0.5.tar.gz:

Publisher: publish.yml on explosion/spacy-layout

Attestations:

File details

Details for the file spacy_layout-0.0.5-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_layout-0.0.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 919b6f49a7a1d8d1dfb21ec60a48248453f45aeada206f2eddbb62fcf2d1ebd3
MD5 0151fd97e45cf165737355c570e5d95c
BLAKE2b-256 aa52c325dec52834767a37c93c9c8e6d12576d2c5c52093e499d30989b09f2ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_layout-0.0.5-py2.py3-none-any.whl:

Publisher: publish.yml on explosion/spacy-layout

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page