Skip to main content

Use spaCy with PDFs, Word docs and other documents

Project description

spaCy Layout: Process PDFs, Word documents and more with spaCy

This plugin integrates with Docling to bring structured processing of PDFs, Word documents and other input formats to your spaCy pipeline. It outputs clean, structured data in a text-based format and outputs spaCy's familiar Doc objects that let you access labelled text spans like sections, headings, or footnotes.

This workflow makes it easy to apply powerful NLP techniques to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing chunking for RAG pipelines.

Test Current Release Version pypi Version Built with spaCy

📝 Usage

⚠️ This package requires Python 3.10 or above.

pip install spacy-layout

After initializing the spaCyLayout preprocessor with an nlp object for tokenization, you can call it on a document path to convert it to structured data. The resulting Doc object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.

import spacy
from spacy_layout import spaCyLayout

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)

# Process a document and create a spaCy Doc object
doc = layout("./starcraft.pdf")

# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)

# Layout spans for different sections
for span in doc.spans["layout"]:
    # Document section and token and character offsets into the text
    print(span.text, span.start, span.end, span.start_char, span.end_char)
    # Section type, e.g. "text", "title", "section_header" etc.
    print(span.label_)
    # Layout features of the section, including bounding box
    print(span._.layout)
    # Closest heading to the span (accuracy depends on document structure)
    print(span._.heading)

If you need to process larger volumes of documents at scale, you can use the spaCyLayout.pipe method, which takes an iterable of paths or bytes instead and yields Doc objects:

paths = ["one.pdf", "two.pdf", "three.pdf", ...]
for doc in layout.pipe(paths):
    print(doc._.layout)

After you've processed the documents, you can serialize the structured Doc objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion.

spaCy also allows you to call the nlp object on an already created Doc, so you can easily apply a pipeline of components for linguistic analysis or named entity recognition, use rule-based matching or anything else you can do with spaCy.

# Load the transformer-based English pipeline
# Installation: python -m spacy download en_core_web_trf
nlp = spacy.load("en_core_web_trf")
layout = spaCyLayout(nlp)

doc = layout("./starcraft.pdf")
# Apply the pipeline to access POS tags, dependencies, entities etc.
doc = nlp(doc)

🎛️ API

Data and extension attributes

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
print(doc._.layout)
for span in doc.spans["layout"]:
    print(span.label_, span._.layout)
Attribute Type Description
Doc._.layout DocLayout Layout features of the document.
Doc._.pages list[tuple[PageLayout, list[Span]]] Pages in the document and the spans they contain.
Doc.spans["layout"] spacy.tokens.SpanGroup The layout spans in the document.
Span.label_ str The type of the extracted layout span, e.g. "text" or "section_header". See here for options.
Span.label int The integer ID of the span label.
Span.id int Running index of layout span.
Span._.layout SpanLayout Layout features of a layout span.
Span._.heading Span | None Closest heading to a span, if available.

dataclass PageLayout

Attribute Type Description
page_no int The page number (1-indexed).
width float Page with in pixels.
height float Page height in pixels.

dataclass DocLayout

Attribute Type Description
pages list[PageLayout] The pages in the document.

dataclass SpanLayout

Attribute Type Description
x float Horizontal offset of the bounding box in pixels.
y float Vertical offset of the bounding box in pixels.
width float Width of the bounding box in pixels.
height float Height of the bounding box in pixels.
page_no int Number of page the span is on.

class spaCyLayout

method spaCyLayout.__init__

Initialize the document processor.

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
Argument Type Description
nlp spacy.language.Language The initialized nlp object to use for tokenization.
separator str Token used to separate sections in the created Doc object. The separator won't be part of the layout span. If None, no separator will be added. Defaults to "\n\n".
attrs dict[str, str] Override the custom spaCy attributes. Can include "doc_layout", "doc_pages", "span_layout", "span_heading" and "span_group".
headings list[str] Labels of headings to consider for Span._.heading detection. Defaults to ["section_header", "page_header", "title"].
docling_options dict[InputFormat, FormatOption] Format options passed to Docling's DocumentConverter.
RETURNS spaCyLayout The initialized object.

method spaCyLayout.__call__

Process a document and create a spaCy Doc object containing the text content and layout spans, available via Doc.spans["layout"] by default.

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
Argument Type Description
source str | Path | bytes Path of document to process or bytes.
RETURNS Doc The processed spaCy Doc object.

method spaCyLayout.pipe

Process multiple documents and create spaCy Doc objects. You should use this method if you're processing larger volumes of documents at scale.

layout = spaCyLayout(nlp)
paths = ["one.pdf", "two.pdf", "three.pdf", ...]
docs = layout.pipe(paths)
Argument Type Description
sources Iterable[str | Path | bytes] Paths of documents to process or bytes.
YIELDS Doc The processed spaCy Doc object.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_layout-0.0.4.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

spacy_layout-0.0.4-py2.py3-none-any.whl (7.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file spacy_layout-0.0.4.tar.gz.

File metadata

  • Download URL: spacy_layout-0.0.4.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for spacy_layout-0.0.4.tar.gz
Algorithm Hash digest
SHA256 609550de9ffc89d37f0e363b70f5daa207d6dd8cc4f50f09b37bdf781a0d102b
MD5 ee54c11efe8494639bc5c18bbb54253b
BLAKE2b-256 19e2a12f5c32fa027496620d34d8706cb31f8b50d67fa8a19217b42fd80aab00

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_layout-0.0.4.tar.gz:

Publisher: publish.yml on explosion/spacy-layout

Attestations:

File details

Details for the file spacy_layout-0.0.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_layout-0.0.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 49133f07af609ab1ba12f5b6ef7d31efda3f9ea9d92f7ffa2d369342b2e2a368
MD5 2284f4770f9c09252e4288cc3c8111f1
BLAKE2b-256 6fc57df488ba0388a9eaec76ac3062b74b66e666f15661159dc7b8f139b051f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_layout-0.0.4-py2.py3-none-any.whl:

Publisher: publish.yml on explosion/spacy-layout

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page