Skip to main content

Use spaCy with PDFs, Word docs and other documents

Project description

spaCy Layout: Process PDFs, Word documents and more with spaCy

This plugin integrates with Docling to bring structured processing of PDFs, Word documents and other input formats to your spaCy pipeline. It outputs clean, structured data in a text-based format and outputs spaCy's familiar Doc objects that let you accessed labelled text spans like sections, headings, or footnotes.

This workflow makes it easy to apply powerful NLP techniques to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing chunking for RAG pipelines.

Test Current Release Version pypi Version Built with spaCy

📝 Usage

⚠️ This package requires Python 3.10 or above.

pip install spacy-layout

After initializing the spaCyLayout preprocessor with an nlp object for tokenization, you can call it on a document path to convert it to structured data. The resulting Doc object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.

import spacy
from spacy_layout import spaCyLayout

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)

# Process a document and create a spaCy Doc object
doc = layout("./starcraft.pdf")

# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)

# Layout spans for different sections
for span in doc.spans["layout"]:
    # Document section and token and character offsets into the text
    print(span.text, span.start, span.end, span.start_char, span.end_char)
    # Section type, e.g. "text", "title", "section_header" etc.
    print(span.label_)
    # Layout features of the section, including bounding box
    print(span._.layout)
    # Closest heading to the span (accuracy depends on document structure)
    print(span._.heading)

After you've processed the documents, you can serialize the structured Doc objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion.

spaCy also allows you to call the nlp object on an already created Doc, so you can easily apply a pipeline of components for linguistic analysis or named entity recognition, use rule-based matching or anything else you can do with spaCy.

# Load the transformer-based English pipeline
# Installation: python -m spacy download en_core_web_trf
nlp = spacy.load("en_core_web_trf")
layout = spaCyLayout(nlp)

doc = layout("./starcraft.pdf")
# Apply the pipeline to access POS tags, dependencies, entities etc.
doc = nlp(doc)

🎛️ API

Data and extension attributes

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
print(doc._.layout)
for span in doc.spans["layout"]:
    print(span.label_, span._.layout)
Attribute Type Description
Doc._.layout DocLayout Layout features of the document.
Doc._.pages list[tuple[PageLayout, list[Span]]] Pages in the document and the spans they contain.
Doc.spans["layout"] spacy.tokens.SpanGroup The layout spans in the document.
Span.label_ str The type of the extracted layout span, e.g. "text" or "section_header". See here for options.
Span.label int The integer ID of the span label.
Span.id int Running index of layout span.
Span._.layout SpanLayout Layout features of a layout span.
Span._.heading Span / None Closest heading to a span, if available.

dataclass PageLayout

Attribute Type Description
page_no int The page number (1-indexed).
width float Page with in pixels.
height float Page height in pixels.

dataclass DocLayout

Attribute Type Description
pages list[PageLayout] The pages in the document.

dataclass SpanLayout

Attribute Type Description
x float Horizontal offset of the bounding box in pixels.
y float Vertical offset of the bounding box in pixels.
width float Width of the bounding box in pixels.
height float Height of the bounding box in pixels.
page_no int Number of page the span is on.

class spaCyLayout

method spaCyLayout.__init__

Initialize the document processor.

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
Argument Type Description
nlp spacy.language.Language The initialized nlp object to use for tokenization.
separator str Token used to separate sections in the created Doc object. The separator won't be part of the layout span. If None, no separator will be added. Defaults to "\n\n".
attrs dict[str, str] Override the custom spaCy attributes. Can include "doc_layout", "doc_pages", "span_layout", "span_heading" and "span_group".
headings list[str] Labels of headings to consider for Span._.heading detection. Defaults to ["section_header", "page_header", "title"].
docling_options dict[InputFormat, FormatOption] Format options passed to Docling's DocumentConverter.
RETURNS spaCyLayout The initialized object.

method spaCyLayout.__call__

Process a document and create a spaCy Doc object containing the text content and layout spans, available via Doc.spans["layout"] by default.

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
Argument Type Description
path str / Path Path to document to process.
RETURNS Doc The processed spaCy Doc object.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_layout-0.0.2.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

spacy_layout-0.0.2-py2.py3-none-any.whl (7.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file spacy_layout-0.0.2.tar.gz.

File metadata

  • Download URL: spacy_layout-0.0.2.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for spacy_layout-0.0.2.tar.gz
Algorithm Hash digest
SHA256 5b32eb2e7fb57810bf94305bdd09ad6ffa2b8a6567dd5f5e8e938bbfcada0e04
MD5 8f82765282ee57fd8fdffe234aca2622
BLAKE2b-256 3026298e6b53c0bd808bce0d6de1f555ded04fe800b6383b054702284d3c8ea0

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_layout-0.0.2.tar.gz:

Publisher: publish.yml on explosion/spacy-layout

Attestations:

File details

Details for the file spacy_layout-0.0.2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_layout-0.0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4c96e741a6c5ecec8ac5b46e501e19e4027b3d25d241a3b16d4adda7b17a54ef
MD5 51c6509435459e030669315b54cfc07c
BLAKE2b-256 f008358d6b6c89770642046a78c3dd48d9b38855097fa2a76bd129cd7a04b2fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_layout-0.0.2-py2.py3-none-any.whl:

Publisher: publish.yml on explosion/spacy-layout

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page