Skip to main content

Use spaCy with PDFs, Word docs and other documents

Project description

spaCy Layout: Process PDFs, Word documents and more with spaCy

This plugin integrates with Docling to bring structured processing of PDFs, Word documents and other input formats to your spaCy pipeline. It outputs clean, structured data in a text-based format and outputs spaCy's familiar Doc objects that let you accessed labelled text spans like sections, headings, or footnotes.

This workflow makes it easy to apply powerful NLP techniques to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing chunking for RAG pipelines.

Test Current Release Version pypi Version

📝 Usage

⚠️ This package requires Python 3.10 or above.

pip install spacy-layout

After initializing the spaCyLayout preprocessor with an nlp object for tokenization, you can call it on a document path to convert it to structured data. The resulting Doc object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.

import spacy
from spacy_layout import spaCyLayout

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)

# Process a document and create a spaCy Doc object
doc = layout("./starcraft.pdf")

# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)

# Layout spans for different sections
for span in doc.spans["layout"]:
    # Document section and token and character offsets into the text
    print(span.text, span.start, span.end, span.start_char, span.end_char)
    # Section type, e.g. "text", "title", "section_header" etc.
    print(span.label_)
    # Layout features of the section, including bounding box
    print(span._.layout)

After you've processed the documents, you can serialize the structured Doc objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion.

spaCy also allows you to call the nlp object on an already created Doc, so you can easily apply a pipeline of components for linguistic analysis or named entity recognition, use rule-based matching or anything else you can do with spaCy.

# Load the transformer-based English pipeline
# Installation: python -m spacy download en_core_web_trf
nlp = spacy.load("en_core_web_trf")
layout = spaCyLayout(nlp)

doc = layout("./starcraft.pdf")
# Apply the pipeline to access POS tags, dependencies, entities etc.
doc = nlp(doc)

🎛️ API

Data and extension attributes

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
print(doc._.layout)
for span in doc.spans["layout"]:
    print(span.label_, span._.layout)
Attribute Type Description
Span.label_ str The type of the extracted layout span, e.g. "text" or "section_header". See here for options.
Span._.layout SpanLayout Layout features of a layout span.
Doc._.layout DocLayout Layout features of the document.
Doc._.pages list[tuple[PageLayout, list[Span]]] Pages in the document and the spans they contain.

dataclass PageLayout

Attribute Type Description
page_no int The page number (1-indexed).
width float Page with in pixels.
height float Page height in pixels.

dataclass DocLayout

Attribute Type Description
pages list[PageLayout] The pages in the document.

dataclass SpanLayout

Attribute Type Description
x float Horizontal offset of the bounding box in pixels.
y float Vertical offset of the bounding box in pixels.
width float Width of the bounding box in pixels.
height float Height of the bounding box in pixels.
page_no int Number of page the span is on.

class spaCyLayout

method spaCyLayout.__init__

Initialize the document processor.

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
Argument Type Description
nlp spacy.language.Language The initialized nlp object to use for tokenization.
separator str Token used to separate sections in the created Doc object. The separator won't be part of the layout span. If None, no separator will be added. Defaults to "\n\n".
attrs dict[str, str] Override the custom spaCy attributes. Can include "doc_layout", "doc_pages", "span_layout" and "span_group".
docling_options dict[InputFormat, FormatOption] Format options passed to Docling's DocumentConverter.
RETURNS spaCyLayout The initialized object.

method spaCyLayout.__call__

Process a document and create a spaCy Doc object containing the text content and layout spans, available via Doc.spans["layout"] by default.

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
Argument Type Description
path str / Path Path to document to process.
RETURNS Doc The processed spaCy Doc object.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_layout-0.0.1.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

spacy_layout-0.0.1-py2.py3-none-any.whl (6.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file spacy_layout-0.0.1.tar.gz.

File metadata

  • Download URL: spacy_layout-0.0.1.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for spacy_layout-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9e922fd61dae568cd42238b4f33cec0259e409cb134048ab62b3e63da7735d6f
MD5 4c1f22e79e876b9155478f3d1ac18d34
BLAKE2b-256 f0fb1e0de1aa16df0d8753dc051226ca302174e138b7aa4731b5303bb81c7b35

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_layout-0.0.1.tar.gz:

Publisher: publish.yml on explosion/spacy-layout

Attestations:

File details

Details for the file spacy_layout-0.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_layout-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 949f36f80a42625ff36fca4240c923abbc74aa0f1718e54cf0a2002c4409a2fc
MD5 c37a8589ddba31978d225148dedc2c62
BLAKE2b-256 19270bb1f0b3a97046b4ac57c684437d469013f83db3d305461790a12998c1c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for spacy_layout-0.0.1-py2.py3-none-any.whl:

Publisher: publish.yml on explosion/spacy-layout

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page