Skip to main content

An easy way to chunk spacy docs.

Project description

spaCy Chunks

spaCy Chunks is a custom pipeline component for spaCy that allows you to generate overlapping chunks of sentences or tokens from a document. This component is useful for various NLP tasks that require processing text in smaller, potentially overlapping segments.

Features

  • Chunk by sentences or tokens
  • Configurable chunk size
  • Adjustable overlap between chunks
  • Option to truncate incomplete chunks

Installation

To use spaCy Chunks, you need to have spaCy installed. You can install spaCy using pip:

pip install spacy
pip install spacy_chunks

Download a spaCy model:

python -m spacy download en_core_web_sm

Usage

Here's how to use the spaCy Chunks component:

import spacy

# Load a spaCy model
nlp = spacy.load("en_core_web_sm")

# Add the chunking component to the pipeline
nlp.add_pipe("chunking", last=True, config={
    "chunking_method": "sentence",
    "chunk_size": 2,
    "overlap": 1,
    "truncate": True
})

# Process a text
text = "This is the first sentence. This is the second one. And here's the third. The fourth is here. And a fifth."
doc = nlp(text)

# Print the chunks
print("Chunks:")
for i, chunk in enumerate(doc._.chunks, 1):
    print(f"Chunk {i}: {[sent.text for sent in chunk]}")

Output:

Chunks:
Chunk 1: ['This is the first sentence.', 'This is the second one.']
Chunk 2: ['This is the second one.', "And here's the third."]
Chunk 3: ["And here's the third.", 'The fourth is here.']
Chunk 4: ['The fourth is here.', 'And a fifth.']

Configuration

When adding the chunking component to your pipeline, you can configure the following parameters:

  • chunking_method: "sentence" or "token" (default: "sentence")
  • chunk_size: Number of sentences or tokens per chunk (default: 3)
  • overlap: Number of overlapping sentences or tokens between chunks (default: 0)
  • truncate: Whether to remove incomplete chunks at the end (default: True)

Changing Configuration Dynamically

You can change the configuration of the chunking component dynamically:

# Change chunk size
nlp.get_pipe("chunking").chunk_size = 3

# Disable truncation
nlp.get_pipe("chunking").truncate = False

# Process the text again with new settings
doc = nlp(text)

Contributing

Contributions to spaCy Chunks are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_chunks-0.0.2.tar.gz (3.8 kB view details)

Uploaded Source

Built Distribution

spacy_chunks-0.0.2-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file spacy_chunks-0.0.2.tar.gz.

File metadata

  • Download URL: spacy_chunks-0.0.2.tar.gz
  • Upload date:
  • Size: 3.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for spacy_chunks-0.0.2.tar.gz
Algorithm Hash digest
SHA256 6d5b9e4de26ba6e04eff64005b5d854cf94df03bbdcfcf9883f736b9af4bc477
MD5 72c5c29ea410a46433278beb7f2c6ff1
BLAKE2b-256 25c7216bb21316de1910c3b24a8522da8c4be773b378c5c00e387143fad5d2cb

See more details on using hashes here.

File details

Details for the file spacy_chunks-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_chunks-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f9dec0eebdded7bc943539547e6957b62ad1f857ff08e70db1ce89d3cf581261
MD5 77b4ccc68c744d72df34e916eaf8f82e
BLAKE2b-256 5b46ef383ad3827fe9a74cc768cdebce0e0ea2bc14bdf85ca297ee3a6282e894

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page