An easy way to chunk spacy docs.
Project description
spaCy Chunks
spaCy Chunks is a custom pipeline component for spaCy that allows you to generate overlapping chunks of sentences or tokens from a document. This component is useful for various NLP tasks that require processing text in smaller, potentially overlapping segments.
Features
- Chunk by sentences or tokens
- Configurable chunk size
- Adjustable overlap between chunks
- Option to truncate incomplete chunks
Installation
To use spaCy Chunks, you need to have spaCy installed. You can install spaCy using pip:
pip install spacy
pip install spacy_chunks
Download a spaCy model:
python -m spacy download en_core_web_sm
Usage
Here's how to use the spaCy Chunks component:
import spacy
# Load a spaCy model
nlp = spacy.load("en_core_web_sm")
# Add the chunking component to the pipeline
nlp.add_pipe("chunking", last=True, config={
"chunking_method": "sentence",
"chunk_size": 2,
"overlap": 1,
"truncate": True
})
# Process a text
text = "This is the first sentence. This is the second one. And here's the third. The fourth is here. And a fifth."
doc = nlp(text)
# Print the chunks
print("Chunks:")
for i, chunk in enumerate(doc._.chunks, 1):
print(f"Chunk {i}: {[sent.text for sent in chunk]}")
Output:
Chunks:
Chunk 1: ['This is the first sentence.', 'This is the second one.']
Chunk 2: ['This is the second one.', "And here's the third."]
Chunk 3: ["And here's the third.", 'The fourth is here.']
Chunk 4: ['The fourth is here.', 'And a fifth.']
Configuration
When adding the chunking component to your pipeline, you can configure the following parameters:
chunking_method
: "sentence" or "token" (default: "sentence")chunk_size
: Number of sentences or tokens per chunk (default: 3)overlap
: Number of overlapping sentences or tokens between chunks (default: 0)truncate
: Whether to remove incomplete chunks at the end (default: True)
Changing Configuration Dynamically
You can change the configuration of the chunking component dynamically:
# Change chunk size
nlp.get_pipe("chunking").chunk_size = 3
# Disable truncation
nlp.get_pipe("chunking").truncate = False
# Process the text again with new settings
doc = nlp(text)
Contributing
Contributions to spaCy Chunks are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file spacy_chunks-0.0.2.tar.gz
.
File metadata
- Download URL: spacy_chunks-0.0.2.tar.gz
- Upload date:
- Size: 3.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d5b9e4de26ba6e04eff64005b5d854cf94df03bbdcfcf9883f736b9af4bc477 |
|
MD5 | 72c5c29ea410a46433278beb7f2c6ff1 |
|
BLAKE2b-256 | 25c7216bb21316de1910c3b24a8522da8c4be773b378c5c00e387143fad5d2cb |
File details
Details for the file spacy_chunks-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: spacy_chunks-0.0.2-py3-none-any.whl
- Upload date:
- Size: 4.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9dec0eebdded7bc943539547e6957b62ad1f857ff08e70db1ce89d3cf581261 |
|
MD5 | 77b4ccc68c744d72df34e916eaf8f82e |
|
BLAKE2b-256 | 5b46ef383ad3827fe9a74cc768cdebce0e0ea2bc14bdf85ca297ee3a6282e894 |