Wowool Chunks
Project description
Chunking documents semantically
The chunks app intelligently segments documents into meaningful, self-contained sections based on their semantic content.
Options
ChunksOptions
interface ChunksOptions {
max_chunk_size?: number;
soft_sentence_limit?: number;
header_level?: number;
canonicalize?: boolean | Record<string, any>;
fix_spelling_mistakes?: boolean;
lowercase: boolean;
cleanup: boolean;
lemmas: boolean;
dates?: boolean;
add_themes?: boolean;
add_topics?: boolean;
add_outline?: boolean;
}
const default_options: ChunksOptions = {
max_chunk_size: 100,
soft_sentence_limit: 4,
header_level: -1,
canonicalize: true,
fix_spelling_mistakes: true,
lowercase: false,
cleanup: false,
lemmas: false,
dates: true,
add_themes: false,
add_topics: false,
add_outline: false,
};
with:
| Property | Description |
|---|---|
max_chunk_size |
Maximum number of tokens allowed in each chunk |
soft_sentence_limit |
Maximum number of sentences to find a soft boundary like a header or paragraph |
header_level |
Use document outlining and split the level of the markup or <h.> tags |
canonicalize |
Resolve names of people and companies to their canonical form. |
lowercase |
Lowercase the results |
fix_spelling_mistakes |
Fix spelling mistakes. Example: cooool -> cool or lable -> label |
cleanup |
Remove markup characters |
lemmas |
Return all the data using the lemmas |
dates |
Resolve relative dates. Example: The previous year |
add_topics |
Add topics to the output of every chunk |
add_themes |
Add themes/categories to the output of every chunk |
add_outline |
Add the outline of the document to the output of every chunk |
Canonicalize
When set to true, all entities that have a canonical form will be replaced with their canonical representation. In some cases, you may want to specify which entities to canonicalize and how to format them. You might want both the literal text and the canonical form when they differ. You can use a dictionary to specify custom formatting for different entity types, such as formatting Person entities differently from other types.
When using a formatted string, you can use three predefined variables:
{literal}: The literal string as it appears in the text{canonical}: The first canonical form{canonicals}: A comma-separated list of all canonical forms
For example,
"canonicalize" :
{
"Person" : "{literal} ({canonical})",
"Company" : "{canonicals}"
}
Results
ChunksResult
type ChunksResult = Chunk[];
Chunk
interface Chunk {
sentences: string[];
begin_offset: number;
end_offset: number;
outline?: string[];
topics?: Topic[];
themes?: Theme[];
}
with:
| Property | Description |
|---|---|
sentences |
Sentences in the given chunk |
begin_offset |
Begin offset of the chunk |
end_offset |
End offset of the chunk |
outline |
Outline of a chunk; provides a hierarchical summary where the chunk is located |
topics |
Topics found in the chunk |
themes |
Themes or categories found in the chunk |
Theme
interface Theme {
name: string;
relevancy: number;
}
Topic
interface Topic {
name: string;
relevancy: number;
}
Examples
Creating chunks from a Markdown document
Chunking documents semantically
The chunks app intelligently segments documents into meaningful, self-contained sections based on their semantic content.
Options
ChunksOptions
interface ChunksOptions {
max_chunk_size?: number;
soft_sentence_limit?: number;
header_level?: number;
canonicalize?: boolean | Record<string, any>;
fix_spelling_mistakes?: boolean;
lowercase: boolean;
cleanup: boolean;
lemmas: boolean;
dates?: boolean;
add_themes?: boolean;
add_topics?: boolean;
add_outline?: boolean;
}
const default_options: ChunksOptions = {
max_chunk_size: 100,
soft_sentence_limit: 4,
header_level: -1,
canonicalize: true,
fix_spelling_mistakes: true,
lowercase: false,
cleanup: false,
lemmas: false,
dates: true,
add_themes: false,
add_topics: false,
add_outline: false,
};
with:
| Property | Description |
|---|---|
max_chunk_size |
Maximum number of tokens allowed in each chunk |
soft_sentence_limit |
Maximum number of sentences to find a soft boundary like a header or paragraph |
header_level |
Use document outlining and split the level of the markup or <h.> tags |
canonicalize |
Resolve names of people and companies to their canonical form. |
lowercase |
Lowercase the results |
fix_spelling_mistakes |
Fix spelling mistakes. Example: cooool -> cool or lable -> label |
cleanup |
Remove markup characters |
lemmas |
Return all the data using the lemmas |
dates |
Resolve relative dates. Example: The previous year |
add_topics |
Add topics to the output of every chunk |
add_themes |
Add themes/categories to the output of every chunk |
add_outline |
Add the outline of the document to the output of every chunk |
Canonicalize
When set to true, all entities that have a canonical form will be replaced with their canonical representation. In some cases, you may want to specify which entities to canonicalize and how to format them. You might want both the literal text and the canonical form when they differ. You can use a dictionary to specify custom formatting for different entity types, such as formatting Person entities differently from other types.
When using a formatted string, you can use three predefined variables:
{literal}: The literal string as it appears in the text{canonical}: The first canonical form{canonicals}: A comma-separated list of all canonical forms
For example,
"canonicalize" :
{
"Person" : "{literal} ({canonical})",
"Company" : "{canonicals}"
}
Results
ChunksResult
type ChunksResult = Chunk[];
Chunk
interface Chunk {
sentences: string[];
begin_offset: number;
end_offset: number;
outline?: string[];
topics?: Topic[];
themes?: Theme[];
}
with:
| Property | Description |
|---|---|
sentences |
Sentences in the given chunk |
begin_offset |
Begin offset of the chunk |
end_offset |
End offset of the chunk |
outline |
Outline of a chunk; provides a hierarchical summary where the chunk is located |
topics |
Topics found in the chunk |
themes |
Themes or categories found in the chunk |
Theme
interface Theme {
name: string;
relevancy: number;
}
Topic
interface Topic {
name: string;
relevancy: number;
}
API
Examples
Creating Chunks
This script demonstrate the capabilities of the chunks app, the input text has been chunked and information like outlines, themes, topics have been added to each chunk.
from wowool.sdk import Pipeline
text = """# List of Authors and Their Books
## J.R.R. Tolkien
J.R.R. Tolkien was an English writer, poet, and professor known for his high fantasy works. He created the richly detailed world of Middle-earth, a place inhabited by hobbits, elves, dwarves, and orcs.
### The Hobbit
- Published: 1937
- Genre: Fantasy
- **Abstract**:
*The Hobbit* is a classic tale of adventure and self-discovery. Bilbo Baggins, a reluctant hobbit, is recruited by the wizard Gandalf and a group of thirteen dwarves to help them reclaim their homeland and treasure from the fearsome dragon Smaug. Along the way, Bilbo encounters trolls, goblins, giant spiders, and a mysterious creature named Gollum. The novel explores themes of bravery, friendship, and the unexpected heroism that lies within ordinary individuals. This book also serves as a prelude to *The Lord of the Rings*, setting up the history of Middle-earth.
### The Lord of the Rings
- Published: 1954
- Genre: Epic Fantasy
- **Abstract**:
*The Lord of the Rings* is a monumental epic fantasy trilogy that follows the journey of Frodo Baggins as he attempts to destroy the One Ring, a powerful artifact created by the dark lord Sauron. The ring grants immense power but also corrupts those who possess it. With the help of friends like Samwise Gamgee, Aragorn, Gandalf, and others, Frodo travels across Middle-earth, facing tremendous challenges and internal struggles. The novel delves into themes of good versus evil, the corrupting influence of power, and the importance of hope and perseverance in the face of overwhelming odds. Tolkien's world-building and intricate mythology make this one of the most beloved and influential works of fantasy literature.
## George Orwell
George Orwell was an English novelist, essayist, and critic whose works often focused on social issues, particularly those related to politics, totalitarianism, and personal freedoms. Orwell's keen insights into human nature and government have made his works enduringly relevant.
### 1984
- Published: 1949
- Genre: Dystopian, Political Fiction
- **Abstract**:
*1984* is a chilling portrayal of a dystopian future where the government, led by the omnipresent Big Brother, controls every aspect of life. Citizens are constantly watched through telescreens, and any hint of rebellion or independent thought is ruthlessly suppressed by the Thought Police. The novel follows Winston Smith, a low-ranking government employee who becomes disillusioned with the oppressive regime and secretly longs for freedom. Orwell's work explores themes such as the dangers of totalitarianism, the manipulation of truth, and the loss of individual identity. The novel remains a powerful critique of oppressive governments and the dangers of surveillance and propaganda.
### Animal Farm
- Published: 1945
- Genre: Allegory, Satire
- **Abstract**:
*Animal Farm* is an allegorical novella that uses a group of farm animals to represent the events leading up to the Russian Revolution of 1917 and the subsequent rise of the Soviet Union. The animals, led by the pigs Napoleon and Snowball, overthrow their human farmer in a bid for equality and freedom. However, as time goes on, the pigs become indistinguishable from the humans they replaced, and the farm’s original ideals are betrayed. Orwell uses the fable to critique the corruption of socialist ideals and the rise of totalitarianism, demonstrating how power can corrupt even the most well-intentioned leaders. The famous line, \"All animals are equal, but some animals are more equal than others,\" captures the central theme of the story.
"""
pipeline = Pipeline(
[
"english",
"entity",
"topics",
"semantic-theme",
{
"name": "chunks.app",
"options": {
"add_outline": True,
"add_themes": True,
"add_topics": True,
"header_level": 3,
"canonicalize": True,
"cleanup": True,
"lowercase": True,
"fix_spelling_mistakes": True,
},
},
]
)
document = pipeline(text)
for chunk in document.chunks:
print("Offsets:", chunk.begin_offset, chunk.end_offset)
print("Outline:", chunk.outline)
print("Themes:", chunk.themes)
print("Topics:", chunk.topics)
print("Sentences:")
for sentence in chunk.sentences:
print(" ", sentence)
print("-" * 30)
License
In both cases you will need to acquirer a license file at https://www.wowool.com
Non-Commercial
This library is licensed under the GNU AGPLv3 for non-commercial use.
For commercial use, a separate license must be purchased.
Commercial license Terms
1. Grants the right to use this library in proprietary software.
2. Requires a valid license key
3. Redistribution in SaaS requires a commercial license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wowool_chunks-1.3.3-py3-none-any.whl.
File metadata
- Download URL: wowool_chunks-1.3.3-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c897395c2d3ebf9c003397caabd4ae1d5d719c54c292a748211ec5ae36410dc
|
|
| MD5 |
28cafa5422074e3aa7da3dcdb477471b
|
|
| BLAKE2b-256 |
8e0fb1c4a6d435085f8bd4536000b1308f031685e51332baf90e9664b94b8d01
|