Wowool Chunks

Project description

Chunking documents semantically

The chunks app intelligently segments documents into meaningful, self-contained sections based on their semantic content.

Options

ChunksOptions

interface ChunksOptions {
  max_chunk_size?: number;
  soft_sentence_limit?: number;
  header_level?: number;
  canonicalize?: boolean | Record<string, any>;
  fix_spelling_mistakes?: boolean;
  lowercase: boolean;
  cleanup: boolean;
  lemmas: boolean;
  dates?: boolean;
  add_themes?: boolean;
  add_topics?: boolean;
  add_outline?: boolean;
}

const default_options: ChunksOptions = {
  max_chunk_size: 100,
  soft_sentence_limit: 4,
  header_level: -1,
  canonicalize: true,
  fix_spelling_mistakes: true,
  lowercase: false,
  cleanup: false,
  lemmas: false,
  dates: true,
  add_themes: false,
  add_topics: false,
  add_outline: false,
};

with:

Property	Description
`max_chunk_size`	Maximum number of tokens allowed in each chunk
`soft_sentence_limit`	Maximum number of sentences to find a soft boundary like a header or paragraph
`header_level`	Use document outlining and split the level of the markup or <h.> tags
`canonicalize`	Resolve names of people and companies to their canonical form.
`lowercase`	Lowercase the results
`fix_spelling_mistakes`	Fix spelling mistakes. Example: cooool -> cool or lable -> label
`cleanup`	Remove markup characters
`lemmas`	Return all the data using the lemmas
`dates`	Resolve relative dates. Example: The previous year
`add_topics`	Add topics to the output of every chunk
`add_themes`	Add themes/categories to the output of every chunk
`add_outline`	Add the outline of the document to the output of every chunk

Canonicalize

When set to true, all entities that have a canonical form will be replaced with their canonical representation. In some cases, you may want to specify which entities to canonicalize and how to format them. You might want both the literal text and the canonical form when they differ. You can use a dictionary to specify custom formatting for different entity types, such as formatting Person entities differently from other types.

When using a formatted string, you can use three predefined variables:

{literal} : The literal string as it appears in the text
{canonical} : The first canonical form
{canonicals} : A comma-separated list of all canonical forms

For example,

"canonicalize" : 
  { 
    "Person" : "{literal} ({canonical})",
    "Company" : "{canonicals}"
  }

Results

ChunksResult

type ChunksResult = Chunk[];

Chunk

interface Chunk {
  sentences: string[];
  begin_offset: number;
  end_offset: number;
  outline?: string[];
  topics?: Topic[];
  themes?: Theme[];
}

with:

Property	Description
`sentences`	Sentences in the given chunk
`begin_offset`	Begin offset of the chunk
`end_offset`	End offset of the chunk
`outline`	Outline of a chunk; provides a hierarchical summary where the chunk is located
`topics`	Topics found in the chunk
`themes`	Themes or categories found in the chunk

Theme

interface Theme {
  name: string;
  relevancy: number;
}

Topic

interface Topic {
  name: string;
  relevancy: number;
}

Examples

Creating chunks from a Markdown document

Chunking documents semantically

The chunks app intelligently segments documents into meaningful, self-contained sections based on their semantic content.

Options

ChunksOptions

interface ChunksOptions {
  max_chunk_size?: number;
  soft_sentence_limit?: number;
  header_level?: number;
  canonicalize?: boolean | Record<string, any>;
  fix_spelling_mistakes?: boolean;
  lowercase: boolean;
  cleanup: boolean;
  lemmas: boolean;
  dates?: boolean;
  add_themes?: boolean;
  add_topics?: boolean;
  add_outline?: boolean;
}

const default_options: ChunksOptions = {
  max_chunk_size: 100,
  soft_sentence_limit: 4,
  header_level: -1,
  canonicalize: true,
  fix_spelling_mistakes: true,
  lowercase: false,
  cleanup: false,
  lemmas: false,
  dates: true,
  add_themes: false,
  add_topics: false,
  add_outline: false,
};

with:

Property	Description
`max_chunk_size`	Maximum number of tokens allowed in each chunk
`soft_sentence_limit`	Maximum number of sentences to find a soft boundary like a header or paragraph
`header_level`	Use document outlining and split the level of the markup or <h.> tags
`canonicalize`	Resolve names of people and companies to their canonical form.
`lowercase`	Lowercase the results
`fix_spelling_mistakes`	Fix spelling mistakes. Example: cooool -> cool or lable -> label
`cleanup`	Remove markup characters
`lemmas`	Return all the data using the lemmas
`dates`	Resolve relative dates. Example: The previous year
`add_topics`	Add topics to the output of every chunk
`add_themes`	Add themes/categories to the output of every chunk
`add_outline`	Add the outline of the document to the output of every chunk

Canonicalize

When using a formatted string, you can use three predefined variables:

{literal} : The literal string as it appears in the text
{canonical} : The first canonical form
{canonicals} : A comma-separated list of all canonical forms

For example,

"canonicalize" : 
  { 
    "Person" : "{literal} ({canonical})",
    "Company" : "{canonicals}"
  }

Results

ChunksResult

type ChunksResult = Chunk[];

Chunk

interface Chunk {
  sentences: string[];
  begin_offset: number;
  end_offset: number;
  outline?: string[];
  topics?: Topic[];
  themes?: Theme[];
}

with:

Property	Description
`sentences`	Sentences in the given chunk
`begin_offset`	Begin offset of the chunk
`end_offset`	End offset of the chunk
`outline`	Outline of a chunk; provides a hierarchical summary where the chunk is located
`topics`	Topics found in the chunk
`themes`	Themes or categories found in the chunk

Theme

interface Theme {
  name: string;
  relevancy: number;
}

Topic

interface Topic {
  name: string;
  relevancy: number;
}

API

Examples

Creating Chunks

This script demonstrate the capabilities of the chunks app, the input text has been chunked and information like outlines, themes, topics have been added to each chunk.

from wowool.sdk import Pipeline

text = """# List of Authors and Their Books

## J.R.R. Tolkien

J.R.R. Tolkien was an English writer, poet, and professor known for his high fantasy works. He created the richly detailed world of Middle-earth, a place inhabited by hobbits, elves, dwarves, and orcs.

### The Hobbit

- Published: 1937
- Genre: Fantasy
- **Abstract**: 
  *The Hobbit* is a classic tale of adventure and self-discovery. Bilbo Baggins, a reluctant hobbit, is recruited by the wizard Gandalf and a group of thirteen dwarves to help them reclaim their homeland and treasure from the fearsome dragon Smaug. Along the way, Bilbo encounters trolls, goblins, giant spiders, and a mysterious creature named Gollum. The novel explores themes of bravery, friendship, and the unexpected heroism that lies within ordinary individuals. This book also serves as a prelude to *The Lord of the Rings*, setting up the history of Middle-earth.

### The Lord of the Rings
- Published: 1954
- Genre: Epic Fantasy
- **Abstract**: 
  *The Lord of the Rings* is a monumental epic fantasy trilogy that follows the journey of Frodo Baggins as he attempts to destroy the One Ring, a powerful artifact created by the dark lord Sauron. The ring grants immense power but also corrupts those who possess it. With the help of friends like Samwise Gamgee, Aragorn, Gandalf, and others, Frodo travels across Middle-earth, facing tremendous challenges and internal struggles. The novel delves into themes of good versus evil, the corrupting influence of power, and the importance of hope and perseverance in the face of overwhelming odds. Tolkien's world-building and intricate mythology make this one of the most beloved and influential works of fantasy literature.

## George Orwell

George Orwell was an English novelist, essayist, and critic whose works often focused on social issues, particularly those related to politics, totalitarianism, and personal freedoms. Orwell's keen insights into human nature and government have made his works enduringly relevant.

### 1984
- Published: 1949
- Genre: Dystopian, Political Fiction
- **Abstract**: 
  *1984* is a chilling portrayal of a dystopian future where the government, led by the omnipresent Big Brother, controls every aspect of life. Citizens are constantly watched through telescreens, and any hint of rebellion or independent thought is ruthlessly suppressed by the Thought Police. The novel follows Winston Smith, a low-ranking government employee who becomes disillusioned with the oppressive regime and secretly longs for freedom. Orwell's work explores themes such as the dangers of totalitarianism, the manipulation of truth, and the loss of individual identity. The novel remains a powerful critique of oppressive governments and the dangers of surveillance and propaganda.

### Animal Farm
- Published: 1945
- Genre: Allegory, Satire
- **Abstract**: 
  *Animal Farm* is an allegorical novella that uses a group of farm animals to represent the events leading up to the Russian Revolution of 1917 and the subsequent rise of the Soviet Union. The animals, led by the pigs Napoleon and Snowball, overthrow their human farmer in a bid for equality and freedom. However, as time goes on, the pigs become indistinguishable from the humans they replaced, and the farm’s original ideals are betrayed. Orwell uses the fable to critique the corruption of socialist ideals and the rise of totalitarianism, demonstrating how power can corrupt even the most well-intentioned leaders. The famous line, \"All animals are equal, but some animals are more equal than others,\" captures the central theme of the story.
"""
pipeline = Pipeline(
    [
        "english",
        "entity",
        "topics",
        "semantic-theme",
        {
            "name": "chunks.app",
            "options": {
                "add_outline": True,
                "add_themes": True,
                "add_topics": True,
                "header_level": 3,
                "canonicalize": True,
                "cleanup": True,
                "lowercase": True,
                "fix_spelling_mistakes": True,
            },
        },
    ]
)
document = pipeline(text)
for chunk in document.chunks:
    print("Offsets:", chunk.begin_offset, chunk.end_offset)
    print("Outline:", chunk.outline)
    print("Themes:", chunk.themes)
    print("Topics:", chunk.topics)
    print("Sentences:")
    for sentence in chunk.sentences:
        print("  ", sentence)

    print("-" * 30)

License

In both cases you will need to acquirer a license file at https://www.wowool.com

Non-Commercial

This library is licensed under the GNU AGPLv3 for non-commercial use.  
For commercial use, a separate license must be purchased.

Commercial license Terms

1. Grants the right to use this library in proprietary software.  
2. Requires a valid license key  
3. Redistribution in SaaS requires a commercial license.

Project details

Release history Release notifications | RSS feed

This version

1.3.3

Oct 23, 2025

1.3.2

Oct 19, 2025

1.3.1

Sep 26, 2025

1.3.0

Jun 23, 2025

1.2.0

Jun 2, 2025

1.1.1

May 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wowool_chunks-1.3.3-py3-none-any.whl (10.9 kB view details)

Uploaded Oct 23, 2025 Python 3

File details

Details for the file wowool_chunks-1.3.3-py3-none-any.whl.

File metadata

Download URL: wowool_chunks-1.3.3-py3-none-any.whl
Upload date: Oct 23, 2025
Size: 10.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for wowool_chunks-1.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0c897395c2d3ebf9c003397caabd4ae1d5d719c54c292a748211ec5ae36410dc`
MD5	`28cafa5422074e3aa7da3dcdb477471b`
BLAKE2b-256	`8e0fb1c4a6d435085f8bd4536000b1308f031685e51332baf90e9664b94b8d01`

See more details on using hashes here.

wowool-chunks 1.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Chunking documents semantically

Options

ChunksOptions

Canonicalize

Results

ChunksResult

Chunk

Theme

Topic

Examples

Creating chunks from a Markdown document

Chunking documents semantically

Options

ChunksOptions

Canonicalize

Results

ChunksResult

Chunk

Theme

Topic

API

Examples

Creating Chunks

License

Non-Commercial

Commercial license Terms

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes