Skip to main content

Wowool Chunks

Project description

Chunking documents semantically

The chunks app intelligently segments documents into meaningful, self-contained sections based on their semantic content.

Options

ChunksOptions

interface ChunksOptions {
  max_chunk_size?: number;
  soft_sentence_limit?: number;
  header_level?: number;
  canonicalize?: boolean | Record<string, any>;
  fix_spelling_mistakes?: boolean;
  lowercase: boolean;
  cleanup: boolean;
  lemmas: boolean;
  dates?: boolean;
  add_themes?: boolean;
  add_topics?: boolean;
  add_outline?: boolean;
}

const default_options: ChunksOptions = {
  max_chunk_size: 100,
  soft_sentence_limit: 4,
  header_level: -1,
  canonicalize: true,
  fix_spelling_mistakes: true,
  lowercase: false,
  cleanup: false,
  lemmas: false,
  dates: true,
  add_themes: false,
  add_topics: false,
  add_outline: false,
};

with:

Property Description
max_chunk_size Maximum number of tokens allowed in each chunk
soft_sentence_limit Maximum number of sentences to find a soft boundary like a header or paragraph
header_level Use document outlining and split the level of the markup or <h.> tags
canonicalize Resolve names of people and companies to their canonical form.
lowercase Lowercase the results
fix_spelling_mistakes Fix spelling mistakes. Example: cooool -> cool or lable -> label
cleanup Remove markup characters
lemmas Return all the data using the lemmas
dates Resolve relative dates. Example: The previous year
add_topics Add topics to the output of every chunk
add_themes Add themes/categories to the output of every chunk
add_outline Add the outline of the document to the output of every chunk

Canonicalize

When set to true, all entities that have a canonical form will be replaced with their canonical representation. In some cases, you may want to specify which entities to canonicalize and how to format them. You might want both the literal text and the canonical form when they differ. You can use a dictionary to specify custom formatting for different entity types, such as formatting Person entities differently from other types.

When using a formatted string, you can use three predefined variables:

  • {literal} : The literal string as it appears in the text
  • {canonical} : The first canonical form
  • {canonicals} : A comma-separated list of all canonical forms

For example,

"canonicalize" : 
  { 
    "Person" : "{literal} ({canonical})",
    "Company" : "{canonicals}"
  }

Results

ChunksResult

type ChunksResult = Chunk[];

Chunk

interface Chunk {
  sentences: string[];
  begin_offset: number;
  end_offset: number;
  outline?: string[];
  topics?: Topic[];
  themes?: Theme[];
}

with:

Property Description
sentences Sentences in the given chunk
begin_offset Begin offset of the chunk
end_offset End offset of the chunk
outline Outline of a chunk; provides a hierarchical summary where the chunk is located
topics Topics found in the chunk
themes Themes or categories found in the chunk

Theme

interface Theme {
  name: string;
  relevancy: number;
}

Topic

interface Topic {
  name: string;
  relevancy: number;
}

Examples

Creating chunks from a Markdown document

Chunking documents semantically

The chunks app intelligently segments documents into meaningful, self-contained sections based on their semantic content.

Options

ChunksOptions

interface ChunksOptions {
  max_chunk_size?: number;
  soft_sentence_limit?: number;
  header_level?: number;
  canonicalize?: boolean | Record<string, any>;
  fix_spelling_mistakes?: boolean;
  lowercase: boolean;
  cleanup: boolean;
  lemmas: boolean;
  dates?: boolean;
  add_themes?: boolean;
  add_topics?: boolean;
  add_outline?: boolean;
}

const default_options: ChunksOptions = {
  max_chunk_size: 100,
  soft_sentence_limit: 4,
  header_level: -1,
  canonicalize: true,
  fix_spelling_mistakes: true,
  lowercase: false,
  cleanup: false,
  lemmas: false,
  dates: true,
  add_themes: false,
  add_topics: false,
  add_outline: false,
};

with:

Property Description
max_chunk_size Maximum number of tokens allowed in each chunk
soft_sentence_limit Maximum number of sentences to find a soft boundary like a header or paragraph
header_level Use document outlining and split the level of the markup or <h.> tags
canonicalize Resolve names of people and companies to their canonical form.
lowercase Lowercase the results
fix_spelling_mistakes Fix spelling mistakes. Example: cooool -> cool or lable -> label
cleanup Remove markup characters
lemmas Return all the data using the lemmas
dates Resolve relative dates. Example: The previous year
add_topics Add topics to the output of every chunk
add_themes Add themes/categories to the output of every chunk
add_outline Add the outline of the document to the output of every chunk

Canonicalize

When set to true, all entities that have a canonical form will be replaced with their canonical representation. In some cases, you may want to specify which entities to canonicalize and how to format them. You might want both the literal text and the canonical form when they differ. You can use a dictionary to specify custom formatting for different entity types, such as formatting Person entities differently from other types.

When using a formatted string, you can use three predefined variables:

  • {literal} : The literal string as it appears in the text
  • {canonical} : The first canonical form
  • {canonicals} : A comma-separated list of all canonical forms

For example,

"canonicalize" : 
  { 
    "Person" : "{literal} ({canonical})",
    "Company" : "{canonicals}"
  }

Results

ChunksResult

type ChunksResult = Chunk[];

Chunk

interface Chunk {
  sentences: string[];
  begin_offset: number;
  end_offset: number;
  outline?: string[];
  topics?: Topic[];
  themes?: Theme[];
}

with:

Property Description
sentences Sentences in the given chunk
begin_offset Begin offset of the chunk
end_offset End offset of the chunk
outline Outline of a chunk; provides a hierarchical summary where the chunk is located
topics Topics found in the chunk
themes Themes or categories found in the chunk

Theme

interface Theme {
  name: string;
  relevancy: number;
}

Topic

interface Topic {
  name: string;
  relevancy: number;
}

API

Examples

Creating Chunks

This script demonstrate the capabilities of the chunks app, the input text has been chunked and information like outlines, themes, topics have been added to each chunk.

from wowool.sdk import Pipeline

text = """# List of Authors and Their Books

## J.R.R. Tolkien

J.R.R. Tolkien was an English writer, poet, and professor known for his high fantasy works. He created the richly detailed world of Middle-earth, a place inhabited by hobbits, elves, dwarves, and orcs.

### The Hobbit

- Published: 1937
- Genre: Fantasy
- **Abstract**: 
  *The Hobbit* is a classic tale of adventure and self-discovery. Bilbo Baggins, a reluctant hobbit, is recruited by the wizard Gandalf and a group of thirteen dwarves to help them reclaim their homeland and treasure from the fearsome dragon Smaug. Along the way, Bilbo encounters trolls, goblins, giant spiders, and a mysterious creature named Gollum. The novel explores themes of bravery, friendship, and the unexpected heroism that lies within ordinary individuals. This book also serves as a prelude to *The Lord of the Rings*, setting up the history of Middle-earth.

### The Lord of the Rings
- Published: 1954
- Genre: Epic Fantasy
- **Abstract**: 
  *The Lord of the Rings* is a monumental epic fantasy trilogy that follows the journey of Frodo Baggins as he attempts to destroy the One Ring, a powerful artifact created by the dark lord Sauron. The ring grants immense power but also corrupts those who possess it. With the help of friends like Samwise Gamgee, Aragorn, Gandalf, and others, Frodo travels across Middle-earth, facing tremendous challenges and internal struggles. The novel delves into themes of good versus evil, the corrupting influence of power, and the importance of hope and perseverance in the face of overwhelming odds. Tolkien's world-building and intricate mythology make this one of the most beloved and influential works of fantasy literature.

## George Orwell

George Orwell was an English novelist, essayist, and critic whose works often focused on social issues, particularly those related to politics, totalitarianism, and personal freedoms. Orwell's keen insights into human nature and government have made his works enduringly relevant.

### 1984
- Published: 1949
- Genre: Dystopian, Political Fiction
- **Abstract**: 
  *1984* is a chilling portrayal of a dystopian future where the government, led by the omnipresent Big Brother, controls every aspect of life. Citizens are constantly watched through telescreens, and any hint of rebellion or independent thought is ruthlessly suppressed by the Thought Police. The novel follows Winston Smith, a low-ranking government employee who becomes disillusioned with the oppressive regime and secretly longs for freedom. Orwell's work explores themes such as the dangers of totalitarianism, the manipulation of truth, and the loss of individual identity. The novel remains a powerful critique of oppressive governments and the dangers of surveillance and propaganda.

### Animal Farm
- Published: 1945
- Genre: Allegory, Satire
- **Abstract**: 
  *Animal Farm* is an allegorical novella that uses a group of farm animals to represent the events leading up to the Russian Revolution of 1917 and the subsequent rise of the Soviet Union. The animals, led by the pigs Napoleon and Snowball, overthrow their human farmer in a bid for equality and freedom. However, as time goes on, the pigs become indistinguishable from the humans they replaced, and the farm’s original ideals are betrayed. Orwell uses the fable to critique the corruption of socialist ideals and the rise of totalitarianism, demonstrating how power can corrupt even the most well-intentioned leaders. The famous line, \"All animals are equal, but some animals are more equal than others,\" captures the central theme of the story.
"""
pipeline = Pipeline(
    [
        "english",
        "entity",
        "topics",
        "semantic-theme",
        {
            "name": "chunks.app",
            "options": {
                "add_outline": True,
                "add_themes": True,
                "add_topics": True,
                "header_level": 3,
                "canonicalize": True,
                "cleanup": True,
                "lowercase": True,
                "fix_spelling_mistakes": True,
            },
        },
    ]
)
document = pipeline(text)
for chunk in document.chunks:
    print("Offsets:", chunk.begin_offset, chunk.end_offset)
    print("Outline:", chunk.outline)
    print("Themes:", chunk.themes)
    print("Topics:", chunk.topics)
    print("Sentences:")
    for sentence in chunk.sentences:
        print("  ", sentence)

    print("-" * 30)

License

In both cases you will need to acquirer a license file at https://www.wowool.com

Non-Commercial

This library is licensed under the GNU AGPLv3 for non-commercial use.  
For commercial use, a separate license must be purchased.  

Commercial license Terms

1. Grants the right to use this library in proprietary software.  
2. Requires a valid license key  
3. Redistribution in SaaS requires a commercial license.  

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wowool_chunks-1.3.3-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file wowool_chunks-1.3.3-py3-none-any.whl.

File metadata

  • Download URL: wowool_chunks-1.3.3-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for wowool_chunks-1.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0c897395c2d3ebf9c003397caabd4ae1d5d719c54c292a748211ec5ae36410dc
MD5 28cafa5422074e3aa7da3dcdb477471b
BLAKE2b-256 8e0fb1c4a6d435085f8bd4536000b1308f031685e51332baf90e9664b94b8d01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page