Skip to main content

Polars expression plugins for text analysis

Project description

polars-text

Polars expression plugins for fast, practical text analysis. Use them as expressions or via the pl.col("text").text.* namespace, plus a few Series-based utilities for token frequency stats and topic modeling.

Quick start

import polars as pl
import polars_text as pt

df = pl.DataFrame({
    "text": [
        "Alice said \"Hello world\".",
        "Hello again, world!",
    ]
})

out = df.with_columns([
    pt.clean_text(pl.col("text")).alias("clean"),
    pt.word_count(pl.col("text")).alias("word_count"),
    pt.char_count(pl.col("text")).alias("char_count"),
    pt.sentence_count(pl.col("text")).alias("sentence_count"),
    pt.tokenize(pl.col("text"), lowercase=True, remove_punct=True).alias("tokens"),
])

Expressions and namespace

All expression functions are available both as module functions and through the text namespace on expressions.

Expression functions

  • tokenize(expr, lowercase=True, remove_punct=True)
  • clean_text(expr)
  • word_count(expr)
  • char_count(expr)
  • sentence_count(expr)
  • concordance(expr, search_word, num_left_tokens=5, num_right_tokens=5, regex=False, case_sensitive=False)
  • quotation(expr)

Namespace usage

df = pl.DataFrame({"text": ["Hello world, hello again."]})

out = df.select([
    pl.col("text").text.clean_text().alias("clean"),
    pl.col("text").text.word_count().alias("word_count"),
    pl.col("text").text.tokenize().alias("tokens"),
])

Concordance

Get left/right context windows around a search term. Output is a list of structs that you can explode and unnest for tabular use.

df = pl.DataFrame({"text": ["Hello world, hello again."]})

concordance = (
    pl.col("text")
    .text.concordance("hello", num_left_tokens=1, num_right_tokens=1)
    .list.explode()
    .struct.unnest()
)

out = df.select(concordance)

Quotation extraction

Extract quoted speech along with speaker, verb, and offsets. Output is a list of structs you can explode and unnest.

df = pl.DataFrame({"text": ["Alice said \"Hello world\"."]})

quotes = (
    pl.col("text")
    .text.quotation()
    .list.explode()
    .struct.unnest()
)

out = df.select(quotes)

Token frequencies and stats

Compute corpus token counts and compare corpora with standard statistics.

series_0 = pl.Series("text", ["hello world", "hello again"])
series_1 = pl.Series("text", ["goodbye world"])

freqs_0 = pt.token_frequencies(series_0)
freqs_1 = pt.token_frequencies(series_1)

stats = pt.token_frequency_stats(freqs_0, freqs_1)

Topic modeling

Cluster documents and return topic labels plus per-document topic assignments.

series = pl.Series("text", [
    "Policy changes were announced today.",
    "Elections are coming soon.",
    "The football match was thrilling.",
])

topics, doc_topics = pt.topic_modeling(series, min_points=2, max_terms=3)

topics is a dict of topic_id -> label and doc_topics is a Series of lists of structs with {topic_id, weight}.

Output schemas

Concordance (list of structs):

  • left_context, matched_text, right_context
  • start_idx, end_idx
  • l1, r1 (first token on left/right for quick filtering)

Quotation (list of structs):

  • speaker, speaker_start_idx, speaker_end_idx
  • quote, quote_start_idx, quote_end_idx
  • verb, verb_start_idx, verb_end_idx
  • quote_type, quote_token_count, is_floating_quote

Topic modeling (Series of list structs):

  • topic_id (int), weight (float)

Models and downloads

Some features download Hugging Face models on first use (via hf-hub) and run on CPU:

  • Tokenization: bert-base-uncased (tokenizer.json)
  • Topic modeling embeddings: sentence-transformers/all-MiniLM-L6-v2
  • Quotation POS tagging: vblagoje/bert-english-uncased-finetuned-pos

The initial call may take longer while models download and cache.

Development

Build the extension locally with maturin and then import as polars_text.

For release and publishing procedures, see PUBLISH.md.

make build
make test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_text-0.1.1.tar.gz (59.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_text-0.1.1-cp314-cp314-win_amd64.whl (7.8 MB view details)

Uploaded CPython 3.14Windows x86-64

polars_text-0.1.1-cp314-cp314-manylinux_2_28_x86_64.whl (11.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

polars_text-0.1.1-cp314-cp314-macosx_11_0_arm64.whl (7.7 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

File details

Details for the file polars_text-0.1.1.tar.gz.

File metadata

  • Download URL: polars_text-0.1.1.tar.gz
  • Upload date:
  • Size: 59.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polars_text-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9f00c66065fce4722af02fb9798eecda529168637182592bb381087528515035
MD5 7d5465f79c52fba8d53493ac606429cf
BLAKE2b-256 818b1081595db4e4e38c73a8876127624a0f1ca28cd6b1f76c9dfa3247c59aef

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.1.tar.gz:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.1-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.1-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 5f75ddef203520bc448b4318a3a5b87dd2055b6f47198a3ddcd26c67c4463f55
MD5 458f50ce5aa61af596cccfa98db63354
BLAKE2b-256 39e4eff93a53fc7c00e3d85d595c302bc99201431b6e62a5f6b30a0af978d25e

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.1-cp314-cp314-win_amd64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.1-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.1-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 081c911bbc148cf96af2dd60d19850e6523bea3e96e937a5eeb78da2b627d0eb
MD5 b8bb2bfdb11a25d12863fd47bd90ccd1
BLAKE2b-256 3803666a2fc6c8749bae2912e51ddf335dc8466cad8552d9b7762e40c8e58ee2

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.1-cp314-cp314-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.1-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.1-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2f74897bb965546c3eb1915dd37d180fff2aa3af03b5827d9f24e083cd515c46
MD5 5984a41b5060926e7db523a4d9869999
BLAKE2b-256 5c3b4ebf46a7802e2dfb6d2873aef6efceecd617a4d802ed27487ec155e69344

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.1-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page