Skip to main content

Polars expression plugins for text analysis

Project description

polars-text

Polars expression plugins for fast, practical text analysis. Use them as expressions or via the pl.col("text").text.* namespace, plus a few Series-based utilities for token frequency stats and topic modeling.

Quick start

import polars as pl
import polars_text as pt

df = pl.DataFrame({
    "text": [
        "Alice said \"Hello world\".",
        "Hello again, world!",
    ]
})

out = df.with_columns([
    pt.clean_text(pl.col("text")).alias("clean"),
    pt.word_count(pl.col("text")).alias("word_count"),
    pt.char_count(pl.col("text")).alias("char_count"),
    pt.sentence_count(pl.col("text")).alias("sentence_count"),
    pt.tokenize(pl.col("text"), lowercase=True, remove_punct=True).alias("tokens"),
])

Expressions and namespace

All expression functions are available both as module functions and through the text namespace on expressions.

Expression functions

  • tokenize(expr, lowercase=True, remove_punct=True)
  • clean_text(expr)
  • word_count(expr)
  • char_count(expr)
  • sentence_count(expr)
  • concordance(expr, search_word, num_left_tokens=5, num_right_tokens=5, regex=False, case_sensitive=False)
  • quotation(expr)

Namespace usage

df = pl.DataFrame({"text": ["Hello world, hello again."]})

out = df.select([
    pl.col("text").text.clean_text().alias("clean"),
    pl.col("text").text.word_count().alias("word_count"),
    pl.col("text").text.tokenize().alias("tokens"),
])

Concordance

Get left/right context windows around a search term. Output is a list of structs that you can explode and unnest for tabular use.

df = pl.DataFrame({"text": ["Hello world, hello again."]})

concordance = (
    pl.col("text")
    .text.concordance("hello", num_left_tokens=1, num_right_tokens=1)
    .list.explode()
    .struct.unnest()
)

out = df.select(concordance)

Quotation extraction

Extract quoted speech along with speaker, verb, and offsets. Output is a list of structs you can explode and unnest.

df = pl.DataFrame({"text": ["Alice said \"Hello world\"."]})

quotes = (
    pl.col("text")
    .text.quotation()
    .list.explode()
    .struct.unnest()
)

out = df.select(quotes)

Token frequencies and stats

Compute corpus token counts and compare corpora with standard statistics.

series_0 = pl.Series("text", ["hello world", "hello again"])
series_1 = pl.Series("text", ["goodbye world"])

freqs_0 = pt.token_frequencies(series_0)
freqs_1 = pt.token_frequencies(series_1)

stats = pt.token_frequency_stats(freqs_0, freqs_1)

Topic modeling

Cluster documents and return topic labels plus per-document topic assignments.

series = pl.Series("text", [
    "Policy changes were announced today.",
    "Elections are coming soon.",
    "The football match was thrilling.",
])

topics, doc_topics = pt.topic_modeling(series, min_points=2, max_terms=3)

topics is a dict of topic_id -> label and doc_topics is a Series of lists of structs with {topic_id, weight}.

Output schemas

Concordance (list of structs):

  • left_context, matched_text, right_context
  • start_idx, end_idx
  • l1, r1 (first token on left/right for quick filtering)

Quotation (list of structs):

  • speaker, speaker_start_idx, speaker_end_idx
  • quote, quote_start_idx, quote_end_idx
  • verb, verb_start_idx, verb_end_idx
  • quote_type, quote_token_count, is_floating_quote

Topic modeling (Series of list structs):

  • topic_id (int), weight (float)

Models and downloads

Some features download Hugging Face models on first use (via hf-hub) and run on CPU:

  • Tokenization: bert-base-uncased (tokenizer.json)
  • Topic modeling embeddings: sentence-transformers/all-MiniLM-L6-v2
  • Quotation POS tagging: vblagoje/bert-english-uncased-finetuned-pos

The initial call may take longer while models download and cache.

Development

Build the extension locally with maturin and then import as polars_text.

For release and publishing procedures, see PUBLISH.md.

make build
make test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_text-0.1.4.tar.gz (5.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_text-0.1.4-cp314-cp314-win_amd64.whl (16.3 MB view details)

Uploaded CPython 3.14Windows x86-64

polars_text-0.1.4-cp314-cp314-manylinux_2_28_x86_64.whl (20.9 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

polars_text-0.1.4-cp314-cp314-macosx_11_0_arm64.whl (16.6 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

File details

Details for the file polars_text-0.1.4.tar.gz.

File metadata

  • Download URL: polars_text-0.1.4.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_text-0.1.4.tar.gz
Algorithm Hash digest
SHA256 4ddd9c0205db327f56ce07b9f57f97c8934e95cd992a16cd878c0b4f87f623db
MD5 7dbb59782d7c156fa189312809960133
BLAKE2b-256 2ac201e8c75cabd0da5331330dd5765b19ed7aa4ffbd572270bca455337ad950

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.4.tar.gz:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.4-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: polars_text-0.1.4-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 16.3 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_text-0.1.4-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 56c48280266d169d9803315a9cc3eaaebfb98b10af6f55c45b5b36b62aee6aaa
MD5 2c9ac8f0f46cfab2ca509c6e0b508254
BLAKE2b-256 6189821ed065f37055b797f3ae69b141190fa3369c611112ad4692a8fcd24aaa

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.4-cp314-cp314-win_amd64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.4-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.4-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c7d8d49f6eac715e0eb70f25ffb9fcd03cf460b688ff52ed0cf9f949e0ec6573
MD5 0c5c3f963fd7d310ccb4c4ad35b63aa9
BLAKE2b-256 61a4809d9f38cbddbc8215c13a6cc32d5ab2e59cef888ad5da2b6dbbb7bf5823

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.4-cp314-cp314-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.4-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.4-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bd42536ccc859168b924f3df0f15cd02a80a57186bed458c192d52439f21e6a0
MD5 36b1c1db25ede07b8ca1cb7a8e22c73d
BLAKE2b-256 7369e170a04f29946c4ef471ad81a175505681a0ff1d615fa5c342f8df09dafb

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.4-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page