Skip to main content

Polars expression plugins for text analysis

Project description

polars-text

Polars expression plugins for fast, practical text analysis. Use them as expressions or via the pl.col("text").text.* namespace, plus a few Series-based utilities for token frequency stats and topic modeling.

Quick start

import polars as pl
import polars_text as pt

df = pl.DataFrame({
    "text": [
        "Alice said \"Hello world\".",
        "Hello again, world!",
    ]
})

out = df.with_columns([
    pt.clean_text(pl.col("text")).alias("clean"),
    pt.word_count(pl.col("text")).alias("word_count"),
    pt.char_count(pl.col("text")).alias("char_count"),
    pt.sentence_count(pl.col("text")).alias("sentence_count"),
    pt.tokenize(pl.col("text"), lowercase=True, remove_punct=True).alias("tokens"),
])

Expressions and namespace

All expression functions are available both as module functions and through the text namespace on expressions.

Expression functions

  • tokenize(expr, lowercase=True, remove_punct=True)
  • clean_text(expr)
  • word_count(expr)
  • char_count(expr)
  • sentence_count(expr)
  • concordance(expr, search_word, num_left_tokens=5, num_right_tokens=5, regex=False, case_sensitive=False)
  • quotation(expr)

Namespace usage

df = pl.DataFrame({"text": ["Hello world, hello again."]})

out = df.select([
    pl.col("text").text.clean_text().alias("clean"),
    pl.col("text").text.word_count().alias("word_count"),
    pl.col("text").text.tokenize().alias("tokens"),
])

Concordance

Get left/right context windows around a search term. Output is a list of structs that you can explode and unnest for tabular use.

df = pl.DataFrame({"text": ["Hello world, hello again."]})

concordance = (
    pl.col("text")
    .text.concordance("hello", num_left_tokens=1, num_right_tokens=1)
    .list.explode()
    .struct.unnest()
)

out = df.select(concordance)

Quotation extraction

Extract quoted speech along with speaker, verb, and offsets. Output is a list of structs you can explode and unnest.

df = pl.DataFrame({"text": ["Alice said \"Hello world\"."]})

quotes = (
    pl.col("text")
    .text.quotation()
    .list.explode()
    .struct.unnest()
)

out = df.select(quotes)

Token frequencies and stats

Compute corpus token counts and compare corpora with standard statistics.

series_0 = pl.Series("text", ["hello world", "hello again"])
series_1 = pl.Series("text", ["goodbye world"])

freqs_0 = pt.token_frequencies(series_0)
freqs_1 = pt.token_frequencies(series_1)

stats = pt.token_frequency_stats(freqs_0, freqs_1)

Topic modeling

Cluster documents and return topic labels plus per-document topic assignments.

series = pl.Series("text", [
    "Policy changes were announced today.",
    "Elections are coming soon.",
    "The football match was thrilling.",
])

topics, doc_topics = pt.topic_modeling(series, min_points=2, max_terms=3)

topics is a dict of topic_id -> label and doc_topics is a Series of lists of structs with {topic_id, weight}.

Output schemas

Concordance (list of structs):

  • left_context, matched_text, right_context
  • start_idx, end_idx
  • l1, r1 (first token on left/right for quick filtering)

Quotation (list of structs):

  • speaker, speaker_start_idx, speaker_end_idx
  • quote, quote_start_idx, quote_end_idx
  • verb, verb_start_idx, verb_end_idx
  • quote_type, quote_token_count, is_floating_quote

Topic modeling (Series of list structs):

  • topic_id (int), weight (float)

Models and downloads

Some features download Hugging Face models on first use (via hf-hub) and run on CPU:

  • Tokenization: bert-base-uncased (tokenizer.json)
  • Topic modeling embeddings: sentence-transformers/all-MiniLM-L6-v2
  • Quotation POS tagging: vblagoje/bert-english-uncased-finetuned-pos

The initial call may take longer while models download and cache.

Development

Build the extension locally with maturin and then import as polars_text.

For release and publishing procedures, see PUBLISH.md.

make build
make test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_text-0.1.7.tar.gz (5.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_text-0.1.7-cp314-cp314-win_amd64.whl (19.5 MB view details)

Uploaded CPython 3.14Windows x86-64

polars_text-0.1.7-cp314-cp314-manylinux_2_28_x86_64.whl (24.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

polars_text-0.1.7-cp314-cp314-macosx_11_0_arm64.whl (19.1 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

File details

Details for the file polars_text-0.1.7.tar.gz.

File metadata

  • Download URL: polars_text-0.1.7.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_text-0.1.7.tar.gz
Algorithm Hash digest
SHA256 90b86af2e6abcaa80e4c71b4217637f6e9f4fd089c5ab9c43e995f40f8c6db32
MD5 efdb0ed250de11381fe0019c9b6803bf
BLAKE2b-256 71ec854ca9bf9ed5f892bab1bc7402b502e023ecd04a72391868e43f56997d5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.7.tar.gz:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.7-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.7-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 c41141f21d1045a439d0bcca88f6ffc20394dac5d18bc09d3b34cdb5ff0c9251
MD5 41230784eae669e99b1fec1743109b3f
BLAKE2b-256 8939f0ed84e9bbeba1a5e7228b55512a04f20909a68e9b74b48e813ee8c635e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.7-cp314-cp314-win_amd64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.7-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.7-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 33cdc3eb27f7f4586dd344136245c636267cdf45c5e2e2e0ce1df76e9422adb5
MD5 611107cd53723170c2716dcfb0a5ac76
BLAKE2b-256 7cefe77aefaebfa84f0bbd820282acb5447070eed9c217f96d251089ed120f01

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.7-cp314-cp314-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.7-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.7-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dd3b68332d1d44b65c30de7844966d2fb65d94843d06164c8adf7e67e564b200
MD5 26f6688e63a544eaf7d1da8762f18dca
BLAKE2b-256 494bd44f995cdf15947c8a3705cddd8512da5d8d41e150230d6395b85651991d

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.7-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page