Skip to main content

Polars expression plugins for text analysis

Project description

polars-text

Polars expression plugins for fast, practical text analysis. Use them as expressions or via the pl.col("text").text.* namespace, plus a few Series-based utilities for token frequency stats and topic modeling.

Quick start

import polars as pl
import polars_text as pt

df = pl.DataFrame({
    "text": [
        "Alice said \"Hello world\".",
        "Hello again, world!",
    ]
})

out = df.with_columns([
    pt.clean_text(pl.col("text")).alias("clean"),
    pt.word_count(pl.col("text")).alias("word_count"),
    pt.char_count(pl.col("text")).alias("char_count"),
    pt.sentence_count(pl.col("text")).alias("sentence_count"),
    pt.tokenize(pl.col("text"), lowercase=True, remove_punct=True).alias("tokens"),
])

Expressions and namespace

All expression functions are available both as module functions and through the text namespace on expressions.

Expression functions

  • tokenize(expr, lowercase=True, remove_punct=True)
  • clean_text(expr)
  • word_count(expr)
  • char_count(expr)
  • sentence_count(expr)
  • concordance(expr, search_word, num_left_tokens=5, num_right_tokens=5, regex=False, case_sensitive=False)
  • quotation(expr)

Namespace usage

df = pl.DataFrame({"text": ["Hello world, hello again."]})

out = df.select([
    pl.col("text").text.clean_text().alias("clean"),
    pl.col("text").text.word_count().alias("word_count"),
    pl.col("text").text.tokenize().alias("tokens"),
])

Concordance

Get left/right context windows around a search term. Output is a list of structs that you can explode and unnest for tabular use.

df = pl.DataFrame({"text": ["Hello world, hello again."]})

concordance = (
    pl.col("text")
    .text.concordance("hello", num_left_tokens=1, num_right_tokens=1)
    .list.explode()
    .struct.unnest()
)

out = df.select(concordance)

Quotation extraction

Extract quoted speech along with speaker, verb, and offsets. Output is a list of structs you can explode and unnest.

df = pl.DataFrame({"text": ["Alice said \"Hello world\"."]})

quotes = (
    pl.col("text")
    .text.quotation()
    .list.explode()
    .struct.unnest()
)

out = df.select(quotes)

Token frequencies and stats

Compute corpus token counts and compare corpora with standard statistics.

series_0 = pl.Series("text", ["hello world", "hello again"])
series_1 = pl.Series("text", ["goodbye world"])

freqs_0 = pt.token_frequencies(series_0)
freqs_1 = pt.token_frequencies(series_1)

stats = pt.token_frequency_stats(freqs_0, freqs_1)

Topic modeling

Cluster documents and return topic labels plus per-document topic assignments.

series = pl.Series("text", [
    "Policy changes were announced today.",
    "Elections are coming soon.",
    "The football match was thrilling.",
])

topics, doc_topics = pt.topic_modeling(series, min_points=2, max_terms=3)

topics is a dict of topic_id -> label and doc_topics is a Series of lists of structs with {topic_id, weight}.

Output schemas

Concordance (list of structs):

  • left_context, matched_text, right_context
  • start_idx, end_idx
  • l1, r1 (first token on left/right for quick filtering)

Quotation (list of structs):

  • speaker, speaker_start_idx, speaker_end_idx
  • quote, quote_start_idx, quote_end_idx
  • verb, verb_start_idx, verb_end_idx
  • quote_type, quote_token_count, is_floating_quote

Topic modeling (Series of list structs):

  • topic_id (int), weight (float)

Models and downloads

Some features download Hugging Face models on first use (via hf-hub) and run on CPU:

  • Tokenization: bert-base-uncased (tokenizer.json)
  • Topic modeling embeddings: sentence-transformers/all-MiniLM-L6-v2
  • Quotation POS tagging: vblagoje/bert-english-uncased-finetuned-pos

The initial call may take longer while models download and cache.

Development

Build the extension locally with maturin and then import as polars_text.

For release and publishing procedures, see PUBLISH.md.

make build
make test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_text-0.1.5.tar.gz (5.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_text-0.1.5-cp314-cp314-win_amd64.whl (19.7 MB view details)

Uploaded CPython 3.14Windows x86-64

polars_text-0.1.5-cp314-cp314-manylinux_2_28_x86_64.whl (25.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

polars_text-0.1.5-cp314-cp314-macosx_11_0_arm64.whl (19.9 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

File details

Details for the file polars_text-0.1.5.tar.gz.

File metadata

  • Download URL: polars_text-0.1.5.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_text-0.1.5.tar.gz
Algorithm Hash digest
SHA256 79dc19d647ba73dbfd2ae4378c6c8157deeb83804ca663bb9060cad5300de59e
MD5 8590199cc27688544083fe4161c6865a
BLAKE2b-256 df8fd56f7c2b2939ce6c44a57aeec0be46e00cae3b0060329f39fc4c90301d4d

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.5.tar.gz:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.5-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: polars_text-0.1.5-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 19.7 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_text-0.1.5-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 738a9ddb2fc9420363149a550565b75ea8a20bd53506dac923b322297d231e7f
MD5 bb845ce407a97fd396ee593258521319
BLAKE2b-256 3c37ba93ccce0f01bbe596d7aa7582a72d70581858552c757d50f843fe7dc54b

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.5-cp314-cp314-win_amd64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.5-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.5-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9399212fb75e518f82e07bab03702dcc5057c9a384c2f72a4886e41f41cb1e28
MD5 1c13adb7045ac182b5b3bb79feaf93ce
BLAKE2b-256 7ea9ad9e6a9886b89d5ae35d2d6a26a0c28c590d29ec19461e1c89ca0dcebc54

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.5-cp314-cp314-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.5-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.5-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4be1134b14b0868f94598de909f43443b21323b1c6807c2e6c77b750ff6ff780
MD5 fafd11254038de7ee007c2f57b6a0dea
BLAKE2b-256 6bdffe72be8e750e98f69a4091591707c3af5459b4696e03eff17501b823c0f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.5-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page