Skip to main content

Polars expression plugins for text analysis

Project description

polars-text

Polars expression plugins for fast, practical text analysis. Use them as expressions or via the pl.col("text").text.* namespace, plus a few Series-based utilities for token frequency stats and topic modeling.

Quick start

import polars as pl
import polars_text as pt

df = pl.DataFrame({
    "text": [
        "Alice said \"Hello world\".",
        "Hello again, world!",
    ]
})

out = df.with_columns([
    pt.clean_text(pl.col("text")).alias("clean"),
    pt.word_count(pl.col("text")).alias("word_count"),
    pt.char_count(pl.col("text")).alias("char_count"),
    pt.sentence_count(pl.col("text")).alias("sentence_count"),
    pt.tokenize(pl.col("text"), lowercase=True, remove_punct=True).alias("tokens"),
])

Expressions and namespace

All expression functions are available both as module functions and through the text namespace on expressions.

Expression functions

  • tokenize(expr, lowercase=True, remove_punct=True)
  • clean_text(expr)
  • word_count(expr)
  • char_count(expr)
  • sentence_count(expr)
  • concordance(expr, search_word, num_left_tokens=5, num_right_tokens=5, regex=False, case_sensitive=False)
  • quotation(expr)

Namespace usage

df = pl.DataFrame({"text": ["Hello world, hello again."]})

out = df.select([
    pl.col("text").text.clean_text().alias("clean"),
    pl.col("text").text.word_count().alias("word_count"),
    pl.col("text").text.tokenize().alias("tokens"),
])

Concordance

Get left/right context windows around a search term. Output is a list of structs that you can explode and unnest for tabular use.

df = pl.DataFrame({"text": ["Hello world, hello again."]})

concordance = (
    pl.col("text")
    .text.concordance("hello", num_left_tokens=1, num_right_tokens=1)
    .list.explode()
    .struct.unnest()
)

out = df.select(concordance)

Quotation extraction

Extract quoted speech along with speaker, verb, and offsets. Output is a list of structs you can explode and unnest.

df = pl.DataFrame({"text": ["Alice said \"Hello world\"."]})

quotes = (
    pl.col("text")
    .text.quotation()
    .list.explode()
    .struct.unnest()
)

out = df.select(quotes)

Token frequencies and stats

Compute corpus token counts and compare corpora with standard statistics.

series_0 = pl.Series("text", ["hello world", "hello again"])
series_1 = pl.Series("text", ["goodbye world"])

freqs_0 = pt.token_frequencies(series_0)
freqs_1 = pt.token_frequencies(series_1)

stats = pt.token_frequency_stats(freqs_0, freqs_1)

Topic modeling

Cluster documents and return topic labels plus per-document topic assignments.

series = pl.Series("text", [
    "Policy changes were announced today.",
    "Elections are coming soon.",
    "The football match was thrilling.",
])

topics, doc_topics = pt.topic_modeling(series, min_points=2, max_terms=3)

topics is a dict of topic_id -> label and doc_topics is a Series of lists of structs with {topic_id, weight}.

Output schemas

Concordance (list of structs):

  • left_context, matched_text, right_context
  • start_idx, end_idx
  • l1, r1 (first token on left/right for quick filtering)

Quotation (list of structs):

  • speaker, speaker_start_idx, speaker_end_idx
  • quote, quote_start_idx, quote_end_idx
  • verb, verb_start_idx, verb_end_idx
  • quote_type, quote_token_count, is_floating_quote

Topic modeling (Series of list structs):

  • topic_id (int), weight (float)

Models and downloads

Some features download Hugging Face models on first use (via hf-hub) and run on CPU:

  • Tokenization: bert-base-uncased (tokenizer.json)
  • Topic modeling embeddings: sentence-transformers/all-MiniLM-L6-v2
  • Quotation POS tagging: vblagoje/bert-english-uncased-finetuned-pos

The initial call may take longer while models download and cache.

Development

Build the extension locally with maturin and then import as polars_text.

For release and publishing procedures, see PUBLISH.md.

make build
make test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_text-0.1.2.tar.gz (53.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_text-0.1.2-cp314-cp314-win_amd64.whl (7.8 MB view details)

Uploaded CPython 3.14Windows x86-64

polars_text-0.1.2-cp314-cp314-manylinux_2_28_x86_64.whl (11.0 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

polars_text-0.1.2-cp314-cp314-macosx_11_0_arm64.whl (7.7 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

File details

Details for the file polars_text-0.1.2.tar.gz.

File metadata

  • Download URL: polars_text-0.1.2.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_text-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2a080c8e7e0ecef10b3b431fb14dbb90de6faf552e24ffa30f52f0876c5fe5e8
MD5 8ae3981db40658afa85174c06e453b87
BLAKE2b-256 42b91653c8ac742e3b84ebe328d213b5eb09ee96ce6241c27364911a2e29e2aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.2.tar.gz:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.2-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.2-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 5b80da5e9902ecaefc4db227a38d20fc6c328ee4254678319329b38248c2ea4e
MD5 f0aa79918e01dfbca02a5518153770f7
BLAKE2b-256 daa6ff1dd58b29c7ac03e7344311e2b82f211d043321d8f9705a9e8e664db900

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.2-cp314-cp314-win_amd64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.2-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.2-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0283306eb5c185cac89db3abf1ecf877b47c4e18a2afc4ad8892a5ab12116b8f
MD5 daf77d645b45cdaad6a8a62e42dfacb0
BLAKE2b-256 a5606417a5e73915e0dbc4c1f0643b41c715428a5350f4b2a97903655b526342

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.2-cp314-cp314-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_text-0.1.2-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_text-0.1.2-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f7dbd56bc3c90d87defc9e682e816ff7cdf264f19ee458df190322ba92c10c20
MD5 d04ce43bdfc22c3bc4e7760e08cdf47b
BLAKE2b-256 b460829e2048d45aa63ec9c2634f63bf750b9569aded969920db3240737ea9e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_text-0.1.2-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: release.yml on Australian-Text-Analytics-Platform/polars-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page