Polars expression plugins for text analysis
Project description
polars-text
Polars expression plugins for fast, practical text analysis. Use them as
expressions or via the pl.col("text").text.* namespace, plus a few
Series-based utilities for token frequency stats and topic modeling.
Quick start
import polars as pl
import polars_text as pt
df = pl.DataFrame({
"text": [
"Alice said \"Hello world\".",
"Hello again, world!",
]
})
out = df.with_columns([
pt.clean_text(pl.col("text")).alias("clean"),
pt.word_count(pl.col("text")).alias("word_count"),
pt.char_count(pl.col("text")).alias("char_count"),
pt.sentence_count(pl.col("text")).alias("sentence_count"),
pt.tokenize(pl.col("text"), lowercase=True, remove_punct=True).alias("tokens"),
])
Expressions and namespace
All expression functions are available both as module functions and through
the text namespace on expressions.
Expression functions
tokenize(expr, lowercase=True, remove_punct=True)clean_text(expr)word_count(expr)char_count(expr)sentence_count(expr)concordance(expr, search_word, num_left_tokens=5, num_right_tokens=5, regex=False, case_sensitive=False)quotation(expr)
Namespace usage
df = pl.DataFrame({"text": ["Hello world, hello again."]})
out = df.select([
pl.col("text").text.clean_text().alias("clean"),
pl.col("text").text.word_count().alias("word_count"),
pl.col("text").text.tokenize().alias("tokens"),
])
Concordance
Get left/right context windows around a search term. Output is a list of
structs that you can explode and unnest for tabular use.
df = pl.DataFrame({"text": ["Hello world, hello again."]})
concordance = (
pl.col("text")
.text.concordance("hello", num_left_tokens=1, num_right_tokens=1)
.list.explode()
.struct.unnest()
)
out = df.select(concordance)
Quotation extraction
Extract quoted speech along with speaker, verb, and offsets. Output is a list
of structs you can explode and unnest.
df = pl.DataFrame({"text": ["Alice said \"Hello world\"."]})
quotes = (
pl.col("text")
.text.quotation()
.list.explode()
.struct.unnest()
)
out = df.select(quotes)
Token frequencies and stats
Compute corpus token counts and compare corpora with standard statistics.
series_0 = pl.Series("text", ["hello world", "hello again"])
series_1 = pl.Series("text", ["goodbye world"])
freqs_0 = pt.token_frequencies(series_0)
freqs_1 = pt.token_frequencies(series_1)
stats = pt.token_frequency_stats(freqs_0, freqs_1)
Topic modeling
Cluster documents and return topic labels plus per-document topic assignments.
series = pl.Series("text", [
"Policy changes were announced today.",
"Elections are coming soon.",
"The football match was thrilling.",
])
topics, doc_topics = pt.topic_modeling(series, min_points=2, max_terms=3)
topics is a dict of topic_id -> label and doc_topics is a Series of lists
of structs with {topic_id, weight}.
Output schemas
Concordance (list of structs):
left_context,matched_text,right_contextstart_idx,end_idxl1,r1(first token on left/right for quick filtering)
Quotation (list of structs):
speaker,speaker_start_idx,speaker_end_idxquote,quote_start_idx,quote_end_idxverb,verb_start_idx,verb_end_idxquote_type,quote_token_count,is_floating_quote
Topic modeling (Series of list structs):
topic_id(int),weight(float)
Models and downloads
Some features download Hugging Face models on first use (via hf-hub) and run
on CPU:
- Tokenization:
bert-base-uncased(tokenizer.json) - Topic modeling embeddings:
sentence-transformers/all-MiniLM-L6-v2 - Quotation POS tagging:
vblagoje/bert-english-uncased-finetuned-pos
The initial call may take longer while models download and cache.
Development
Build the extension locally with maturin and then import as polars_text.
For release and publishing procedures, see PUBLISH.md.
make build
make test
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_text-0.1.6.tar.gz.
File metadata
- Download URL: polars_text-0.1.6.tar.gz
- Upload date:
- Size: 5.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
137c8cf9ef28e481b732be975a0c1e5093a6c21e695bea2b650a972076fe791e
|
|
| MD5 |
f7ff15cd8423fb54178d223422964f19
|
|
| BLAKE2b-256 |
a4ee84f522aeeef7af6ca8184b8e965808c94204a972b6f20a8cda65e8e854cf
|
Provenance
The following attestation bundles were made for polars_text-0.1.6.tar.gz:
Publisher:
release.yml on Australian-Text-Analytics-Platform/polars-text
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_text-0.1.6.tar.gz -
Subject digest:
137c8cf9ef28e481b732be975a0c1e5093a6c21e695bea2b650a972076fe791e - Sigstore transparency entry: 1402210988
- Sigstore integration time:
-
Permalink:
Australian-Text-Analytics-Platform/polars-text@4a79c0ce5e8f757f6214ab85f7ccdf631c76389a -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/Australian-Text-Analytics-Platform
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4a79c0ce5e8f757f6214ab85f7ccdf631c76389a -
Trigger Event:
push
-
Statement type:
File details
Details for the file polars_text-0.1.6-cp314-cp314-win_amd64.whl.
File metadata
- Download URL: polars_text-0.1.6-cp314-cp314-win_amd64.whl
- Upload date:
- Size: 19.3 MB
- Tags: CPython 3.14, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
244f29f4b0311c3cdf025bb677cfd9091845e22d8f8a8172fae8f5d0b5cfb7fd
|
|
| MD5 |
de4fea3a65e3dc869f3de23ecb666240
|
|
| BLAKE2b-256 |
fd03ec228c4ce2f422a8c1e87b56ab2b6256f88ae87000ae84417313d597a501
|
Provenance
The following attestation bundles were made for polars_text-0.1.6-cp314-cp314-win_amd64.whl:
Publisher:
release.yml on Australian-Text-Analytics-Platform/polars-text
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_text-0.1.6-cp314-cp314-win_amd64.whl -
Subject digest:
244f29f4b0311c3cdf025bb677cfd9091845e22d8f8a8172fae8f5d0b5cfb7fd - Sigstore transparency entry: 1402211034
- Sigstore integration time:
-
Permalink:
Australian-Text-Analytics-Platform/polars-text@4a79c0ce5e8f757f6214ab85f7ccdf631c76389a -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/Australian-Text-Analytics-Platform
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4a79c0ce5e8f757f6214ab85f7ccdf631c76389a -
Trigger Event:
push
-
Statement type:
File details
Details for the file polars_text-0.1.6-cp314-cp314-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: polars_text-0.1.6-cp314-cp314-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 24.0 MB
- Tags: CPython 3.14, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84a8276d9bcfebeae4c5fdfff8d2890c89464810244607277341d81793f4879a
|
|
| MD5 |
b219b827f45057142b5ca8e582a40fc2
|
|
| BLAKE2b-256 |
47bfe651469ae9195eabb1a476913538c3eeabf841b5033e56455cb0e9b229e8
|
Provenance
The following attestation bundles were made for polars_text-0.1.6-cp314-cp314-manylinux_2_28_x86_64.whl:
Publisher:
release.yml on Australian-Text-Analytics-Platform/polars-text
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_text-0.1.6-cp314-cp314-manylinux_2_28_x86_64.whl -
Subject digest:
84a8276d9bcfebeae4c5fdfff8d2890c89464810244607277341d81793f4879a - Sigstore transparency entry: 1402211120
- Sigstore integration time:
-
Permalink:
Australian-Text-Analytics-Platform/polars-text@4a79c0ce5e8f757f6214ab85f7ccdf631c76389a -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/Australian-Text-Analytics-Platform
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4a79c0ce5e8f757f6214ab85f7ccdf631c76389a -
Trigger Event:
push
-
Statement type:
File details
Details for the file polars_text-0.1.6-cp314-cp314-macosx_11_0_arm64.whl.
File metadata
- Download URL: polars_text-0.1.6-cp314-cp314-macosx_11_0_arm64.whl
- Upload date:
- Size: 18.9 MB
- Tags: CPython 3.14, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55fcf62a24d1bb2a3047d9ffc9cfc5155ec50b8ce8bf895f24c80ba4c09eeabe
|
|
| MD5 |
9ece4ad3a82ee4c43a4104f418e6b2a8
|
|
| BLAKE2b-256 |
54af17ce6b849702924916b45463c2ab9150235bab4e2608edcab64a686039dd
|
Provenance
The following attestation bundles were made for polars_text-0.1.6-cp314-cp314-macosx_11_0_arm64.whl:
Publisher:
release.yml on Australian-Text-Analytics-Platform/polars-text
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_text-0.1.6-cp314-cp314-macosx_11_0_arm64.whl -
Subject digest:
55fcf62a24d1bb2a3047d9ffc9cfc5155ec50b8ce8bf895f24c80ba4c09eeabe - Sigstore transparency entry: 1402211071
- Sigstore integration time:
-
Permalink:
Australian-Text-Analytics-Platform/polars-text@4a79c0ce5e8f757f6214ab85f7ccdf631c76389a -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/Australian-Text-Analytics-Platform
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4a79c0ce5e8f757f6214ab85f7ccdf631c76389a -
Trigger Event:
push
-
Statement type: