Skip to main content

Sequence Graph Transform (SGT) for Polars - Transform sequential data into weighted n-gram representations

Project description

polars-sgt

Sequence Graph Transform for Polars

PyPI version

Transform sequential data into powerful n-gram representations with Polars.

polars-sgt brings Sequence Graph Transform (SGT) to Polars, enabling you to:

  • Transform sequences into weighted n-gram features
  • Grouped Analysis: Apply SGT across subsets (e.g., by direction, metric) and merge into a single wide DataFrame
  • Billion-Row Scale: Optimized Rust implementation with O(1) time weight lookups
  • Temporal Dynamics: Capture patterns with multiple decay functions across all n-gram transitions
  • Flexible: Support for datetime, date, duration, and numeric time columns
  • Lazy & Parallel: Fully compatible with Polars lazy evaluation and Rayon-backed parallel processing

What is SGT?

Sequence Graph Transform converts sequential data (like user clickstreams, sensor readings, or transaction histories) into weighted n-gram representations. Unlike traditional n-grams, SGT captures:

  • Sequential patterns: Multi-transition dependencies (Unigrams, bigrams, trigrams...)
  • Temporal dynamics: Weights decay based on time gaps between events.
  • Normalized features: L1/L2 normalization for machine-learning-ready feature spaces.

Performance at Scale

Optimized for processing billions of rows:

  • O(1) Weight Calculation: Uses cumulative product prefix arrays to calculate multi-transition time weights in constant time.
  • Zero-Cost Abstraction: Written in Rust with Rayon for automatic multi-core utilization.
  • Memory Efficient: Leverages Polars' arrow-backed memory management.

Installation

pip install polars-sgt

Quick Start

1. High-Level API: sgt_transform_df

The sgt_transform_df function is the easiest way to generate SGT features. It handles unnesting, exploding, and pivoting into a wide format automatically.

Single Group (Default)

import polars as pl
import polars_sgt as sgt

df = pl.DataFrame({
    "user_id": ["A", "A", "A", "B", "B"],
    "action": ["login", "view", "purchase", "login", "view"],
    "time": [1, 2, 10, 1, 5],
})

# Generate wide-format features merged into one DataFrame
features = sgt.sgt_transform_df(
    df, 
    sequence_id_col="user_id", 
    state_col="action", 
    time_col="time",
    kappa=2
)

Grouped Sequence Analysis

Calculate separate SGT features for different groups (e.g., event types or directions) and merge them into one wide DataFrame.

# Calculate SGT features for each 'direction' and 'metric'
result = sgt.sgt_transform_df(
    df,
    sequence_id_col="user_id",
    state_col="action",
    time_col="time",
    group_cols=["direction", "metric"],
    kappa=3,
    time_penalty="exponential",
    alpha=0.7,
    group_name="analysis"
)
# Columns: ['user_id', 'analysis-buy-p_login', 'analysis-sell-p_login', ...]

2. Expression API: sgt_transform

For more control or integration into complex pipelines, use the expression-based API.

# Basic expression usage (returns a struct)
result = df.select(
    sgt.sgt_transform(
        "user_id",
        "action",
        time_col="time",
        kappa=2,
        time_penalty="exponential",
        alpha=0.1,
        mode="l1"
    ).alias("sgt_features")
)

# Extract and explode
features = result.select([
    pl.col("sgt_features").struct.field("sequence_id"),
    pl.col("sgt_features").struct.field("ngram_keys").alias("ngrams"),
    pl.col("sgt_features").struct.field("value").alias("weights"),
]).explode(["ngrams", "weights"])

With DateTime Columns

from datetime import datetime

df = pl.DataFrame({
    "session_id": ["A", "A", "A", "A"],
    "event": ["start", "click", "scroll", "exit"],
    "time": [
        datetime(2024, 1, 1, 10, 0),
        datetime(2024, 1, 1, 10, 5),
        datetime(2024, 1, 1, 10, 7),
        datetime(2024, 1, 1, 10, 15),
    ],
})

result = df.select(
    sgt.sgt_transform(
        "session_id",
        "event",
        time_col="time",
        deltatime="m",  # unit: minutes
        kappa=3,
    )
)

Lazy Evaluation & Streaming

result = (
    pl.scan_csv("large_sequences.csv")
    .with_columns(pl.col("timestamp").str.to_datetime())
    .select(
        sgt.sgt_transform(
            "user_id",
            "action",
            time_col="timestamp",
            kappa=2,
            deltatime="h",
        )
    )
    .collect(engine="streaming")
)

API Reference

sgt.sgt_transform_df

The recommended high-level entry point. Returns a wide-format DataFrame.

  • df: Input DataFrame or LazyFrame.
  • sequence_id_col: Column(s) identifying sequences.
  • state_col: Column containing states/events.
  • time_col: Optional timestamp column.
  • group_cols: Optional column(s) to group by before SGT.
  • kappa: Maximum n-gram size.
  • mode: Normalization ("l1", "l2", "none").
  • time_penalty: Decay function ("inverse", "exponential", "linear", "power", "none").

sgt.sgt_transform (Expression)

Returns a struct with sequence_id, ngram_keys, and value.

df.select(
    sgt.sgt_transform("user", "action", kappa=2).alias("sgt")
).unnest("sgt")

Author & Acknowledgments

Author: Zedd (lytran14789@gmail.com)

Special Thanks: Built upon polars-xdt by Marco Gorelli.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_sgt-0.3.0.tar.gz (1.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_sgt-0.3.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

polars_sgt-0.3.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

polars_sgt-0.3.0-cp39-abi3-win_amd64.whl (6.2 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_sgt-0.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_sgt-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (6.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

polars_sgt-0.3.0-cp39-abi3-macosx_11_0_arm64.whl (5.9 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_sgt-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file polars_sgt-0.3.0.tar.gz.

File metadata

  • Download URL: polars_sgt-0.3.0.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polars_sgt-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ebbbe2e07b25c7a2bfde8671c9d12d6db8ee4f367077ea134bc4e2b5f499ee1d
MD5 cbe8c879ef63f5123ad5d4b271b6d920
BLAKE2b-256 8f0cb104a3aa2ec355bfc6ffdfbe9232fa3fdccac4cfefa96c69dad648871568

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_sgt-0.3.0.tar.gz:

Publisher: CI.yml on 4ursmile/polars-sgt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_sgt-0.3.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sgt-0.3.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a38595a9255d7424edc7513a29ecb33fa54e6168caa43428886d6260ec1f9acc
MD5 e2425ca5fe577b32423fb1e749ed4ae1
BLAKE2b-256 a2127e6c7f889ccf0d5995530a3d950750940405729288eb16c40ad247ce107c

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_sgt-0.3.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: CI.yml on 4ursmile/polars-sgt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_sgt-0.3.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sgt-0.3.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c2563e4be54defdb5a9f0ecea5dafa1bfe4f3b87b03801c498cd17553f5b086f
MD5 4f71e518a18ddc6381ce5caa28dfc810
BLAKE2b-256 0e02f157bae9e5147c873e8b86da217d24f43ddc8e1a718d849839a40b60acd9

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_sgt-0.3.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: CI.yml on 4ursmile/polars-sgt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_sgt-0.3.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: polars_sgt-0.3.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 6.2 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polars_sgt-0.3.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 bdc4f66103ad9e0fc9f73729ac84b1a1ad2076e4f9cc08dd6afcec866efb5359
MD5 9a7c799d10f65129e1666c845fb85f17
BLAKE2b-256 348dc72d6f891b4da2b29c351fd7957d0296e97598444bbaff606e282c993582

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_sgt-0.3.0-cp39-abi3-win_amd64.whl:

Publisher: CI.yml on 4ursmile/polars-sgt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_sgt-0.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_sgt-0.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f8add2a7543a75782da1ac4adc37657f0a9c95409720f098b126a62d3b1dc828
MD5 3a33b51b3101294cf81834d8e865ccb5
BLAKE2b-256 66183f9720b13c14a22e79ad5ca4cff4065aa34f3abb1a048eab25e5f9d8701c

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_sgt-0.3.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: CI.yml on 4ursmile/polars-sgt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_sgt-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sgt-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 df0ecef6cbe0ee95a5edb78f020e2a2e08757195ae51a21f6545570c491ecf1f
MD5 92d5404bbdfeac3b7c80a3e5785ed488
BLAKE2b-256 fc19ebe82d31939342ae7aeb25cdd374fe9008104db57e1e9b6b9116cedbcc9a

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_sgt-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: CI.yml on 4ursmile/polars-sgt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_sgt-0.3.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_sgt-0.3.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b4c33be6c0d490613a3b6713dbac2e109a706e4d5d2b4678032c90469a964552
MD5 f8d70bb16fde175d533bd087f0d63eea
BLAKE2b-256 4ec19b8f271c98708d61ce0005009bf619fe0c3f004d747b23d5cb95cead65e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_sgt-0.3.0-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: CI.yml on 4ursmile/polars-sgt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_sgt-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_sgt-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b7d1a31ea35dd0e7be6af620589e65665335c8d429aba7a0970a4ed6ed3ecdcf
MD5 e805b7459521f607d6f1222c854d9173
BLAKE2b-256 bef41a60e9c64ca4363c6f19cf3978a0ecda30919a5cd909ff56421d432255d4

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_sgt-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: CI.yml on 4ursmile/polars-sgt

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page