GSP (Generalized Sequence Pattern) algorithm in Python
Project description
GSP-Py
GSP-Py: A Python-powered library to mine sequential patterns in large datasets, based on the robust Generalized Sequence Pattern (GSP) algorithm. Ideal for market basket analysis, temporal mining, and user journey discovery.
[!IMPORTANT] GSP-Py is compatible with Python 3.11 and later versions!
📚 Table of Contents
- 🔍 What is GSP?
- 🔧 Requirements
- 🚀 Installation
- 🛠️ Developer Installation
- 📖 Documentation
- 🎯 Interactive Examples
- 💡 Usage
- ⌨️ Typing
- 🌟 Planned Features
- 🤝 Contributing
- 📝 License
- 📖 Citation
🔍 What is GSP?
The Generalized Sequential Pattern (GSP) algorithm is a sequential pattern mining technique based on Apriori principles. Using support thresholds, GSP identifies frequent sequences of items in transaction datasets.
Key Features:
- Ordered (non-contiguous) matching: Detects patterns where items appear in order but not necessarily adjacent, following standard GSP semantics. For example, the pattern
('A', 'C')is found in the sequence['A', 'B', 'C']. - Support-based pruning: Only retains sequences that meet the minimum support threshold.
- Candidate generation: Iteratively generates candidate sequences of increasing length.
- Temporal constraints: Support for time-constrained pattern mining with
mingap,maxgap, andmaxspanparameters to find patterns within specific time windows. - General-purpose: Useful in retail, web analytics, social networks, temporal sequence mining, and more.
For example:
- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next" - even if other items appear between bread and milk.
- In a website clickstream, GSP might find patterns like "Users visit A, then eventually go to C" - capturing user journeys with intermediate steps.
🔧 Requirements
You will need Python installed on your system. On most Linux systems, you can install Python with:
sudo apt install python3
For package dependencies of GSP-Py, they will automatically be installed when using pip.
🚀 Installation
GSP-Py can be easily installed from either the repository or PyPI.
Option 1: Clone the Repository
To manually clone the repository and set up the environment:
git clone https://github.com/jacksonpradolima/gsp-py.git
cd gsp-py
Refer to the Developer Installation section and run the setup with uv.
Option 2: Install via pip
Alternatively, install GSP-Py from PyPI with:
pip install gsppy
🛠️ Developer Installation
This project now uses uv for dependency management and virtual environments.
1. Install uv
curl -Ls https://astral.sh/uv/install.sh | bash
Make sure uv is on your PATH (for most Linux setups):
export PATH="$HOME/.local/bin:$PATH"
2. Set up the project environment
Create a local virtual environment and install dependencies from uv.lock (single source of truth):
uv venv .venv
uv sync --frozen --extra dev # uses uv.lock
uv pip install -e .
3. Optional: Enable Rust acceleration
Rust acceleration is optional and provides faster support counting using a PyO3 extension. Python fallback remains available.
Build the extension locally:
make rust-build
Select backend at runtime (auto tries Rust, then falls back to Python):
export GSPPY_BACKEND=rust # or python, or unset for auto
Run benchmarks (adjust to your machine):
make bench-small
make bench-big # may use significant memory/CPU
# or customize:
GSPPY_BACKEND=auto uv run --python .venv/bin/python --no-project \
python benchmarks/bench_support.py --n_tx 1000000 --tx_len 8 --vocab 50000 --min_support 0.2 --warmup
4. Optional: Enable GPU (CuPy) acceleration
GPU acceleration is experimental and currently optimizes singleton (k=1) support counting using CuPy. Non-singleton candidates fall back to the Rust/Python backend.
Install the optional extra (choose a CuPy build that matches your CUDA/ROCm setup if needed):
uv run pip install -e .[gpu]
Select the GPU backend at runtime:
export GSPPY_BACKEND=gpu
If a GPU isn't available, an error will be raised when GSPPY_BACKEND=gpu is set. Otherwise, the default "auto" uses CPU.
5. Common development tasks
After the environment is ready, activate it and run tasks with standard tools:
source .venv/bin/activate
pytest -n auto
ruff check .
pyright
If you prefer, you can also prefix commands with uv without activating:
uv run pytest -n auto
uv run ruff check .
uv run pyright
5. Makefile (shortcuts)
You can use the Makefile to automate common tasks:
make setup # create .venv with uv and pin Python
make install # sync deps (from uv.lock) + install project (-e .)
make test # pytest -n auto
make lint # ruff check .
make format # ruff --fix
make typecheck # pyright + ty
make pre-commit-install # install the pre-commit hook
make pre-commit-run # run pre-commit on all files
# Rust-specific shortcuts
make rust-setup # install rustup toolchain
make rust-build # build PyO3 extension with maturin
make bench-small # run small benchmark
make bench-big # run large benchmark
6. Testing & Quality Assurance
GSP-Py includes comprehensive testing to ensure reliability and correctness:
Running Tests:
# Run all tests in parallel (excluding slow integration tests)
make test
# or
pytest -n auto
# Run specific test suites
pytest tests/test_gsp.py # Core GSP algorithm tests
pytest tests/test_gsp_fuzzing.py # Property-based fuzzing tests
pytest tests/test_gsp_edge_cases.py # Extended edge-case tests
pytest tests/test_cli_fuzzing.py # CLI fuzzing tests
# Run integration tests (slow, marked for CI/merge to master only)
pytest -m integration # Run all integration tests
pytest -m "not integration" # Skip integration tests (default)
# Run with coverage
make coverage
# or
pytest --cov=gsppy --cov-report=html
Property-Based Testing (Fuzzing):
GSP-Py uses Hypothesis for property-based testing, which automatically generates test cases to discover edge cases:
# Run fuzzing test suites (fast tests only, ~2 minutes)
pytest tests/test_gsp_fuzzing.py tests/test_gsp_edge_cases.py -v
# Run specific fuzzing test
pytest tests/test_gsp_edge_cases.py::test_gsp_handles_large_transactions -v
# Reproduce a specific failure using seed
pytest tests/test_gsp_fuzzing.py --hypothesis-seed=12345
The fuzzing tests validate critical properties and invariants:
- ✅ Support monotonicity (lower thresholds → more patterns)
- ✅ Pattern length progression (level k contains k-sequences)
- ✅ Support threshold compliance (all patterns meet minimum support)
- ✅ Determinism (same input → same output)
- ✅ No duplicate patterns within levels
- ✅ Robustness to edge cases (extreme sizes, sparse data, noise, special characters)
Note: Fuzzing tests use small example counts (3-10 examples per test) for fast execution while maintaining good coverage. Individual tests complete in seconds, with the full fuzzing suite running in under a minute.
For more details on writing and extending property-based tests, see the Contributing Guide.
[!TIP] GSP-Py provides reusable Hypothesis strategies in
tests/hypothesis_strategies.pythat you can use to write new property-based tests. These strategies generate various types of test data including extreme transaction sizes, noisy patterns, temporal data, and malformed inputs.
[!NOTE] Tox in this project uses the "tox-uv" plugin. When running
make toxortox, missing Python interpreters can be provisioned automatically via uv (no need to pre-install all versions). This makes local setup faster.
🔏 Release assets and verification
Every GitHub release bundles artifacts to help you validate what you download:
- Built wheels and source distributions produced by the automated publish workflow.
sbom.json(CycloneDX) generated with Syft.- Sigstore-generated
.sigand.pemfiles for each artifact, created using GitHub OIDC identity.
To verify a downloaded artifact from a release:
python -m pip install sigstore # installs the CLI
sigstore verify identity \
--certificate gsppy-<version>-py3-none-any.whl.pem \
--signature gsppy-<version>-py3-none-any.whl.sig \
--cert-identity "https://github.com/jacksonpradolima/gsp-py/.github/workflows/publish.yml@refs/tags/v<version>" \
--cert-oidc-issuer https://token.actions.githubusercontent.com \
gsppy-<version>-py3-none-any.whl
Replace <version> with the numeric package version (for example, 3.1.1) in the filenames; in --cert-identity, this becomes v<version> (for example, v3.1.1). Adjust the filenames for the sdist (.tar.gz) if preferred. The same release page also hosts sbom.json for supply-chain inspection.
📖 Documentation
-
Live site: https://jacksonpradolima.github.io/gsp-py/
-
Build locally:
uv venv .venv uv sync --extra docs uv run mkdocs serve
The docs use MkDocs with the Material theme and mkdocstrings to render the Python API directly from docstrings.
🎯 Interactive Examples
GSP-Py provides interactive marimo notebooks that demonstrate various features with executable code:
- Sequence Abstraction: Learn how to use the Sequence class for structured pattern manipulation
- Itemset Support: Work with itemsets where multiple items occur together
- DataFrame Integration: Use GSP-Py with Polars and Pandas DataFrames
These notebooks can be viewed in the documentation or run locally:
pip install marimo 'gsppy[dataframe]'
git clone https://github.com/jacksonpradolima/gsp-py.git
cd gsp-py
marimo edit notebooks/sequence_example.py
💡 Usage
The library is designed to be easy to use and integrate with your own projects. You can use GSP-Py either programmatically (Python API) or directly from the command line (CLI).
🚦 Using GSP-Py via CLI
GSP-Py provides a command-line interface (CLI) for running the Generalized Sequential Pattern algorithm on transactional data. This allows you to mine frequent sequential patterns from JSON or CSV files without writing any code.
Installation
First, install GSP-Py (if not already installed):
pip install gsppy
This will make the gsppy CLI command available in your environment.
Preparing Your Data
Your input file should be either:
-
JSON: A list of transactions, each transaction is a list of items. Example:
[ ["Bread", "Milk"], ["Bread", "Diaper", "Beer", "Eggs"], ["Milk", "Diaper", "Beer", "Coke"], ["Bread", "Milk", "Diaper", "Beer"], ["Bread", "Milk", "Diaper", "Coke"] ]
-
CSV: Each row is a transaction, items separated by commas. Example:
Bread,Milk Bread,Diaper,Beer,Eggs Milk,Diaper,Beer,Coke Bread,Milk,Diaper,Beer Bread,Milk,Diaper,Coke
-
SPM/GSP Format: Uses delimiters to separate elements and sequences. This format is commonly used in sequential pattern mining datasets.
-1: Marks the end of an element (itemset)-2: Marks the end of a sequence (transaction)
Example:
1 2 -1 3 -1 -2 4 -1 5 6 -1 -2 1 -1 2 3 -1 -2
The above represents:
- Transaction 1:
[[1, 2], [3]]→ flattened to[1, 2, 3] - Transaction 2:
[[4], [5, 6]]→ flattened to[4, 5, 6] - Transaction 3:
[[1], [2, 3]]→ flattened to[1, 2, 3]
String tokens are also supported:
A B -1 C -1 -2 D -1 E F -1 -2
-
Parquet/Arrow Files: Modern columnar data formats (requires 'gsppy[dataframe]')
pip install 'gsppy[dataframe]'
This installs optional dependencies:
polars,pandas, andpyarrowfor DataFrame support.
Running the CLI
Use the following command to run GSPPy on your data:
gsppy --file path/to/transactions.json --min_support 0.3 --backend auto
Or for CSV files:
gsppy --file path/to/transactions.csv --min_support 0.3 --backend rust
For SPM/GSP format files, use the --format spm option:
gsppy --file path/to/data.txt --format spm --min_support 0.3
CLI Options
--file: Path to your input file (JSON, CSV, or SPM format). Required.--format: File format to use for input. Options:auto(default, auto-detect from extension),json,csv,spm,parquet,arrow.--min_support: Minimum support threshold as a fraction (e.g.,0.3for 30%). Default is0.2.--backend: Backend to use for support counting. One ofauto(default),python,rust, orgpu.--output: Path to save mining results to a file. If not specified, results are printed to console.--output-format: Output format for mining results. Options:auto(default, detect from extension),parquet,arrow,csv,json. Requires--outputto be specified.--verbose: Enable detailed logging with timestamps, log levels, and process IDs for debugging and traceability.--mingap,--maxgap,--maxspan: Temporal constraints for time-aware pattern mining (requires timestamped transactions).
Verbose Mode
For debugging or to track execution in CI/CD pipelines, use the --verbose flag:
gsppy --file transactions.json --min_support 0.3 --verbose
This produces structured logging output with timestamps, log levels, and process information:
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Pre-processing transactions...
YYYY-MM-DDTHH:MM:SS | DEBUG | PID:4179 | gsppy.gsp | Unique candidates: [('Bread',), ('Milk',), ...]
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Starting GSP algorithm with min_support=0.3...
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Run 1: 6 candidates filtered to 5.
...
For complete logging documentation, see docs/logging.md.
Example
Suppose you have a file transactions.json as shown above. To find patterns with at least 30% support:
gsppy --file transactions.json --min_support 0.3
Sample output:
Pre-processing transactions...
Starting GSP algorithm with min_support=0.3...
Run 1: 6 candidates filtered to 5.
Run 2: 20 candidates filtered to 3.
Run 3: 2 candidates filtered to 2.
Run 4: 1 candidates filtered to 0.
GSP algorithm completed.
Frequent Patterns Found:
1-Sequence Patterns:
Pattern: ('Bread',), Support: 4
Pattern: ('Milk',), Support: 4
Pattern: ('Diaper',), Support: 4
Pattern: ('Beer',), Support: 3
Pattern: ('Coke',), Support: 2
2-Sequence Patterns:
Pattern: ('Bread', 'Milk'), Support: 3
Pattern: ('Milk', 'Diaper'), Support: 3
Pattern: ('Diaper', 'Beer'), Support: 3
3-Sequence Patterns:
Pattern: ('Bread', 'Milk', 'Diaper'), Support: 2
Pattern: ('Milk', 'Diaper', 'Beer'), Support: 2
Exporting Results
GSP-Py supports exporting mining results to various formats for further analysis or integration with data pipelines:
Export to Parquet (efficient columnar format for large datasets):
gsppy --file transactions.json --min_support 0.3 --output results.parquet
Export to CSV:
gsppy --file transactions.json --min_support 0.3 --output results.csv
Export to JSON:
gsppy --file transactions.json --min_support 0.3 --output results.json
Specify format explicitly:
gsppy --file transactions.json --min_support 0.3 --output results.data --output-format parquet
The exported files contain three columns:
pattern: The sequential pattern (e.g.,('Bread', 'Milk'))support: Number of transactions containing the patternlevel: Length of the pattern sequence
Export formats are particularly useful for:
- Parquet/Arrow: Integration with big data tools (Spark, Polars, Pandas), data lakes, and cloud analytics
- CSV: Easy viewing in spreadsheets and compatibility with traditional tools
- JSON: Structured data for web applications and APIs
Error Handling
- If the file does not exist or is in an unsupported format, a clear error message will be shown.
- The
min_supportvalue must be between 0.0 and 1.0 (exclusive of 0.0, inclusive of 1.0).
Advanced: Verbose Output
To see detailed logs for debugging, add the --verbose flag:
gsppy --file transactions.json --min_support 0.3 --verbose
The following example shows how to use GSP-Py programmatically in Python:
Example Input Data
The input to the algorithm is a sequence of transactions, where each transaction contains a sequence of items:
transactions = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke'],
['Bread', 'Milk', 'Diaper', 'Beer'],
['Bread', 'Milk', 'Diaper', 'Coke']
]
Importing and Initializing the GSP Algorithm
Import the GSP class from the gsppy package and call the search method to find frequent patterns with a support
threshold (e.g., 0.3):
from gsppy.gsp import GSP
# Example transactions: customer purchases
transactions = [
['Bread', 'Milk'], # Transaction 1
['Bread', 'Diaper', 'Beer', 'Eggs'], # Transaction 2
['Milk', 'Diaper', 'Beer', 'Coke'], # Transaction 3
['Bread', 'Milk', 'Diaper', 'Beer'], # Transaction 4
['Bread', 'Milk', 'Diaper', 'Coke'] # Transaction 5
]
# Set minimum support threshold (30%)
min_support = 0.3
# Find frequent patterns
result = GSP(transactions).search(min_support)
# Output the results
print(result)
Verbose Mode for Debugging
Enable detailed logging to track algorithm progress and debug issues:
from gsppy.gsp import GSP
# Enable verbose logging for the entire instance
gsp = GSP(transactions, verbose=True)
result = gsp.search(min_support=0.3)
# Or enable verbose for a specific search only
gsp = GSP(transactions)
result = gsp.search(min_support=0.3, verbose=True)
Verbose mode provides:
- Detailed progress information during execution
- Candidate generation and filtering statistics
- Preprocessing and validation details
- Useful for debugging, research, and CI/CD integration
For complete documentation on logging, see docs/logging.md.
Using Sequence Objects for Rich Pattern Representation
GSP-Py 4.0+ introduces a Sequence abstraction class that provides a richer, more maintainable way to work with sequential patterns. The Sequence class encapsulates pattern items, support counts, and optional metadata in an immutable, hashable object.
Traditional Dict-based Output (Default)
from gsppy import GSP
transactions = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke']
]
gsp = GSP(transactions)
result = gsp.search(min_support=0.3)
# Returns: [{('Bread',): 4, ('Milk',): 4, ...}, {('Bread', 'Milk'): 3, ...}, ...]
for level_patterns in result:
for pattern, support in level_patterns.items():
print(f"Pattern: {pattern}, Support: {support}")
Sequence Objects (New Feature)
from gsppy import GSP
transactions = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke']
]
gsp = GSP(transactions)
result = gsp.search(min_support=0.3, return_sequences=True)
# Returns: [[Sequence(('Bread',), support=4), ...], [Sequence(('Bread', 'Milk'), support=3), ...], ...]
for level_patterns in result:
for seq in level_patterns:
print(f"Pattern: {seq.items}, Support: {seq.support}, Length: {seq.length}")
# Access sequence properties
print(f" First item: {seq.first_item}, Last item: {seq.last_item}")
# Check if item is in sequence
if "Milk" in seq:
print(f" Contains Milk!")
Key Benefits of Sequence Objects
- Rich API: Access pattern properties like
length,first_item,last_item - Type Safety: IDE autocomplete and better type hints
- Immutable & Hashable: Can be used as dictionary keys
- Extensible: Add metadata for confidence, lift, or custom properties
- Backward Compatible: Convert to/from dict format as needed
from gsppy import Sequence, sequences_to_dict, dict_to_sequences
# Create custom sequences
seq = Sequence.from_tuple(("A", "B", "C"), support=5)
# Extend sequences
extended = seq.extend("D") # Creates Sequence(("A", "B", "C", "D"))
# Add metadata
seq_with_meta = seq.with_metadata(confidence=0.85, lift=1.5)
# Convert between formats for compatibility
seq_result = gsp.search(min_support=0.3, return_sequences=True)
dict_format = sequences_to_dict(seq_result[0]) # Convert to dict
For a complete interactive example, see the sequence_example.py marimo notebook.
Loading SPM/GSP Format Files
GSP-Py supports loading datasets in the classical SPM/GSP delimiter format, which is widely used in sequential pattern mining research. This format uses:
-1to mark the end of an element (itemset)-2to mark the end of a sequence (transaction)
Using the SPM Loader
from gsppy.utils import read_transactions_from_spm
from gsppy import GSP
# Load SPM format file
transactions = read_transactions_from_spm('data.txt')
# Run GSP algorithm
gsp = GSP(transactions)
result = gsp.search(min_support=0.3)
SPM Format Examples
Simple sequence file (data.txt):
1 2 -1 3 -1 -2
4 -1 5 6 -1 -2
1 -1 2 3 -1 -2
This represents:
- Transaction 1: Items [1, 2] followed by item [3] → flattened to [1, 2, 3]
- Transaction 2: Item [4] followed by items [5, 6] → flattened to [4, 5, 6]
- Transaction 3: Item [1] followed by items [2, 3] → flattened to [1, 2, 3]
String tokens are also supported:
A B -1 C -1 -2
D -1 E F -1 -2
Token Mapping
For workflows requiring conversion between string tokens and integer IDs, use the TokenMapper:
from gsppy.utils import read_transactions_from_spm
from gsppy import TokenMapper
# Load with mappings
transactions, str_to_int, int_to_str = read_transactions_from_spm(
'data.txt',
return_mappings=True
)
print("String to Int:", str_to_int)
# Output: {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}
print("Int to String:", int_to_str)
# Output: {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6'}
# Use the TokenMapper class directly
mapper = TokenMapper()
id_a = mapper.add_token("A")
id_b = mapper.add_token("B")
print(f"A -> {id_a}, B -> {id_b}")
# Output: A -> 0, B -> 1
Edge Cases Handled
The SPM loader gracefully handles:
- Empty lines (skipped)
- Missing
-2delimiter at end of line - Extra or consecutive delimiters
- Mixed-length elements in sequences
- Both integer and string tokens
Output
The algorithm will return a list of patterns with their corresponding support.
Sample Output:
[
{('Bread',): 4, ('Milk',): 4, ('Diaper',): 4, ('Beer',): 3, ('Coke',): 2},
{('Bread', 'Milk'): 3, ('Bread', 'Diaper'): 3, ('Bread', 'Beer'): 2, ('Milk', 'Diaper'): 3, ('Milk', 'Beer'): 2, ('Milk', 'Coke'): 2, ('Diaper', 'Beer'): 3, ('Diaper', 'Coke'): 2},
{('Bread', 'Milk', 'Diaper'): 2, ('Bread', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Coke'): 2}
]
- The first dictionary contains single-item sequences with their frequencies (e.g.,
('Bread',): 4means "Bread" appears in 4 transactions). - The second dictionary contains 2-item sequential patterns (e.g.,
('Bread', 'Milk'): 3means the sequence " Bread → Milk" appears in 3 transactions). Note that patterns like('Bread', 'Beer')are detected even when they don't appear adjacent in transactions - they just need to appear in order. - The third dictionary contains 3-item sequential patterns (e.g.,
('Bread', 'Milk', 'Diaper'): 2means the sequence "Bread → Milk → Diaper" appears in 2 transactions).
[!NOTE] The support of a sequence is calculated as the fraction of transactions containing the sequence in order (not necessarily contiguously), e.g.,
('Bread', 'Milk')appears in 3 out of 5 transactions → Support =3 / 5 = 0.6(60%). This insight helps identify frequently occurring sequential patterns in datasets, such as shopping trends or user behavior.
[!IMPORTANT] Non-contiguous (ordered) matching: GSP-Py detects patterns where items appear in the specified order but not necessarily adjacent. For example, the pattern
('Bread', 'Beer')matches the transaction['Bread', 'Milk', 'Diaper', 'Beer']because Bread appears before Beer, even though they are not adjacent. This follows the standard GSP algorithm semantics for sequential pattern mining.
Understanding Non-Contiguous Pattern Matching
GSP-Py follows the standard GSP algorithm semantics by detecting ordered (non-contiguous) subsequences. This means:
- ✅ Order matters: Items must appear in the specified sequence order
- ✅ Gaps allowed: Items don't need to be adjacent
- ❌ Wrong order rejected: Items appearing in different order won't match
Example:
from gsppy.gsp import GSP
sequences = [
['a', 'b', 'c'], # Contains: (a,b), (a,c), (b,c), (a,b,c)
['a', 'c'], # Contains: (a,c)
['b', 'c', 'a'], # Contains: (b,c), (b,a), (c,a)
['a', 'b', 'c', 'd'], # Contains: (a,b), (a,c), (a,d), (b,c), (b,d), (c,d), etc.
]
gsp = GSP(sequences)
result = gsp.search(min_support=0.5) # Need at least 2/4 sequences
# Pattern ('a', 'c') is found with support=3 because:
# - It appears in ['a', 'b', 'c'] (with 'b' in between)
# - It appears in ['a', 'c'] (adjacent)
# - It appears in ['a', 'b', 'c', 'd'] (with 'b' in between)
# Total: 3 out of 4 sequences = 75% support ✅
[!TIP] For more complex examples, find example scripts in the
gsppy/testsfolder.
📊 DataFrame Input Support
GSP-Py supports Polars and Pandas DataFrames as input, enabling high-performance workflows with modern data formats like Arrow and Parquet. This feature is particularly useful for large-scale data engineering pipelines and integration with existing data processing workflows.
Installation
Install GSP-Py with DataFrame support:
pip install 'gsppy[dataframe]'
This installs the optional dependencies: polars, pandas, and pyarrow.
DataFrame Input Formats
GSP-Py supports two DataFrame formats:
1. Grouped Format (Transaction ID + Item Columns)
Use when your data has separate rows for each item in a transaction:
import polars as pl
from gsppy import GSP
# Polars DataFrame with transaction_id and item columns
df = pl.DataFrame({
"transaction_id": [1, 1, 2, 2, 2, 3, 3],
"item": ["Bread", "Milk", "Bread", "Diaper", "Beer", "Milk", "Coke"],
})
# Run GSP directly on the DataFrame
gsp = GSP(df, transaction_col="transaction_id", item_col="item")
patterns = gsp.search(min_support=0.3)
for level, freq_patterns in enumerate(patterns, start=1):
print(f"\n{level}-Sequence Patterns:")
for pattern, support in freq_patterns.items():
print(f" {pattern}: {support}")
2. Sequence Format (List Column)
Use when each row contains a complete transaction as a list:
import pandas as pd
from gsppy import GSP
# Pandas DataFrame with sequences as lists
df = pd.DataFrame({
"transaction": [
["Bread", "Milk"],
["Bread", "Diaper", "Beer"],
["Milk", "Coke"],
]
})
gsp = GSP(df, sequence_col="transaction")
patterns = gsp.search(min_support=0.3)
DataFrame with Timestamps
DataFrames support temporal constraints for time-aware pattern mining:
import polars as pl
from gsppy import GSP
# Grouped format with timestamps
df = pl.DataFrame({
"transaction_id": [1, 1, 1, 2, 2, 2],
"item": ["Login", "Browse", "Purchase", "Login", "Browse", "Purchase"],
"timestamp": [0, 2, 5, 0, 1, 15], # Time in seconds
})
# Find patterns where consecutive events occur within 10 seconds
gsp = GSP(
df,
transaction_col="transaction_id",
item_col="item",
timestamp_col="timestamp",
maxgap=10
)
patterns = gsp.search(min_support=0.5)
For sequence format with timestamps:
import pandas as pd
from gsppy import GSP
df = pd.DataFrame({
"sequence": [["A", "B", "C"], ["A", "D"]],
"timestamps": [[1, 2, 3], [1, 5]], # Timestamps per item
})
gsp = GSP(df, sequence_col="sequence", timestamp_col="timestamps", maxgap=3)
patterns = gsp.search(min_support=0.5)
Working with Parquet and Arrow Files
DataFrames enable seamless integration with columnar storage formats:
import polars as pl
from gsppy import GSP
# Read directly from Parquet
df = pl.read_parquet("transactions.parquet")
# Run GSP with automatic schema detection
gsp = GSP(df, transaction_col="txn_id", item_col="product")
patterns = gsp.search(min_support=0.2)
# Or use Pandas with Arrow backend
import pandas as pd
df_pandas = pd.read_parquet("transactions.parquet", engine="pyarrow")
gsp = GSP(df_pandas, transaction_col="txn_id", item_col="product")
patterns = gsp.search(min_support=0.2)
Performance Considerations
DataFrames offer performance benefits for large datasets:
- Polars: Leverages Arrow for zero-copy operations and parallel processing
- Pandas: Compatible with Arrow backend for efficient memory usage
- Parquet/Arrow: Columnar storage enables efficient filtering and reading
- Schema validation: Errors are caught early with clear messages
DataFrame Schema Requirements
Grouped Format:
transaction_col: Column containing transaction/sequence IDs (any type)item_col: Column containing items (any type, converted to strings)timestamp_col(optional): Column containing timestamps (numeric)
Sequence Format:
sequence_col: Column containing lists of itemstimestamp_col(optional): Column containing lists of timestamps (must match sequence lengths)
Error Handling
GSP-Py provides clear error messages for schema issues:
import polars as pl
from gsppy import GSP
df = pl.DataFrame({
"txn_id": [1, 2],
"product": ["A", "B"],
})
# ❌ Missing required column
try:
gsp = GSP(df, transaction_col="txn_id", item_col="item") # 'item' doesn't exist
except ValueError as e:
print(f"Error: {e}") # "Column 'item' not found in DataFrame"
# ❌ Invalid format specification
try:
gsp = GSP(df) # Must specify either sequence_col or both transaction_col and item_col
except ValueError as e:
print(f"Error: {e}") # "Must specify either 'sequence_col' or both 'transaction_col' and 'item_col'"
Backward Compatibility
Traditional list-based input continues to work:
from gsppy import GSP
# Lists still work as before
transactions = [["A", "B"], ["A", "C"], ["B", "C"]]
gsp = GSP(transactions)
patterns = gsp.search(min_support=0.5)
DataFrame parameters cannot be mixed with list input:
transactions = [["A", "B"], ["C", "D"]]
# ❌ This raises an error
gsp = GSP(transactions, transaction_col="txn") # ValueError: DataFrame parameters cannot be used with list input
Examples and Tests
For complete examples and edge cases, see:
tests/test_dataframe.py- Comprehensive test suite- DataFrame adapter documentation in
gsppy/dataframe_adapters.py
🔗 Itemset Support
GSP-Py supports itemsets within sequence elements, enabling you to capture co-occurrence of multiple items at the same time step. This is crucial for applications where items occur together rather than in strict sequential order.
What are Itemsets?
- Flat sequences:
['A', 'B', 'C']- each item occurs at a separate time step - Itemset sequences:
[['A', 'B'], ['C']]- items A and B occur together at the first time step, then C occurs later
Why Use Itemsets?
Itemsets are essential when temporal co-occurrence matters in your domain:
- Market basket analysis: Customers buy multiple items in a single shopping trip, then return for more items later
- Web analytics: Users open multiple pages in parallel tabs before moving to the next set of pages
- Event logs: Multiple events can occur simultaneously in complex systems
- Purchase patterns: Items bought together vs. items bought in sequence
Using Itemsets
Basic Example
from gsppy import GSP
# Itemset format: nested lists where inner lists are items that occur together
transactions = [
[['Bread', 'Milk'], ['Eggs']], # Bought Bread & Milk together, then Eggs later
[['Bread', 'Milk', 'Butter']], # Bought all three items together
[['Bread', 'Milk'], ['Eggs']], # Same pattern as customer 1
]
gsp = GSP(transactions)
patterns = gsp.search(min_support=0.5)
# Pattern ('Bread',) will match any itemset containing Bread
# Pattern ('Bread', 'Eggs') will match sequences where Bread appears before Eggs
# (even if they're in different itemsets)
Backward Compatibility with Flat Sequences
GSP-Py automatically normalizes flat sequences to itemsets internally, ensuring full backward compatibility:
from gsppy import GSP
# These are equivalent after normalization:
flat_transactions = [['A', 'B', 'C']] # Flat format
itemset_transactions = [[['A'], ['B'], ['C']]] # Equivalent itemset format
# Both produce the same results
gsp1 = GSP(flat_transactions)
gsp2 = GSP(itemset_transactions)
# Patterns are identical
patterns1 = gsp1.search(min_support=0.5)
patterns2 = gsp2.search(min_support=0.5)
Itemset Matching Semantics
Pattern matching with itemsets uses subset semantics:
- A pattern element matches a sequence element if all items in the pattern element are present in the sequence element
- Example: Pattern
[['A', 'B']]matches sequence element['A', 'B', 'C']because {A, B} ⊆ {A, B, C} - Pattern elements must appear in order across the sequence
from gsppy import GSP
transactions = [
[['A', 'B', 'D'], ['E'], ['C', 'F']], # A,B,D together, then E, then C,F together
]
gsp = GSP(transactions)
# Pattern ('A', 'C') will match because:
# - 'A' is in first itemset ['A', 'B', 'D'] ✓
# - 'C' appears later in third itemset ['C', 'F'] ✓
# - Order is preserved ✓
Reading Itemsets from SPM Format
The SPM/GSP format supports itemsets using delimiters:
-1: End of itemset-2: End of sequence
from gsppy.utils import read_transactions_from_spm
# SPM file content:
# 1 2 -1 3 -1 -2
# 1 -1 3 4 -1 -2
# Read with itemsets preserved
transactions = read_transactions_from_spm("data.txt", preserve_itemsets=True)
# Result: [[['1', '2'], ['3']], [['1'], ['3', '4']]]
# Read with itemsets flattened (backward compatible)
transactions = read_transactions_from_spm("data.txt", preserve_itemsets=False)
# Result: [['1', '2', '3'], ['1', '3', '4']]
Itemsets with Timestamps
Itemsets work seamlessly with temporal constraints:
from gsppy import GSP
# Itemsets with timestamps: [(item, timestamp), ...]
transactions = [
[[('Login', 0), ('Home', 0)], [('Product', 5)], [('Checkout', 10)]],
[[('Login', 0)], [('Home', 2), ('Product', 2)], [('Checkout', 15)]],
]
# Find patterns where events in the same itemset occur together
# and subsequent itemsets occur within maxgap time units
gsp = GSP(transactions, maxgap=10)
patterns = gsp.search(min_support=0.5)
Complete Example
See the itemset_example.py marimo notebook for comprehensive examples including:
- Market basket analysis with itemsets
- Web clickstream with parallel page views
- Comparison of flat vs. itemset semantics
- Reading and processing SPM format files
Key Takeaways
✓ Itemsets capture co-occurrence of items at the same time step
✓ Flat sequences are automatically normalized to itemsets internally
✓ Both formats work seamlessly with GSP-Py
✓ Use itemsets when temporal co-occurrence matters in your domain
✓ SPM format supports both flat and itemset representations
⏱️ Temporal Constraints
GSP-Py supports time-constrained sequential pattern mining with three powerful temporal constraints: mingap, maxgap, and maxspan. These constraints enable domain-specific applications such as medical event mining, retail analytics, and temporal user journey discovery.
Temporal Constraint Parameters
mingap: Minimum time gap required between consecutive items in a patternmaxgap: Maximum time gap allowed between consecutive items in a patternmaxspan: Maximum time span from the first to the last item in a pattern
Using Temporal Constraints
To use temporal constraints, your transactions must include timestamps as (item, timestamp) tuples:
from gsppy.gsp import GSP
# Transactions with timestamps (e.g., in seconds, hours, days, etc.)
timestamped_transactions = [
[("Login", 0), ("Browse", 2), ("AddToCart", 5), ("Purchase", 7)],
[("Login", 0), ("Browse", 1), ("AddToCart", 15), ("Purchase", 20)],
[("Login", 0), ("Browse", 3), ("AddToCart", 6), ("Purchase", 8)],
]
# Find patterns where consecutive events occur within 10 time units
gsp = GSP(timestamped_transactions, maxgap=10)
patterns = gsp.search(min_support=0.6)
# The pattern ("Browse", "AddToCart", "Purchase") will:
# - Be found in transaction 1: gaps are 3 and 2 (both ≤ 10) ✅
# - NOT be found in transaction 2: gap between Browse→AddToCart is 14 (exceeds maxgap) ❌
# - Be found in transaction 3: gaps are 3 and 2 (both ≤ 10) ✅
# Result: Support = 2/3 = 67% (above threshold of 60%)
CLI Usage with Temporal Constraints
# Find patterns with maximum gap of 5 time units
gsppy --file temporal_data.json --min_support 0.3 --maxgap 5
# Find patterns with minimum gap of 2 time units
gsppy --file temporal_data.json --min_support 0.3 --mingap 2
# Find patterns that complete within 10 time units
gsppy --file temporal_data.json --min_support 0.3 --maxspan 10
# Combine multiple constraints
gsppy --file temporal_data.json --min_support 0.3 --mingap 1 --maxgap 5 --maxspan 10
Real-World Examples
Medical Event Mining
from gsppy.gsp import GSP
# Medical events with timestamps in days
medical_sequences = [
[("Symptom", 0), ("Diagnosis", 2), ("Treatment", 5), ("Recovery", 15)],
[("Symptom", 0), ("Diagnosis", 1), ("Treatment", 20), ("Recovery", 30)],
[("Symptom", 0), ("Diagnosis", 3), ("Treatment", 6), ("Recovery", 18)],
]
# Find patterns where treatment follows diagnosis within 10 days
gsp = GSP(medical_sequences, maxgap=10)
result = gsp.search(min_support=0.5)
# Pattern ("Diagnosis", "Treatment") found in sequences 1 & 3 only
# (sequence 2 has gap of 19 days, exceeding maxgap)
Retail Analytics
from gsppy.gsp import GSP
# Customer purchases with timestamps in hours
purchase_sequences = [
[("Browse", 0), ("AddToCart", 0.5), ("Purchase", 1)],
[("Browse", 0), ("AddToCart", 1), ("Purchase", 25)], # Long delay
[("Browse", 0), ("AddToCart", 0.3), ("Purchase", 0.8)],
]
# Find purchase journeys that complete within 2 hours
gsp = GSP(purchase_sequences, maxspan=2)
result = gsp.search(min_support=0.5)
# Full sequence found in 2 out of 3 transactions
# (sequence 2 has span of 25 hours, exceeding maxspan)
User Journey Discovery
from gsppy.gsp import GSP
# Website navigation with timestamps in seconds
navigation_sequences = [
[("Home", 0), ("Search", 5), ("Product", 10), ("Checkout", 15)],
[("Home", 0), ("Search", 3), ("Product", 8), ("Checkout", 180)],
[("Home", 0), ("Search", 4), ("Product", 9), ("Checkout", 14)],
]
# Find navigation patterns with:
# - Minimum 2 seconds between steps (mingap)
# - Maximum 20 seconds between steps (maxgap)
# - Complete within 30 seconds total (maxspan)
gsp = GSP(navigation_sequences, mingap=2, maxgap=20, maxspan=30)
result = gsp.search(min_support=0.5)
Important Notes
- Temporal constraints require timestamped transactions (item-timestamp tuples)
- If temporal constraints are specified but transactions don't have timestamps, a warning is logged and constraints are ignored
- When using temporal constraints, the Python backend is automatically used (accelerated backends don't yet support temporal constraints)
- Timestamps can be in any unit (seconds, minutes, hours, days) as long as they're consistent within your dataset
🔧 Flexible Candidate Pruning
GSP-Py supports flexible candidate pruning strategies that allow you to customize how candidate sequences are filtered during pattern mining. This enables optimization for different dataset characteristics and mining requirements.
Built-in Pruning Strategies
1. Support-Based Pruning (Default)
The standard GSP pruning based on minimum support threshold:
from gsppy.gsp import GSP
from gsppy.pruning import SupportBasedPruning
# Explicit support-based pruning
pruner = SupportBasedPruning(min_support_fraction=0.3)
gsp = GSP(transactions, pruning_strategy=pruner)
result = gsp.search(min_support=0.3)
2. Frequency-Based Pruning
Prunes candidates based on absolute frequency (minimum number of occurrences):
from gsppy.pruning import FrequencyBasedPruning
# Require patterns to appear at least 5 times
pruner = FrequencyBasedPruning(min_frequency=5)
gsp = GSP(transactions, pruning_strategy=pruner)
result = gsp.search(min_support=0.2)
Use case: When you need patterns to occur a minimum absolute number of times, regardless of dataset size.
3. Temporal-Aware Pruning
Optimizes pruning for time-constrained pattern mining by pre-filtering infeasible patterns:
from gsppy.pruning import TemporalAwarePruning
# Prune patterns that cannot satisfy temporal constraints
pruner = TemporalAwarePruning(
mingap=1,
maxgap=5,
maxspan=10,
min_support_fraction=0.3
)
gsp = GSP(timestamped_transactions, mingap=1, maxgap=5, maxspan=10, pruning_strategy=pruner)
result = gsp.search(min_support=0.3)
Use case: Improves performance for temporal pattern mining by eliminating patterns that cannot satisfy temporal constraints.
4. Combined Pruning
Combines multiple pruning strategies for aggressive filtering:
from gsppy.pruning import CombinedPruning, SupportBasedPruning, FrequencyBasedPruning
# Apply both support and frequency constraints
strategies = [
SupportBasedPruning(min_support_fraction=0.3),
FrequencyBasedPruning(min_frequency=5)
]
pruner = CombinedPruning(strategies)
gsp = GSP(transactions, pruning_strategy=pruner)
result = gsp.search(min_support=0.3)
Use case: When you want to combine multiple filtering criteria for more selective pattern discovery.
Custom Pruning Strategies
You can create custom pruning strategies by implementing the PruningStrategy interface:
from gsppy.pruning import PruningStrategy
from typing import Dict, Optional, Tuple
class MyCustomPruner(PruningStrategy):
def should_prune(
self,
candidate: Tuple[str, ...],
support_count: int,
total_transactions: int,
context: Optional[Dict] = None
) -> bool:
# Custom pruning logic
# Return True to prune (filter out), False to keep
pattern_length = len(candidate)
# Example: Prune very long patterns with low support
if pattern_length > 5 and support_count < 10:
return True
return False
# Use your custom pruner
custom_pruner = MyCustomPruner()
gsp = GSP(transactions, pruning_strategy=custom_pruner)
result = gsp.search(min_support=0.2)
Performance Characteristics
Different pruning strategies have different performance tradeoffs:
| Strategy | Pruning Aggressiveness | Use Case | Performance Impact |
|---|---|---|---|
| SupportBased | Moderate | General-purpose mining | Baseline performance |
| FrequencyBased | High (for large datasets) | Require absolute frequency | Faster on large datasets |
| TemporalAware | High (for temporal data) | Time-constrained patterns | Significant speedup for temporal mining |
| Combined | Very High | Selective pattern discovery | Fastest, but may miss edge cases |
Benchmarking Pruning Strategies
To compare pruning strategies on your dataset:
# Compare all strategies
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy all
# Benchmark a specific strategy
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy frequency
# Run multiple rounds for averaging
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy all --rounds 3
See benchmarks/bench_pruning.py for the complete benchmarking script.
⌨️ Typing
gsppy ships inline type information (PEP 561) via a bundled py.typed marker. The public API is re-exported from
gsppy directly—import GSP for programmatic use or reuse the CLI helpers (detect_and_read_file,
read_transactions_from_json, read_transactions_from_csv, and setup_logging) when embedding the tool in
larger applications.
🌟 Planned Features
We are actively working to improve GSP-Py. Here are some exciting features planned for future releases:
- Support for Preprocessing and Postprocessing:
- Add hooks to allow users to transform datasets before mining and customize the output results.
Want to contribute or suggest an improvement? Open a discussion or issue!
🤝 Contributing
We welcome contributions from the community! If you'd like to help improve GSP-Py, read our CONTRIBUTING.md guide to get started.
Development dependencies (e.g., testing and linting tools) are handled via uv. To set up and run the main tasks:
uv venv .venv
uv sync --frozen --extra dev
uv pip install -e .
# Run tasks
uv run pytest -n auto
uv run ruff check .
uv run pyright
General Steps:
- Fork the repository.
- Create a feature branch:
git checkout -b feature/my-feature. - Commit your changes using Conventional Commits format:
git commit -m "feat: add my feature". - Push to your branch:
git push origin feature/my-feature. - Submit a pull request to the main repository!
Looking for ideas? Check out our Planned Features section.
Release Management
GSP-Py uses automated release management with Conventional Commits. When commits are merged to main:
- Releases are triggered by:
fix:(patch),feat:(minor),perf:(patch), orBREAKING CHANGE:(major) - No release for:
docs:,style:,refactor:,test:,build:,ci:,chore: - CHANGELOG.md is automatically updated with structured release notes
- Git tags and GitHub releases are created automatically
See Release Management Guide for details on commit message format and release process.
📝 License
This project is licensed under the terms of the MIT License. For more details, refer to the LICENSE file.
📖 Citation
If GSP-Py contributed to your research or project that led to a publication, we kindly ask that you cite it as follows:
@misc{pradolima_gsppy,
author = {Prado Lima, Jackson Antonio do},
title = {{GSP-Py - Generalized Sequence Pattern algorithm in Python}},
month = Dec,
year = 2025,
doi = {10.5281/zenodo.3333987},
url = {https://doi.org/10.5281/zenodo.3333987}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gsppy-5.3.0.tar.gz.
File metadata
- Download URL: gsppy-5.3.0.tar.gz
- Upload date:
- Size: 153.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ef7c1c2f2426aca14851fdae52b0e7c5d22df838396933e5163c0a7ede19825
|
|
| MD5 |
9078ede9e6f241a7162ee818fc4ffb00
|
|
| BLAKE2b-256 |
e115209bbee176429b39a61e030119ddb3f77dcbbae8e651a1daf1a80b20eb0d
|
File details
Details for the file gsppy-5.3.0-py3-none-any.whl.
File metadata
- Download URL: gsppy-5.3.0-py3-none-any.whl
- Upload date:
- Size: 61.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c460884a1682ac103369957babe8cb06d5ff2acb89087eed718251845d9935d9
|
|
| MD5 |
28cccccda0e1ed6e1b8a3b1205d46b67
|
|
| BLAKE2b-256 |
ac67bbbcd7433db8a95fbbbe6a2a4ebb7c32da4fd9fe5926da070af44baaa888
|