Skip to main content

PySpark batch analytics: Extract, Transform, Stage, and analytical modules (linear regression, correlation, PCA, t-test).

Project description

Batch Analytics

PySpark-based analytics pipeline for ClickHouse data: ExtractTransformStageAnalytics. Designed to run as the main application inside a Spark driver container (invoked by analytics_runners via SparkApplication CRD).

Bundle contents

Only the files required for the batch analytics job runner:

analytics/
├── pyproject.toml
├── requirements.txt          # core + scipy + boto3 + clickhouse-connect (single-file install)
├── requirements-batch.txt  # includes requirements.txt
├── README.md
└── src/
    └── batch_analytics/
        ├── __init__.py
        ├── __main__.py        # python -m batch_analytics
        ├── job_runner.py      # Entry point
        ├── config.py
        ├── extract.py
        ├── transform.py
        ├── log.py
        ├── README.md
        └── analytics/
            ├── __init__.py
            ├── linear_regression.py
            ├── correlation.py
            ├── pca_clustering.py
            └── t_test.py

Install

pip install -e .
# or install every runtime dependency used anywhere in the package, then editable:
pip install -r requirements.txt && pip install -e .
# PyPI install includes numpy and scipy (t-test); extras: s3, clickhouse, output, full
pip install "batch-analytics[full]"

Run

# Via module
python -m batch_analytics

# Via CLI (after pip install -e .)
batch-analytics

# Full pipeline
batch-analytics

# Analytics only (from staged ClickHouse table)
batch-analytics --from-stage --modules lr corr pca ttest

Configuration

See src/batch_analytics/README.md for environment variables and usage.

Docker image

For Spark on Kubernetes, build an image that includes this package and exposes job_runner.py at the path used by mainApplicationFile (e.g. local:///opt/analytics/job_runner.py).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

batch_analytics-0.3.6.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

batch_analytics-0.3.6-py3-none-any.whl (28.6 kB view details)

Uploaded Python 3

File details

Details for the file batch_analytics-0.3.6.tar.gz.

File metadata

  • Download URL: batch_analytics-0.3.6.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for batch_analytics-0.3.6.tar.gz
Algorithm Hash digest
SHA256 19fe8cf551a512003bec3e374a7cf08fe83f5541cd07492b14c65e8d835fd5b8
MD5 bb26bf047425729501842bcfc60b41ca
BLAKE2b-256 c403f57da87fdc3716dc941448d7cc4ba7043494c7ac79c1bcab2424ad32137a

See more details on using hashes here.

File details

Details for the file batch_analytics-0.3.6-py3-none-any.whl.

File metadata

File hashes

Hashes for batch_analytics-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ca740ea5d63398ea80faea2aa4b88db846e21c0049663991bbe80a3b039831fe
MD5 6b953f87a1d10ca1339e8a1cda3ea8fe
BLAKE2b-256 34bdf4cfaa54e0f0742b521dba17d6d1932891635bd70a7d7b54594c89a9292c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page