Skip to main content

pyspark-sampling

Project description

pyspark-sampling

sparksampling is a PySpark-based sampling and data quality assessment GRPC service that supports containerized deployments and Spark On K8S

Feature

  • Common sampling methods: Random, Stratified, Simple
  • Relationship Sampling based on DAG and Topological sorting
  • Cloud Native and Spark on K8S support

QUICK START

Installation

The trial only requires direct installation using pypi

pip install sparksampling

run as

sparksampling

The service will start and listen on port 8530

Docker

docker run -p 8530:8530 wh1isper/pysparksampling:latest

Development

Using dev install

pip install -e .[test]
pre-commit install

run test

pytest -v

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparksampling-0.4.2.tar.gz (1.9 MB view hashes)

Uploaded Source

Built Distribution

sparksampling-0.4.2-py3-none-any.whl (33.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page