Extract events from Splunk journal archives to raw format (JSON, CSV, Parquet)

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ponquersohn

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
Topic
- System :: Archiving

Project description

Splunk DDSS Extractor

Convert Splunk self-hosted storage archives from compressed journal format to raw format.

Overview

Splunk DDSS Extractor is a Python library that processes Splunk journal archives, extracts events, and converts them to raw format for easier analysis and long-term storage. Use it in your own applications, data pipelines, or as a CLI tool.

Note: This project is based on the concept from fionera/splunker, reimplemented in Python with additional features for production use.

Features

Automatic compression detection (.zst, .gz, uncompressed)
Extract events with full metadata (host, source, sourcetype, timestamps)
Multiple output formats (JSON Lines, CSV, Parquet)
Streaming processing for memory efficiency
Simple Python API and CLI interface
Docker support for containerized deployments
Integrates with AWS Lambda, ECS, or any Python environment

Quick Start

Using the Makefile (Recommended)

# Show all available commands
make env

# Complete development setup (venv + dependencies)
make dev-setup

# Run tests
make test

# Build Docker image
make docker

Manual Setup

Installation

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt
pip install -e .

# Optional: Install Parquet support
pip install pyarrow

Basic Usage

Extract a journal file:

from splunk_ddss_extractor.extractor import Extractor

extractor = Extractor()

# Extract to JSON Lines
extractor.extract(
    input_path='/path/to/journal.zst',
    output_path='output.json',
    output_format='ndjson'
)

# Extract to CSV
extractor.extract(
    input_path='/path/to/journal.zst',
    output_path='output.csv',
    output_format='csv'
)

# Extract to Parquet
extractor.extract(
    input_path='/path/to/journal.zst',
    output_path='output.parquet',
    output_format='parquet'
)

# Extract from S3 to local file (streaming, no download)
extractor.extract(
    input_path='s3://bucket/path/journal.zst',
    output_path='output.json',
    output_format='ndjson'
)

# Extract from local to S3
extractor.extract(
    input_path='/path/to/journal.zst',
    output_path='s3://bucket/output/data.json',
    output_format='ndjson'
)

Low-level streaming (advanced):

from splunk_ddss_extractor.decoder import JournalDecoder
import zstandard as zstd

# For low-level access, decoder needs an uncompressed stream
# If reading a compressed file, decompress it first:
with open('/path/to/journal.zst', 'rb') as compressed_file:
    dctx = zstd.ZstdDecompressor()
    with dctx.stream_reader(compressed_file) as reader:
        decoder = JournalDecoder(reader=reader)
        while decoder.scan():
            event = decoder.get_event()
            print(f"Host: {decoder.host()}")
            print(f"Source: {decoder.source()}")
            print(f"Sourcetype: {decoder.source_type()}")
            print(f"Timestamp: {event.index_time}")
            print(f"Message: {event.message_string()}")

# For uncompressed journal files:
with open('/path/to/journal', 'rb') as f:
    decoder = JournalDecoder(reader=f)
    while decoder.scan():
        event = decoder.get_event()
        # Process event...

Docker Usage

# Build image
make docker

# Run with local file
docker run -v /path/to/data:/data ghcr.io/ponquersohn/splunk_ddss_extractor:latest

# Use in your own Dockerfile
FROM ghcr.io/ponquersohn/splunk_ddss_extractor:latest
COPY your_script.py /app/
CMD ["python", "/app/your_script.py"]

Architecture

This is a Python library with the following components:

Journal Decoder - Low-level decoder for Splunk's binary journal format
Extractor Interface - High-level API for common extraction tasks
Output Writers - Support for JSON, CSV, and Parquet formats
Compression Detection - Automatic detection and handling of .zst, .gz formats

Integration Options:

Direct Python import in your applications
AWS Lambda functions for serverless processing
ECS/Fargate tasks for batch processing
Docker containers for isolated environments
Local scripts for one-off extractions

See CLAUDE.md for detailed architecture documentation.

Development

Quick Commands

# Run tests
make test

# Run tests with coverage
make test-coverage

# Build Docker image
make docker

# Test Docker locally
make docker-run

# Run all checks (tests)
make check

# Clean temporary files
make clean

Manual Commands

# Run tests
pytest tests/

# Code formatting
black src/ tests/

# Local Docker testing
cd docker
docker-compose up

Configuration

When integrating with AWS or other environments, you may use these environment variables:

OUTPUT_FORMAT: Output format - json, csv, or parquet (default: json)
LOG_LEVEL: Logging level (default: INFO)
AWS_REGION: AWS region for S3 operations (default: us-east-1)
S3_BUCKET: S3 bucket name (for S3 integrations)

Output Formats

JSON Lines (default)

{"timestamp": 1234567890, "host": "server01", "source": "/var/log/app.log", "sourcetype": "app", "message": "Event data"}

CSV

timestamp,host,source,sourcetype,message
1234567890,server01,/var/log/app.log,app,"Event data"

Parquet

Columnar format optimized for analytics (requires pyarrow).

Credits

This project is inspired by and based on the concept from fionera/splunker, a Go implementation for extracting Splunk journal files. This Python implementation extends the original concept with:

Streaming S3 support (no temporary files)
Multiple output formats (JSON Lines, CSV, Parquet)
Python library API for easy integration
Docker and AWS deployment options

License

Proprietary

Contributing

See CLAUDE.md for development guidelines.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ponquersohn

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
Topic
- System :: Archiving

Release history Release notifications | RSS feed

0.4.3

Mar 9, 2026

0.4.2

Mar 9, 2026

0.4.1

Mar 9, 2026

0.4.0

Mar 9, 2026

This version

0.3.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splunk_ddss_extractor-0.3.0.tar.gz (20.7 kB view details)

Uploaded Mar 9, 2026 Source

File details

Details for the file splunk_ddss_extractor-0.3.0.tar.gz.

File metadata

Download URL: splunk_ddss_extractor-0.3.0.tar.gz
Upload date: Mar 9, 2026
Size: 20.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for splunk_ddss_extractor-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`130ed8d3530ef383332424788d5ae753c5c99c21c7d41bdd2fc0e50d2d8a903c`
MD5	`f0f2b52d16974fa17ab8cab4863023bb`
BLAKE2b-256	`5728f1d2fc0258a5eacffb3e008cd0176fe0e79e3773c399a8db1116c1665925`

See more details on using hashes here.

Provenance

The following attestation bundles were made for splunk_ddss_extractor-0.3.0.tar.gz:

Publisher: publish.yml on ponquersohn/splunk_ddss_extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: splunk_ddss_extractor-0.3.0.tar.gz
- Subject digest: 130ed8d3530ef383332424788d5ae753c5c99c21c7d41bdd2fc0e50d2d8a903c
- Sigstore transparency entry: 1066679962
- Sigstore integration time: Mar 9, 2026
Source repository:
- Permalink: ponquersohn/splunk_ddss_extractor@70d9babe6affd2c21cbd48d9e6154139cac47b4d
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ponquersohn
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@70d9babe6affd2c21cbd48d9e6154139cac47b4d
- Trigger Event: release

splunk-ddss-extractor 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Splunk DDSS Extractor

Overview

Features

Quick Start

Using the Makefile (Recommended)

Manual Setup

Installation

Basic Usage

Docker Usage

Architecture

Development

Quick Commands

Manual Commands

Configuration

Output Formats

JSON Lines (default)

CSV

Parquet

Credits

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Provenance