Petabyte-scale sonar file catalog with deduplication across distributed NFS systems

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Sonar Catalog

Petabyte-scale sonar file catalog with content-based deduplication across distributed NFS systems. Automatically discovers hosts, resolves NFS mount topologies, crawls filesystems, extracts navigation tracks, and serves everything through a searchable web UI with a CesiumJS 3D globe.

sonar-catalog init
sonar-catalog crawl /mnt/sonar-nas-01
sonar-catalog web

Architecture

graph TB
    subgraph CLI["CLI (cli.py)"]
        init[init]
        discover[discover]
        crawl[crawl / crawl-all]
        search[search / dupes / where]
        extractnav[extract-nav]
        demo[demo]
        export[export]
        web[web]
    end

    subgraph Core["Core Engine"]
        DE[Discovery Engine]
        MR[Mount Resolver]
        FC[File Crawler]
        FH[File Hasher]
        CS[Catalog Search]
    end

    subgraph Plugins["Plugin System"]
        PM[Plugin Manager]
        HS[Hook Specs]
        BI[Built-in Plugin]
        TP[Third-Party Plugins]
    end

    subgraph Storage["Storage Layer"]
        DB[(SQLite / PostgreSQL)]
        FTS[FTS5 Full-Text Index]
    end

    subgraph Extractors["Nav Extractors"]
        JSF[JSF Parser]
        XTF[XTF Parser]
        NMEA[NMEA Parser]
        SC[Sidecar Reader]
    end

    subgraph WebUI["Web Interface"]
        Flask[Flask App]
        API[REST API]
        Globe[CesiumJS Globe]
        Search2[Search UI]
    end

    discover --> DE
    crawl --> FC
    FC --> FH
    FC --> MR
    search --> CS
    extractnav --> Extractors
    web --> Flask

    DE --> DB
    FC --> DB
    CS --> DB
    API --> DB
    FTS -.-> DB

    PM --> HS
    BI --> PM
    TP --> PM
    FC -.-> PM
    Extractors -.-> PM

    Flask --> API
    Flask --> Globe
    Flask --> Search2

Data Flow

How a file goes from disk to searchable catalog entry:

sequenceDiagram
    participant User
    participant CLI
    participant Discovery as Discovery Engine
    participant Resolver as Mount Resolver
    participant Crawler as File Crawler
    participant Hasher as File Hasher
    participant Extractor as Nav Extractor
    participant DB as Database

    User->>CLI: sonar-catalog crawl /mnt/nas01
    CLI->>Resolver: Load NFS mounts (proc, autofs, fstab)
    Resolver-->>CLI: Mount table ready

    CLI->>Crawler: crawl_local("/mnt/nas01")
    loop Each file in directory tree
        Crawler->>Crawler: Walk filesystem, apply filters
        Crawler->>Hasher: Partial hash (first 4MB)
        Hasher-->>Crawler: Fingerprint

        alt New fingerprint
            Crawler->>Hasher: Full content hash (BLAKE3)
            Hasher-->>Crawler: content_hash
        else Known fingerprint
            Crawler->>Crawler: Skip (already cataloged)
        end

        Crawler->>Resolver: Resolve /mnt/nas01/data/file.xtf
        Resolver-->>Crawler: nas01:/export/survey/data/file.xtf

        Crawler->>Extractor: Extract nav from .xtf
        Extractor-->>Crawler: Track [[lat,lon], ...]
    end

    Crawler->>DB: Batch insert (files + locations + nav)
    DB-->>User: Stats: 1,247 files, 892 new, 355 deduped

Database Schema

erDiagram
    files ||--o{ locations : "content_hash"
    files ||--o| file_metadata : "content_hash"

    files {
        text content_hash PK "BLAKE3 / SHA-256"
        bigint file_size
        text partial_hash "First 4MB hash"
        text hash_algorithm
        text mime_type
        text sonar_format "xtf, jsf, s7k, ..."
        timestamp first_seen
        timestamp last_seen
    }

    locations {
        bigint id PK
        text content_hash FK
        text nfs_server "Origin NFS server"
        text nfs_export "Server export path"
        text remote_path "Full path on origin"
        text canonical_path UK "server:/full/path"
        boolean is_local
        text access_path "Local mount path"
        text access_hostname "Crawling machine"
        text mount_source "proc, autofs, fstab"
        text file_name
        text directory
        timestamp mtime
    }

    file_metadata {
        text content_hash PK
        text metadata "JSON (track, properties)"
        real lat_min
        real lat_max
        real lon_min
        real lon_max
        real lat_center
        real lon_center
        integer has_nav
    }

    hosts {
        text ip_address PK
        text hostname
        text discovery_method
        boolean ssh_accessible
        text scan_status
        timestamp last_scan_at
    }

    scans {
        bigint id PK
        text target_path
        text hostname
        text status
        integer files_found
        integer files_new
    }

Key concept: No data is copied. The catalog stores pointers. If the same file appears on 10 NFS servers, there is one row in files and 10 rows in locations, each pointing back to its canonical NFS origin.

Plugin Architecture

Inspired by napari's plugin system. All built-in functionality (format detection, nav extraction, export) is registered through the same hook system that third-party plugins use.

graph LR
    subgraph Discovery["Plugin Discovery"]
        EP["entry_points()"]
        MF["sonar-plugin.yaml"]
    end

    subgraph Manager["Plugin Manager"]
        REG[Register]
        EN[Enable / Disable]
        CALL[Call Hook]
    end

    subgraph Hooks["Hook Specs (7 extension points)"]
        H1["detect_format (firstresult)"]
        H2["extract_nav (firstresult)"]
        H3["get_format_signatures (historic)"]
        H4["get_extension_map (historic)"]
        H5["export_data (firstresult)"]
        H6["get_export_formats (historic)"]
        H7["register_web_routes (historic)"]
    end

    subgraph Builtin["Built-in Plugin"]
        FMT[Format Detection]
        NAV[Nav Extraction]
        EXP[CSV / GeoJSON / JSON]
    end

    subgraph ThirdParty["Third-Party Plugin"]
        TP1["my_sonar_plugin"]
    end

    EP --> REG
    MF --> REG
    REG --> EN
    EN --> CALL
    CALL --> Hooks

    FMT --> H1
    FMT --> H3
    FMT --> H4
    NAV --> H2
    EXP --> H5
    EXP --> H6
    TP1 -.-> H1
    TP1 -.-> H2

Hook Modes

Mode	Behavior	Example
firstresult	First plugin to return non-`None` wins (short-circuits)	Format detection, nav extraction
historic	All plugins contribute results (collected into a list)	Signature registration, export formats

Writing a Plugin

Create a Python package with an entry point:

# pyproject.toml
[project.entry-points."sonar_catalog.plugins"]
my_plugin = "my_sonar_plugin"

# my_sonar_plugin/__init__.py
def register(manager):
    manager.register_plugin(
        name="my-plugin",
        version="1.0.0",
        description="Adds FooBar sonar format support",
    )

    manager.register_hook_impl(
        "detect_format", "my-plugin", detect_foobar
    )
    manager.register_hook_impl(
        "extract_nav", "my-plugin", extract_foobar_nav
    )


def detect_foobar(file_path=None, header=None, extension=None, **kw):
    if header and header[:4] == b"\xFO\x0B\xAR":
        return "foobar"
    return None


def extract_foobar_nav(file_path=None, sonar_format=None, **kw):
    if sonar_format != "foobar":
        return None
    # ... parse binary nav data ...
    from sonar_catalog.extractors.base import NavResult
    return NavResult(track=[[56.0, 3.0], [56.1, 3.1]], source="foobar_nav")

Or use a YAML manifest instead of code:

# sonar-plugin.yaml
name: my-sonar-plugin
version: 1.0.0
contributions:
  formats:
    - name: foobar
      extensions: [".fb", ".fbar"]
      magic_bytes: "f00bar"
      magic_offset: 0
  nav_extractors:
    - format: foobar
      python_name: my_package.extractors:FooBarExtractor

NFS Mount Resolution

The mount resolver determines the canonical origin of every file by cross-referencing multiple mount sources:

flowchart TD
    FILE["/mnt/sonar01/project/line_001.xtf"]

    subgraph Sources["Mount Sources (loaded once)"]
        PROC["/proc/mounts"]
        FSTAB["/etc/fstab"]
        AUTOFS["automount -m"]
        MOUNT["mount command"]
    end

    PROC --> TABLE
    FSTAB --> TABLE
    AUTOFS --> TABLE
    MOUNT --> TABLE

    TABLE["Mount Table"]

    FILE --> MATCH{"Longest prefix match"}
    TABLE --> MATCH

    MATCH --> |"/mnt/sonar01 → nas01:/export/survey"| RESULT

    RESULT["Canonical: nas01:/export/survey/project/line_001.xtf"]

    style RESULT fill:#065f46,stroke:#10b981,color:#fff

This means the same file accessed from different machines (different local mount points) resolves to the same canonical path, enabling cross-machine deduplication.

Host Discovery

flowchart LR
    subgraph Methods["Discovery Methods"]
        A["autofs maps"]
        B["showmount -e"]
        C["ip neigh (ARP)"]
        D["Subnet scan"]
    end

    A --> MERGE
    B --> MERGE
    C --> MERGE
    D --> MERGE

    MERGE["Merge & Deduplicate\n(by IP)"]

    MERGE --> FILTER["Apply hostname\npatterns & blacklist"]

    FILTER --> SSH["SSH Probe\n(parallel)"]

    SSH --> |accessible| READY["Ready to Crawl"]
    SSH --> |unreachable| SKIP["Skipped"]

    style READY fill:#065f46,stroke:#10b981,color:#fff
    style SKIP fill:#7f1d1d,stroke:#ef4444,color:#fff

Nav Extraction Pipeline

flowchart TD
    FILE["Sonar File"]

    FILE --> CHECK{"Format?"}

    CHECK --> |".jsf"| JSF["JSF Parser\n(Message type 80: NMEA)"]
    CHECK --> |".xtf"| XTF["XTF Parser\n(PingHeader lat/lon)"]
    CHECK --> |any| SIDECAR["Sidecar Search\n({stem}.nav, {stem}.csv)"]

    JSF --> NMEA["NMEA Parser\n(GGA, RMC, GLL)"]
    NMEA --> TRACK
    XTF --> TRACK
    SIDECAR --> |found| CSV["CSV / JSON\ncompanion parser"]
    CSV --> TRACK
    SIDECAR --> |not found| NONE["No nav data"]

    TRACK["Raw Track\n[[lat,lon], ...]"]

    TRACK --> DS["Downsample\n(max 1000 points)"]
    DS --> BOUNDS["Compute Bounds\n(bbox + center)"]
    BOUNDS --> DB["Store in\nfile_metadata"]

    style DB fill:#065f46,stroke:#10b981,color:#fff
    style NONE fill:#7f1d1d,stroke:#ef4444,color:#fff

Two-Pass Deduplication

The hasher uses a two-pass strategy to minimize I/O on large files:

flowchart LR
    FILE["File\n(500 MB)"] --> P1["Pass 1: Partial Hash\n(first 4 MB → BLAKE3)"]

    P1 --> CHECK{"Fingerprint\nknown?"}

    CHECK --> |"New"| P2["Pass 2: Full Hash\n(all 500 MB → BLAKE3)"]
    CHECK --> |"Seen before"| SKIP["Skip\n(already cataloged)"]

    P2 --> INSERT["Insert file +\nlocation"]

    style SKIP fill:#065f46,stroke:#10b981,color:#fff
    style INSERT fill:#1e3a5f,stroke:#3b82f6,color:#fff

For a typical NFS share with 80% unchanged files, this skips reading ~80% of bytes on incremental scans.

Web Interface

graph TB
    subgraph Flask["Flask Application"]
        VBP["Views Blueprint\n/ and /globe"]
        ABP["API Blueprint\n/api/*"]
        PLUG["Plugin Routes\n(register_web_routes hook)"]
    end

    subgraph Pages["Pages"]
        SEARCH["Search UI\n(Tailwind CSS)"]
        GLOBE["CesiumJS Globe\n(3D track visualization)"]
    end

    subgraph Endpoints["REST API"]
        E1["GET /api/search"]
        E2["GET /api/stats"]
        E3["GET /api/files/:hash"]
        E4["GET /api/duplicates"]
        E5["GET /api/hosts"]
        E6["GET /api/geo/points"]
        E7["GET /api/geo/track/:hash"]
        E8["GET /api/geo/bounds"]
        E9["GET /api/servers"]
        E10["GET /api/formats"]
    end

    VBP --> SEARCH
    VBP --> GLOBE
    ABP --> Endpoints
    GLOBE --> E6
    GLOBE --> E7
    GLOBE --> E8
    SEARCH --> E1
    SEARCH --> E2

    style GLOBE fill:#1e3a5f,stroke:#3b82f6,color:#fff

The globe view renders file locations as color-coded markers (by sonar format) on a 3D CesiumJS globe. Clicking a marker loads and displays the navigation track as a polyline.

Supported Sonar Formats

Format	Extensions	Detection	Nav Extraction
XTF (eXtended Triton Format)	`.xtf`	Magic bytes `0x0100`	PingHeader lat/lon fields
JSF (EdgeTech)	`.jsf`	Magic bytes `0x1616`	NMEA from message type 80
Kongsberg .all	`.all`	Magic bytes `0x49`	-
Kongsberg KMALL	`.kmall`	Magic bytes `0x4B4D`	-
Reson S7K	`.s7k`	Extension	-
BAG (Bathymetric Attributed Grid)	`.bag`	Extension	-
SEG-Y	`.sgy`, `.segy`	Extension	-
Humminbird	`.db`	Extension	-
Lowrance	`.sl2`, `.sl3`	Extension	-
Raw	`.raw`	Extension	-
Custom	configurable	Magic bytes + extension	Via plugin

Add custom formats at runtime:

sonar-catalog add-magic-byte sample.myformat --format-name my_sonar --length 4

Installation

# Minimal (SQLite only, no external dependencies)
pip install .

# With all optional features
pip install ".[all]"

# Individual extras
pip install ".[web]"       # Flask web UI + CesiumJS globe
pip install ".[blake3]"    # BLAKE3 hashing (faster than SHA-256)
pip install ".[postgres]"  # PostgreSQL backend
pip install ".[magic]"     # python-magic MIME detection

# Development
pip install -e ".[dev,all]"

Zero required dependencies. The core runs on Python 3.10+ stdlib only (sqlite3, hashlib, subprocess, socket).

Quick Start

# Initialize the database
sonar-catalog init

# Option A: Crawl a real NFS mount
sonar-catalog crawl /mnt/sonar-nas-01

# Option B: Load synthetic demo data (no real sonar files needed)
sonar-catalog demo --num-files 100

# Extract navigation tracks from cataloged files
sonar-catalog extract-nav

# Explore
sonar-catalog stats
sonar-catalog search "line_001"
sonar-catalog dupes
sonar-catalog where <content_hash>

# Start the web UI
sonar-catalog web --port 8080
# Open http://localhost:8080        (search)
# Open http://localhost:8080/globe  (3D globe)

Auto-Discovery Mode

# Discover all NFS hosts on the network, then crawl everything
sonar-catalog discover --deep
sonar-catalog crawl-all

CLI Reference

Command	Description
`init`	Initialize database and config
`discover`	Discover NFS hosts and mounts on the network
`crawl <path>`	Crawl and catalog a directory
`crawl-all`	Discover and crawl all accessible hosts
`search`	Search the catalog (supports FTS5)
`dupes`	Find duplicate files across servers
`where <hash>`	Show all locations for a file by content hash
`stats`	Show catalog statistics
`hosts`	List discovered hosts
`extract-nav`	Extract navigation data from cataloged sonar files
`export`	Export data to CSV, GeoJSON, or JSON
`demo`	Load synthetic demo data for exploring the UI
`plugins list`	List installed plugins and their hooks
`plugins enable/disable <name>`	Enable or disable a plugin
`add-magic-byte <file>`	Learn a new format's magic bytes from a sample file
`list-magic-bytes`	Show all registered magic byte signatures
`rebuild-index`	Rebuild the FTS5 search index
`config`	Show current configuration as JSON
`web`	Start the Flask web interface

Configuration

Configuration lives at ~/.config/sonar-catalog/config.json. All fields are optional — defaults work out of the box.

{
  "database": {
    "backend": "sqlite",
    "sqlite_path": "~/.local/share/sonar-catalog/catalog.db"
  },
  "discovery": {
    "ssh_user": "survey",
    "ssh_timeout": 3,
    "hostname_patterns": ["sonar-*", "nas-*"],
    "use_autofs": true,
    "use_showmount": true
  },
  "crawler": {
    "hash_algorithm": "blake3",
    "partial_hash_size": 4194304,
    "hash_workers": 4,
    "batch_size": 1000,
    "incremental": true
  },
  "metadata": {
    "sonar_extensions": [".xtf", ".jsf", ".s7k", ".all", ".kmall", ".bag"],
    "nav_extraction": {
      "enabled": true,
      "max_track_points": 1000,
      "sidecar_patterns": [
        {
          "pattern": "{stem}.nav",
          "format": "csv",
          "lat_field": "lat",
          "lon_field": "lon"
        }
      ]
    }
  },
  "plugins": {
    "disabled_plugins": []
  }
}

Project Structure

sonar_catalog/
    __init__.py
    cli.py                 # CLI entry point (18 subcommands)
    config.py              # Dataclass-based configuration
    database.py            # SQLite + PostgreSQL dual backend
    search.py              # FTS5-powered search engine
    crawler.py             # Filesystem walker + batch processor
    hasher.py              # Two-pass BLAKE3/SHA-256 deduplication
    discovery.py           # Network host + NFS mount discovery
    mount_resolver.py      # NFS mount topology resolution
    demo.py                # Synthetic data generator

    extractors/
        __init__.py        # Dispatcher (plugin-aware)
        base.py            # NavExtractor ABC + NavResult dataclass
        jsf.py             # EdgeTech JSF binary parser
        xtf.py             # eXtended Triton Format parser
        nmea.py            # GGA/RMC/GLL sentence parser
        sidecar.py         # Companion file reader (CSV/JSON)

    plugins/
        __init__.py        # Global singleton + initialize/reset
        hooks.py           # HookSpec with firstresult + historic modes
        manager.py         # PluginManager (discover, register, enable/disable)
        manifest.py        # YAML manifest parser (sonar-plugin.yaml)
        builtin/
            __init__.py    # Built-in plugin registration
            formats.py     # Magic byte signatures + extension map
            nav.py         # Nav extraction hook
            exporters.py   # CSV, GeoJSON, JSON export hooks

    web/
        __init__.py        # Flask app factory
        api.py             # REST API (10 endpoints)
        views.py           # Page routes (/ and /globe)
        templates/
            base.html      # Tailwind CSS layout
            index.html     # Search + stats page
            globe.html     # CesiumJS 3D globe
        static/
            app.js         # Search UI logic
            app.css        # Custom styles
            globe.js       # Globe rendering + track display

tests/                     # 396 tests, 80% coverage

Testing

# Run all tests
pytest

# With coverage
pytest --cov=sonar_catalog --cov-report=term-missing

# Specific module
pytest tests/test_extractors.py -v

396 tests covering the full stack: CLI dispatch, database operations, NFS resolution, format detection, nav extraction, plugin lifecycle, web API, and export hooks.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

geraci

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sonar_catalog-0.2.0.tar.gz (1.9 MB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sonar_catalog-0.2.0-py3-none-any.whl (1.6 MB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file sonar_catalog-0.2.0.tar.gz.

File metadata

Download URL: sonar_catalog-0.2.0.tar.gz
Upload date: Feb 13, 2026
Size: 1.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sonar_catalog-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`59951ece2edd02d3a8c0acfbace8482d65ea1c6b847d99f669263023d9608592`
MD5	`2a9171a24328c2c07799e297a6ec4be2`
BLAKE2b-256	`bbf2c1874e1bd27f3443741e784751bc7757397d89f5d31bbc794f56693dc04f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sonar_catalog-0.2.0.tar.gz:

Publisher: publish.yml on brandonrc/sonar-catalog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sonar_catalog-0.2.0.tar.gz
- Subject digest: 59951ece2edd02d3a8c0acfbace8482d65ea1c6b847d99f669263023d9608592
- Sigstore transparency entry: 948033480
- Sigstore integration time: Feb 13, 2026
Source repository:
- Permalink: brandonrc/sonar-catalog@71bdd68a27b3a68b86c452ae1805a2dd6202101c
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/brandonrc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@71bdd68a27b3a68b86c452ae1805a2dd6202101c
- Trigger Event: push

File details

Details for the file sonar_catalog-0.2.0-py3-none-any.whl.

File metadata

Download URL: sonar_catalog-0.2.0-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 1.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sonar_catalog-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0611eda2ee3a280c7a09872057663e75d70a29a044573cfec418d30270a6a1d2`
MD5	`94dbea7f67cfd9714730dd2386298cd2`
BLAKE2b-256	`c7afbefa217157e8849a413a3b6ccbd990bc3eeaa8c1f7d8750ccbe9758336a7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sonar_catalog-0.2.0-py3-none-any.whl:

Publisher: publish.yml on brandonrc/sonar-catalog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sonar_catalog-0.2.0-py3-none-any.whl
- Subject digest: 0611eda2ee3a280c7a09872057663e75d70a29a044573cfec418d30270a6a1d2
- Sigstore transparency entry: 948033520
- Sigstore integration time: Feb 13, 2026
Source repository:
- Permalink: brandonrc/sonar-catalog@71bdd68a27b3a68b86c452ae1805a2dd6202101c
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/brandonrc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@71bdd68a27b3a68b86c452ae1805a2dd6202101c
- Trigger Event: push

sonar-catalog 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Sonar Catalog

Architecture

Data Flow

Database Schema

Plugin Architecture

Hook Modes

Writing a Plugin

NFS Mount Resolution

Host Discovery

Nav Extraction Pipeline

Two-Pass Deduplication

Web Interface

Supported Sonar Formats

Installation

Quick Start

Auto-Discovery Mode

CLI Reference

Configuration

Project Structure

Testing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance