Skip to main content

GPU-powered Structured Data De-identification Engine

Project description

Citadel

Citadel is a policy-driven de-identification tool for JSONL training and evaluation data. It reads one JSON object per line, applies a versioned YAML policy, and writes a compact de-identified JSONL file beside the input.

The normal command is:

uv run tacit.citadel policy.yaml input.jsonl

For input.jsonl, Citadel writes:

input.citadel.jsonl

The CLI currently takes exactly two positional arguments: the policy file and the input JSONL file. The output path is always derived from the input path.

Setup

Citadel is packaged as tacit-citadel and exposes the tacit.citadel console script.

The project supports Python >=3.11,<3.14; the checked-in .python-version currently selects Python 3.11.

uv sync

pyjq is a runtime dependency and builds native code when no compatible wheel is available. On macOS, make sure Xcode command line tools and the autotools chain are installed. If setup fails with No such file or directory: 'autoreconf', install the missing tools and rerun uv sync.

brew install autoconf automake libtool

If the pyjq build finds the command line tools but fails with stdlib.h not found, pass the active macOS SDK path into the build:

SDKROOT=$(xcrun --show-sdk-path) uv sync

CUDA-enabled spaCy is available as an optional extra for Linux CPython 3.12:

uv sync --extra cuda

The default install path does not install CUDA spaCy packages.

Usage

Run the sample policy against the sample record:

uv run tacit.citadel policy.yaml sample.jsonl

This creates:

sample.citadel.jsonl

On success, Citadel prints a short report:

output: sample.citadel.jsonl
records processed: 1
fields changed: 7
llm calls: 1
epoch seed: 1787680000

The epoch seed is generated from the current Unix time unless process_file is called directly with an explicit epoch_seed.

Input

Citadel expects JSONL. Each line must be a complete JSON object.

{"client_id":"007","intake_details":{"date":"2026-01-05","weight":102.4}}

Non-object JSONL lines fail the run. Citadel processes records in chunks of 50 and writes one compact JSON object per output line.

Policy

Policy files are YAML mappings validated with Pydantic. Extra fields are rejected.

Required top-level fields:

version: 1
name: nourish-intake-and-trajectory
description: De-identification policy for Nourish-style records.

llm:
  base_url: http://127.0.0.1:8000/v1
  model: google/gemma-4-12B-it-qat-w4a16-ct
  temperature: 1.0
  top_p: 0.95
  top_k: 64

rules:
  - path: .client_id
    action: drop

Each rule has:

- path: .jq.selector
  action: drop
  required: true
  params: {}

path is a jq selector. Citadel resolves selectors through pyjq and applies actions to the concrete JSON locations returned by path(...).

required defaults to true. If a required rule matches nothing, the run fails. Use required: false for sparse paths that are absent from some records.

Actions

Citadel currently supports four actions.

drop

Removes the matched object field.

- path: .client_id
  action: drop

drop only deletes object fields. It does not remove array elements.

fuzz_number

Shifts numeric values while preserving approximate modelling signal. Boolean and non-numeric values are rejected.

Percent mode:

- path: .intake_details.weight
  action: fuzz_number
  params:
    mode: percent
    max_percent: 5
    precision: 1

Range mode:

- path: .intake_details.age
  action: fuzz_number
  params:
    mode: range
    min_delta: -2
    max_delta: 2
    step: 1

The random generator is seeded once per run. Integer inputs stay integers when the fuzzed value is integral.

date_offset

Replaces an absolute date with a human-readable offset from an anchor date in the same record.

- path: .trajectories[] | select(.type == "set_target").date
  action: date_offset
  required: false
  params:
    anchor_path: .intake_details.date
    output: human_relative

Supported output strings are:

same day
N day after
N days after
N day before
N days before

Date values must be strings accepted by Python's ISO date/datetime parser.

llm_rewrite

Queues selected string fields for rewriting through an OpenAI-compatible chat completion endpoint.

- path: .trajectories[] | select(.type == "messages").thread[].content
  action: llm_rewrite
  required: false
  params:
    system_prompt: You are a high-recall sensitive-data anonymizer.
    user_prompt: |
      Rewrite the INPUT text by replacing sensitive values with typed
      placeholders. Return only the rewritten text.

      INPUT
      {{content}}

Only the matched field value is sent to the model. {{content}} in the system or user prompt is replaced with that selected text.

The LLM client uses the policy's llm.base_url, llm.model, temperature, top_p, and top_k. The API key is set to not-needed, which matches local OpenAI-compatible servers such as vLLM.

Within a run, duplicate source text is rewritten once and reused from an in-memory cache. Cache misses in the same chunk are submitted concurrently.

If a rewrite request fails or is cancelled, Citadel writes <LLM_REWRITE_FAILED> into that field and continues the run.

To smoke-test a local rewrite server directly:

uv run python -m tacit_citadel.llm \
  --base-url http://127.0.0.1:8000/v1 \
  --model google/gemma-4-12B-it-qat-w4a16-ct \
  --text "Hi Jamie, your appointment is on January 12."

Processing Model

For each run, Citadel:

  1. Validates the policy YAML.
  2. Opens the input JSONL file.
  3. Parses each line as a JSON object.
  4. Applies policy rules in order.
  5. Resolves jq selectors to concrete JSON locations.
  6. Queues and runs LLM rewrites for each 50-record chunk.
  7. Writes compact JSONL to a temporary output file.
  8. Atomically replaces the derived output path after the full run succeeds.
  9. Prints a short report.

If a fatal error occurs before replacement, Citadel deletes the temporary file. An existing output file is preserved.

Failure Behavior

Citadel fails the run for:

  • missing policy or input files
  • invalid policy YAML or unsupported policy fields
  • invalid JSONL
  • JSONL lines that are not objects
  • invalid jq selectors
  • unmatched required rule paths
  • action type errors, such as applying fuzz_number to a string
  • invalid or missing date_offset anchors

LLM rewrite request failures are nonfatal. The failed field is replaced with <LLM_REWRITE_FAILED> and processing continues.

Development

Run the test suite:

uv run pytest

Run the lightweight checks:

uv run ruff check .
uv run ty check

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tacit_citadel-0.1.0.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tacit_citadel-0.1.0-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file tacit_citadel-0.1.0.tar.gz.

File metadata

  • Download URL: tacit_citadel-0.1.0.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tacit_citadel-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bc00dae5b9a70905e4a0c9ffb8c668ea0154c1d6bd9b0dc46e497a6a8d331ee4
MD5 5ca7b417b725c6830d9c92ca227d13e8
BLAKE2b-256 8900c86d610cee138693344f59e61cc1fa9abd8313234494e4f3b5d467055544

See more details on using hashes here.

File details

Details for the file tacit_citadel-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tacit_citadel-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tacit_citadel-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7dcd61c045ba12f51c02101fa592ae262ba4d2f5a01685c9f64eee9ba3b06fd4
MD5 de9eb0a47961d96990208e3462f5e2f6
BLAKE2b-256 2fcc5d103e32535f2a85e1149d6f13cdfc286ce97b84ca12ae7b66411ee77352

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page