GPU-powered Structured Data De-identification Engine
Project description
Citadel
Citadel is a policy-driven de-identification tool for JSONL training and evaluation data. It reads one JSON object per line, applies a versioned YAML policy, and writes a compact de-identified JSONL file beside the input.
The normal command is:
uv run tacit.citadel policy.yaml input.jsonl
For input.jsonl, Citadel writes:
input.citadel.jsonl
The CLI currently takes exactly two positional arguments: the policy file and the input JSONL file. The output path is always derived from the input path.
Setup
Citadel is packaged as tacit-citadel and exposes the tacit.citadel console
script.
The project supports Python >=3.11,<3.14; the checked-in .python-version
currently selects Python 3.11.
uv sync
pyjq is a runtime dependency and builds native code when no compatible wheel
is available. On macOS, make sure Xcode command line tools and the autotools
chain are installed. If setup fails with No such file or directory: 'autoreconf', install the missing tools and rerun uv sync.
brew install autoconf automake libtool
If the pyjq build finds the command line tools but fails with stdlib.h not
found, pass the active macOS SDK path into the build:
SDKROOT=$(xcrun --show-sdk-path) uv sync
CUDA-enabled spaCy is available as an optional extra for Linux CPython 3.12:
uv sync --extra cuda
The default install path does not install CUDA spaCy packages.
Usage
Run the sample policy against the sample record:
uv run tacit.citadel policy.yaml sample.jsonl
This creates:
sample.citadel.jsonl
On success, Citadel prints a short report:
output: sample.citadel.jsonl
records processed: 1
fields changed: 7
llm calls: 1
epoch seed: 1787680000
The epoch seed is generated from the current Unix time unless process_file
is called directly with an explicit epoch_seed.
Input
Citadel expects JSONL. Each line must be a complete JSON object.
{"client_id":"007","intake_details":{"date":"2026-01-05","weight":102.4}}
Non-object JSONL lines fail the run. Citadel processes records in chunks of 50 and writes one compact JSON object per output line.
Policy
Policy files are YAML mappings validated with Pydantic. Extra fields are rejected.
Required top-level fields:
version: 1
name: nourish-intake-and-trajectory
description: De-identification policy for Nourish-style records.
llm:
base_url: http://127.0.0.1:8000/v1
model: google/gemma-4-12B-it-qat-w4a16-ct
temperature: 1.0
top_p: 0.95
top_k: 64
rules:
- path: .client_id
action: drop
Each rule has:
- path: .jq.selector
action: drop
required: true
params: {}
path is a jq selector. Citadel resolves selectors through pyjq and applies
actions to the concrete JSON locations returned by path(...).
required defaults to true. If a required rule matches nothing, the run
fails. Use required: false for sparse paths that are absent from some records.
Actions
Citadel currently supports four actions.
drop
Removes the matched object field.
- path: .client_id
action: drop
drop only deletes object fields. It does not remove array elements.
fuzz_number
Shifts numeric values while preserving approximate modelling signal. Boolean and non-numeric values are rejected.
Percent mode:
- path: .intake_details.weight
action: fuzz_number
params:
mode: percent
max_percent: 5
precision: 1
Range mode:
- path: .intake_details.age
action: fuzz_number
params:
mode: range
min_delta: -2
max_delta: 2
step: 1
The random generator is seeded once per run. Integer inputs stay integers when the fuzzed value is integral.
date_offset
Replaces an absolute date with a human-readable offset from an anchor date in the same record.
- path: .trajectories[] | select(.type == "set_target").date
action: date_offset
required: false
params:
anchor_path: .intake_details.date
output: human_relative
Supported output strings are:
same day
N day after
N days after
N day before
N days before
Date values must be strings accepted by Python's ISO date/datetime parser.
llm_rewrite
Queues selected string fields for rewriting through an OpenAI-compatible chat completion endpoint.
- path: .trajectories[] | select(.type == "messages").thread[].content
action: llm_rewrite
required: false
params:
system_prompt: You are a high-recall sensitive-data anonymizer.
user_prompt: |
Rewrite the INPUT text by replacing sensitive values with typed
placeholders. Return only the rewritten text.
INPUT
{{content}}
Only the matched field value is sent to the model. {{content}} in the system
or user prompt is replaced with that selected text.
The LLM client uses the policy's llm.base_url, llm.model, temperature,
top_p, and top_k. The API key is set to not-needed, which matches local
OpenAI-compatible servers such as vLLM.
Within a run, duplicate source text is rewritten once and reused from an in-memory cache. Cache misses in the same chunk are submitted concurrently.
If a rewrite request fails or is cancelled, Citadel writes
<LLM_REWRITE_FAILED> into that field and continues the run.
To smoke-test a local rewrite server directly:
uv run python -m tacit_citadel.llm \
--base-url http://127.0.0.1:8000/v1 \
--model google/gemma-4-12B-it-qat-w4a16-ct \
--text "Hi Jamie, your appointment is on January 12."
Processing Model
For each run, Citadel:
- Validates the policy YAML.
- Opens the input JSONL file.
- Parses each line as a JSON object.
- Applies policy rules in order.
- Resolves jq selectors to concrete JSON locations.
- Queues and runs LLM rewrites for each 50-record chunk.
- Writes compact JSONL to a temporary output file.
- Atomically replaces the derived output path after the full run succeeds.
- Prints a short report.
If a fatal error occurs before replacement, Citadel deletes the temporary file. An existing output file is preserved.
Failure Behavior
Citadel fails the run for:
- missing policy or input files
- invalid policy YAML or unsupported policy fields
- invalid JSONL
- JSONL lines that are not objects
- invalid jq selectors
- unmatched required rule paths
- action type errors, such as applying
fuzz_numberto a string - invalid or missing
date_offsetanchors
LLM rewrite request failures are nonfatal. The failed field is replaced with
<LLM_REWRITE_FAILED> and processing continues.
Development
Run the test suite:
uv run pytest
Run the lightweight checks:
uv run ruff check .
uv run ty check
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tacit_citadel-0.1.0.tar.gz.
File metadata
- Download URL: tacit_citadel-0.1.0.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc00dae5b9a70905e4a0c9ffb8c668ea0154c1d6bd9b0dc46e497a6a8d331ee4
|
|
| MD5 |
5ca7b417b725c6830d9c92ca227d13e8
|
|
| BLAKE2b-256 |
8900c86d610cee138693344f59e61cc1fa9abd8313234494e4f3b5d467055544
|
File details
Details for the file tacit_citadel-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tacit_citadel-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7dcd61c045ba12f51c02101fa592ae262ba4d2f5a01685c9f64eee9ba3b06fd4
|
|
| MD5 |
de9eb0a47961d96990208e3462f5e2f6
|
|
| BLAKE2b-256 |
2fcc5d103e32535f2a85e1149d6f13cdfc286ce97b84ca12ae7b66411ee77352
|