CLI to inspect and normalize messy data files into clean tables
Project description
filelens
Turn messy files into clean tables in one command.
filelens is a CLI that helps you understand and clean messy data files.
Ever opened a file where:
- headers start on row 6
- metadata is mixed with data
- columns are inconsistent
filelens lets you:
- inspect structure and issues
- infer a schema
- convert to a clean table (Parquet)
No config. No guessing. Deterministic output.
Quick start
filelens inspect file.csv
filelens convert file.csv --out file.parquet
Install (pip)
Build and install locally:
pip install .
pip install . builds from source and requires Rust/Cargo on your machine.
Build a distributable wheel:
maturin build -b bin --release -o dist
pip install dist/filelens-*.whl
Example
Before
Metadata + mixed rows + unclear structure:
Metadata: Device=LabX
Date: 2024-01-01
Sample ID,Value,Unit
S1,0.45,mg/mL
S2,0.50,mg/mL
Inspect:
filelens inspect sample.csv
Output:
Detected:
- header row: 4
- metadata rows: 1-3
- columns: 3
Warnings:
- none
One command
filelens convert sample.csv --out sample.parquet
What it does:
- detects structure
- skips metadata
- infers schema
- writes
sample.parquet
After
Clean table:
sample_id | value | unit
S1 | 0.45 | mg/mL
S2 | 0.50 | mg/mL
Supported inputs
Supports common messy data formats used in analytics and healthcare.
- Excel / CSV (messy tabular files):
.xlsx,.xlsm,.xls,.csv,.tsv,.psv,.txt - JSON (nested data):
.json,.ndjson - XML (including cXML / CDA / NAACCR):
.xml,.cxml,.xcml - HL7 (basic extraction):
.hl7,.msg - Compressed text variants:
.gzwrappers for supported text formats
Design
- deterministic (no AI guessing)
- no config required
- optimized for messy real-world files
When to use filelens
- You opened a file and do not understand its structure
- Your Excel export has metadata rows and broken headers
- You need to convert XML/JSON into a table quickly
- You want clean input for dbt or a data warehouse
Command reference
Inspect:
filelens inspect data/file.xlsx
filelens inspect data/order.cxml
filelens inspect data/patient-example.json
filelens inspect data/oru_r01.msg
filelens inspect data/clinical.xml
filelens inspect data/patient-example.ttl
filelens inspect data/patient-example.ttl.html
Schema:
filelens schema data/file.xlsx
filelens schema data/patient-example.json --parser fhir
Convert:
filelens convert data/file.xlsx --out data/file.parquet
filelens convert data/order.cxml --out data/order.parquet
filelens convert data/nested_lab_result.json --out data/nested_lab_result.parquet
filelens convert data/oru_r01.msg --out data/oru_r01.parquet
filelens convert data/patient-example.ttl --out data/patient-example.ttl.parquet
Optional parser override:
filelens inspect data/file.xml --parser cda
filelens inspect data/file.json --parser json
filelens inspect data/file.json --parser fhir
filelens inspect data/file.msg --parser hl7
filelens inspect data/file.ttl --parser rdf
CXML extraction mode:
# curated canonical fields only
filelens schema data/order.cxml --parser cxml --cxml-mode mapped
# path-based auto-captured fields only (x_* columns)
filelens schema data/order.cxml --parser cxml --cxml-mode auto
# both canonical + path-based fields
filelens convert data/order.cxml --parser cxml --cxml-mode both --out data/order.parquet
If running from source, use ./target/release/filelens instead of filelens.
Works with dbt
filelens outputs Parquet files that can be loaded into warehouses and modeled with dbt.
Use it in this order:
- Convert files to parquet.
- Load parquet into Postgres
raw.filelens_lines. - Run dbt models.
- Query typed marts.
Setup env vars:
export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt
One-command local pipeline (public examples only):
scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh
What this command does:
- loads parquet into
raw.filelens_lines - syncs
rawintoraw_procurementandraw_clinical - runs staging models
- runs marts (including typed marts)
- runs tests
- prints row counts and next query hints
Which tables to query:
analytics_marts.fct_procurement_linesfor procurement analyticsanalytics_marts.fct_fhir_resourcesfor FHIR analyticsanalytics_marts.fct_naaccr_casesfor NAACCR analyticsanalytics_marts.fct_record_attributesfor generic key/value search across all extracted attributes
analytics_registry.idx_filelens_records is a cross-format registry/index table (lineage + canonical fields). It is not the primary end-user analytics table.
Why keep raw -> internal -> marts:
raw: ingestion/debug layer (what got loaded)analytics_internal: normalization layer (map parser-specific columns into stable canonical fields)marts: consumption layer (deduped and typed tables for analysts/apps)
Example consumer queries:
select * from analytics_marts.fct_procurement_lines limit 20;
select * from analytics_marts.fct_fhir_resources limit 20;
select * from analytics_marts.fct_naaccr_cases limit 20;
select * from analytics_marts.fct_record_attributes limit 20;
Trace NAACCR attributes back to original source ids:
select
source_file,
record_key,
attribute_scope,
attribute_source_id,
attribute_name,
attribute_value
from analytics_marts.fct_record_attributes
where source_kind = 'naaccr'
and attribute_source_id in ('grade', 'patientidnumber', 'tumorrecordnumber')
limit 20;
Examples
See examples/ for real sample inputs:
- procurement (
cXML/xCML) - healthcare (
FHIR,HL7,CDA,NAACCR) - RDF/Turtle (
.ttl,.ttl.html) - messy CSV/TSV/PSV/TXT
Build
cargo build --release
Binary path:
./target/release/filelens
Release (GitHub Actions)
Tag-based release:
git tag v0.1.0
git push origin v0.1.0
What happens on tag push (v*):
- builds platform wheels (Linux, macOS Intel/ARM, Windows)
- builds source distribution on Linux
- creates a GitHub Release and uploads
dist/*artifacts
Optional PyPI publish:
- configure PyPI Trusted Publisher for this repo (recommended)
- on tag push, distributions are published to PyPI via GitHub OIDC
- or run the
Releaseworkflow manually withpublish_pypi=true
PyPI Trusted Publisher settings:
- PyPI project ->
Manage->Publishing->Add a new publisher->GitHub. - Set:
- Owner:
<your-github-owner> - Repository name:
filelens - Workflow name:
release.yml - Environment name:
pypi
- Owner:
- Save. No API token secret is required.
Workflow
What this workflow does:
- builds the
filelensbinary - converts only
examples/publicfiles into parquet underoutput/public - loads only
output/public/**/*.parquetinto Postgres raw tables - runs dbt staging + marts with
--full-refresh(and tests) - does not include non-public example paths unless you change the command
Why --full-refresh in this demo workflow:
- it rebuilds marts from scratch so the demo is deterministic after parser/model changes
- it avoids stale incremental state while iterating locally
- for recurring production loads, omit
--full-refreshand use incremental dbt runs
cargo build --release
scripts/convert_inputs.sh --input-dir examples/public --output-dir output/public
export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt
scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh
scripts/convert_inputs.sh is non-strict by default (skips failures and continues). Add --strict to fail on first conversion error.
Advanced formats
- RDF/Turtle (
.ttl,.rdf) — experimental support - HTML pages containing RDF/Turtle
<pre>blocks (for example*.ttl.html)
Why not pandas?
pandas can read files, but it does not:
- detect likely header/metadata layout
- explain quality issues up front
- normalize mixed file families with one deterministic CLI pass
filelens is focused on that first cleanup step before your pipeline.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file filelens-0.1.2.tar.gz.
File metadata
- Download URL: filelens-0.1.2.tar.gz
- Upload date:
- Size: 88.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47141c33d85fffb4fe1a0fde4de4e4bdca5d3da51f62aa0f47318b331da28a08
|
|
| MD5 |
326ca1a3c27b815b7a34f1b17024dbfd
|
|
| BLAKE2b-256 |
ffcc9aaf238c03d799ececdb3d7de82c60ddad91c725637a3cfc08a3f5c48d99
|
Provenance
The following attestation bundles were made for filelens-0.1.2.tar.gz:
Publisher:
release.yml on kraftaa/filelens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filelens-0.1.2.tar.gz -
Subject digest:
47141c33d85fffb4fe1a0fde4de4e4bdca5d3da51f62aa0f47318b331da28a08 - Sigstore transparency entry: 1395995959
- Sigstore integration time:
-
Permalink:
kraftaa/filelens@103fb9938d2da1b69b4c098d2ecb85494fc17f72 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/kraftaa
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@103fb9938d2da1b69b4c098d2ecb85494fc17f72 -
Trigger Event:
push
-
Statement type:
File details
Details for the file filelens-0.1.2-py3-none-win_amd64.whl.
File metadata
- Download URL: filelens-0.1.2-py3-none-win_amd64.whl
- Upload date:
- Size: 6.6 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d35f1c76c0ed4ce39c449b63339970c8ca33963b68f59916939fb57be55fd40
|
|
| MD5 |
547f3dd71e65b872a362b2c10f39f9d9
|
|
| BLAKE2b-256 |
080c79b4cc0dc2968fc5bbc2f64814dfa75c9a3e16b21c02a96bafbb902a276c
|
Provenance
The following attestation bundles were made for filelens-0.1.2-py3-none-win_amd64.whl:
Publisher:
release.yml on kraftaa/filelens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filelens-0.1.2-py3-none-win_amd64.whl -
Subject digest:
6d35f1c76c0ed4ce39c449b63339970c8ca33963b68f59916939fb57be55fd40 - Sigstore transparency entry: 1395996127
- Sigstore integration time:
-
Permalink:
kraftaa/filelens@103fb9938d2da1b69b4c098d2ecb85494fc17f72 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/kraftaa
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@103fb9938d2da1b69b4c098d2ecb85494fc17f72 -
Trigger Event:
push
-
Statement type:
File details
Details for the file filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 6.4 MB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf75f6576f2089e0014a9d7444e0160728d44377f58461bb65130a53636d3cac
|
|
| MD5 |
e515ebad956b0400c485c851db848a0d
|
|
| BLAKE2b-256 |
f825693e0166dae3d36cbed663e5f3f0b6d7fbadf1db347f2ae47707aa651acf
|
Provenance
The following attestation bundles were made for filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on kraftaa/filelens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filelens-0.1.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
cf75f6576f2089e0014a9d7444e0160728d44377f58461bb65130a53636d3cac - Sigstore transparency entry: 1395996182
- Sigstore integration time:
-
Permalink:
kraftaa/filelens@103fb9938d2da1b69b4c098d2ecb85494fc17f72 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/kraftaa
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@103fb9938d2da1b69b4c098d2ecb85494fc17f72 -
Trigger Event:
push
-
Statement type:
File details
Details for the file filelens-0.1.2-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: filelens-0.1.2-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 5.7 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ad134dc9892c08170fa1d8689ce78a91052dfd59a435523ea6e7082d5c89cde
|
|
| MD5 |
bd45411e058b32f284b6b8de1bb4ad63
|
|
| BLAKE2b-256 |
8f2513184aed9f802207bf29ddb224c62a5615b9dfb65c95bad6c4cfb398d75f
|
Provenance
The following attestation bundles were made for filelens-0.1.2-py3-none-macosx_11_0_arm64.whl:
Publisher:
release.yml on kraftaa/filelens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
filelens-0.1.2-py3-none-macosx_11_0_arm64.whl -
Subject digest:
6ad134dc9892c08170fa1d8689ce78a91052dfd59a435523ea6e7082d5c89cde - Sigstore transparency entry: 1395996048
- Sigstore integration time:
-
Permalink:
kraftaa/filelens@103fb9938d2da1b69b4c098d2ecb85494fc17f72 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/kraftaa
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@103fb9938d2da1b69b4c098d2ecb85494fc17f72 -
Trigger Event:
push
-
Statement type: