CLI to inspect and normalize messy data files into clean tables

These details have not been verified by PyPI

Project description

filelens

Parse cXML, XML, JSON, and messy data files into clean tables — in one command.

Most tools give you an XML tree. filelens gives you rows you can actually use.

filelens is a CLI that helps you understand and clean messy data files. Most of the time, the hardest part is just understanding the file.

Ever opened a file where:

headers start on row 6
metadata is mixed with data
columns are inconsistent

filelens lets you:

inspect structure and issues
infer a schema
convert to a clean table (Parquet)

No config. No guessing. Deterministic output. Built for real-world data engineering workflows.

Quick start

Install with pip:

pip install filelens

Install with Homebrew:

brew tap kraftaa/filelens
brew install filelens

Use it:

filelens inspect file.csv
filelens convert file-or-folder --out-dir output/

Example

Before

Metadata + mixed rows + unclear structure:

Metadata: Device=LabX
Date: 2024-01-01

Sample ID,Value,Unit
S1,0.45,mg/mL
S2,0.50,mg/mL

Inspect:

filelens inspect sample.csv

Inspect messy CSV

Output:

Detected:
- header row: 4
- metadata rows: 1-3
- columns: 3

Warnings:
- none

One command

filelens convert sample.cxml --out sample.parquet

Convert cXML both

What it does:

detects structure
skips metadata
infers schema
writes sample.parquet

After

Clean table:

sample_id | value | unit
S1        | 0.45  | mg/mL
S2        | 0.50  | mg/mL

Read in pandas:

import pandas as pd
df = pd.read_parquet("sample.parquet")
df.head()

That's it

For most use cases, you only need:

filelens inspect file.csv
filelens inspect file.cxml
filelens convert file-or-folder --out-dir output/

Everything below is optional (advanced formats, dbt integration, pipelines).

Default workflow

Golden path:

filelens convert <file-or-folder> --out-dir output/

What it does:

auto-detects parser from extension/content
keeps cXML default mode as mapped (canonical columns)
writes parquet output files
appends _source_file, _source_kind, _record_id columns
writes sidecar report: output/_filelens_report.json (schema + warnings + status per file)

If your cXML needs extra nested fields, use:

filelens convert po.cxml --parser cxml --cxml-mode both --out-dir output/

Folder-specific command:

filelens batch examples/public/trade --out-dir output/trade

batch prints:

files processed / succeeded / failed
total rows written
top warnings

Supported inputs

Supports common messy data formats used in analytics and healthcare.

Excel / CSV (messy tabular files): .xlsx, .xlsm, .xls, .csv, .tsv, .psv, .txt
JSON (nested data): .json, .ndjson
XML (including cXML / CDA / NAACCR): .xml, .cxml, .xcml
HL7 (basic extraction): .hl7, .msg
Compressed text variants: .gz wrappers for supported text formats

Design

deterministic
no config required
optimized for messy real-world files

When to use filelens

You opened a file and do not understand its structure
Your Excel export has metadata rows and broken headers
You need to convert XML/JSON into a table quickly
You want clean input for dbt or a data warehouse

More examples

Inspect:

filelens inspect data/file.xlsx
filelens inspect data/order.cxml
filelens inspect data/patient-example.json
filelens inspect data/oru_r01.msg
filelens inspect data/clinical.xml
filelens inspect data/patient-example.ttl
filelens inspect data/patient-example.ttl.html

Schema:

filelens schema data/file.xlsx
filelens schema data/patient-example.json --parser fhir

Convert:

filelens convert data/file.xlsx --out data/file.parquet
filelens convert data/order.cxml --out data/order.parquet
filelens convert data/nested_lab_result.json --out data/nested_lab_result.parquet
filelens convert data/oru_r01.msg --out data/oru_r01.parquet
filelens convert data/patient-example.ttl --out data/patient-example.ttl.parquet

Optional parser override:

filelens inspect data/file.xml --parser cda
filelens inspect data/file.json --parser json
filelens inspect data/file.json --parser fhir
filelens inspect data/file.msg --parser hl7
filelens inspect data/file.ttl --parser rdf

CXML extraction mode:

cXML mode controls which columns are emitted:

mapped (default): canonical columns only (order_id, line_number, quantity, ...)
auto: extracted path-based columns only (x_*)
both: union of mapped + auto

If you do not pass --cxml-mode, filelens uses mapped.

# curated canonical fields only
filelens schema data/order.cxml --parser cxml --cxml-mode mapped

# path-based auto-captured fields only (x_* columns)
filelens schema data/order.cxml --parser cxml --cxml-mode auto

# both canonical + path-based fields
filelens convert data/order.cxml --parser cxml --cxml-mode both --out data/order.parquet

If running from source, use ./target/release/filelens instead of filelens.

Optional: use with dbt

filelens outputs Parquet files that can be loaded into warehouses and modeled with dbt.

Use it in this order:

Convert files to parquet.
Load parquet into Postgres raw.filelens_lines.
Run dbt models.
Query typed marts.

Setup env vars:

export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt

One-command local pipeline (public examples only):

scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh

What this command does:

loads parquet into raw.filelens_lines
syncs raw into raw_procurement and raw_clinical
runs staging models
runs marts (including typed marts)
runs tests
prints row counts and next query hints

Which tables to query:

analytics_marts.fct_procurement_lines for procurement analytics
analytics_marts.fct_fhir_resources for FHIR analytics
analytics_marts.fct_naaccr_cases for NAACCR analytics
analytics_marts.fct_record_attributes for generic key/value search across all extracted attributes

analytics_registry.idx_filelens_records is a cross-format registry/index table (lineage + canonical fields). It is not the primary end-user analytics table.

Why keep raw -> internal -> marts:

raw: ingestion/debug layer (what got loaded)
analytics_internal: normalization layer (map parser-specific columns into stable canonical fields)
marts: consumption layer (deduped and typed tables for analysts/apps)

Example consumer queries:

select * from analytics_marts.fct_procurement_lines limit 20;
select * from analytics_marts.fct_fhir_resources limit 20;
select * from analytics_marts.fct_naaccr_cases limit 20;
select * from analytics_marts.fct_record_attributes limit 20;

Trace NAACCR attributes back to original source ids:

select
  source_file,
  record_key,
  attribute_scope,
  attribute_source_id,
  attribute_name,
  attribute_value
from analytics_marts.fct_record_attributes
where source_kind = 'naaccr'
  and attribute_source_id in ('grade', 'patientidnumber', 'tumorrecordnumber')
limit 20;

Examples

See examples/ for real sample inputs:

procurement (cXML / xCML)
healthcare (FHIR, HL7, CDA, NAACCR)
RDF/Turtle (.ttl, .ttl.html)
messy CSV/TSV/PSV/TXT
hard cXML edge-case fixtures for parser testing: examples/hard/cxml

Workflow

What this workflow does:

converts only examples/public/** into Parquet under output/public/**
loads only output/public/**/*.parquet into Postgres raw.filelens_lines
runs dbt staging + marts with --full-refresh (and tests)
does not include non-public example paths unless you change the command

Why --full-refresh in this demo workflow:

it rebuilds marts from scratch so the demo is deterministic after parser/model changes
it avoids stale incremental state while iterating locally
for recurring production loads, omit --full-refresh and use incremental dbt runs

scripts/convert_inputs.sh --input-dir examples/public --output-dir output/public

export PGHOST=localhost
export PGPORT=5432
export PGUSER=...
export PGPASSWORD=...
export PGDATABASE=postgres
export DBT_PROFILES_DIR=dbt

scripts/auto_load_and_run_dbt.sh --parquet-glob "$PWD/output/public/**/*.parquet" --full-refresh

scripts/convert_inputs.sh uses ./target/release/filelens by default. Use --bin to point to another binary path.

scripts/convert_inputs.sh is non-strict by default (skips failures and continues). Add --strict to fail on first conversion error.

Advanced formats

RDF/Turtle (.ttl, .rdf) — experimental support
HTML pages containing RDF/Turtle <pre> blocks (for example *.ttl.html)

Why not pandas?

pandas can read files, but it does not:

detect likely header/metadata layout
explain quality issues up front
normalize mixed file families with one deterministic CLI pass

filelens is focused on that first cleanup step before your pipeline.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

Apr 29, 2026

0.1.2

Apr 28, 2026

0.1.1

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filelens-0.1.3.tar.gz (615.2 kB view details)

Uploaded Apr 29, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

filelens-0.1.3-py3-none-win_amd64.whl (6.7 MB view details)

Uploaded Apr 29, 2026 Python 3Windows x86-64

filelens-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB view details)

Uploaded Apr 29, 2026 Python 3manylinux: glibc 2.17+ x86-64

filelens-0.1.3-py3-none-macosx_11_0_arm64.whl (5.8 MB view details)

Uploaded Apr 29, 2026 Python 3macOS 11.0+ ARM64

File details

Details for the file filelens-0.1.3.tar.gz.

File metadata

Download URL: filelens-0.1.3.tar.gz
Upload date: Apr 29, 2026
Size: 615.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`cb6bbd5a954630c5daab896ec13994dc3a1a5eca0f65056bc08739ffc46bf0e2`
MD5	`359b39b0e9ecddbd4f14ce276a0e84b5`
BLAKE2b-256	`13e5ea7fdb283a7a3be87f45953fca03f43643651c96b93e227fb8f08c23591e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.3.tar.gz:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filelens-0.1.3.tar.gz
- Subject digest: cb6bbd5a954630c5daab896ec13994dc3a1a5eca0f65056bc08739ffc46bf0e2
- Sigstore transparency entry: 1398999888
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: kraftaa/filelens@7befc336e85dd80a33c5c5e7ca17e49e59b4e0fc
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/kraftaa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7befc336e85dd80a33c5c5e7ca17e49e59b4e0fc
- Trigger Event: push

File details

Details for the file filelens-0.1.3-py3-none-win_amd64.whl.

File metadata

Download URL: filelens-0.1.3-py3-none-win_amd64.whl
Upload date: Apr 29, 2026
Size: 6.7 MB
Tags: Python 3, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.3-py3-none-win_amd64.whl
Algorithm	Hash digest
SHA256	`58fe0a848f4a7e1f6edb2a5f50e8ca2c5813ffa1fabf8b9fcd9612aa0c5d21ec`
MD5	`2a5f66ae20303320c579e8d1c0007f57`
BLAKE2b-256	`80b498d7f5328319488deaf8a7cee5ad44f70f0a6b9e11a0ad06a392491b22e3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.3-py3-none-win_amd64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filelens-0.1.3-py3-none-win_amd64.whl
- Subject digest: 58fe0a848f4a7e1f6edb2a5f50e8ca2c5813ffa1fabf8b9fcd9612aa0c5d21ec
- Sigstore transparency entry: 1398999901
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: kraftaa/filelens@7befc336e85dd80a33c5c5e7ca17e49e59b4e0fc
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/kraftaa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7befc336e85dd80a33c5c5e7ca17e49e59b4e0fc
- Trigger Event: push

File details

Details for the file filelens-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: filelens-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Apr 29, 2026
Size: 6.5 MB
Tags: Python 3, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`ed0437ebc0b2c8b65d1158652337a09d9b9a1d2cd41c61ecfa3e1bb444036a2d`
MD5	`ec20b00343b9cf9ca86c2bdbc8a8438a`
BLAKE2b-256	`593ec1b3d08c026f5e5df6711858bbbaba284d2d999e76e221d4c49f3eb46535`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filelens-0.1.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Subject digest: ed0437ebc0b2c8b65d1158652337a09d9b9a1d2cd41c61ecfa3e1bb444036a2d
- Sigstore transparency entry: 1398999895
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: kraftaa/filelens@7befc336e85dd80a33c5c5e7ca17e49e59b4e0fc
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/kraftaa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7befc336e85dd80a33c5c5e7ca17e49e59b4e0fc
- Trigger Event: push

File details

Details for the file filelens-0.1.3-py3-none-macosx_11_0_arm64.whl.

File metadata

Download URL: filelens-0.1.3-py3-none-macosx_11_0_arm64.whl
Upload date: Apr 29, 2026
Size: 5.8 MB
Tags: Python 3, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filelens-0.1.3-py3-none-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`c587d1217f4b3634966956e4b10c0884ac541bb0b85eefa5564f39573cca16b6`
MD5	`f0eea66b823660d9227402b7ad411c5a`
BLAKE2b-256	`1aafa612b4112f972dac57c4a0bc4187a95bf8c54001fcec6faac1a9d5e012b4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filelens-0.1.3-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on kraftaa/filelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filelens-0.1.3-py3-none-macosx_11_0_arm64.whl
- Subject digest: c587d1217f4b3634966956e4b10c0884ac541bb0b85eefa5564f39573cca16b6
- Sigstore transparency entry: 1398999911
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: kraftaa/filelens@7befc336e85dd80a33c5c5e7ca17e49e59b4e0fc
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/kraftaa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7befc336e85dd80a33c5c5e7ca17e49e59b4e0fc
- Trigger Event: push

filelens 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

filelens

Quick start

Example

Before

One command

After

That's it

Default workflow

Supported inputs

Design

When to use filelens

More examples

Optional: use with dbt

Examples

Workflow

Advanced formats

Why not pandas?

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance