Detect and fix oversized vector database metadata before upsert.
Project description
vectormeta
Stop vector DB metadata limit errors before upsert.
Website: https://achal13jain.github.io/vectormeta/
vectormeta is a Python CLI package for detecting and fixing oversized metadata in
vector database records. It scans JSON or JSONL vector records, reports the largest
metadata fields, and can move heavy content fields into local JSON sidecar files while
leaving clean filterable metadata in the vector database payload.
The project is designed for developers preparing records for Pinecone, Chroma, Qdrant, Weaviate, or a custom metadata policy. Pinecone is the clearest strict-limit target in the MVP. Other targets use conservative advisory limits that should be adjusted for each deployment.
Why This Exists
Vector database metadata should usually stay small and filterable:
sourcepagesectiondoc_idchunk_idtagslanguage
Large payloads such as full chunk text, raw HTML, Markdown, OCR text, summaries, tables,
or full documents can push records over service metadata limits and make upserts fail.
vectormeta catches that problem before upload and can rewrite records into a safer
shape:
vector record metadata -> small filterable fields + content_ref
sidecar JSON file -> large text, HTML, tables, summaries, payloads
Features
- Scan JSON arrays and newline-delimited JSON records.
- Measure metadata using compact UTF-8 JSON bytes.
- Report oversized records, largest fields, byte counts, KB counts, and suggested moves.
- Exit with code
1when oversized records are found, which makes scans useful in CI. - Move heavy metadata fields into sidecar JSON files.
- Preserve unknown record fields and original record order.
- Sanitize sidecar filenames derived from record IDs.
- Protect output files and sidecars from accidental overwrite.
- Hydrate records back from sidecar references for debugging and migrations.
- Keep core logic independent from Typer and Rich so it can be tested and reused.
Tech Stack
- Python 3.10+
- Typer for the CLI
- Rich for human-readable terminal reports
- Pydantic for YAML config validation
- PyYAML for config loading
- Pytest for tests
- Ruff for linting and formatting
- Mypy for strict type checks
- Setuptools and
python -m buildfor packaging
Installation
Clone and install locally:
git clone https://github.com/Achal13jain/vectormeta.git
cd vectormeta
pip install -e ".[dev]"
Check the CLI:
vectormeta --help
vectormeta --version
python -m vectormeta --help
After the package is published to PyPI, the intended install command is:
pip install vectormeta
Input Format
JSON array:
[
{
"id": "doc_1_chunk_1",
"values": [0.1, 0.2, 0.3],
"metadata": {
"source": "paper.pdf",
"page": 1,
"chunk_text": "large text..."
}
}
]
JSONL:
{"id":"doc_1","values":[0.1],"metadata":{"text":"large text..."}}
{"id":"doc_2","values":[0.2],"metadata":{"text":"large text..."}}
Each record must contain:
idor_idmetadataas a JSON object
Vector fields such as values, vector, or embedding are preserved but not deeply
validated by the MVP.
Quickstart
Scan the included oversized Pinecone example:
vectormeta scan examples/oversized_pinecone_records.json --target pinecone --no-fail
Fix the records:
vectormeta fix examples/oversized_pinecone_records.json \
--target pinecone \
--sidecar examples/sidecar \
--out examples/pinecone_ready.json \
--overwrite
Verify the cleaned file now fits the Pinecone-sized policy:
vectormeta scan examples/pinecone_ready.json --target pinecone --no-fail
Hydrate records for local inspection:
vectormeta hydrate examples/pinecone_ready.json \
--sidecar examples/sidecar \
--out examples/hydrated.json \
--overwrite
Commands
Scan
vectormeta scan chunks.json --target pinecone
Useful options:
--target pinecone|chroma|qdrant|weaviate|custom--limit-kb <number>for custom or overridden limits--top <number>for the largest oversized records to show--format table|json--no-failto exit0even when oversized records are found
Exit codes:
0: all records fit, or--no-failwas passed1: oversized records were found2: expected user-facing input, config, target, or overwrite error
Fix
vectormeta fix chunks.json --target pinecone --sidecar ./sidecar --out pinecone_ready.json
Move explicit fields:
vectormeta fix chunks.json \
--target pinecone \
--move-fields chunk_text,raw_html,summary \
--keep-fields source,page,section,doc_id,chunk_id \
--content-ref-field content_ref \
--sidecar ./sidecar \
--out pinecone_ready.json
Preview without writing:
vectormeta fix chunks.json --target pinecone --sidecar ./sidecar --out ready.json --dry-run
fix does not overwrite files unless --overwrite is passed.
If your input metadata already contains content_ref, choose another reference field:
vectormeta fix chunks.json \
--target pinecone \
--content-ref-field vectormeta_content_ref \
--sidecar ./sidecar \
--out pinecone_ready.json
Hydrate
vectormeta hydrate pinecone_ready.json --sidecar ./sidecar --out hydrated.json
Hydrate sidecar content into a separate record field:
vectormeta hydrate pinecone_ready.json \
--sidecar ./sidecar \
--mode content_field \
--content-field payload \
--out hydrated.json
Limits
vectormeta limits
Current MVP defaults:
| Target | Default | Meaning |
|---|---|---|
pinecone |
40 KB | Primary strict-limit target for this MVP |
chroma |
256 KB | Advisory local/configurable policy |
qdrant |
64 KB | Conservative advisory policy |
weaviate |
64 KB | Conservative advisory policy |
custom |
none | Requires --limit-kb |
Limits and provider behavior can change. Verify official vector database documentation before treating any preset as a production guarantee.
How Metadata Reduction Works
vectormeta sizes metadata exactly as compact UTF-8 JSON:
json.dumps(metadata, ensure_ascii=False, separators=(",", ":")).encode("utf-8")
The fixer reduces metadata in this order:
- Move explicit
--move-fields, if provided. - Otherwise move known heavy fields such as
text,chunk_text,raw_html,markdown,summary,tables, andocr_text. - If metadata is still above the limit, move the largest non-keep fields one at a time until the record fits.
- Keep fields such as
source,page,doc_id, andtagsare preserved unless the record cannot fit without moving them. - When fields are moved, metadata receives a
content_ref, and moved fields are written to a sidecar JSON payload.
The logic is covered by tests for Unicode byte sizing, nested metadata sizing, JSON/JSONL input, fixer output, sidecar overwrite protection, hydration, and CLI exit codes. See docs/metadata-reduction.md.
Local Verification
Run the same checks used in CI:
python -m pytest
ruff check .
ruff format --check .
mypy vectormeta
python -m build
Run the acceptance workflow:
vectormeta scan examples/oversized_pinecone_records.json --target pinecone --no-fail
vectormeta fix examples/oversized_pinecone_records.json --target pinecone --sidecar examples/sidecar --out examples/pinecone_ready.json --overwrite
vectormeta scan examples/pinecone_ready.json --target pinecone --no-fail
vectormeta hydrate examples/pinecone_ready.json --sidecar examples/sidecar --out examples/hydrated.json --overwrite
Expected result:
- The original example reports one oversized record.
- The fixed output reports zero oversized records.
- Sidecar files are created under
examples/sidecar. - Hydration restores moved fields for inspection.
Documentation
- Project website
- Architecture overview
- Metadata reduction logic
- Usage guide
- Testing checklist
- Vector database notes
Limitations
- Local JSON sidecars only. Keep the cleaned output file and sidecar directory together; the MVP does not provide an atomic database-backed sidecar store.
- Sidecars are one file per changed record. The MVP does not deduplicate repeated fields
such as shared
raw_htmlacross chunks from the same document. - Input support is JSON arrays and JSONL records, but files are currently read into memory. Streaming JSONL scan/fix is planned for larger embedding datasets.
- Vector values are preserved but not deeply validated.
- Provider-specific metadata schemas and value types are not fully validated. For example, Pinecone has metadata format rules beyond byte size.
- Non-Pinecone target limits are conservative advisory defaults, not vendor claims.
- The fixer is policy-based; review cleaned outputs before production ingestion.
Roadmap
Planned ideas include:
- SQLite sidecar backend
- Content-addressed sidecar deduplication
- Streaming JSONL scan/fix
- S3 sidecar backend
- LangChain
Documentadapter - LlamaIndex
Nodeadapter - Pinecone upsert wrapper
- GitHub Action for metadata checks
- HTML report output
See ROADMAP.md.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vectormeta-0.1.0.tar.gz.
File metadata
- Download URL: vectormeta-0.1.0.tar.gz
- Upload date:
- Size: 23.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57c658e90eec2ba219ac95335ace4220e828d7cf9d46c8457646277c49c35626
|
|
| MD5 |
052343f305f4c942638a5eb7e505dec7
|
|
| BLAKE2b-256 |
a45fe30e380bfb8a88e5d681d1956fa8691db1df7353183961092b4d11747482
|
Provenance
The following attestation bundles were made for vectormeta-0.1.0.tar.gz:
Publisher:
release.yml on Achal13jain/vectormeta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectormeta-0.1.0.tar.gz -
Subject digest:
57c658e90eec2ba219ac95335ace4220e828d7cf9d46c8457646277c49c35626 - Sigstore transparency entry: 1746076489
- Sigstore integration time:
-
Permalink:
Achal13jain/vectormeta@451ca06b200b1ea91636e1efb4dfbadf156432cc -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Achal13jain
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@451ca06b200b1ea91636e1efb4dfbadf156432cc -
Trigger Event:
push
-
Statement type:
File details
Details for the file vectormeta-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vectormeta-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a15aba77961137a8d9a3d21ebc352679b1fe0957104ae5948f55a6c1187be2d9
|
|
| MD5 |
a11c5ce66311bb3a8b6d521ffd4cf0e9
|
|
| BLAKE2b-256 |
0c69136b21d244e6d9605daf188a8dca611809d6eb393697275e9b4307340e54
|
Provenance
The following attestation bundles were made for vectormeta-0.1.0-py3-none-any.whl:
Publisher:
release.yml on Achal13jain/vectormeta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectormeta-0.1.0-py3-none-any.whl -
Subject digest:
a15aba77961137a8d9a3d21ebc352679b1fe0957104ae5948f55a6c1187be2d9 - Sigstore transparency entry: 1746076619
- Sigstore integration time:
-
Permalink:
Achal13jain/vectormeta@451ca06b200b1ea91636e1efb4dfbadf156432cc -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Achal13jain
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@451ca06b200b1ea91636e1efb4dfbadf156432cc -
Trigger Event:
push
-
Statement type: