Skip to main content

Turn silly json into intelligent embeddings

Project description

jsonllm

Tools for working with LLMs on JSON data

Usage | Installation | Why | How

Usage

Usage: jsonllm [OPTIONS] COMMAND [ARGS]...

  Tools for working with LLMs on JSON data

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  embed  Turn a JSON of content into a JSON of embeddings.
Usage: jsonllm embed [OPTIONS]

  Turn a JSON of content into a JSON of embeddings.

Options:
  -i, --input PATH  File to embed
  -m, --model TEXT  Embedding model(s) to use
                    
                    Issue `llm embed-models list` to list available models.
                    
                    Currently installed are: ['3-large', '3-large-1024',
                    '3-large-256', '3-small', '3-small-512', 'ada-002',
                    'clip', 'jina-embeddings-v2-base-en', 'jina-
                    embeddings-v2-large-en', 'jina-embeddings-v2-small-en',
                    'onnx-bge-base', 'onnx-bge-large', 'onnx-bge-micro',
                    'onnx-bge-small', 'onnx-gte-tiny', 'onnx-minilm-l12',
                    'onnx-minilm-l6', 'sentence-transformers/all-MiniLM-L6-v2']
                    
                    You can install more via `llm install ...`
                    
                    You can find available models here: https://llm.datasette.io/en/stable/plugins/directory.html#embedding-models
  -j, --jq TEXT     Embed only the keys that satisfy the given jq filter
                    expression
  --in-arrays       Embed text appearing in arrays too
  --help            Show this message and exit.
CREATE TABLE people (data JSONB);
python tests/gen_people.py 100 |\
jsonllm embed -m clip -j '.name'
psql -c "\COPY people(data) FROM stdin"
echo '{"hello": "world"}' | jsonllm embed -m clip

Installation

pip install jsonllm

Available Models

Available embedding models are those provided and installed via the llm package.

llm install llm-sentence-transformers
llm install llm-clip
llm install llm-embed-jina
llm install llm-embed-onnx

For an up-to-date list check here

Why

There are now plenty of tools providing ways of getting embeddings out of a corpus of text. Some even can generate embeddings from JSON documents, but they treat JSON as simple text too.

That is rarely the case though; JSON documents have structure and semantics depending on their application in context. Most importantly though it's data exchange format and a data aggregation tool. Aggregation in the sense of getting data from A to B.

In my case point A was a JSON object created by an SQL query from a Postgres database, piped through jsonllm and pushed into another Postgres instance specifically designed for AI-related experiments.

How

jsonllm traverses a JSON object recursively, and replaces text values with their embeddings array.

Other data types are not modified at all and the overall object structure is not changed.

Development

pip install -e '.[test]'
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonllm-0.1.0a2.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

jsonllm-0.1.0a2-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file jsonllm-0.1.0a2.tar.gz.

File metadata

  • Download URL: jsonllm-0.1.0a2.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for jsonllm-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 02e778b5cc39ff65cd1298a5856abe83e6651aeb424f36d6ca0144ad465e7e59
MD5 f542208aa4a98f81ced4aebf1dcb6ae5
BLAKE2b-256 a92e96c3892ff2c2916011d15f32ee807a222374a8e43f2d5ef916698d85bc93

See more details on using hashes here.

File details

Details for the file jsonllm-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: jsonllm-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for jsonllm-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 d4bb238ec70e3ac5b4279b28e3cf1f9cec27a8b62d03ab8ec9470426bd2823d4
MD5 40f87febfe91902ce89685310205a91b
BLAKE2b-256 e9d23ec4ba3e05ce62ab8b6c42164e2c9f158cf9fd9932e3116c113280899b30

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page