Turn silly json into intelligent embeddings

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

jsonllm

Tools for working with LLMs on JSON data

Usage | Installation | Why | How

Usage

Usage: jsonllm [OPTIONS] COMMAND [ARGS]...

  Tools for working with LLMs on JSON data

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  embed  Turn a JSON of content into a JSON of embeddings.

Usage: jsonllm embed [OPTIONS]

  Turn a JSON of content into a JSON of embeddings.

Options:
  -i, --input PATH  File to embed
  -m, --model TEXT  Embedding model(s) to use
                    
                    Issue `llm embed-models list` to list available models.
                    
                    Currently installed are: ['3-large', '3-large-1024',
                    '3-large-256', '3-small', '3-small-512', 'ada-002',
                    'clip', 'jina-embeddings-v2-base-en', 'jina-
                    embeddings-v2-large-en', 'jina-embeddings-v2-small-en',
                    'onnx-bge-base', 'onnx-bge-large', 'onnx-bge-micro',
                    'onnx-bge-small', 'onnx-gte-tiny', 'onnx-minilm-l12',
                    'onnx-minilm-l6', 'sentence-transformers/all-MiniLM-L6-v2']
                    
                    You can install more via `llm install ...`
                    
                    You can find available models here: https://llm.datasette.io/en/stable/plugins/directory.html#embedding-models
  -j, --jq TEXT     Embed only the keys that satisfy the given jq filter
                    expression
  --in-arrays       Embed text appearing in arrays too
  --help            Show this message and exit.

CREATE TABLE people (data JSONB);

python tests/gen_people.py 100 |\
jsonllm embed -m clip -j '.name'
psql -c "\COPY people(data) FROM stdin"

echo '{"hello": "world"}' | jsonllm embed -m clip

Installation

pip install jsonllm

Available Models

Available embedding models are those provided and installed via the llm package.

llm-sentence-transformers adds support for embeddings using the sentence-transformers library, which provides access to a wide range of embedding models.
llm-clip provides the CLIP model, which can be used to embed images and text in the same vector space, enabling text search against images. See Build an image search engine with llm-clip for more on this plugin.
llm-embed-jina provides Jina AI's 8K text embedding models.
llm-embed-onnx provides seven embedding models that can be executed using the ONNX model framework.

llm install llm-sentence-transformers
llm install llm-clip
llm install llm-embed-jina
llm install llm-embed-onnx

For an up-to-date list check here

Why

There are now plenty of tools providing ways of getting embeddings out of a corpus of text. Some even can generate embeddings from JSON documents, but they treat JSON as simple text too.

That is rarely the case though; JSON documents have structure and semantics depending on their application in context. Most importantly though it's data exchange format and a data aggregation tool. Aggregation in the sense of getting data from A to B.

In my case point A was a JSON object created by an SQL query from a Postgres database, piped through jsonllm and pushed into another Postgres instance specifically designed for AI-related experiments.

How

jsonllm traverses a JSON object recursively, and replaces text values with their embeddings array.

Other data types are not modified at all and the overall object structure is not changed.

Development

pip install -e '.[test]'
pytest

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0a2 pre-release

Feb 11, 2024

0.1.0a1 pre-release

Feb 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonllm-0.1.0a2.tar.gz (5.5 kB view hashes)

Uploaded Feb 11, 2024 Source

Built Distribution

jsonllm-0.1.0a2-py3-none-any.whl (5.6 kB view hashes)

Uploaded Feb 11, 2024 Python 3

Hashes for jsonllm-0.1.0a2.tar.gz

Hashes for jsonllm-0.1.0a2.tar.gz
Algorithm	Hash digest
SHA256	`02e778b5cc39ff65cd1298a5856abe83e6651aeb424f36d6ca0144ad465e7e59`
MD5	`f542208aa4a98f81ced4aebf1dcb6ae5`
BLAKE2b-256	`a92e96c3892ff2c2916011d15f32ee807a222374a8e43f2d5ef916698d85bc93`

Hashes for jsonllm-0.1.0a2-py3-none-any.whl

Hashes for jsonllm-0.1.0a2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d4bb238ec70e3ac5b4279b28e3cf1f9cec27a8b62d03ab8ec9470426bd2823d4`
MD5	`40f87febfe91902ce89685310205a91b`
BLAKE2b-256	`e9d23ec4ba3e05ce62ab8b6c42164e2c9f158cf9fd9932e3116c113280899b30`