Turn silly json into intelligent embeddings
Project description
jsonllm
Tools for working with LLMs on JSON data
Usage |
Installation |
Why |
How
Usage
Usage: jsonllm [OPTIONS] COMMAND [ARGS]...
Tools for working with LLMs on JSON data
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
embed Turn a JSON of content into a JSON of embeddings.
Usage: jsonllm embed [OPTIONS]
Turn a JSON of content into a JSON of embeddings.
Options:
-i, --input PATH File to embed
-m, --model TEXT Embedding model(s) to use
Issue `llm embed-models list` to list available models.
Currently installed are: ['3-large', '3-large-1024',
'3-large-256', '3-small', '3-small-512', 'ada-002',
'clip', 'jina-embeddings-v2-base-en', 'jina-
embeddings-v2-large-en', 'jina-embeddings-v2-small-en',
'onnx-bge-base', 'onnx-bge-large', 'onnx-bge-micro',
'onnx-bge-small', 'onnx-gte-tiny', 'onnx-minilm-l12',
'onnx-minilm-l6', 'sentence-transformers/all-MiniLM-L6-v2']
You can install more via `llm install ...`
You can find available models here: https://llm.datasette.io/en/stable/plugins/directory.html#embedding-models
-j, --jq TEXT Embed only the keys that satisfy the given jq filter
expression
--in-arrays Embed text appearing in arrays too
--help Show this message and exit.
CREATE TABLE people (data JSONB);
python tests/gen_people.py 100 |\
jsonllm embed -m clip -j '.name'
psql -c "\COPY people(data) FROM stdin"
echo '{"hello": "world"}' | jsonllm embed -m clip
Installation
pip install jsonllm
Available Models
Available embedding models
are those provided and installed via the llm
package.
- llm-sentence-transformers adds support for embeddings using the sentence-transformers library, which provides access to a wide range of embedding models.
- llm-clip provides the CLIP model, which can be used to embed images and text in the same vector space, enabling text search against images. See Build an image search engine with llm-clip for more on this plugin.
- llm-embed-jina provides Jina AI's 8K text embedding models.
- llm-embed-onnx provides seven embedding models that can be executed using the ONNX model framework.
llm install llm-sentence-transformers
llm install llm-clip
llm install llm-embed-jina
llm install llm-embed-onnx
For an up-to-date list check here
Why
There are now plenty of tools providing ways of getting embeddings out of a corpus of text. Some even can generate embeddings from JSON documents, but they treat JSON as simple text too.
That is rarely the case though; JSON documents have structure and semantics depending on their application in context. Most importantly though it's data exchange format and a data aggregation tool. Aggregation in the sense of getting data from A to B.
In my case point A was a JSON object created by an SQL query from a Postgres
database, piped through jsonllm
and pushed into another Postgres instance
specifically designed for AI-related experiments.
How
jsonllm
traverses a JSON object recursively,
and replaces text values with their embeddings array.
Other data types are not modified at all and the overall object structure is not changed.
Development
pip install -e '.[test]'
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file jsonllm-0.1.0a2.tar.gz
.
File metadata
- Download URL: jsonllm-0.1.0a2.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02e778b5cc39ff65cd1298a5856abe83e6651aeb424f36d6ca0144ad465e7e59 |
|
MD5 | f542208aa4a98f81ced4aebf1dcb6ae5 |
|
BLAKE2b-256 | a92e96c3892ff2c2916011d15f32ee807a222374a8e43f2d5ef916698d85bc93 |
File details
Details for the file jsonllm-0.1.0a2-py3-none-any.whl
.
File metadata
- Download URL: jsonllm-0.1.0a2-py3-none-any.whl
- Upload date:
- Size: 5.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4bb238ec70e3ac5b4279b28e3cf1f9cec27a8b62d03ab8ec9470426bd2823d4 |
|
MD5 | 40f87febfe91902ce89685310205a91b |
|
BLAKE2b-256 | e9d23ec4ba3e05ce62ab8b6c42164e2c9f158cf9fd9932e3116c113280899b30 |