Skip to main content

Easily create and search text embeddings using OpenAI's API using json for local storage. Just add dicts of info and search! Built for rapid prototyping.

Project description

Embedme

Embedme is a python module that allows you to easily use embeddings from text fields with OpenAI's Embedding API and store them in a local folder.

It's like a lazy version of pinecone - Numpy is actually pretty fast for embeddings stuff at smaller scale, why overthink stuff? We store the data and vectors as json and build the numpy array before you search (and store it until you add more)

Installation

To install Embedme, you can use pip:

pip install embedme

Setup

The only thing you must do before you use embedme is setup auth with OpenAI. We use it to embed your items and search queries, so it is required. I don't want to touch any of that code - just sign in how they tell you to, either in the script via a file for the key, or an environment variable for your key.

OpenAI Python Module (With Auth Instructions)

Usage

Embedme provides a simple interface to use embeddings from text fields with OpenAI's Embedding API and store them in a local folder.

Check out the example notebook for a better example, but useage is something like:

import openai
import nltk
from more_itertools import chunked
from embedme import Embedme
from tqdm import tqdm

# Downloading the NLTK corpus
nltk.download('gutenberg')

# Creating an instance of the Embedme class
embedme = Embedme(data_folder='.embedme', model="text-embedding-ada-002")

# Getting the text
text = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')

# Splitting the text into sentences
sentences = nltk.sent_tokenize(text)

input("Hey this call will cost you money and take a minute. Like, a few cents probably, but wanted to warn you.")

for i, chunk in enumerate(tqdm(chunked(sentences, 20))):
    data = {'name': f'moby_dick_chunk_{i}', 'text': ' '.join(chunk)}
    embedme.add(data, save=False)

embedme.save()

And to search:

embedme.search("lessons")

You can do anything you would want to with .vectors after you call .prepare_search() (or... search something, it's automatic mostly), like plot clusters, etc.

Follow Us

Some friends and I are writing about large language model stuff at SensibleDefaults.io, honest to god free. Follow us (or star this repo!) if this helps you!

Note

Embedme uses OpenAI's Embedding API to get embeddings for text fields, so an API key is required to use it. You can get one from https://beta.openai.com/signup/

The token limit today is about 8k, so... you're probably fine

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedme-0.1.3.tar.gz (3.9 kB view details)

Uploaded Source

Built Distribution

embedme-0.1.3-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file embedme-0.1.3.tar.gz.

File metadata

  • Download URL: embedme-0.1.3.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/23.3.0

File hashes

Hashes for embedme-0.1.3.tar.gz
Algorithm Hash digest
SHA256 653e1f0ab554cd9fef0a8cb3a8f631cd4253109da93f43e0e11d7126949ae9c7
MD5 f5f1107f92f26ea1bcbfdc4c23b4c6a9
BLAKE2b-256 8f0bfea886e8f3fc2098e5fa309dba14d2916e5fc68a127213039976e83be394

See more details on using hashes here.

File details

Details for the file embedme-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: embedme-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Darwin/23.3.0

File hashes

Hashes for embedme-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 30a92ece883ee8e82e1caf7c1d960216558fd912ab8405ab18c539d9eab39cae
MD5 82474289ec55b5e76bf2fa7a4d4f909a
BLAKE2b-256 fea20b38da5aaca65f360581dd12e93d13aa17a3ddbadaa9c28dcd0a427eb339

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page