Spyder Index is an open-source framework for building LLM applications

These details have not been verified by PyPI

Project links

GitHub Statistics

Project description

Spyder Index

Embeddings
- HuggingFace
Ingestion
Text Splitter
- Semantic Splitter
- Sentence Splitter
Vector Store
- Elasticsearch

Embeddings: HuggingFace

Overview

Class for computing text embeddings using HuggingFace models.

API Reference

from spyder_index.embeddings import HuggingFaceEmbeddings

`HuggingFaceEmbeddings(model_name: str= "sentence-transformers/all-MiniLM-L6-v2", device: Literal["cpu", "cuda"] = "cpu")`

Initialize a HuggingFaceEmbeddings object.

model_name (str, optional): Name of the HuggingFace model to be used. Defaults to "sentence-transformers/all-MiniLM-L6-v2".
device (Literal["cpu", "cuda"], optional): Device to run the model on. Defaults to "cpu".

`get_query_embedding(query: str) -> List[float]`

Compute embedding for a query.

text (str): Input query to compute embedding.

`get_embedding_from_texts(texts: List[str]) -> List[List[float]]`

Compute embeddings for a list of texts.

texts (List[str])

`get_documents_embedding(documents: List[str]) -> List[List[float]]`

Compute embeddings for a list of Documents.

documents (List[Documents])

Ingestion: Directory Files

Overview

This class provides functionality to load documents from a directory using various file loaders.

API Reference

from spyder_index.ingestion import DirectoryLoader

`load_data(dir: str) -> List[Document]`

Loads data from the specified directory.

dir (str): The directory path from which to load the documents.
metadata (Optional[dict]): Additional metadata to include in the document.

Ingestion: IBM Cloud Object Store (s3)

Overview

Loads data from IBM Cloud Object Storage (COS) using S3 interface.

API Reference

from spyder_index.ingestion import IBMS3Loader

`IBMS3Loader(bucket: str, ibm_api_key_id: str, ibm_service_instance_id: str, s3_endpoint_url: str)`

Initialize an IBMS3Loader object.

bucket (str): The name of the IBM COS bucket.
ibm_api_key_id (str): The IBM Cloud API key ID for accessing the bucket.
ibm_service_instance_id (str): The service instance ID for the IBM COS.
s3_endpoint_url (str): The endpoint URL for the IBM COS S3 service.

`load_data() -> List[Document]`

Loads data from the specified directory.

Ingestion: JSON

Overview

Loads data from JSON.

API Reference

from spyder_index.ingestion import JSONLoader

`JSONLoader(jq_schema: str, text_content: bool)`

Initialize an JSONLoader object.

jq_schema (str): The jq schema to use to extract the data from the JSON.
text_content (bool): Flag to indicate whether the content is in string format. Default is False

`load_data() -> List[Document]`

Loads data from the specified directory.

file (str): The file path of JSON.

Text Splitter: Semantic Splitter

Overview

Semantic Splitter is a Python class designed to split text into chunks using semantic understanding. It utilizes pre-trained embeddings to identify breakpoints in the text and divide it into meaningful segments.

API Reference

from spyder_index.text_splitters import SemanticSplitter

`SemanticSplitter(model_name: str = "sentence-transformers/all-MiniLM-L6-v2", buffer_size: int = 1, breakpoint_threshold_amount: int = 95, device: Literal["cpu", "cuda"] = "cpu") -> None`

Initialize the SemanticSplitter instance.

model_name: Name of the pre-trained embeddings model to use. Default is "sentence-transformers/all-MiniLM-L6-v2".
buffer_size: Size of the buffer for semantic chunking. Default is 1.
breakpoint_threshold_amount: Threshold percentage for detecting breakpoints. Default is 95.
device: Device to use for processing, either "cpu" or "cuda". Default is "cpu".

`from_text(text: str) -> List[str]`

Split text into chunks.

text: Input text to split.

`from_documents(documents: List[Document]) -> List[Document]`

Split text from a list of documents into chunks.

documents: List of Documents.

Text Splitter: Sentence Splitter

Overview

This Python class SentenceSplitter is designed to split input text into smaller chunks, particularly useful for processing large documents or texts. It provides methods to split text into chunks and to split a list of documents into chunks.

API Reference

from spyder_index.text_splitters import SentenceSplitter

`SentenceSplitter(chunk_size: int = 512, chunk_overlap: int = 256, length_function = len, separators: List[str] = ["\n\n", "\n", " ", ""]) -> None`

Creates a new instance of the SentenceSplitter class.

chunk_size (int, optional): Size of each chunk. Default is 512.
chunk_overlap (int, optional): Amount of overlap between chunks. Default is 256.
length_function (function, optional): Function to compute the length of the text. Default is len.
separators (List[str], optional): List of separators used to split the text into chunks. Default separators are ["\n\n", "\n", " ", ""].

`from_text(text: str) -> List[str]`

Splits the input text into chunks.

text (str): Input text to split.

`from_documents(documents: List[Document]) -> List[Document]`

Splits a list of documents into chunks.

documents (List[Document]): List of Document objects.

Vector Store: Elasticsearch

Overview

The ElasticsearchVectorStore class provides functionality to interact with Elasticsearch for storing and querying document embeddings. It facilitates adding documents, performing similarity searches, and deleting documents from an Elasticsearch index.

API Reference

from spyder_index.vector_stores import ElasticsearchVectorStore

`ElasticsearchVectorStore(index_name, es_hostname, es_user, es_password, dims_length, embedding, batch_size=200, ssl=False, distance_strategy="cosine", text_field="text", vector_field="embedding")`

Initializes the ElasticsearchVectorStore instance.

index_name: The name of the Elasticsearch index.
es_hostname: The hostname of the Elasticsearch instance.
es_user: The username for authentication.
es_password: The password for authentication.
dims_length: The length of the embedding dimensions.
embedding: An instance of embeddings.
batch_size: The batch size for bulk operations. Defaults to 200.
ssl: Whether to use SSL. Defaults to False.
distance_strategy: The distance strategy for similarity search. Defaults to "cosine".
text_field: The name of the field containing text. Defaults to "text".
vector_field: The name of the field containing vector embeddings. Defaults to "embedding".

`add_documents(documents, create_index_if_not_exists=True)`

Adds documents to the Elasticsearch index.

documents: A list of Document objects to add to the index.
create_index_if_not_exists: Whether to create the index if it doesn't exist. Defaults to True.

`similarity_search(query, top_k=4)`

Performs a similarity search based on the documents most similar to the query.

query: The query text.
top_k: The number of top results to return. Defaults to 4.

`delete(ids=None)`

Deletes documents from the Elasticsearch index.

ids: A list of document IDs to delete. Defaults to None.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Release history Release notifications | RSS feed

0.2.7

Jul 13, 2024

0.2.6

Jul 10, 2024

0.2.5

Jul 9, 2024

0.2.4

Jul 1, 2024

0.2.3

Jun 12, 2024

0.2.2

May 16, 2024

0.2.1

May 13, 2024

0.2.0

May 10, 2024

This version

0.1.2

May 9, 2024

0.1.1

May 8, 2024

0.1.0

May 8, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spyder_index-0.1.2.tar.gz (13.3 kB view hashes)

Uploaded May 9, 2024 Source

Built Distribution

spyder_index-0.1.2-py3-none-any.whl (15.3 kB view hashes)

Uploaded May 9, 2024 Python 3

Hashes for spyder_index-0.1.2.tar.gz

Hashes for spyder_index-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`049dc499af68e8f2ebd8ea73f52b432aad2e2cfc0613bf9fa8852985e13c3fcd`
MD5	`5b217207036e6df58ddf8356b09f6604`
BLAKE2b-256	`72c0f4e44f641d292c9d4b9a6e5a06f60f07f9780d78bb362072df5bbc216fd4`

Hashes for spyder_index-0.1.2-py3-none-any.whl

Hashes for spyder_index-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2fbf2365d26af8eb6730a85ef124dfa4eb107709bc8cb6e51ba464a07267fa29`
MD5	`f4b15a2475db447ab10f984bbeb3f7a7`
BLAKE2b-256	`6596743f8637fa2680dd867cc6622d94809a200509c0b15b18af9dc32077894d`

spyder-index 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Spyder Index

Table of Contents

Embeddings: HuggingFace

Overview

API Reference

HuggingFaceEmbeddings(model_name: str= "sentence-transformers/all-MiniLM-L6-v2", device: Literal["cpu", "cuda"] = "cpu")

get_query_embedding(query: str) -> List[float]

get_embedding_from_texts(texts: List[str]) -> List[List[float]]

get_documents_embedding(documents: List[str]) -> List[List[float]]

Ingestion: Directory Files

Overview

API Reference

load_data(dir: str) -> List[Document]

Ingestion: IBM Cloud Object Store (s3)

Overview

API Reference

IBMS3Loader(bucket: str, ibm_api_key_id: str, ibm_service_instance_id: str, s3_endpoint_url: str)

load_data() -> List[Document]

Ingestion: JSON

Overview

API Reference

JSONLoader(jq_schema: str, text_content: bool)

load_data() -> List[Document]

Text Splitter: Semantic Splitter

Overview

API Reference

SemanticSplitter(model_name: str = "sentence-transformers/all-MiniLM-L6-v2", buffer_size: int = 1, breakpoint_threshold_amount: int = 95, device: Literal["cpu", "cuda"] = "cpu") -> None

from_text(text: str) -> List[str]

from_documents(documents: List[Document]) -> List[Document]

Text Splitter: Sentence Splitter

Overview

API Reference

SentenceSplitter(chunk_size: int = 512, chunk_overlap: int = 256, length_function = len, separators: List[str] = ["\n\n", "\n", " ", ""]) -> None

from_text(text: str) -> List[str]

from_documents(documents: List[Document]) -> List[Document]

Vector Store: Elasticsearch

Overview

API Reference

ElasticsearchVectorStore(index_name, es_hostname, es_user, es_password, dims_length, embedding, batch_size=200, ssl=False, distance_strategy="cosine", text_field="text", vector_field="embedding")

add_documents(documents, create_index_if_not_exists=True)

similarity_search(query, top_k=4)

delete(ids=None)

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`HuggingFaceEmbeddings(model_name: str= "sentence-transformers/all-MiniLM-L6-v2", device: Literal["cpu", "cuda"] = "cpu")`

`get_query_embedding(query: str) -> List[float]`

`get_embedding_from_texts(texts: List[str]) -> List[List[float]]`

`get_documents_embedding(documents: List[str]) -> List[List[float]]`

`load_data(dir: str) -> List[Document]`

`IBMS3Loader(bucket: str, ibm_api_key_id: str, ibm_service_instance_id: str, s3_endpoint_url: str)`

`load_data() -> List[Document]`

`JSONLoader(jq_schema: str, text_content: bool)`

`load_data() -> List[Document]`

`SemanticSplitter(model_name: str = "sentence-transformers/all-MiniLM-L6-v2", buffer_size: int = 1, breakpoint_threshold_amount: int = 95, device: Literal["cpu", "cuda"] = "cpu") -> None`

`from_text(text: str) -> List[str]`

`from_documents(documents: List[Document]) -> List[Document]`

`SentenceSplitter(chunk_size: int = 512, chunk_overlap: int = 256, length_function = len, separators: List[str] = ["\n\n", "\n", " ", ""]) -> None`

`from_text(text: str) -> List[str]`

`from_documents(documents: List[Document]) -> List[Document]`

`ElasticsearchVectorStore(index_name, es_hostname, es_user, es_password, dims_length, embedding, batch_size=200, ssl=False, distance_strategy="cosine", text_field="text", vector_field="embedding")`

`add_documents(documents, create_index_if_not_exists=True)`

`similarity_search(query, top_k=4)`

`delete(ids=None)`