Skip to main content

Extract and structure all the data from a Git repository to make them usable in RAG.

Project description

Apophenia

Build PyPI PyPI - Python Version PyPI - Status Downloads Downloads

Apophenia give meaning to any existing Git repository.

Apophenia extract and structure all the data from a Git repository to make them usable in RAG or within AI agents.

Apophenia impose a meaningful interpretation on a nebulous stimulus (a Git repo).

Install

$ pip install apophenia

Usage

Extract data from a given repository:

$ apophenia extract https://github.com/4383/niet \
  --faiss_path /tmp/results.faiss \
  --metadata_path /tmp/results.json

Or directly discuss with your github repo:

$ apophenia discuss https://github.com/4383/niet

If you simply want to extract the data and then use the generated data in a RAG, below is python snippet example, but the discuss subcommand provide you this kind of shortcut:

import faiss
import json
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the FAISS index and the JSON metadata previously generated
def load_index_and_metadata(faiss_path, metadata_path):
    index = faiss.read_index(faiss_path)
    with open(metadata_path, 'r', encoding='utf-8') as f:
        metadata = json.load(f)
    return index, metadata

# Embedding of the user request
def embed_query(query, model):
    return model.encode(query, convert_to_tensor=True).cpu().numpy()

# Seach in the FAISS index
def search_in_faiss(index, query_embedding, metadata, k=5):
    distances, indices = index.search(np.array([query_embedding]), k)
    results = []
    for i, idx in enumerate(indices[0]):
        result = metadata[idx]
        result['distance'] = distances[0][i]
        results.append(result)
    return results

# Build a prompt for a generative model
def build_prompt(query, retrieved_info):
    prompt = f"Answer the following question based on the retrieved information:\n\n"
    prompt += f"Question: {query}\n\n"
    prompt += "Retrieved Information:\n"
    for info in retrieved_info:
        content_type = info.get("type", "unknown")
        content_preview = info.get("content_preview", "No preview available")
        prompt += f"- {content_type.upper()}: {content_preview}\n"
    prompt += "\nYour Answer:"
    return prompt

# Generate a response with a generative model
def generate_response(prompt, model_name="EleutherAI/gpt-neo-125M", max_length=200):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output = model.generate(input_ids, max_length=max_length)
    return tokenizer.decode(output[0], skip_special_tokens=True)


def run_rag_system(query, faiss_path, metadata_path, embedding_model_name, generative_model_name):
    # Load data (FAISS index and metadata, and embedding)
    index, metadata = load_index_and_metadata(faiss_path, metadata_path)
    embedding_model = SentenceTransformer(embedding_model_name)

    query_embedding = embed_query(query, embedding_model)

    # Search in FAISS
    retrieved_info = search_in_faiss(index, query_embedding, metadata)

    prompt = build_prompt(query, retrieved_info)

    response = generate_response(prompt, model_name=generative_model_name)

    return response, retrieved_info

if __name__ == "__main__":
    # Configuration
    FAISS_PATH = "results.faiss"
    METADATA_PATH = "results.json"
    EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
    GENERATIVE_MODEL_NAME = "EleutherAI/gpt-neo-125M"

    query = "How does the authentication system work in this repository?"

    response, retrieved_info = run_rag_system(
        query=query,
        faiss_path=FAISS_PATH,
        metadata_path=METADATA_PATH,
        embedding_model_name=EMBEDDING_MODEL_NAME,
        generative_model_name=GENERATIVE_MODEL_NAME
    )

    print("Generated Response:")
    print(response)
    print("\nRetrieved Information:")
    for info in retrieved_info:
        print(info)

For more details:

$ apophenia -h

Applications

The results (FAISS vectors and JSON metadata) generated by Apophenia can be used within a RAG (Retrieval-Augmented Generation) system. Here’s a list of potential applications for using apophenia:

  • Generate enriched answers by combining documentation, commit messages, and code. Example questions include: "How do I use the authenticate_user function?" or "What is the structure of this project?"
  • Quickly search for specific parts of the code or documentation. Identify relevant functions or files based on queries like "Where is the authentication logic implemented?" or "Which module handles network connections?"
  • Retrieve historical information to understand bugs or errors. Analyze recent changes with queries like "What are the latest modifications in this file?" or "Which commits mention this bug?"
  • Automatically generate a changelog based on commit messages and diffs for a new release.
  • Identify outdated dependencies or technologies and plan migrations. For instance, answer queries like "Which files are using Eventlet?" or "Which commits introduced asyncio?"
  • Search for changes related to vulnerabilities or critical dependencies. Example questions include "Which files use OpenSSL?" or "Which commits fixed vulnerabilities?"
  • Generate technical guides or manuals from existing code and documentation fragments. For example, create an installation guide from README files and configuration scripts.
  • Understand individual contributions or file evolution by asking questions like "Who wrote this function?" or "What are John Doe's contributions?"
  • Search for specific concepts within the project, such as "Where is the caching logic handled?" or "Which files mention secure connections?"
  • Simplify onboarding for new developers by providing guided answers like "The main features of this project are documented in README.md." or "auth.py handles authentication logic."
  • Identify which files or functions are impacted by a specific commit with questions like "Which files were modified by this commit?" or "Which tests are affected by this change?"
  • Extract code examples from existing fragments in files or commits. For instance, generate a snippet to illustrate how to use a specific function or module.
  • Quickly find useful information to solve a technical issue, such as "Which file is responsible for this exception?" or "Which commit introduced this error?"
  • Identify the libraries used and their versions. Example questions include "Which version of Django is being used?" or "Which commits mention outdated dependencies?"
  • Search for changes related to performance optimization with questions like "Which commits optimized this file?" or "Which functions were refactored for better performance?"
  • Identify team members who are most active in certain areas of the project by asking "Who contributes the most to the networking module?" or "What are the primary files in this project?"
  • Create customized reports on the state or evolution of a project. For example, generate a report on the 10 most significant recent commits or list the main modules and the most modified files.
  • Integrate extracted data into CI/CD pipelines. For instance, identify critical files for a specific build task.
  • Compare versions of files or branches using diffs and commits.
  • Identify areas of the code that need documentation or refactoring by asking "Which files lack associated documentation?" or "Which commits mention suboptimal code?"

If you recognize yourself in one of these examples then Apophenia is for you:

$ pip install apophenia

Going Further with FAISS

You can use generated output FAISS with langchain or with any modern libraries like llamaindex

Where apophenia stands for?

Apophenia (/æpoʊˈfiːniə/) is the tendency to perceive meaningful connections between unrelated things.

Apophenia has also come to describe a human propensity to unreasonably seek definite patterns in random information, such as can occur in gambling.

https://en.wikipedia.org/wiki/Apophenia

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apophenia-0.2.1.tar.gz (23.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

apophenia-0.2.1-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file apophenia-0.2.1.tar.gz.

File metadata

  • Download URL: apophenia-0.2.1.tar.gz
  • Upload date:
  • Size: 23.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for apophenia-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f0b79db8bdef7b8521e13e226fe2172e9c9d9d4b08d17f98da9f71f1aed81a90
MD5 9a385bbd055bb47e0b0b77d3ef4ff8da
BLAKE2b-256 2aac059aaf10a8e33e65738bcf049f12bbf363765dc4b8af5fcd1051d33b1eab

See more details on using hashes here.

Provenance

The following attestation bundles were made for apophenia-0.2.1.tar.gz:

Publisher: main.yml on 4383/apophenia

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file apophenia-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: apophenia-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 23.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for apophenia-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0e2cac26561385b39b64e520dc11439c22f071a4be8463d0d71fbf6f5b62b6c7
MD5 55dfc788636d8d12c2d93b4440302f7a
BLAKE2b-256 606fee0dba0e4e0046719e7ef73ce6fb6231aa56b44c7409098d2ee9b9dd1c2a

See more details on using hashes here.

Provenance

The following attestation bundles were made for apophenia-0.2.1-py3-none-any.whl:

Publisher: main.yml on 4383/apophenia

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page