Extract and structure all the data from a Git repository to make them usable in RAG.

These details have not been verified by PyPI

Project description

Apophenia

Build PyPI PyPI - Python Version PyPI - Status

Apophenia give meaning to any existing Git repository.

Apophenia extract and structure all the data from a Git repository to make them usable in RAG or in with AI agents.

Apophenia impose a meaningful interpretation on a nebulous stimulus (a Git repo).

Usage

Extract data from a given repository:

$ pip install apophenia
$ apophenia https://github.com/4383/niet \
  --faiss_path /tmp/results.faiss \
  --metadata_path /tmp/results.json

And use generated data in a RAG (python snippet example):

import faiss
import json
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the FAISS index and the JSON metadata previously generated
def load_index_and_metadata(faiss_path, metadata_path):
    index = faiss.read_index(faiss_path)
    with open(metadata_path, 'r', encoding='utf-8') as f:
        metadata = json.load(f)
    return index, metadata

# Embedding of the user request
def embed_query(query, model):
    return model.encode(query, convert_to_tensor=True).cpu().numpy()

# Seach in the FAISS index
def search_in_faiss(index, query_embedding, metadata, k=5):
    distances, indices = index.search(np.array([query_embedding]), k)
    results = []
    for i, idx in enumerate(indices[0]):
        result = metadata[idx]
        result['distance'] = distances[0][i]
        results.append(result)
    return results

# Build a prompt for a generative model
def build_prompt(query, retrieved_info):
    prompt = f"Answer the following question based on the retrieved information:\n\n"
    prompt += f"Question: {query}\n\n"
    prompt += "Retrieved Information:\n"
    for info in retrieved_info:
        content_type = info.get("type", "unknown")
        content_preview = info.get("content_preview", "No preview available")
        prompt += f"- {content_type.upper()}: {content_preview}\n"
    prompt += "\nYour Answer:"
    return prompt

# Generate a response with a generative model
def generate_response(prompt, model_name="EleutherAI/gpt-neo-125M", max_length=200):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output = model.generate(input_ids, max_length=max_length)
    return tokenizer.decode(output[0], skip_special_tokens=True)


def run_rag_system(query, faiss_path, metadata_path, embedding_model_name, generative_model_name):
    # Load data (FAISS index and metadata, and embedding)
    index, metadata = load_index_and_metadata(faiss_path, metadata_path)
    embedding_model = SentenceTransformer(embedding_model_name)

    query_embedding = embed_query(query, embedding_model)

    # Search in FAISS
    retrieved_info = search_in_faiss(index, query_embedding, metadata)

    prompt = build_prompt(query, retrieved_info)

    response = generate_response(prompt, model_name=generative_model_name)

    return response, retrieved_info

if __name__ == "__main__":
    # Configuration
    FAISS_PATH = "results.faiss"
    METADATA_PATH = "results.json"
    EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
    GENERATIVE_MODEL_NAME = "EleutherAI/gpt-neo-125M"

    query = "How does the authentication system work in this repository?"

    response, retrieved_info = run_rag_system(
        query=query,
        faiss_path=FAISS_PATH,
        metadata_path=METADATA_PATH,
        embedding_model_name=EMBEDDING_MODEL_NAME,
        generative_model_name=GENERATIVE_MODEL_NAME
    )

    print("Generated Response:")
    print(response)
    print("\nRetrieved Information:")
    for info in retrieved_info:
        print(info)

Why using apophenia?

Here’s a list of potential applications for using the generated results (FAISS vectors and JSON metadata) within a RAG (Retrieval-Augmented Generation) system:

1. Augmented Documentation

Generate enriched answers by combining documentation, commit messages, and code.
Example questions:
- "How do I use the authenticate_user function?"
- "What is the structure of this project?"

2. Developer Assistance

Quickly search for specific parts of the code or documentation.
Identify relevant functions or files based on queries like:
- "Where is the authentication logic implemented?"
- "Which module handles network connections?"

3. Intelligent Debugging

Retrieve historical information to understand bugs or errors.
Analyze recent changes with queries like:
- "What are the latest modifications in this file?"
- "Which commits mention this bug?"

4. Automated Changelog Generation

Create a changelog based on commit messages and their diffs.
Example use case:
- Automatically generate a structured changelog for a new release.

5. Code Migration and Modernization

Identify outdated dependencies or technologies.
Plan migrations by answering queries like:
- "Which files are using Eventlet?"
- "Which commits introduced asyncio?"

6. Audit and Compliance

Search for changes related to vulnerabilities or critical dependencies.
Example questions:
- "Which files use OpenSSL?"
- "Which commits fixed vulnerabilities?"

7. Documentation Generation

Generate guides or technical manuals from existing code and documentation fragments.
Example:
- Create an installation guide from README files and configuration scripts.

8. Git History Analysis

Understand individual contributions or file evolution.
Example questions:
- "Who wrote this function?"
- "What are John Doe's contributions?"

9. Keyword or Contextual Search

Search for specific concepts within the project:
- "Where is the caching logic handled?"
- "Which files mention secure connections?"

10. Onboarding Assistance

Simplify onboarding for new developers:
- Provide guided answers like:
  - "The main features of this project are documented in README.md."
  - "auth.py handles authentication logic."

11. Change Impact Analysis

Identify which files or functions are impacted by a specific commit.
Example questions:
- "Which files were modified by this commit?"
- "Which tests are affected by this change?"

12. Code Example Generation

Extract code examples from existing fragments in files or commits.
Example use case:
- Generate a snippet to illustrate how to use a specific function or module.

13. Technical Problem Solving

Quickly find useful information to solve a technical issue.
Example questions:
- "Which file is responsible for this exception?"
- "Which commit introduced this error?"

14. Dependency Auditing

Identify the libraries used and their versions.
Example questions:
- "Which version of Django is being used?"
- "Which commits mention outdated dependencies?"

15. Performance Analysis

Search for changes related to performance optimization.
Example questions:
- "Which commits optimized this file?"
- "Which functions were refactored for better performance?"

16. Contribution Analysis for Project Management

Identify team members who are most active in certain areas of the project.
Example questions:
- "Who contributes the most to the networking module?"
- "What are the primary files in this project?"

17. Technical Report Generation

Create customized reports on the state or evolution of a project.
Examples:
- Report on the 10 most significant recent commits.
- List of main modules and the most modified files.

18. Automating DevOps Workflows

Integrate extracted data into CI/CD pipelines.
Example:
- Identify critical files for a specific build task.

19. Comparative Analysis

Compare versions of files or branches using diffs and commits.

20. Continuous Improvement

Identify areas of the code that need documentation or refactoring.
Example questions:
- "Which files lack associated documentation?"
- "Which commits mention suboptimal code?"

If you recognize yourself in one of these examples then Apophenia is for you.

Going Further with FAISS

You can use generated output FAISS with langchain or with any modern libraries like llamaindex

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

Nov 26, 2024

0.1.6

Nov 26, 2024

0.1.5

Nov 25, 2024

0.1.4

Nov 25, 2024

0.1.3

Nov 25, 2024

This version

0.1.2

Nov 25, 2024

0.1.1

Nov 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apophenia-0.1.2.tar.gz (21.6 kB view details)

Uploaded Nov 25, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

apophenia-0.1.2-py3-none-any.whl (19.1 kB view details)

Uploaded Nov 25, 2024 Python 3

File details

Details for the file apophenia-0.1.2.tar.gz.

File metadata

Download URL: apophenia-0.1.2.tar.gz
Upload date: Nov 25, 2024
Size: 21.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for apophenia-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`8d2b6907f1fd56e75d0b735dc94f924f33977273a2ee7db096188478b5ffec3c`
MD5	`c90a1eb70a30e47910164b5021e890e6`
BLAKE2b-256	`8d40a61232d57dd0683f75410f8553d280a51452c23ff047bce5520366be9ee0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for apophenia-0.1.2.tar.gz:

Publisher: main.yml on 4383/apophenia

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: apophenia-0.1.2.tar.gz
- Subject digest: 8d2b6907f1fd56e75d0b735dc94f924f33977273a2ee7db096188478b5ffec3c
- Sigstore transparency entry: 151395065
- Sigstore integration time: Nov 25, 2024
Source repository:
- Permalink: 4383/apophenia@9c5cdaef3cc540bdda4883e4ba1e3dff9712cc6e
- Branch / Tag: refs/tags/0.1.2
- Owner: https://github.com/4383
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: main.yml@9c5cdaef3cc540bdda4883e4ba1e3dff9712cc6e
- Trigger Event: push

File details

Details for the file apophenia-0.1.2-py3-none-any.whl.

File metadata

Download URL: apophenia-0.1.2-py3-none-any.whl
Upload date: Nov 25, 2024
Size: 19.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for apophenia-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec2026fe03d2ae1643128f53dc25778c82b0f16c9b01b8b401d26de554bb1a7f`
MD5	`2d165627d2a455847d7fdf55415a402e`
BLAKE2b-256	`6ac960ecbe62f9610462841e429372760873a017f2304ef108b9ddc269b16959`

See more details on using hashes here.

Provenance

The following attestation bundles were made for apophenia-0.1.2-py3-none-any.whl:

Publisher: main.yml on 4383/apophenia

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: apophenia-0.1.2-py3-none-any.whl
- Subject digest: ec2026fe03d2ae1643128f53dc25778c82b0f16c9b01b8b401d26de554bb1a7f
- Sigstore transparency entry: 151395077
- Sigstore integration time: Nov 25, 2024
Source repository:
- Permalink: 4383/apophenia@9c5cdaef3cc540bdda4883e4ba1e3dff9712cc6e
- Branch / Tag: refs/tags/0.1.2
- Owner: https://github.com/4383
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: main.yml@9c5cdaef3cc540bdda4883e4ba1e3dff9712cc6e
- Trigger Event: push

apophenia 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Apophenia

Usage

Why using apophenia?

1. Augmented Documentation

2. Developer Assistance

3. Intelligent Debugging

4. Automated Changelog Generation

5. Code Migration and Modernization

6. Audit and Compliance

7. Documentation Generation

8. Git History Analysis

9. Keyword or Contextual Search

10. Onboarding Assistance

11. Change Impact Analysis

12. Code Example Generation

13. Technical Problem Solving

14. Dependency Auditing

15. Performance Analysis

16. Contribution Analysis for Project Management

17. Technical Report Generation

18. Automating DevOps Workflows

19. Comparative Analysis

20. Continuous Improvement

Going Further with FAISS

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance