Skip to main content

A production-ready Python service for intelligently chunking source code and ingesting it into RAG pipelines

Project description

Code Ingestion Service

A production-ready Python service for intelligently chunking source code and ingesting it into RAG (Retrieval-Augmented Generation) pipelines. Features optimized performance, Pinecone vector database integration, and a powerful CLI for both public and private repositories.

Author: Sandeep G
Copyright: © 2025 Sandeep G
License: Apache License 2.0 - see LICENSE file for details

🚀 Features

Core Capabilities

  • Smart Code Chunking: Single-pass CST analysis.
  • Pluggable Architecture: Swap embedding providers (Nomic, OpenAI) and vector stores (Pinecone, Weaviate)
  • High Performance: Optimized batching for ultra-fast embedding generation
  • Production-Ready CLI: Unified interface with provider selection and verbose logging

Language & Platform Support

  • Java: Full CST parsing with REST API detection and method extraction
  • Git Integration: Shallow clone support for public/private repositories
  • Multiple Providers: Nomic (default), OpenAI, HuggingFace embedding support
  • Vector Stores: Pinecone (default), with extensible architecture

Performance & Reliability

  • Optimized Processing: Single-pass CST traversal, method-level context caching
  • Smart Filtering: Include/exclude patterns for selective ingestion
  • Error Handling: Robust processing with cleanup and validation
  • Test Coverage: Comprehensive test suite for all components

🛠️ Core Components

CodeChunker

The main orchestrator for code chunking operations, responsible for:

  • Parsing source code
  • Analyzing Concrete Syntax Trees (CST)
  • Applying chunking strategies
  • Generating structured code chunks with metadata

Key Features

  • Package & Import Handling: Preserves context by maintaining package and import statements
  • Class-Level Chunking: Creates complete class chunks when appropriate
  • Method-Level Chunking: Breaks down classes into method-level chunks when needed
  • Intelligent ID Generation: Creates unique identifiers for each code chunk
  • Metadata Management: Tracks comprehensive metadata for each chunk

🏗️ Architecture

The service follows a pluggable architecture with these main components:

  1. Orchestration: Coordinates the complete ingestion pipeline
  2. Chunkers: Handle intelligent code splitting with CST analysis
  3. Embedding Providers: Generate embeddings (Nomic, OpenAI, HuggingFace)
  4. Vector Stores: Store embeddings (Pinecone, Weaviate, Qdrant)
  5. Data Models: Define structured representations for chunks and metadata

💻 Usage

CLI Usage (Recommended)

The CLI provides a simple interface for ingesting repositories into your RAG pipeline:

Basic Usage

# Ingest a local repository
code-ingestion /path/to/your/repo

# Ingest a public GitHub repository
code-ingestion https://github.com/spring-projects/spring-boot

# Ingest with file filtering
code-ingestion https://github.com/kdn251/interviews \
  --include "**/*.java" \
  --exclude "**/test/**" \
  --max-files 50

Advanced Filtering

# Include specific folders and file types
code-ingestion https://github.com/kdn251/interviews \
  --include "company/**/*.java" \
  --include "leetcode/**/*.java" \
  --include "cracking-the-coding-interview/**/*.java" \
  --max-files 30

# Exclude unwanted directories
code-ingestion https://github.com/spring-projects/spring-boot \
  --include "**/*.java" \
  --exclude "**/test/**" \
  --exclude "**/build/**" \
  --exclude "**/target/**"

Provider Selection

# Use different embedding providers
code-ingestion /path/to/repo --embedding-provider nomic     # Default
code-ingestion /path/to/repo --embedding-provider openai
code-ingestion /path/to/repo --embedding-provider huggingface

# Use different vector stores  
code-ingestion /path/to/repo --vector-store pinecone        # Default
code-ingestion /path/to/repo --vector-store weaviate

# Enable detailed logging and progress
code-ingestion /path/to/repo --verbose

CLI Options

  • --embedding-provider: Choose embedding provider (nomic, openai, huggingface)
  • --vector-store: Choose vector store (pinecone, weaviate, qdrant)
  • --verbose: Enable detailed logging and progress reports
  • --include: File patterns to include (supports glob patterns like **/*.java)
  • --exclude: File patterns to exclude (default excludes test, build, node_modules, etc.)
  • --max-files: Limit number of files processed (useful for large repos)
  • --cleanup/--no-cleanup: Control temporary file cleanup (default: cleanup enabled)

GitHub Actions Integration (Recommended)

Create automated ingestion workflows for your repositories:

# .github/workflows/ingest-code.yml
name: Ingest Codebase to RAG
on:
  workflow_dispatch:  # Manual trigger
  push:
    branches: [ main ]  # Auto-trigger on main branch updates
  
jobs:
  ingest:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Install Code Ingestion Service
        run: |
          pip install git+https://github.com/sandeepgovi/code-ingestion-service.git@main
      
      - name: Ingest Current Repository
        run: |
          code-ingestion ./ \
            --include "**/*.java" \
            --include "**/*.py" \
            --exclude "**/test/**" \
            --verbose
        env:
          PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
          PINECONE_INDEX_NAME: ${{ secrets.PINECONE_INDEX_NAME }}

  # Optional: Ingest external repositories
  ingest-external:
    runs-on: ubuntu-latest
    steps:
      - name: Install Code Ingestion Service
        run: |
          pip install git+https://github.com/sandeepgovi/code-ingestion-service.git@main
          
      - name: Ingest Spring Boot (example)
        run: |
          code-ingestion https://github.com/spring-projects/spring-boot \
            --include "**/*.java" \
            --max-files 200 \
            --verbose
        env:
          PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
          PINECONE_INDEX_NAME: ${{ secrets.PINECONE_INDEX_NAME }}

Multi-Repository Workflow

# Ingest multiple repositories in one workflow
name: Build Knowledge Base
on:
  workflow_dispatch:
  schedule:
    - cron: '0 2 * * 0'  # Weekly on Sunday at 2 AM

jobs:
  ingest-repositories:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        repo: 
          - 'https://github.com/spring-projects/spring-boot'
          - 'https://github.com/apache/kafka' 
          - 'your-org/internal-repo'
    steps:
      - name: Install Code Ingestion Service
        run: |
          pip install git+https://github.com/sandeepgovi/code-ingestion-service.git@main
          
      - name: Ingest Repository
        run: |
          code-ingestion ${{ matrix.repo }} \
            --include "**/*.java" \
            --exclude "**/test/**" \
            --max-files 500 \
            --verbose
        env:
          PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
          PINECONE_INDEX_NAME: ${{ secrets.PINECONE_INDEX_NAME }}

Local Development & Testing

You can test the service locally before setting up CI/CD:

# Test with a local repository
code-ingestion /path/to/your/local/repo --verbose

# Test with a public repository
code-ingestion https://github.com/spring-projects/spring-boot \
  --include "**/*.java" \
  --max-files 50 \
  --verbose

# Test different providers
code-ingestion /path/to/repo \
  --embedding-provider openai \
  --vector-store pinecone \
  --verbose

⚙️ Setup

Prerequisites

  • Python 3.13+
  • Pinecone account and API key
  • Git (for repository cloning)

Installation

Option 1: Direct Installation (Recommended)

# Install directly from GitHub
pip install git+https://github.com/sandeepgovi/code-ingestion-service.git@main

# Test the installation
code-ingestion --help

Option 2: Development Installation

# Clone and install in development mode
git clone https://github.com/sandeepgovi/code-ingestion-service
cd code-ingestion-service
pip install -e .

# Test the installation  
code-ingestion --help

Setup Environment Variables

# Set up required environment variables
export PINECONE_API_KEY=your_api_key_here
export PINECONE_INDEX_NAME=your_index_name

# Or create .env file
echo "PINECONE_API_KEY=your_api_key_here" > .env
echo "PINECONE_INDEX_NAME=your_index_name" >> .env

🔒 Security

Environment Variables

  • Never commit .env files or hardcoded secrets to version control
  • Use the provided .env.example as a template
  • Store API keys and sensitive configuration in environment variables only

Recommended Security Practices

  • Review Dependencies: Regularly audit dependencies for vulnerabilities
  • Access Control: Limit repository access when processing private repositories
  • API Keys: Use read-only API keys when possible, rotate keys regularly
  • Local Processing: Sensitive code processing happens locally before embedding

Supported Environment Variables

# Pinecone (default vector store)
PINECONE_API_KEY=your_pinecone_api_key        # Required for Pinecone integration
PINECONE_INDEX_NAME=your_index_name           # Required for Pinecone integration
PINECONE_BATCH_SIZE=100                       # Optional: batch size for uploads

# Embedding providers
EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5  # Default model
OPENAI_API_KEY=your_openai_key                # Required for OpenAI embeddings

# Other providers (as you add them)
WEAVIATE_URL=http://localhost:8080            # For Weaviate vector store
QDRANT_URL=http://localhost:6333              # For Qdrant vector store

Security Audit

Run dependency vulnerability checks:

# Install security audit tool
pip install pip-audit

# Check for vulnerabilities
pip-audit

# Or using pipenv
pipenv check

🤝 Contributing

Contributions are welcome! Feel free to submit pull requests or open issues for:

  • Bug fixes
  • New features
  • Documentation improvements
  • Additional language support

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Attribution Requirements

When using this software, you must:

  • Include the copyright notice and license in any copy or substantial portion of the software
  • State any significant changes made to the original code
  • Include attribution to the original author (Sandeep G) in derivative works

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_ingestion_service-0.1.0.tar.gz (39.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

code_ingestion_service-0.1.0-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file code_ingestion_service-0.1.0.tar.gz.

File metadata

  • Download URL: code_ingestion_service-0.1.0.tar.gz
  • Upload date:
  • Size: 39.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for code_ingestion_service-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fa19497a2cb8e61237219eadbf6ca4cfebc790e3d546c1decdd68ab53413346f
MD5 e73afee04fe3bdc218c5333d256d3e42
BLAKE2b-256 40b360f50655ec99533b764dca7b780b6abf51d7aa108e53ac710d8173ada201

See more details on using hashes here.

File details

Details for the file code_ingestion_service-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for code_ingestion_service-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fac916ad50ddd1bfdae70cdebb2da443994d0eb61e4b1783767868d84b4089a0
MD5 64d25ae4061dbce523046ae9763ffec2
BLAKE2b-256 a495c4ef70dc502fa85c725fecb194e01ab99d72938dec1cfb9ab2254d0e52bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page