🚀 Builds a structured markdown knowledge base from external sources such as websites, documents, and GitHub repos with large language models. Ideal for RAG, SEO-friendly LLM contexts (/llms.txt), and chatbots.

These details have not been verified by PyPI

Project links

Project description

🧠 Multi-Source Knowledge Base Builder for LLMs

Python package to transform diverse content into a powerful, structured knowledge base. This tool seamlessly ingests files (PDFs, DOCXs, spreadsheets, Markdown, plaintext), websites (HTML pages, sitemaps, XML content), and GitHub repositories, processes them with state-of-the-art large language models (Gemini, GPT-4o, Claude), and produces a comprehensive Markdown knowledge base. Perfect for creating web-crawlable /llms.txt files, powering RAG applications, preprocessing content for vector databases, or building specialized chatbots. The algorithm uses a logarithmic-depth parallel merge tree with a concurrency-limited semaphore to efficiently process and merge all documents.

✨ Features

📄 Document ingestion – Downloads local or remote documents and extracts structured text.
🌐 Website ingestion – Crawls pages from a sitemap or list of pages and extracts clean HTML content.
📘 GitHub integration – Fetches Markdown files from public repositories.
🧠 LLM-powered summarization – Uses state-of-the-art models to convert raw data into readable, structured Markdown.
🔁 Recursive merging – Combines multiple knowledge base sections into a single cohesive document.
🔄 Multiple model providers – Choose between Google Gemini, OpenAI GPT-4o, or Anthropic Claude 3.7 Sonnet.
⚡ Performance – Load files in parallel and make multiple asynchronous calls to LLMs to summarize documents.

🚀 Installation

Install from PyPI

pip install knowledge-base-builder

Install from Source

git clone https://github.com/kostadindev/knowledge-base-builder.git
cd knowledge-base-builder
pip install -e .

🚀 Quickstart

1. Set up your `.env` file

Create a .env file in your project directory with the following variables (add the API keys for the models you intend to use):

# You need only one of the following
GOOGLE_API_KEY=your_google_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here

# Optional if you want to include Github repositories with a high rate limit
GITHUB_API_KEY=your_github_api_key_here

2. Use as a Python Package

import os
from dotenv import load_dotenv
from knowledge_base_builder import KBBuilder

# Load environment variables
load_dotenv()

# API and model configuration
config = {
    'GOOGLE_API_KEY': os.getenv("GOOGLE_API_KEY"),     # For Gemini | Get Free API Key at https://aistudio.google.com/app/apikey
    'OPENAI_API_KEY': os.getenv("OPENAI_API_KEY"),     # For GPT-4o
    'ANTHROPIC_API_KEY': os.getenv("ANTHROPIC_API_KEY"), # For Claude
}

# Source documents - unified approach
    sources = {
        # Unified files list - automatically detects and processes each file type
        'files': [
            # PDF documents - remote
            "https://kostadindev.github.io/static/documents/cv.pdf",
            "https://kostadindev.github.io/static/documents/sbu_transcript.pdf",
            # Local file path (no need for file:/// prefix)
            "C:/Users/kosta/OneDrive/Desktop/MS Application Materials/emf-ellipse-publication.pdf",
            
            # Web pages
            "https://kostadindev.github.io/index.html",
            "https://kostadindev.github.io/projects.html",
            
            # Add other file types as needed
            # "https://example.com/data.csv",
            # "path/to/local/document.docx",  # Relative local path example
            # "https://example.com/api-docs.json",
        ],
        
        # Process all pages from a sitemap
        'sitemap_url': "https://kostadindev.github.io/sitemap.xml",
        
        # GitHub repositories to process (format: username/repo or full URL)
        'github_repositories': [
            "https://github.com/kostadindev/Knowledge-Base-Builder",
            "https://github.com/kostadindev/GONEXT",
            "https://github.com/kostadindev/GONEXT-ML",
            "https://github.com/kostadindev/meta-me",
            "https://github.com/kostadindev/Recursive-QA",
            "https://github.com/kostadindev/deep-gestures",
            "https://github.com/kostadindev/emf-ellipse"
        ]
    }
# Create KB builder
kbb = KBBuilder(config)

# Build knowledge base
kbb.build(sources=sources, output_file="final_knowledge_base.md")

🔧 Supported Sources

Source Type	Description	Formats
Documents	Text documents	PDF, DOCX, TXT, MD, RTF
Spreadsheets	Tabular data	CSV, TSV, XLSX, ODS
Web Content	Structured web data	HTML, XML, JSON, YAML/YML
Websites	Live web pages	Any URL or sitemap
GitHub	Repository content	Markdown files from public repos

All sources can now be added through the unified files parameter, with automatic format detection.

🧠 LLM Providers

Provider	Models	Features
Google Gemini	gemini-2.0-flash (default)	Free to try, Fast, large context window cost-effective summaries
OpenAI	gpt-4o (default)	High-quality summaries, strong reasoning
Anthropic	claude-3-7-sonnet (default)	High-quality summaries, excellent formatting

Recommended Provider: Google Gemini

Google Gemini is the recommended provider as a free development API key can be obtained. Additionally, it is fast, has a large context window, and performs well on benchmarks. Get your free API key at Google AI Studio.

📥 Output Example

# Resume Summary

## Education
- B.S. in Computer Science from XYZ University

## Experience
- Software Engineer at ABC Corp
- Developed NLP-based document parsers...

---

# Website Summary

## Project Pages
- **Project Alpha**: A machine learning system for ...
- **Blog Post**: How to use Gemini with LangChain ...

🔍 Applications

Web Crawlable LLM Context Enhancement

/llms.txt: Generate a compact, web-crawlable context file (typically 10-20KB) that allows LLMs to access your personal or organizational information during web searches.
/llms-full.txt: Create an expanded knowledge file (50-100KB) with comprehensive details about your work, expertise, and content that search-powered LLMs can index.
Web Context Sources: Enable web search LLMs like Perplexity, ChatGPT, Claude, and Gemini to discover and reference your structured information during user queries.

RAG Applications

Vector Database Preprocessing: Generate clean, structured content before embedding into vector stores like Pinecone, Chroma, or Weaviate, improving retrieval quality.
Single-Context LLM Applications: Provide a comprehensive knowledge base that fits within a single LLM context window (up to 128K tokens) for domain-specific assistants.
Hybrid RAG Systems: Combine the full knowledge base with selective vector retrieval for specialized question answering systems with reduced hallucination.

Personal Knowledge Management

Professional Portfolio: Create a comprehensive knowledge base integrating your resume, publications, projects, and online presence into a single searchable document.
Academic Research: Compile research papers, conference proceedings, and citations into a structured knowledge base for literature reviews or thesis preparation.
Technical Documentation: Consolidate documentation across multiple GitHub repositories, technical blogs, and API references into a unified technical manual.

Enterprise Use Cases

Company Knowledge Base: Consolidate internal documentation, product specifications, and team information into an easily updatable central resource.
Customer Support: Transform support tickets, FAQs, and product manuals into a comprehensive knowledge base for support agents or automated systems.
Competitive Intelligence: Build a structured repository of competitor information from various public sources, updated periodically with the latest data.
Candidate Evaluation: Generate comprehensive profiles of job candidates by compiling their GitHub contributions, research papers, portfolio, and online presence.
Onboarding Acceleration: Create personalized knowledge bases for new employees containing company policies, codebase documentation, and team information.

🌲 Algorithm

The knowledge base builder uses a two-step approach for efficient processing:

Parallel Preprocessing
- All documents are preprocessed concurrently into structured KBs
- Uses a semaphore to limit concurrent LLM requests
- Each document is converted into a well-formatted Markdown knowledge base
- Optimized for parallel processing with controlled concurrency
Single Merge
- All preprocessed KBs are merged in a single operation
- Maintains logical structure and organization
- Reduces total LLM calls compared to recursive approaches
- More predictable memory usage

This approach provides several advantages:

Fewer total LLM calls (one per document + one final merge)
Better parallelization of preprocessing
More predictable memory usage
Simpler and more maintainable code
Faster overall processing time

⚡ Concurrency Model

The knowledge base builder implements a multi-layer concurrency model to maximize performance while maintaining stability:

1. File Processing Concurrency

# Multiple files processed simultaneously
tasks = [process_file_async(file) for file in files]
await asyncio.gather(*tasks)

Enables parallel processing of multiple files
Each file type (PDF, DOCX, web page, etc.) is processed independently
Significantly reduces total processing time for multiple files

2. CPU-Bound Operations

# CPU-intensive operations run in separate threads
path = await asyncio.to_thread(processor.download, url)
text = await asyncio.to_thread(processor.extract_text, path)

Downloads and text extraction run in separate threads
Prevents blocking the event loop during I/O operations
Optimizes CPU utilization across cores

3. LLM Processing Concurrency

# Controlled concurrent LLM API calls
async with self._sem:
    result = await self.llm_client.run_async(prompt)

Uses a semaphore to limit concurrent LLM API calls
Prevents overwhelming the LLM API
Helps stay within API rate limits
Default concurrency limit: 8 simultaneous requests

4. Final KB Merging

# Concurrent preprocessing followed by single merge
tasks = [preprocess_text_async(text) for text in texts]
preprocessed_kbs = await asyncio.gather(*tasks)
final_kb = await merge_all_kbs(preprocessed_kbs)

Preprocesses all documents concurrently
Merges them into a final knowledge base
Optimizes both speed and memory usage

Performance Considerations

Resource Management: CPU-bound operations don't block the event loop
Rate Limiting: LLM API calls are properly throttled
Scalability: System can handle many files without performance degradation
Constraints:
- LLM concurrency limit (default: 8)
- System resources (CPU, memory, network)
- LLM API rate limits

⚠️ Limitations

Memory Usage

Document Processing: Each document is loaded into memory during processing
LLM Context Windows: Different models have different context window limits:
- Gemini: 1M tokens
Merge Operations: Final merge operation requires all preprocessed KBs in memory
Recommendation: Monitor memory usage when processing large documents or many files

Processing Time

I/O Operations: Each file requires multiple I/O operations:
- Downloading/reading the file
- Text extraction
- LLM API calls
LLM Latency: Each document requires at least one LLM call:
- One call per document for preprocessing
- One final call for merging

Rate Limits

LLM API Limits: Each provider has different rate limits
GitHub API: 60 requests per hour (unauthenticated)
Web Scraping: Some websites may block rapid requests

🧪 Future Improvements

Data Sources Expansion

Cloud Integration: Add support for Google Drive, Dropbox, and OneDrive documents
Social & Professional: Add LinkedIn profiles, Twitter feeds, and Medium articles integration
Academic Sources: Connect to arXiv, Google Scholar, and research databases

Performance Optimizations

Parallel Processing: Improve multi-document processing with adaptive concurrency control
Merge Algorithm: Enhance the logarithmic-depth merge tree for better memory efficiency
Streaming Processing: Implement document streaming for reduced memory footprint

Output & Integration

Vector DB Export: Direct export to Pinecone, Chroma, Weaviate, and other vector databases
LangChain Integration: Simplified integration with LangChain for RAG applications
Custom Schemas: User-definable output schemas for specialized knowledge base formats

Advanced Features

Incremental Updates: Support for updating existing knowledge bases with new content
Multi-language Support: Process and merge content across different languages
Custom Taxonomies: Allow users to define custom categorization schemas for content organization

Performance & Limitations Improvements

Memory Optimization:
- Implement streaming document processing to reduce memory footprint
- Add chunking for documents exceeding context windows
- Develop smart caching system for processed documents
- Add memory usage monitoring and automatic batch sizing
Processing Speed:
- Implement progressive document loading
- Develop smart retry mechanisms for failed operations
Rate Limit Management:
- Add automatic rate limit detection and adaptation
- Implement smart queuing system for API calls
- Add support for multiple API keys rotation

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details on how to get started.

🐛 Bug Reports

Found a bug? Please check our Contributing Guidelines for instructions on how to report it.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

May 7, 2025

0.1.1

May 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

knowledge_base_builder-0.1.2.tar.gz (38.4 kB view details)

Uploaded May 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

knowledge_base_builder-0.1.2-py3-none-any.whl (40.9 kB view details)

Uploaded May 7, 2025 Python 3

File details

Details for the file knowledge_base_builder-0.1.2.tar.gz.

File metadata

Download URL: knowledge_base_builder-0.1.2.tar.gz
Upload date: May 7, 2025
Size: 38.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.1

File hashes

Hashes for knowledge_base_builder-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`0f3bc0ee7696b3fa2a53d9ecf4685c136d42b116bf317f5b89081fd99f387b7e`
MD5	`0b68c989f8501928d26376debc5b0cc8`
BLAKE2b-256	`a75f0cb6bde1060def310ef802cc97c5c5addd41f55c836058cf644b33d3e461`

See more details on using hashes here.

File details

Details for the file knowledge_base_builder-0.1.2-py3-none-any.whl.

File metadata

Download URL: knowledge_base_builder-0.1.2-py3-none-any.whl
Upload date: May 7, 2025
Size: 40.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.1

File hashes

Hashes for knowledge_base_builder-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`05e77df20fe2fecf0c1261dafca8e113abf77821f9f6243fb2a6e362db0fcce2`
MD5	`324fbfc6998db4577e727e1d7eb59580`
BLAKE2b-256	`f890e35b5d2b0c85595e05d26fcb40ea2244bf4df79c310c5a4b214926dae769`

See more details on using hashes here.

knowledge-base-builder 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧠 Multi-Source Knowledge Base Builder for LLMs

✨ Features

🚀 Installation

Install from PyPI

Install from Source

🚀 Quickstart

1. Set up your .env file

2. Use as a Python Package

🔧 Supported Sources

🧠 LLM Providers

📥 Output Example

🔍 Applications

Web Crawlable LLM Context Enhancement

RAG Applications

Personal Knowledge Management

Enterprise Use Cases

🌲 Algorithm

⚡ Concurrency Model

1. File Processing Concurrency

2. CPU-Bound Operations

3. LLM Processing Concurrency

4. Final KB Merging

Performance Considerations

⚠️ Limitations

Memory Usage

Processing Time

Rate Limits

🧪 Future Improvements

Data Sources Expansion

Performance Optimizations

Output & Integration

Advanced Features

Performance & Limitations Improvements

🤝 Contributing

🐛 Bug Reports

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Set up your `.env` file