🚀 Builds a structured markdown knowledge base from external sources such as websites, documents, and GitHub repos with large language models. Ideal for RAG, SEO-friendly LLM contexts (/llms.txt), and chatbots.
Project description
🧠 Multi-Source Knowledge Base Builder for LLMs
Python package to transform diverse content into a powerful, structured knowledge base. This tool seamlessly ingests files (PDFs, DOCXs, spreadsheets, Markdown, plaintext), websites (HTML pages, sitemaps, XML content), and GitHub repositories, processes them with state-of-the-art large language models (Gemini, GPT-4o, Claude), and produces a comprehensive Markdown knowledge base. Perfect for creating web-crawlable /llms.txt files, powering RAG applications, preprocessing content for vector databases, or building specialized chatbots. The algorithm uses a logarithmic-depth parallel merge tree with a concurrency-limited semaphore to efficiently process and merge all documents.
✨ Features
- 📄 Document ingestion – Downloads local or remote documents and extracts structured text.
- 🌐 Website ingestion – Crawls pages from a sitemap or list of pages and extracts clean HTML content.
- 📘 GitHub integration – Fetches Markdown files from public repositories.
- 🧠 LLM-powered summarization – Uses state-of-the-art models to convert raw data into readable, structured Markdown.
- 🔁 Recursive merging – Combines multiple knowledge base sections into a single cohesive document.
- 🔄 Multiple model providers – Choose between Google Gemini, OpenAI GPT-4o, or Anthropic Claude 3.7 Sonnet.
- ⚡ Performance – Load files in parallel and make multiple asynchronous calls to LLMs to summarize documents.
🚀 Installation
Install from PyPI
pip install knowledge-base-builder
Install from Source
git clone https://github.com/kostadindev/knowledge-base-builder.git
cd knowledge-base-builder
pip install -e .
🚀 Quickstart
1. Set up your .env file
Create a .env file in your project directory with the following variables (add the API keys for the models you intend to use):
# You need only one of the following
GOOGLE_API_KEY=your_google_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
# Optional if you want to include Github repositories with a high rate limit
GITHUB_API_KEY=your_github_api_key_here
2. Use as a Python Package
import os
from dotenv import load_dotenv
from knowledge_base_builder import KBBuilder
# Load environment variables
load_dotenv()
# API and model configuration
config = {
'GOOGLE_API_KEY': os.getenv("GOOGLE_API_KEY"), # For Gemini | Get Free API Key at https://aistudio.google.com/app/apikey
'OPENAI_API_KEY': os.getenv("OPENAI_API_KEY"), # For GPT-4o
'ANTHROPIC_API_KEY': os.getenv("ANTHROPIC_API_KEY"), # For Claude
}
# Source documents - unified approach
sources = {
# Unified files list - automatically detects and processes each file type
'files': [
# PDF documents - remote
"https://kostadindev.github.io/static/documents/cv.pdf",
"https://kostadindev.github.io/static/documents/sbu_transcript.pdf",
# Local file path (no need for file:/// prefix)
"C:/Users/kosta/OneDrive/Desktop/MS Application Materials/emf-ellipse-publication.pdf",
# Web pages
"https://kostadindev.github.io/index.html",
"https://kostadindev.github.io/projects.html",
# Add other file types as needed
# "https://example.com/data.csv",
# "path/to/local/document.docx", # Relative local path example
# "https://example.com/api-docs.json",
],
# Process all pages from a sitemap
'sitemap_url': "https://kostadindev.github.io/sitemap.xml",
# GitHub repositories to process (format: username/repo or full URL)
'github_repositories': [
"https://github.com/kostadindev/Knowledge-Base-Builder",
"https://github.com/kostadindev/GONEXT",
"https://github.com/kostadindev/GONEXT-ML",
"https://github.com/kostadindev/meta-me",
"https://github.com/kostadindev/Recursive-QA",
"https://github.com/kostadindev/deep-gestures",
"https://github.com/kostadindev/emf-ellipse"
]
}
# Create KB builder
kbb = KBBuilder(config)
# Build knowledge base
kbb.build(sources=sources, output_file="final_knowledge_base.md")
🔧 Supported Sources
| Source Type | Description | Formats |
|---|---|---|
| Documents | Text documents | PDF, DOCX, TXT, MD, RTF |
| Spreadsheets | Tabular data | CSV, TSV, XLSX, ODS |
| Web Content | Structured web data | HTML, XML, JSON, YAML/YML |
| Websites | Live web pages | Any URL or sitemap |
| GitHub | Repository content | Markdown files from public repos |
All sources can now be added through the unified
filesparameter, with automatic format detection.
🧠 LLM Providers
| Provider | Models | Features |
|---|---|---|
| Google Gemini | gemini-2.0-flash (default) | Free to try, Fast, large context window cost-effective summaries |
| OpenAI | gpt-4o (default) | High-quality summaries, strong reasoning |
| Anthropic | claude-3-7-sonnet (default) | High-quality summaries, excellent formatting |
Recommended Provider: Google Gemini
Google Gemini is the recommended provider as a free development API key can be obtained. Additionally, it is fast, has a large context window, and performs well on benchmarks. Get your free API key at Google AI Studio.
📥 Output Example
# Resume Summary
## Education
- B.S. in Computer Science from XYZ University
## Experience
- Software Engineer at ABC Corp
- Developed NLP-based document parsers...
---
# Website Summary
## Project Pages
- **Project Alpha**: A machine learning system for ...
- **Blog Post**: How to use Gemini with LangChain ...
🔍 Applications
Web Crawlable LLM Context Enhancement
- /llms.txt: Generate a compact, web-crawlable context file (typically 10-20KB) that allows LLMs to access your personal or organizational information during web searches.
- /llms-full.txt: Create an expanded knowledge file (50-100KB) with comprehensive details about your work, expertise, and content that search-powered LLMs can index.
- Web Context Sources: Enable web search LLMs like Perplexity, ChatGPT, Claude, and Gemini to discover and reference your structured information during user queries.
RAG Applications
- Vector Database Preprocessing: Generate clean, structured content before embedding into vector stores like Pinecone, Chroma, or Weaviate, improving retrieval quality.
- Single-Context LLM Applications: Provide a comprehensive knowledge base that fits within a single LLM context window (up to 128K tokens) for domain-specific assistants.
- Hybrid RAG Systems: Combine the full knowledge base with selective vector retrieval for specialized question answering systems with reduced hallucination.
Personal Knowledge Management
- Professional Portfolio: Create a comprehensive knowledge base integrating your resume, publications, projects, and online presence into a single searchable document.
- Academic Research: Compile research papers, conference proceedings, and citations into a structured knowledge base for literature reviews or thesis preparation.
- Technical Documentation: Consolidate documentation across multiple GitHub repositories, technical blogs, and API references into a unified technical manual.
Enterprise Use Cases
- Company Knowledge Base: Consolidate internal documentation, product specifications, and team information into an easily updatable central resource.
- Customer Support: Transform support tickets, FAQs, and product manuals into a comprehensive knowledge base for support agents or automated systems.
- Competitive Intelligence: Build a structured repository of competitor information from various public sources, updated periodically with the latest data.
- Candidate Evaluation: Generate comprehensive profiles of job candidates by compiling their GitHub contributions, research papers, portfolio, and online presence.
- Onboarding Acceleration: Create personalized knowledge bases for new employees containing company policies, codebase documentation, and team information.
🌲 Algorithm
The knowledge base builder uses a two-step approach for efficient processing:
-
Parallel Preprocessing
- All documents are preprocessed concurrently into structured KBs
- Uses a semaphore to limit concurrent LLM requests
- Each document is converted into a well-formatted Markdown knowledge base
- Optimized for parallel processing with controlled concurrency
-
Single Merge
- All preprocessed KBs are merged in a single operation
- Maintains logical structure and organization
- Reduces total LLM calls compared to recursive approaches
- More predictable memory usage
This approach provides several advantages:
- Fewer total LLM calls (one per document + one final merge)
- Better parallelization of preprocessing
- More predictable memory usage
- Simpler and more maintainable code
- Faster overall processing time
⚡ Concurrency Model
The knowledge base builder implements a multi-layer concurrency model to maximize performance while maintaining stability:
1. File Processing Concurrency
# Multiple files processed simultaneously
tasks = [process_file_async(file) for file in files]
await asyncio.gather(*tasks)
- Enables parallel processing of multiple files
- Each file type (PDF, DOCX, web page, etc.) is processed independently
- Significantly reduces total processing time for multiple files
2. CPU-Bound Operations
# CPU-intensive operations run in separate threads
path = await asyncio.to_thread(processor.download, url)
text = await asyncio.to_thread(processor.extract_text, path)
- Downloads and text extraction run in separate threads
- Prevents blocking the event loop during I/O operations
- Optimizes CPU utilization across cores
3. LLM Processing Concurrency
# Controlled concurrent LLM API calls
async with self._sem:
result = await self.llm_client.run_async(prompt)
- Uses a semaphore to limit concurrent LLM API calls
- Prevents overwhelming the LLM API
- Helps stay within API rate limits
- Default concurrency limit: 8 simultaneous requests
4. Final KB Merging
# Concurrent preprocessing followed by single merge
tasks = [preprocess_text_async(text) for text in texts]
preprocessed_kbs = await asyncio.gather(*tasks)
final_kb = await merge_all_kbs(preprocessed_kbs)
- Preprocesses all documents concurrently
- Merges them into a final knowledge base
- Optimizes both speed and memory usage
Performance Considerations
- Resource Management: CPU-bound operations don't block the event loop
- Rate Limiting: LLM API calls are properly throttled
- Scalability: System can handle many files without performance degradation
- Constraints:
- LLM concurrency limit (default: 8)
- System resources (CPU, memory, network)
- LLM API rate limits
⚠️ Limitations
Memory Usage
- Document Processing: Each document is loaded into memory during processing
- LLM Context Windows: Different models have different context window limits:
- Gemini: 1M tokens
- Merge Operations: Final merge operation requires all preprocessed KBs in memory
- Recommendation: Monitor memory usage when processing large documents or many files
Processing Time
- I/O Operations: Each file requires multiple I/O operations:
- Downloading/reading the file
- Text extraction
- LLM API calls
- LLM Latency: Each document requires at least one LLM call:
- One call per document for preprocessing
- One final call for merging
Rate Limits
- LLM API Limits: Each provider has different rate limits
- GitHub API: 60 requests per hour (unauthenticated)
- Web Scraping: Some websites may block rapid requests
🧪 Future Improvements
Data Sources Expansion
- Cloud Integration: Add support for Google Drive, Dropbox, and OneDrive documents
- Social & Professional: Add LinkedIn profiles, Twitter feeds, and Medium articles integration
- Academic Sources: Connect to arXiv, Google Scholar, and research databases
Performance Optimizations
- Parallel Processing: Improve multi-document processing with adaptive concurrency control
- Merge Algorithm: Enhance the logarithmic-depth merge tree for better memory efficiency
- Streaming Processing: Implement document streaming for reduced memory footprint
Output & Integration
- Vector DB Export: Direct export to Pinecone, Chroma, Weaviate, and other vector databases
- LangChain Integration: Simplified integration with LangChain for RAG applications
- Custom Schemas: User-definable output schemas for specialized knowledge base formats
Advanced Features
- Incremental Updates: Support for updating existing knowledge bases with new content
- Multi-language Support: Process and merge content across different languages
- Custom Taxonomies: Allow users to define custom categorization schemas for content organization
Performance & Limitations Improvements
-
Memory Optimization:
- Implement streaming document processing to reduce memory footprint
- Add chunking for documents exceeding context windows
- Develop smart caching system for processed documents
- Add memory usage monitoring and automatic batch sizing
-
Processing Speed:
- Implement progressive document loading
- Develop smart retry mechanisms for failed operations
-
Rate Limit Management:
- Add automatic rate limit detection and adaptation
- Implement smart queuing system for API calls
- Add support for multiple API keys rotation
🤝 Contributing
We welcome contributions! Please see our Contributing Guidelines for details on how to get started.
🐛 Bug Reports
Found a bug? Please check our Contributing Guidelines for instructions on how to report it.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
MIT © Kostadin Devedzhiev
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file knowledge_base_builder-0.1.2.tar.gz.
File metadata
- Download URL: knowledge_base_builder-0.1.2.tar.gz
- Upload date:
- Size: 38.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f3bc0ee7696b3fa2a53d9ecf4685c136d42b116bf317f5b89081fd99f387b7e
|
|
| MD5 |
0b68c989f8501928d26376debc5b0cc8
|
|
| BLAKE2b-256 |
a75f0cb6bde1060def310ef802cc97c5c5addd41f55c836058cf644b33d3e461
|
File details
Details for the file knowledge_base_builder-0.1.2-py3-none-any.whl.
File metadata
- Download URL: knowledge_base_builder-0.1.2-py3-none-any.whl
- Upload date:
- Size: 40.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05e77df20fe2fecf0c1261dafca8e113abf77821f9f6243fb2a6e362db0fcce2
|
|
| MD5 |
324fbfc6998db4577e727e1d7eb59580
|
|
| BLAKE2b-256 |
f890e35b5d2b0c85595e05d26fcb40ea2244bf4df79c310c5a4b214926dae769
|