GitHub repository reader with document chunking for RAG/LLM applications
Project description
gitsource
GitHub repository reader with document chunking for RAG/LLM applications.
Features
- Download repositories directly from GitHub using
codeload.github.com(no git required) - Filter files by extension and path patterns
- Parse YAML frontmatter from markdown files
- Chunk documents using sliding windows (preserves metadata)
- Lightweight Jupyter notebook parser
Installation
pip install gitsource
# or
uv add gitsource
Usage
Read GitHub Repository
from gitsource import GithubRepositoryDataReader
reader = GithubRepositoryDataReader(
repo_owner="evidentlyai",
repo_name="docs",
allowed_extensions={"md", "mdx"},
)
files = reader.read()
Parse Frontmatter
from gitsource import GithubRepositoryDataReader
reader = GithubRepositoryDataReader(
repo_owner="alexeygrigorev",
repo_name="gitsource",
allowed_extensions={"md"},
)
files = reader.read()
# Parse YAML frontmatter from markdown files
for file in files:
data = file.parse()
print(f"{data['filename']}: {data.get('title', 'No title')}")
Process Jupyter Notebooks
from gitsource import GithubRepositoryDataReader, notebook_processor
reader = GithubRepositoryDataReader(
repo_owner="alexeygrigorev",
repo_name="gitsource",
branch="master",
allowed_extensions={"md", "ipynb"},
filename_filter=lambda fp: fp.startswith("fixtures/"),
processors={"ipynb": notebook_processor}, # Convert .ipynb to text
)
files = reader.read()
for file in files:
print(f"{file.filename}: {file.content[:50]}...")
Chunk Documents
from gitsource import chunk_documents
documents = [
{"content": "Long text here...", "filename": "doc.txt"}
]
chunks = chunk_documents(
documents,
size=2000,
step=1000
)
License
WTFPL
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gitsource-0.0.4.tar.gz
(73.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gitsource-0.0.4.tar.gz.
File metadata
- Download URL: gitsource-0.0.4.tar.gz
- Upload date:
- Size: 73.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.3 cpython/3.13.5 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a33ca2ddd9359f04bb1f66eab273cb46cd4a8c83c75650d9bcd14bfc2327c69
|
|
| MD5 |
fc8eca2f958e80b8e0b29b1e6bf647de
|
|
| BLAKE2b-256 |
a91c845fc8743b2af778f1df6ca651e7ab8b9f045fe47e8d7d5507a9eb34b027
|
File details
Details for the file gitsource-0.0.4-py3-none-any.whl.
File metadata
- Download URL: gitsource-0.0.4-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.3 cpython/3.13.5 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b31f5113bdf400ed9f4e925daaca35939dab7068c3474048f1d0771403043fdd
|
|
| MD5 |
f37c328958ced838153913bf1d1bb2a6
|
|
| BLAKE2b-256 |
ecb23a28db38c042437786bb255f88d2995362b1f9697b2b4143bfeeafc87b72
|