A python package for scraping input data for chATLAS.
Project description
chATLAS Scrape
A Python package for scraping and processing data for the chATLAS project. This package provides comprehensive tools for extracting markdown documentation from GitLab repositories and processing it for use in RAG (Retrieval-Augmented Generation) systems.
Overview
chATLAS_Scrape is part of the chATLAS packages ecosystem, designed specifically for the ATLAS collaboration's documentation and knowledge management needs. It handles the data ingestion pipeline, extracting and processing markdown content from CERN GitLab repositories.
The GitLab scraping and preprocessing workflows are regularly executed as scheduled GitLab pipelines, ensuring that the chATLAS knowledge base stays up-to-date with the latest documentation changes across ATLAS repositories.
Features
GitLab Scraping
- Multi-stage scraping pipeline: Three-stage process for comprehensive project discovery and content extraction
- Smart filtering: Automatically excludes archived projects, forks, personal repositories, and irrelevant projects
- Rate limiting: Built-in rate limiting to respect GitLab API limits
- Configurable: Customizable filtering criteria, timeouts, and project selection rules
Markdown Processing
- Intelligent chunking: Advanced text splitting that preserves markdown structure
- Content preprocessing: Handles GitLab-specific markdown features (admonitions, content tabs)
- Structural preservation: Maintains document hierarchy and formatting during processing
- Filtering: Content-based filtering to exclude non-relevant files
Installation
Install the package using uv (recommended):
cd chATLAS_Scrape
uv sync
Configuration
Environment Variables
Set up your GitLab Personal Access Token:
export GITLAB_PAT="your_gitlab_personal_access_token"
Configuration Options
The scraping behavior can be customized in chATLAS_Scrape/gitlab/config.py:
- Base URL: Defaults to
https://gitlab.cern.ch - Project filtering: Exclude projects containing specific keywords
- Minimum file requirements: Set minimum number of markdown files per project
- API timeouts: Configure REST (30s) and GraphQL (60s) request timeouts
Usage
Basic Scraping Workflow
- Project Discovery: Find and filter relevant GitLab projects
- Content Extraction: Download markdown files from selected projects
- Processing: Clean and chunk the content for downstream use
Running Tests
Execute the test suite:
uv run pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chatlas_scrape-0.0.2.tar.gz.
File metadata
- Download URL: chatlas_scrape-0.0.2.tar.gz
- Upload date:
- Size: 30.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58af23ed420a62254c44e9feb72c3b101be46140ce03b06cd37cd95b3ec83a0e
|
|
| MD5 |
3578e768e096f051c34032b90cab5eb6
|
|
| BLAKE2b-256 |
05317cd67c96bc87056edc4f5674ee768d6e54b374873d7af496a2252a931148
|
File details
Details for the file chatlas_scrape-0.0.2-py3-none-any.whl.
File metadata
- Download URL: chatlas_scrape-0.0.2-py3-none-any.whl
- Upload date:
- Size: 37.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6367cc18dd38a66ad96f6fe53f55cab8e8d25b5b224408e153dfb1338837f604
|
|
| MD5 |
b099fe4b3279487a2461b301aea255c4
|
|
| BLAKE2b-256 |
4a82dc5f66247c9c413f933b2e86ac191672e5e6f0f958ea98b2f74e7034907a
|