Skip to main content

A python package for scraping input data for chATLAS.

Project description

chATLAS Scrape

A Python package for scraping and processing data for the chATLAS project. This package provides comprehensive tools for extracting markdown documentation from GitLab repositories and processing it for use in RAG (Retrieval-Augmented Generation) systems.

Overview

chATLAS_Scrape is part of the chATLAS packages ecosystem, designed specifically for the ATLAS collaboration's documentation and knowledge management needs. It handles the data ingestion pipeline, extracting and processing markdown content from CERN GitLab repositories.

The GitLab scraping and preprocessing workflows are regularly executed as scheduled GitLab pipelines, ensuring that the chATLAS knowledge base stays up-to-date with the latest documentation changes across ATLAS repositories.

Features

GitLab Scraping

  • Multi-stage scraping pipeline: Three-stage process for comprehensive project discovery and content extraction
  • Smart filtering: Automatically excludes archived projects, forks, personal repositories, and irrelevant projects
  • Rate limiting: Built-in rate limiting to respect GitLab API limits
  • Configurable: Customizable filtering criteria, timeouts, and project selection rules

Markdown Processing

  • Intelligent chunking: Advanced text splitting that preserves markdown structure
  • Content preprocessing: Handles GitLab-specific markdown features (admonitions, content tabs)
  • Structural preservation: Maintains document hierarchy and formatting during processing
  • Filtering: Content-based filtering to exclude non-relevant files

Installation

Install the package using uv (recommended):

cd chATLAS_Scrape
uv sync

Configuration

Environment Variables

Set up your GitLab Personal Access Token:

export GITLAB_PAT="your_gitlab_personal_access_token"

Configuration Options

The scraping behavior can be customized in chATLAS_Scrape/gitlab/config.py:

  • Base URL: Defaults to https://gitlab.cern.ch
  • Project filtering: Exclude projects containing specific keywords
  • Minimum file requirements: Set minimum number of markdown files per project
  • API timeouts: Configure REST (30s) and GraphQL (60s) request timeouts

Usage

Basic Scraping Workflow

  1. Project Discovery: Find and filter relevant GitLab projects
  2. Content Extraction: Download markdown files from selected projects
  3. Processing: Clean and chunk the content for downstream use

Running Tests

Execute the test suite:

uv run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chatlas_scrape-0.0.2.tar.gz (30.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chatlas_scrape-0.0.2-py3-none-any.whl (37.8 kB view details)

Uploaded Python 3

File details

Details for the file chatlas_scrape-0.0.2.tar.gz.

File metadata

  • Download URL: chatlas_scrape-0.0.2.tar.gz
  • Upload date:
  • Size: 30.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for chatlas_scrape-0.0.2.tar.gz
Algorithm Hash digest
SHA256 58af23ed420a62254c44e9feb72c3b101be46140ce03b06cd37cd95b3ec83a0e
MD5 3578e768e096f051c34032b90cab5eb6
BLAKE2b-256 05317cd67c96bc87056edc4f5674ee768d6e54b374873d7af496a2252a931148

See more details on using hashes here.

File details

Details for the file chatlas_scrape-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: chatlas_scrape-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 37.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for chatlas_scrape-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6367cc18dd38a66ad96f6fe53f55cab8e8d25b5b224408e153dfb1338837f604
MD5 b099fe4b3279487a2461b301aea255c4
BLAKE2b-256 4a82dc5f66247c9c413f933b2e86ac191672e5e6f0f958ea98b2f74e7034907a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page