Python scraper for Oryx equipment loss data, matching the R script approach

These details have not been verified by PyPI

Project description

oryx-wat-scraper

Python scraper for Oryx equipment loss data, matching the R script approach from scrape_oryx.

This package scrapes equipment loss data directly from the Oryx blog post and generates CSV files in the same format as the oryx_data repository.

Features

✅ CSV Output: Generates CSV files matching oryx_data format
✅ Type-safe: Full type hints with dataclasses
✅ Context Manager: Proper resource cleanup
✅ Well-tested: Comprehensive test suite
✅ Modern Python: Requires Python 3.10+

Installation

pip install oryx-wat-scraper

Or using uv:

uv add oryx-wat-scraper

Or using poetry:

poetry add oryx-wat-scraper

Quick Start

Python API

from oryx_wat_scraper import OryxScraper

# Initialize the scraper
scraper = OryxScraper()

# Scrape data
data = scraper.scrape()

# Generate CSV files (matching oryx_data format)
scraper.scrape_to_csv('outputfiles')

# Or get JSON
json_data = scraper.scrape_to_json('output.json')

# Close when done
scraper.close()

Context Manager

from oryx_wat_scraper import OryxScraper

with OryxScraper() as scraper:
    # Scrape specific countries
    data = scraper.scrape(countries=['russia', 'ukraine'])

    # Generate CSV files
    scraper.scrape_to_csv('outputfiles')

Command Line

# Generate CSV files
oryx-scraper --csv

# Save to JSON
oryx-scraper -o output.json

# Scrape specific countries
oryx-scraper --csv --countries russia ukraine

# Custom output directory
oryx-scraper --csv --output-dir my_output

Output Formats

CSV Files (matching oryx_data format)

daily_count.csv (columns: country, equipment_type, destroyed, abandoned, captured, damaged, type_total, date_recorded)

country,equipment_type,destroyed,abandoned,captured,damaged,type_total,date_recorded
russia,T-62M,154,5,34,1,194,2024-01-15
ukraine,T-72,45,2,8,0,55,2024-01-15

totals_by_type.csv (columns: country, type, destroyed, abandoned, captured, damaged, total)

country,type,destroyed,abandoned,captured,damaged,total
russia,T-62M,154,5,34,1,194
ukraine,T-72,45,2,8,0,55

JSON Output

{
  "url": "https://www.oryxspioenkop.com/...",
  "date_scraped": "2024-01-15",
  "total_entries": 1000,
  "daily_count": [...],
  "totals_by_type": [...]
}

API Reference

`OryxScraper`

Main scraper class.

Methods

scrape(countries: List[str] | None = None) -> Dict: Scrape data for specified countries
scrape_to_csv(output_dir: str = 'outputfiles') -> Dict: Scrape and save to CSV files
scrape_to_json(output_file: str | None = None, indent: int = 2) -> str: Scrape and return/save as JSON
close(): Close the HTTP client

Models

EquipmentEntry: Individual equipment entry with status
SystemEntry: Individual system entry with status

Exceptions

OryxScraperError: Base exception
OryxScraperNetworkError: Network errors
OryxScraperParseError: HTML parsing errors
OryxScraperValidationError: Data validation errors

Development

Setup

# Clone the repository
git clone https://github.com/wat-suite/oryx-wat-scraper.git
cd oryx-wat-scraper

# Install with uv
uv sync --dev

# Install pre-commit hooks
uv run pre-commit install

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=oryx_wat_scraper --cov-report=html

# Run specific test file
pytest tests/test_client.py

Running CI Checks Locally

You can run the same checks that CI runs locally:

# Run all checks
black --check .
ruff check .
mypy oryx_wat_scraper
pytest

Code Quality

# Format code
black .

# Lint code
ruff check .

# Type check
mypy oryx_wat_scraper

Make Commands

make install-dev  # Install with dev dependencies
make test         # Run tests
make lint         # Run linters
make format       # Format code
make type-check   # Type check
make clean        # Clean build artifacts

Releasing

This project uses GitHub Releases to trigger PyPI publishing. The workflow automatically publishes to PyPI when a final (non-pre-release) GitHub Release is created.

Release Workflow

1. Update Version

Update the version in pyproject.toml:

[project]
version = "0.2.0"  # Update to your new version

Follow Semantic Versioning:

MAJOR (1.0.0): Breaking changes
MINOR (0.1.0): New features, backwards compatible
PATCH (0.0.1): Bug fixes, backwards compatible

2. Update Changelog

Update CHANGELOG.md with the changes for this version.

3. Commit and Push Changes

git add pyproject.toml CHANGELOG.md
git commit -m "chore: bump version to 0.2.0"
git push origin main

4. Create a Git Tag

Create a tag matching the version (with or without 'v' prefix):

# Option 1: Tag with 'v' prefix
git tag v0.2.0

# Option 2: Tag without prefix
git tag 0.2.0

# Push the tag
git push origin v0.2.0

Important: The tag version must match the version in pyproject.toml exactly (excluding the 'v' prefix if used).

5. Create GitHub Release

Go to the GitHub Releases page and click "Draft a new release":

For Pre-Release (Testing):

Tag: Select the tag you just created (e.g., v0.2.0)
Release title: v0.2.0 (or your version)
Description: Copy from CHANGELOG.md or write release notes
☑️ Set as a pre-release: Check this box
Click "Publish release"

Pre-releases are not published to PyPI. Use them for testing before the final release.

For Final Release (Publishing to PyPI):

Tag: Select the tag you just created (e.g., v0.2.0)
Release title: v0.2.0 (or your version)
Description: Copy from CHANGELOG.md or write release notes
☐ Set as a pre-release: Leave this unchecked
Click "Publish release"

The GitHub Actions workflow will:

Verify the tag version matches pyproject.toml
Build the package
Check the package with twine
Publish to PyPI (only for final releases, not pre-releases)

Workflow Summary

1. Update version in pyproject.toml
2. Update CHANGELOG.md
3. Commit and push changes
4. Create and push git tag
5. Create GitHub Release (pre-release or final)
   └─> Pre-release: Testing only, not published to PyPI
   └─> Final release: Automatically published to PyPI

Troubleshooting

Version mismatch error: Ensure the tag version (without 'v' prefix) exactly matches pyproject.toml version
Pre-release published: Pre-releases are intentionally skipped. Create a final release to publish to PyPI
Workflow not triggered: Ensure the release is "Published" (not "Draft") and the tag exists

Based On

This scraper is based on the R script approach from:

scrape_oryx - R script for scraping Oryx data
oryx_data - Processed CSV data repository

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Workflow

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests for new functionality
Ensure all tests pass and code is formatted (black . && ruff check . && pytest)
Commit your changes (following Conventional Commits)
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Commit Message Guidelines

This project follows Conventional Commits. Commit messages should be formatted as:

<type>(<scope>): <subject>

<body>

<footer>

Types:

feat: New feature
fix: Bug fix
docs: Documentation changes
style: Code style changes (formatting, etc.)
refactor: Code refactoring
test: Adding or updating tests
chore: Maintenance tasks

Code Style

This project uses Black for code formatting
Ruff is used for linting
mypy is used for type checking
All code must pass linting and type checking

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

If you encounter any issues or have questions, please open an issue on GitHub.

Changelog

See CHANGELOG.md for a list of changes and version history.

Acknowledgments

Oryx for documenting equipment losses
scarnecchia for the R script and data processing
All contributors who help improve this library

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jan 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oryx_wat_scraper-0.1.0.tar.gz (16.2 kB view details)

Uploaded Jan 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oryx_wat_scraper-0.1.0-py3-none-any.whl (16.6 kB view details)

Uploaded Jan 14, 2026 Python 3

File details

Details for the file oryx_wat_scraper-0.1.0.tar.gz.

File metadata

Download URL: oryx_wat_scraper-0.1.0.tar.gz
Upload date: Jan 14, 2026
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for oryx_wat_scraper-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f41e51d0c6da7399ac383a6e3624eabc15fbfb88da8d0204b347dd9c218616b4`
MD5	`63bd4ae51ea1cc0c273cd3576f3bc1f9`
BLAKE2b-256	`7aeddd748df4cd14a067bb60609bbbe436caceaddb957d20ee1d82c46230ae0a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for oryx_wat_scraper-0.1.0.tar.gz:

Publisher: publish.yml on WAT-Suite/oryx-wat-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: oryx_wat_scraper-0.1.0.tar.gz
- Subject digest: f41e51d0c6da7399ac383a6e3624eabc15fbfb88da8d0204b347dd9c218616b4
- Sigstore transparency entry: 821195947
- Sigstore integration time: Jan 14, 2026
Source repository:
- Permalink: WAT-Suite/oryx-wat-scraper@b4751729c17d409f0e9ad35fe34a996b0f2c030b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/WAT-Suite
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b4751729c17d409f0e9ad35fe34a996b0f2c030b
- Trigger Event: release

File details

Details for the file oryx_wat_scraper-0.1.0-py3-none-any.whl.

File metadata

Download URL: oryx_wat_scraper-0.1.0-py3-none-any.whl
Upload date: Jan 14, 2026
Size: 16.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for oryx_wat_scraper-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa7bf187026abb16fab63cec2527296f62c2d0133b5b7765c83fd442dbcc48ac`
MD5	`66ebd36ba4329e82cdd647f3cca263c3`
BLAKE2b-256	`9565b5df0a249c23fdbd510fd3039e7f4bf33f8414c694b5d122d4d104751207`

See more details on using hashes here.

Provenance

The following attestation bundles were made for oryx_wat_scraper-0.1.0-py3-none-any.whl:

Publisher: publish.yml on WAT-Suite/oryx-wat-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: oryx_wat_scraper-0.1.0-py3-none-any.whl
- Subject digest: aa7bf187026abb16fab63cec2527296f62c2d0133b5b7765c83fd442dbcc48ac
- Sigstore transparency entry: 821195951
- Sigstore integration time: Jan 14, 2026
Source repository:
- Permalink: WAT-Suite/oryx-wat-scraper@b4751729c17d409f0e9ad35fe34a996b0f2c030b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/WAT-Suite
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b4751729c17d409f0e9ad35fe34a996b0f2c030b
- Trigger Event: release

oryx-wat-scraper 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

oryx-wat-scraper

Features

Installation

Quick Start

Python API

Context Manager

Command Line

Output Formats

CSV Files (matching oryx_data format)

JSON Output

API Reference

OryxScraper

Methods

Models

Exceptions

Development

Setup

Running Tests

Running CI Checks Locally

Code Quality

Make Commands

Releasing

Release Workflow

1. Update Version

2. Update Changelog

3. Commit and Push Changes

4. Create a Git Tag

5. Create GitHub Release

Workflow Summary

Troubleshooting

Based On

Contributing

Development Workflow

Commit Message Guidelines

Code Style

License

Support

Changelog

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`OryxScraper`