Python scraper for Oryx equipment loss data, matching the R script approach
Project description
oryx-wat-scraper
Python scraper for Oryx equipment loss data, matching the R script approach from scrape_oryx.
This package scrapes equipment loss data directly from the Oryx blog post and generates CSV files in the same format as the oryx_data repository.
Features
- ✅ CSV Output: Generates CSV files matching oryx_data format
- ✅ Type-safe: Full type hints with dataclasses
- ✅ Context Manager: Proper resource cleanup
- ✅ Well-tested: Comprehensive test suite
- ✅ Modern Python: Requires Python 3.10+
Installation
pip install oryx-wat-scraper
Or using uv:
uv add oryx-wat-scraper
Or using poetry:
poetry add oryx-wat-scraper
Quick Start
Python API
from oryx_wat_scraper import OryxScraper
# Initialize the scraper
scraper = OryxScraper()
# Scrape data
data = scraper.scrape()
# Generate CSV files (matching oryx_data format)
scraper.scrape_to_csv('outputfiles')
# Or get JSON
json_data = scraper.scrape_to_json('output.json')
# Close when done
scraper.close()
Context Manager
from oryx_wat_scraper import OryxScraper
with OryxScraper() as scraper:
# Scrape specific countries
data = scraper.scrape(countries=['russia', 'ukraine'])
# Generate CSV files
scraper.scrape_to_csv('outputfiles')
Command Line
# Generate CSV files
oryx-scraper --csv
# Save to JSON
oryx-scraper -o output.json
# Scrape specific countries
oryx-scraper --csv --countries russia ukraine
# Custom output directory
oryx-scraper --csv --output-dir my_output
Output Formats
CSV Files (matching oryx_data format)
daily_count.csv (columns: country, equipment_type, destroyed, abandoned, captured, damaged, type_total, date_recorded)
country,equipment_type,destroyed,abandoned,captured,damaged,type_total,date_recorded
russia,T-62M,154,5,34,1,194,2024-01-15
ukraine,T-72,45,2,8,0,55,2024-01-15
totals_by_type.csv (columns: country, type, destroyed, abandoned, captured, damaged, total)
country,type,destroyed,abandoned,captured,damaged,total
russia,T-62M,154,5,34,1,194
ukraine,T-72,45,2,8,0,55
JSON Output
{
"url": "https://www.oryxspioenkop.com/...",
"date_scraped": "2024-01-15",
"total_entries": 1000,
"daily_count": [...],
"totals_by_type": [...]
}
API Reference
OryxScraper
Main scraper class.
Methods
scrape(countries: List[str] | None = None) -> Dict: Scrape data for specified countriesscrape_to_csv(output_dir: str = 'outputfiles') -> Dict: Scrape and save to CSV filesscrape_to_json(output_file: str | None = None, indent: int = 2) -> str: Scrape and return/save as JSONclose(): Close the HTTP client
Models
EquipmentEntry: Individual equipment entry with statusSystemEntry: Individual system entry with status
Exceptions
OryxScraperError: Base exceptionOryxScraperNetworkError: Network errorsOryxScraperParseError: HTML parsing errorsOryxScraperValidationError: Data validation errors
Development
Setup
# Clone the repository
git clone https://github.com/wat-suite/oryx-wat-scraper.git
cd oryx-wat-scraper
# Install with uv
uv sync --dev
# Install pre-commit hooks
uv run pre-commit install
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=oryx_wat_scraper --cov-report=html
# Run specific test file
pytest tests/test_client.py
Running CI Checks Locally
You can run the same checks that CI runs locally:
# Run all checks
black --check .
ruff check .
mypy oryx_wat_scraper
pytest
Code Quality
# Format code
black .
# Lint code
ruff check .
# Type check
mypy oryx_wat_scraper
Make Commands
make install-dev # Install with dev dependencies
make test # Run tests
make lint # Run linters
make format # Format code
make type-check # Type check
make clean # Clean build artifacts
Releasing
This project uses GitHub Releases to trigger PyPI publishing. The workflow automatically publishes to PyPI when a final (non-pre-release) GitHub Release is created.
Release Workflow
1. Update Version
Update the version in pyproject.toml:
[project]
version = "0.2.0" # Update to your new version
Follow Semantic Versioning:
- MAJOR (1.0.0): Breaking changes
- MINOR (0.1.0): New features, backwards compatible
- PATCH (0.0.1): Bug fixes, backwards compatible
2. Update Changelog
Update CHANGELOG.md with the changes for this version.
3. Commit and Push Changes
git add pyproject.toml CHANGELOG.md
git commit -m "chore: bump version to 0.2.0"
git push origin main
4. Create a Git Tag
Create a tag matching the version (with or without 'v' prefix):
# Option 1: Tag with 'v' prefix
git tag v0.2.0
# Option 2: Tag without prefix
git tag 0.2.0
# Push the tag
git push origin v0.2.0
Important: The tag version must match the version in pyproject.toml exactly (excluding the 'v' prefix if used).
5. Create GitHub Release
Go to the GitHub Releases page and click "Draft a new release":
For Pre-Release (Testing):
- Tag: Select the tag you just created (e.g.,
v0.2.0) - Release title:
v0.2.0(or your version) - Description: Copy from
CHANGELOG.mdor write release notes - ☑️ Set as a pre-release: Check this box
- Click "Publish release"
Pre-releases are not published to PyPI. Use them for testing before the final release.
For Final Release (Publishing to PyPI):
- Tag: Select the tag you just created (e.g.,
v0.2.0) - Release title:
v0.2.0(or your version) - Description: Copy from
CHANGELOG.mdor write release notes - ☐ Set as a pre-release: Leave this unchecked
- Click "Publish release"
The GitHub Actions workflow will:
- Verify the tag version matches
pyproject.toml - Build the package
- Check the package with
twine - Publish to PyPI (only for final releases, not pre-releases)
Workflow Summary
1. Update version in pyproject.toml
2. Update CHANGELOG.md
3. Commit and push changes
4. Create and push git tag
5. Create GitHub Release (pre-release or final)
└─> Pre-release: Testing only, not published to PyPI
└─> Final release: Automatically published to PyPI
Troubleshooting
- Version mismatch error: Ensure the tag version (without 'v' prefix) exactly matches
pyproject.tomlversion - Pre-release published: Pre-releases are intentionally skipped. Create a final release to publish to PyPI
- Workflow not triggered: Ensure the release is "Published" (not "Draft") and the tag exists
Based On
This scraper is based on the R script approach from:
- scrape_oryx - R script for scraping Oryx data
- oryx_data - Processed CSV data repository
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Development Workflow
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure all tests pass and code is formatted (
black . && ruff check . && pytest) - Commit your changes (following Conventional Commits)
- Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Commit Message Guidelines
This project follows Conventional Commits. Commit messages should be formatted as:
<type>(<scope>): <subject>
<body>
<footer>
Types:
feat: New featurefix: Bug fixdocs: Documentation changesstyle: Code style changes (formatting, etc.)refactor: Code refactoringtest: Adding or updating testschore: Maintenance tasks
Code Style
- This project uses Black for code formatting
- Ruff is used for linting
- mypy is used for type checking
- All code must pass linting and type checking
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
If you encounter any issues or have questions, please open an issue on GitHub.
Changelog
See CHANGELOG.md for a list of changes and version history.
Acknowledgments
- Oryx for documenting equipment losses
- scarnecchia for the R script and data processing
- All contributors who help improve this library
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oryx_wat_scraper-0.1.0.tar.gz.
File metadata
- Download URL: oryx_wat_scraper-0.1.0.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f41e51d0c6da7399ac383a6e3624eabc15fbfb88da8d0204b347dd9c218616b4
|
|
| MD5 |
63bd4ae51ea1cc0c273cd3576f3bc1f9
|
|
| BLAKE2b-256 |
7aeddd748df4cd14a067bb60609bbbe436caceaddb957d20ee1d82c46230ae0a
|
Provenance
The following attestation bundles were made for oryx_wat_scraper-0.1.0.tar.gz:
Publisher:
publish.yml on WAT-Suite/oryx-wat-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
oryx_wat_scraper-0.1.0.tar.gz -
Subject digest:
f41e51d0c6da7399ac383a6e3624eabc15fbfb88da8d0204b347dd9c218616b4 - Sigstore transparency entry: 821195947
- Sigstore integration time:
-
Permalink:
WAT-Suite/oryx-wat-scraper@b4751729c17d409f0e9ad35fe34a996b0f2c030b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/WAT-Suite
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b4751729c17d409f0e9ad35fe34a996b0f2c030b -
Trigger Event:
release
-
Statement type:
File details
Details for the file oryx_wat_scraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: oryx_wat_scraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa7bf187026abb16fab63cec2527296f62c2d0133b5b7765c83fd442dbcc48ac
|
|
| MD5 |
66ebd36ba4329e82cdd647f3cca263c3
|
|
| BLAKE2b-256 |
9565b5df0a249c23fdbd510fd3039e7f4bf33f8414c694b5d122d4d104751207
|
Provenance
The following attestation bundles were made for oryx_wat_scraper-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on WAT-Suite/oryx-wat-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
oryx_wat_scraper-0.1.0-py3-none-any.whl -
Subject digest:
aa7bf187026abb16fab63cec2527296f62c2d0133b5b7765c83fd442dbcc48ac - Sigstore transparency entry: 821195951
- Sigstore integration time:
-
Permalink:
WAT-Suite/oryx-wat-scraper@b4751729c17d409f0e9ad35fe34a996b0f2c030b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/WAT-Suite
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b4751729c17d409f0e9ad35fe34a996b0f2c030b -
Trigger Event:
release
-
Statement type: