Utilities that integrate advanced scraping knowledge into just one library.
Project description
datamarket
datamarket is a Python library of reusable scraping, data ingestion, and integration utilities used across DataMarket projects.
This README explains what the library is, how to run it locally, how to use it, and where deeper documentation lives.
It solves a practical problem: different scrapers and ETL jobs often re-implement the same low-level pieces (HTTP retries, proxy rotation, SQLAlchemy batch writes, cloud storage clients, LLM wrappers). This repository centralizes those capabilities in a single package so projects can stay focused on business logic.
Project Overview
- Primary value: standardized interfaces for data collection, transformation, and delivery.
- Language/runtime: Python
^3.12(frompyproject.toml). - Package manager/build: Poetry (
pyproject.toml,poetry.lock). - Testing: pytest-based tests in
tests/. - Lint/format: pre-commit hooks (Ruff + Ruff format) via
pre-commit-configsubmodule.
High-Level Architecture
Core package code lives in src/datamarket/ and is organized by responsibility:
src/datamarket/interfaces/: service-facing interfaces (LLM, SQLAlchemy, proxy, AWS, Azure Blob, Drive, FTP, Tinybird, Nominatim, PeerDB).src/datamarket/utils/: shared helpers (HTTP client wrapper, config loading, logging, Playwright/Selenium helpers, string normalization, data quality sampler).src/datamarket/exceptions/: custom exception types used across request and proxy workflows.src/datamarket/params/: static parameter dictionaries/constants (for example, Nominatim enrichment data).
For architecture diagrams and deeper design notes, see docs/2. Architecture Overview.md.
Prerequisites
- Python
3.12. pip(for install) and optionallypoetry(for dependency/workflow management).- Optional: Conda if you want to use the bootstrap helper in
init.sh.
Installation
To install this library in your Python environment:
pip install datamarket
Environment Setup
Option A: Poetry workflow
poetry install
poetry shell
Option B: Conda bootstrap script
init.sh creates a Conda environment named <package>_env, installs the package in editable mode, initializes submodules, and installs pre-commit hooks.
bash init.sh
Basic Usage
This section shows how to use the package from consumer projects.
- Import interfaces directly from module paths, for example:
from datamarket.interfaces.llm import LLMInterfacefrom datamarket.interfaces.proxy import ProxyInterfacefrom datamarket.interfaces.alchemy import AlchemyInterface
- Load INI-style config using
datamarket.utils.main.get_configwhen needed. - Run end-to-end examples from
examples/for LLM and vision use cases.
Development Workflow
Run examples
python examples/llm_usage_examples.py
python examples/llm_vision_examples.py
Run tests
pytest -v
Lint and format
This repo uses pre-commit hooks defined in pre-commit-config/.pre-commit-config.yaml:
pre-commit run --all-files
Build artifacts
poetry build
Built distributions are output to dist/.
Configuration
This library is configuration-driven. Most interfaces expect either:
- a dict-like object (
config["section"]["key"]), or - a
ConfigParser/RawConfigParserobject for INI files.
Common sections used by interfaces include:
[llm]forLLMInterface(provider,api_key,model).[db]forAlchemyInterfaceand Postgres peer operations.[proxy]forProxyInterface(hosts, optionaltor_password).[tinybird],[osm],[drive].- Profile-based sections such as
[aws:<profile>],[azure:<profile>],[ftp:<profile>]. - PeerDB-specific sections:
[peerdb],[clickhouse],[peerdb-s3].
See the generated wiki pages in docs/ for concrete config and workflow details, especially docs/3. Workflows.md and docs/Deep Dive/Interfaces.md.
Deployment and Release Notes
- This repository is a library package, not a deployable service.
- Release packaging is supported through Poetry (
poetry build) and Twine can be used for publishing. - CI/CD release automation is not configured in this repository (no
.github/workflows/present). - See
docs/4. ADRs.mdfor architecture-level release and maintenance trade-offs.
Troubleshooting
ModuleNotFoundErrorfor optional features: install required extras (for example.[llm],.[pytest],.[boto3]).Configuration must contain 'llm' section: include[llm]withapi_keybefore creatingLLMInterface.No working proxies available: verify[proxy] hostsformat (host:portoruser:pass@host:port) and network access.- SQLAlchemy connection errors: verify
[db]credentials and engine string. - Pre-commit command not found: install
pre-commitin your active environment.
Contributing (Summary)
- Keep changes scoped and aligned with existing module boundaries in
src/datamarket/. - Add or update tests under
tests/for behavioral changes. - Run
pytest -vandpre-commit run --all-filesbefore opening a PR. - Keep docs current when interfaces, config keys, or workflows change.
Documentation Map
- Wiki home:
docs/Home.md - Project overview:
docs/1. Project Overview.md - Architecture overview (C4):
docs/2. Architecture Overview.md - Workflows:
docs/3. Workflows.md - Architecture decisions:
docs/4. ADRs.md - Deep dives:
docs/Deep Dive/Interfaces.md,docs/Deep Dive/LLM.md,docs/Deep Dive/SQLAlchemy.md,docs/Deep Dive/Utilities.md,docs/Deep Dive/Geo Enrichment.md - Digital twin artifacts:
docs/_twin/inventory.json,docs/_twin/graph.json,docs/_twin/domain-map.md,docs/_twin/patterns.md
Documentation Status
Diataxis type: Reference.
- This README is the entry point and is maintained incrementally from validated repository summaries.
- Current generated references in this run:
docs/1. Project Overview.mddocs/2. Architecture Overview.mddocs/3. Workflows.mddocs/4. ADRs.md
- Known unknowns:
UNKNOWN: linked pages under the externaldocssubmodule may differ from this local snapshot.
License
GPL-3.0-or-later. See LICENSE.
Sources: README.md (summary_hash: 905c027d111146820a6ea5c807c7b4a0f7094f9b36ab6528f36500a3f5e07520)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datamarket-0.10.22.tar.gz.
File metadata
- Download URL: datamarket-0.10.22.tar.gz
- Upload date:
- Size: 98.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82e370bcd12c1cbec86a76e003942bcbef27aa113bc0d42f79b22f7844092944
|
|
| MD5 |
0e551d2d5978461c409f5306562a53e7
|
|
| BLAKE2b-256 |
c1781a3b6a719196ac050b5f3e3b6de237ec447e6405d967ffb44e48d046cbab
|
File details
Details for the file datamarket-0.10.22-py3-none-any.whl.
File metadata
- Download URL: datamarket-0.10.22-py3-none-any.whl
- Upload date:
- Size: 109.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33a084a813b915b9bb54790ff622d4aa32c60bd43ddb416f289fad6cc049239b
|
|
| MD5 |
346badfc6bbd4ce0d02269c5d28340ab
|
|
| BLAKE2b-256 |
158e472ef0e7ee34c6a61fd001c22e3e340edde6a80726dd4a020bb4a9566e15
|