Skip to main content

Utilities that integrate advanced scraping knowledge into just one library.

Project description

datamarket

datamarket is a Python library of reusable scraping, data ingestion, and integration utilities used across DataMarket projects.

This README explains what the library is, how to run it locally, how to use it, and where deeper documentation lives.

It solves a practical problem: different scrapers and ETL jobs often re-implement the same low-level pieces (HTTP retries, proxy rotation, SQLAlchemy batch writes, cloud storage clients, LLM wrappers). This repository centralizes those capabilities in a single package so projects can stay focused on business logic.

Project Overview

  • Primary value: standardized interfaces for data collection, transformation, and delivery.
  • Language/runtime: Python ^3.12 (from pyproject.toml).
  • Package manager/build: Poetry (pyproject.toml, poetry.lock).
  • Testing: pytest-based tests in tests/.
  • Lint/format: pre-commit hooks (Ruff + Ruff format) via pre-commit-config submodule.

High-Level Architecture

Core package code lives in src/datamarket/ and is organized by responsibility:

  • src/datamarket/interfaces/: service-facing interfaces (LLM, SQLAlchemy, proxy, AWS, Azure Blob, Drive, FTP, Tinybird, Nominatim, PeerDB).
  • src/datamarket/utils/: shared helpers (HTTP client wrapper, config loading, logging, Playwright/Selenium helpers, string normalization, data quality sampler).
  • src/datamarket/exceptions/: custom exception types used across request and proxy workflows.
  • src/datamarket/params/: static parameter dictionaries/constants (for example, Nominatim enrichment data).

For architecture diagrams and deeper design notes, see docs/2. Architecture Overview.md.

Prerequisites

  • Python 3.12.
  • pip (for install) and optionally poetry (for dependency/workflow management).
  • Optional: Conda if you want to use the bootstrap helper in init.sh.

Installation

To install this library in your Python environment:

pip install datamarket

Environment Setup

Option A: Poetry workflow

poetry install
poetry shell

Option B: Conda bootstrap script

init.sh creates a Conda environment named <package>_env, installs the package in editable mode, initializes submodules, and installs pre-commit hooks.

bash init.sh

Basic Usage

This section shows how to use the package from consumer projects.

  • Import interfaces directly from module paths, for example:
    • from datamarket.interfaces.llm import LLMInterface
    • from datamarket.interfaces.proxy import ProxyInterface
    • from datamarket.interfaces.alchemy import AlchemyInterface
  • Load INI-style config using datamarket.utils.main.get_config when needed.
  • Run end-to-end examples from examples/ for LLM and vision use cases.

Development Workflow

Run examples

python examples/llm_usage_examples.py
python examples/llm_vision_examples.py

Run tests

pytest -v

Lint and format

This repo uses pre-commit hooks defined in pre-commit-config/.pre-commit-config.yaml:

pre-commit run --all-files

Build artifacts

poetry build

Built distributions are output to dist/.

Configuration

This library is configuration-driven. Most interfaces expect either:

  • a dict-like object (config["section"]["key"]), or
  • a ConfigParser/RawConfigParser object for INI files.

Common sections used by interfaces include:

  • [llm] for LLMInterface (provider, api_key, model).
  • [db] for AlchemyInterface and Postgres peer operations.
  • [proxy] for ProxyInterface (hosts, optional tor_password).
  • [tinybird], [osm], [drive].
  • Profile-based sections such as [aws:<profile>], [azure:<profile>], [ftp:<profile>].
  • PeerDB-specific sections: [peerdb], [clickhouse], [peerdb-s3].

See the generated wiki pages in docs/ for concrete config and workflow details, especially docs/3. Workflows.md and docs/Deep Dive/Interfaces.md.

Deployment and Release Notes

  • This repository is a library package, not a deployable service.
  • Release packaging is supported through Poetry (poetry build) and Twine can be used for publishing.
  • CI/CD release automation is not configured in this repository (no .github/workflows/ present).
  • See docs/4. ADRs.md for architecture-level release and maintenance trade-offs.

Troubleshooting

  • ModuleNotFoundError for optional features: install required extras (for example .[llm], .[pytest], .[boto3]).
  • Configuration must contain 'llm' section: include [llm] with api_key before creating LLMInterface.
  • No working proxies available: verify [proxy] hosts format (host:port or user:pass@host:port) and network access.
  • SQLAlchemy connection errors: verify [db] credentials and engine string.
  • Pre-commit command not found: install pre-commit in your active environment.

Contributing (Summary)

  • Keep changes scoped and aligned with existing module boundaries in src/datamarket/.
  • Add or update tests under tests/ for behavioral changes.
  • Run pytest -v and pre-commit run --all-files before opening a PR.
  • Keep docs current when interfaces, config keys, or workflows change.

Documentation Map

  • Wiki home: docs/Home.md
  • Project overview: docs/1. Project Overview.md
  • Architecture overview (C4): docs/2. Architecture Overview.md
  • Workflows: docs/3. Workflows.md
  • Architecture decisions: docs/4. ADRs.md
  • Deep dives: docs/Deep Dive/Interfaces.md, docs/Deep Dive/LLM.md, docs/Deep Dive/SQLAlchemy.md, docs/Deep Dive/Utilities.md, docs/Deep Dive/Geo Enrichment.md
  • Digital twin artifacts: docs/_twin/inventory.json, docs/_twin/graph.json, docs/_twin/domain-map.md, docs/_twin/patterns.md

Documentation Status

Diataxis type: Reference.

  • This README is the entry point and is maintained incrementally from validated repository summaries.
  • Current generated references in this run:
    • docs/1. Project Overview.md
    • docs/2. Architecture Overview.md
    • docs/3. Workflows.md
    • docs/4. ADRs.md
  • Known unknowns:
    • UNKNOWN: linked pages under the external docs submodule may differ from this local snapshot.

License

GPL-3.0-or-later. See LICENSE.

Sources: README.md (summary_hash: 905c027d111146820a6ea5c807c7b4a0f7094f9b36ab6528f36500a3f5e07520)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamarket-0.10.16.tar.gz (89.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datamarket-0.10.16-py3-none-any.whl (102.2 kB view details)

Uploaded Python 3

File details

Details for the file datamarket-0.10.16.tar.gz.

File metadata

  • Download URL: datamarket-0.10.16.tar.gz
  • Upload date:
  • Size: 89.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for datamarket-0.10.16.tar.gz
Algorithm Hash digest
SHA256 e82bd26e5f496e2cc2b02ac4d141214ff7acd519b95c01d8f5eaa5719dc420ef
MD5 a6444534a4da0c61c2bb4d12d482c695
BLAKE2b-256 e7d2b3d13b004e5a0af39429e957e355abb102f3f945222dec8b22957c50b2c0

See more details on using hashes here.

File details

Details for the file datamarket-0.10.16-py3-none-any.whl.

File metadata

  • Download URL: datamarket-0.10.16-py3-none-any.whl
  • Upload date:
  • Size: 102.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for datamarket-0.10.16-py3-none-any.whl
Algorithm Hash digest
SHA256 0dfbb63e2d1c250355db6d7e3d9f5d35e3254a8aba5a0340828116e86d8cb99f
MD5 c21e04eedcdb848cf83d59a2c30e65f1
BLAKE2b-256 c469cf99f1d913e0414ab5d26db36b403892e5a9f1b09e0f99b06e224c50599b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page