Skip to main content

Load Wikimedia pageview data to a duckdb

Project description

pvduck

Lint, Test, Build, Deploy

pvduck is a cli tool allowing you to sample, download, filter, parse, and aggregate Wikimedia pageviews dumps, creating an overview of the most visited pages on a sampled subset of Wikimedia pages in a duckdb database.

This is useful if you want a rough overview of pageviews for a specific subset of pages, but lack access to the analytics servers and don't want to host a full HDFS cluster yourself. Instead of storing several GB per day, you can aggregate years worth of data in a small database holding just the data you need.

The tool was developed for my own weekend projects, where gauging popularity without perfect precision or history is very helpful.

Installation

Recommended: pipx (installed)

Install pvduck in an isolated environment with pipx:

pipx install pvduck

Then use it directly:

pvduck create myproject
pvduck sync myproject

Alternative: uvx (no installation)

Run pvduck without installing using uvx:

uvx pvduck create myproject
uvx pvduck sync myproject

Perfect for trying the tool without commitment!

Alternative: Docker

For containerized environments:

docker run --rm -it \
    -v ~/.config:/root/.config \
    -v ~/.local/share:/root/.local/share \
    vegardege/pvduck:latest \
    create myproject

Note: Use -it for commands that open an editor.

Usage

Call pvduck --help for instructions and pvduck --install-completion to install auto-completion in your shell, both of which will help.

The tool has six commands:

Command Description
create <project_name> Create a new project
edit <project_name> Edit project configuration
rm <project_name> Delete config and database
sync <project_name> Download missing data (if any) and aggregate into the database
open <project_name> Open the project's database in duckdb
status <project_name> See progress status for the project
compact <project_name> Reclaim disk space according to duckdb best practice
ls List all existing projects

By default, sync will run until it exhausted the date range given with the given sample rate. If you want to run it for a limited time only, apply the max_files option to stop after a specific number of files.

Note that syncing can be memory intensive. It operates in chunks of 1 000 000 rows by default, which can be modified with the PVDUCK_CHUNK_SIZE environment variable. Increase the value for faster syncs, decrease it for more memory efficient (but slower) execution.

When you create a new project, you can define the configuration:

Param Description
base_url Which Wikimedia mirror to use (recommended: use mirror close to you)
sleep_time How many seconds to wait between each file download
start_date Date of the first dump file to download
end_date Date of the last dump file to download (or blank for current date expanding)
sample_rate Probability of downloading each hourly file in the interval

In addition, the config file contains filters to reduce the size of the dataset. All filters can be set to blank values, which means no rows are excluded.

Filter Type Description
line_regex regex Regular expression used to filter lines before parsing
page_title regex Regular expression used to filter page titles after parsing
domain_codes list[str] List of domain codes to accept
min_views int Minimum amount of views needed to be accepted
max_views int Maximum amount of views allowed
languages list[str] List of languages to accept
domains list[str] List of domains to accept
mobile bool If set, filter on whether the row belongs to a mobile site

[!IMPORTANT]
The sync operation is destructive. It keeps track of which files you have downloaded, but can not revert any aggregation operations. As a result, only some parameters and filters can be changed without putting the database in an inconsistent state. Notably, you can expand the date range and increase the sample rate without issues.

Test

uv run pytest tests/ --cov=src

Name                       Stmts   Miss  Cover
----------------------------------------------
src/pvduck/__init__.py         0      0   100%
src/pvduck/cli.py            108      0   100%
src/pvduck/config.py          69      0   100%
src/pvduck/db.py              70      0   100%
src/pvduck/project.py         21      0   100%
src/pvduck/stream.py          20      0   100%
src/pvduck/timeseries.py      32      0   100%
src/pvduck/validators.py      22      0   100%
src/pvduck/wikimedia.py        8      0   100%
----------------------------------------------
TOTAL                        350      0   100%

[!CAUTION] Tests are separated in unit tests and integration tests. The integration tests take several minutes to run and downloads files from the configured server. Don't run the full test suite frivolously.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pvduck-0.1.0.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pvduck-0.1.0-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file pvduck-0.1.0.tar.gz.

File metadata

  • Download URL: pvduck-0.1.0.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pvduck-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9e0ddead9bfd2c97468c9cfd19f7f73f8e3313c55018c833da837decc1e2fe9d
MD5 57989f01e8cdc5f0cb21f0a51b3c643d
BLAKE2b-256 ea732d060c9b07f6e1c010b1cba20ddae0ce388f9ad041d1171163a3b32559ef

See more details on using hashes here.

File details

Details for the file pvduck-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pvduck-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pvduck-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 058c29584930e816d1466832745366de293fd84b756734e04e1a4ee23d1c1ba7
MD5 bf9d763b52ef763c9cb3f70ed2092874
BLAKE2b-256 ec3c1a21667883a775d1d060c24681f9d87410063a391f1f8ca78e07c66c48fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page