Load Wikimedia pageview data to a duckdb

These details have not been verified by PyPI

Project links

Project description

pvduck

pvduck is a cli tool allowing you to sample, download, filter, parse, and aggregate Wikimedia pageviews dumps, creating an overview of the most visited pages on a sampled subset of Wikimedia pages in a duckdb database.

This is useful if you want a rough overview of pageviews for a specific subset of pages, but lack access to the analytics servers and don't want to host a full HDFS cluster yourself. Instead of storing several GB per day, you can aggregate years worth of data in a small database holding just the data you need.

The tool was developed for my own weekend projects, where gauging popularity without perfect precision or history is very helpful.

Installation

Recommended: pipx (installed)

Install pvduck in an isolated environment with pipx:

pipx install pvduck

Then use it directly:

pvduck create myproject
pvduck sync myproject

Alternative: uvx (no installation)

Run pvduck without installing using uvx:

uvx pvduck create myproject
uvx pvduck sync myproject

Perfect for trying the tool without commitment!

Alternative: Docker

For containerized environments:

docker run --rm -it \
    -v ~/.config:/root/.config \
    -v ~/.local/share:/root/.local/share \
    vegardege/pvduck:latest \
    create myproject

Note: Use -it for commands that open an editor.

Usage

Call pvduck --help for instructions and pvduck --install-completion to install auto-completion in your shell, both of which will help.

The tool has six commands:

Command	Description
`create <project_name>`	Create a new project
`edit <project_name>`	Edit project configuration
`rm <project_name>`	Delete config and database
`sync <project_name>`	Download missing data (if any) and aggregate into the database
`open <project_name>`	Open the project's database in `duckdb`
`status <project_name>`	See progress status for the project
`compact <project_name>`	Reclaim disk space according to `duckdb` best practice
`ls`	List all existing projects

By default, sync will run until it exhausted the date range given with the given sample rate. If you want to run it for a limited time only, apply the max_files option to stop after a specific number of files.

Note that syncing can be memory intensive. It operates in chunks of 1 000 000 rows by default, which can be modified with the PVDUCK_CHUNK_SIZE environment variable. Increase the value for faster syncs, decrease it for more memory efficient (but slower) execution.

When you create a new project, you can define the configuration:

Param	Description
`base_url`	Which Wikimedia mirror to use (recommended: use mirror close to you)
`sleep_time`	How many seconds to wait between each file download
`start_date`	Date of the first dump file to download
`end_date`	Date of the last dump file to download (or blank for current date expanding)
`sample_rate`	Probability of downloading each hourly file in the interval

In addition, the config file contains filters to reduce the size of the dataset. All filters can be set to blank values, which means no rows are excluded.

Filter	Type	Description
`line_regex`	`regex`	Regular expression used to filter lines before parsing
`page_title`	`regex`	Regular expression used to filter page titles after parsing
`domain_codes`	`list[str]`	List of domain codes to accept
`min_views`	`int`	Minimum amount of views needed to be accepted
`max_views`	`int`	Maximum amount of views allowed
`languages`	`list[str]`	List of languages to accept
`domains`	`list[str]`	List of domains to accept
`mobile`	`bool`	If set, filter on whether the row belongs to a mobile site

[!IMPORTANT]
The sync operation is destructive. It keeps track of which files you have downloaded, but can not revert any aggregation operations. As a result, only some parameters and filters can be changed without putting the database in an inconsistent state. Notably, you can expand the date range and increase the sample rate without issues.

Test

uv run pytest tests/ --cov=src

Name                       Stmts   Miss  Cover
----------------------------------------------
src/pvduck/__init__.py         0      0   100%
src/pvduck/cli.py            108      0   100%
src/pvduck/config.py          69      0   100%
src/pvduck/db.py              70      0   100%
src/pvduck/project.py         21      0   100%
src/pvduck/stream.py          20      0   100%
src/pvduck/timeseries.py      32      0   100%
src/pvduck/validators.py      22      0   100%
src/pvduck/wikimedia.py        8      0   100%
----------------------------------------------
TOTAL                        350      0   100%

[!CAUTION] Tests are separated in unit tests and integration tests. The integration tests take several minutes to run and downloads files from the configured server. Don't run the full test suite frivolously.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Dec 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pvduck-0.1.0.tar.gz (12.6 kB view details)

Uploaded Dec 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pvduck-0.1.0-py3-none-any.whl (16.1 kB view details)

Uploaded Dec 8, 2025 Python 3

File details

Details for the file pvduck-0.1.0.tar.gz.

File metadata

Download URL: pvduck-0.1.0.tar.gz
Upload date: Dec 8, 2025
Size: 12.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pvduck-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9e0ddead9bfd2c97468c9cfd19f7f73f8e3313c55018c833da837decc1e2fe9d`
MD5	`57989f01e8cdc5f0cb21f0a51b3c643d`
BLAKE2b-256	`ea732d060c9b07f6e1c010b1cba20ddae0ce388f9ad041d1171163a3b32559ef`

See more details on using hashes here.

File details

Details for the file pvduck-0.1.0-py3-none-any.whl.

File metadata

Download URL: pvduck-0.1.0-py3-none-any.whl
Upload date: Dec 8, 2025
Size: 16.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pvduck-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`058c29584930e816d1466832745366de293fd84b756734e04e1a4ee23d1c1ba7`
MD5	`bf9d763b52ef763c9cb3f70ed2092874`
BLAKE2b-256	`ec3c1a21667883a775d1d060c24681f9d87410063a391f1f8ca78e07c66c48fd`

See more details on using hashes here.

pvduck 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pvduck

Installation

Recommended: pipx (installed)

Alternative: uvx (no installation)

Alternative: Docker

Usage

Test

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes