Load Wikimedia pageview data to a duckdb
Project description
pvduck
pvduck is a cli tool allowing you to sample, download, filter, parse, and
aggregate Wikimedia pageviews dumps, creating an overview of the most visited
pages on a sampled subset of Wikimedia pages in a duckdb database.
This is useful if you want a rough overview of pageviews for a specific subset of pages, but lack access to the analytics servers and don't want to host a full HDFS cluster yourself. Instead of storing several GB per day, you can aggregate years worth of data in a small database holding just the data you need.
The tool was developed for my own weekend projects, where gauging popularity without perfect precision or history is very helpful.
Installation
Recommended: pipx (installed)
Install pvduck in an isolated environment with pipx:
pipx install pvduck
Then use it directly:
pvduck create myproject
pvduck sync myproject
Alternative: uvx (no installation)
Run pvduck without installing using uvx:
uvx pvduck create myproject
uvx pvduck sync myproject
Perfect for trying the tool without commitment!
Alternative: Docker
For containerized environments:
docker run --rm -it \
-v ~/.config:/root/.config \
-v ~/.local/share:/root/.local/share \
vegardege/pvduck:latest \
create myproject
Note: Use -it for commands that open an editor.
Usage
Call pvduck --help for instructions and pvduck --install-completion to
install auto-completion in your shell, both of which will help.
The tool has six commands:
| Command | Description |
|---|---|
create <project_name> |
Create a new project |
edit <project_name> |
Edit project configuration |
rm <project_name> |
Delete config and database |
sync <project_name> |
Download missing data (if any) and aggregate into the database |
open <project_name> |
Open the project's database in duckdb |
status <project_name> |
See progress status for the project |
compact <project_name> |
Reclaim disk space according to duckdb best practice |
ls |
List all existing projects |
By default, sync will run until it exhausted the date range given with the
given sample rate. If you want to run it for a limited time only, apply the
max_files option to stop after a specific number of files.
Note that syncing can be memory intensive. It operates in chunks of 1 000 000
rows by default, which can be modified with the PVDUCK_CHUNK_SIZE environment
variable. Increase the value for faster syncs, decrease it for more memory
efficient (but slower) execution.
When you create a new project, you can define the configuration:
| Param | Description |
|---|---|
base_url |
Which Wikimedia mirror to use (recommended: use mirror close to you) |
sleep_time |
How many seconds to wait between each file download |
start_date |
Date of the first dump file to download |
end_date |
Date of the last dump file to download (or blank for current date expanding) |
sample_rate |
Probability of downloading each hourly file in the interval |
In addition, the config file contains filters to reduce the size of the dataset. All filters can be set to blank values, which means no rows are excluded.
| Filter | Type | Description |
|---|---|---|
line_regex |
regex |
Regular expression used to filter lines before parsing |
page_title |
regex |
Regular expression used to filter page titles after parsing |
domain_codes |
list[str] |
List of domain codes to accept |
min_views |
int |
Minimum amount of views needed to be accepted |
max_views |
int |
Maximum amount of views allowed |
languages |
list[str] |
List of languages to accept |
domains |
list[str] |
List of domains to accept |
mobile |
bool |
If set, filter on whether the row belongs to a mobile site |
[!IMPORTANT]
Thesyncoperation is destructive. It keeps track of which files you have downloaded, but can not revert any aggregation operations. As a result, only some parameters and filters can be changed without putting the database in an inconsistent state. Notably, you can expand the date range and increase the sample rate without issues.
Test
uv run pytest tests/ --cov=src
Name Stmts Miss Cover
----------------------------------------------
src/pvduck/__init__.py 0 0 100%
src/pvduck/cli.py 108 0 100%
src/pvduck/config.py 69 0 100%
src/pvduck/db.py 70 0 100%
src/pvduck/project.py 21 0 100%
src/pvduck/stream.py 20 0 100%
src/pvduck/timeseries.py 32 0 100%
src/pvduck/validators.py 22 0 100%
src/pvduck/wikimedia.py 8 0 100%
----------------------------------------------
TOTAL 350 0 100%
[!CAUTION] Tests are separated in unit tests and integration tests. The integration tests take several minutes to run and downloads files from the configured server. Don't run the full test suite frivolously.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pvduck-0.1.0.tar.gz.
File metadata
- Download URL: pvduck-0.1.0.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e0ddead9bfd2c97468c9cfd19f7f73f8e3313c55018c833da837decc1e2fe9d
|
|
| MD5 |
57989f01e8cdc5f0cb21f0a51b3c643d
|
|
| BLAKE2b-256 |
ea732d060c9b07f6e1c010b1cba20ddae0ce388f9ad041d1171163a3b32559ef
|
File details
Details for the file pvduck-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pvduck-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
058c29584930e816d1466832745366de293fd84b756734e04e1a4ee23d1c1ba7
|
|
| MD5 |
bf9d763b52ef763c9cb3f70ed2092874
|
|
| BLAKE2b-256 |
ec3c1a21667883a775d1d060c24681f9d87410063a391f1f8ca78e07c66c48fd
|