Skip to main content

Stream download, parse, and filter Wikimedia pageviews files

Project description

pvstream

Code Quality and Tests PyPI Crates.io Version docs.rs

pvstream is a Rust library with python bindings allowing you to efficiently stream download, parse, and filter pageviews from Wikimedia's hourly dumps.

The library can be used from Rust or python. In both languages you can choose between an iterator of parsed objects, made available on the fly as the file is downloaded, or a complete parquet file of parsed and filtered data.

Installation

Rust

Add pvstream to your Cargo.toml:

[dependencies]
pvstream = "0.1"

Or use cargo-add:

cargo add pvstream

Python

Install from PyPI:

pip install pvstream

Building from Source

To build the Python package for your specific hardware:

pip install maturin
git clone https://github.com/vegardege/pvstream
cd pvstream
maturin develop --release

Or build a wheel:

maturin build --release
pip install target/wheels/pvstream-*.whl

Usage

There are four main entry points for this library:

Function  Input  Output
stream_from_file Filename on the local file system Iterator of parsed row structs
stream_from_url URL of a remotely stored file Iterator of parsed row structs
parquet_from_file Filename on the local file system Parquet file of parsed row structs
parquet_from_url URL of a remotely stored file Parquet file of parsed row structs

[!CAUTION] The _url functions will stream the file directly from Wikimedia's servers. Please be kind to the servers and cache if you plan to read the same file more than once. Consider using a mirror closer to you. You can find mirrors listed on wikimedia.org.

They all accept similar filters. In python, Regex is a str, Vec is a list, u32 is an int:

Filter Type Description
line_regex Option<Regex> Regular expression used to filter lines before parsing
page_title Option<Regex> Regular expression used to filter page titles after parsing
domain_codes Option<Vec<String>> List of domain codes to accept
min_views Option<u32> Minimum amount of views needed to be accepted
max_views Option<u32> Maximum amount of views allowed
languages Option<Vec<String>> List of languages to accept
domains Option<Vec<String>> List of domains to accept
mobile Option<bool> If set, filter on whether the row belongs to a mobile site

Learn more about the format from Wikimedia's documentation.

Example (Rust):

use pvstream::filter::FilterBuilder;
use pvstream::stream_from_file;
use std::path::PathBuf;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Path to your pageviews file
    let path = PathBuf::from("pageviews-20240818-080000.gz");

    // View all English mobile sites containing the word 'Rust'
    let filter = FilterBuilder::new()
        .domain_codes(["en.m"])
        .page_title("Rust")
        .build();

    // Stream rows matching the filter
    let rows = stream_from_file(path, &filter)?;

    // Iterate over results
    for row in rows {
        match row {
            Ok(pageview) => println!("{:?}", pageview),
            Err(e) => eprintln!("Error parsing row: {:?}", e),
        }
    }

    Ok(())
}

Example (python):

import pvstream

rows = pvstream.stream_from_file(
    "pageviews-20240818-080000.gz",
    domain_codes=["en.m"],
    page_title="Rust",
)

for row in rows:
    print(row)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pvstream-0.1.0.tar.gz (53.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pvstream-0.1.0-cp313-cp313-macosx_11_0_arm64.whl (1.9 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file pvstream-0.1.0.tar.gz.

File metadata

  • Download URL: pvstream-0.1.0.tar.gz
  • Upload date:
  • Size: 53.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for pvstream-0.1.0.tar.gz
Algorithm Hash digest
SHA256 35c883a210f8c1fbb9b0868a0882a4d0eb77b38a2c5eef1537779621aa0da17c
MD5 3db5d94889315dbe084aaa487a0b85f9
BLAKE2b-256 b67b07db72abc5d79337e5c72e152427a9a482e10ace2cbee556cae012ab8122

See more details on using hashes here.

File details

Details for the file pvstream-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pvstream-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 83dc37e97f36220168054fbabed2b6beadc639254602cb0db26508a53c6caca1
MD5 e0559bca77c579c66714d5983c168419
BLAKE2b-256 0e1d2bdfab5786d2cbb8b1095352aa46202f995a9dc82f7b744fe3efc63b67c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page