Skip to main content

Stream download, filter, and parse Wikimedia pageviews files

Project description

pvstream

pvstream is a Rust library with python bindings allowing you to efficiently stream download, parse, and filter pageview from Wikimedia's hourly dumps.

The library can be used from Rust or python. In both languages you can choose between an iterator of parsed objects, made available on the fly as the file is downloaded, or a complete parquet file of parsed and filtered data.

Installation

To use pvstream in your Rust project, add it to your Cargo.toml:

[dependencies]
pvstream = { git = "https://github.com/vegardege/pvstream" }

To use pvstream in a python project, you can run this in your virtual environment:

pip install maturin
git clone https://github.com/vegardege/pvstream
cd pvstream
maturin develop --release

Or run:

maturin build --release

and pip install from target/wheels.

Usage

There are four main entry points for this library:

Function  Input  Output
stream_from_file Filename on the local file system Iterator of parsed row structs
stream_from_url URL of a remotely stored file Iterator of parsed row structs
parquet_from_file Filename on the local file system Parquet file of parsed row structs
parquet_from_url URL of a remotely stored file Parquet file of parsed row structs

[!CAUTION] The _url functions will stream the file directly from Wikimedia's servers. Please be kind to the servers and cache if you plan to read the same file more than once. Consider using a mirror closer to you by. You can find mirrors listed on wikimedia.org.

They all accept similar filters. In python, Regex is a str, Vec is a list, u32 is an int:

Filter Type Description
line_regex Option<Regex> Regular expression used to filter lines before parsing
page_title Option<Regex> Regular expression used to filter page titles after parsing
domain_codes Option<Vec<String>> List of domain codes to accept
min_views Option<u32> Minimum amount of views needed to be accepted
max_views Option<u32> Maximum amount of views allowed
languages Option<Vec<String>> List of languages to accept
domains Option<Vec<String>> List of domains to accept
mobile Option<bool> If set, filter on whether the row belongs to a mobile site

Learn more about the format from Wikimedia's documentation.

Example (Rust):

use pvstream::filter::FilterBuilder;
use pvstream::stream_from_file;
use std::path::PathBuf;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Path to your pageviews file
    let path = PathBuf::from("pageviews-20240818-080000.gz");

    // View all English mobile sites containing the word 'Rust'
    let filter = FilterBuilder::new()
        .domain_codes(["en.m"])
        .page_title("Rust")
        .build();

    // Stream rows matching the filter
    let rows = stream_from_file(path, &filter)?;

    // Iterate over results
    for row in rows {
        match row {
            Ok(pageview) => println!("{:?}", pageview),
            Err(e) => eprintln!("Error parsing row: {:?}", e),
        }
    }

    Ok(())
}

Example (python):

import pvstream

rows = pvstream.stream_from_file(
    "pageviews-20240818-080000.gz",
    domain_codes=["en.m"],
    page_title="Rust",
)

for row in rows:
    print(row)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pvstream-0.1.0a1.tar.gz (51.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pvstream-0.1.0a1-cp311-cp311-macosx_11_0_arm64.whl (1.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file pvstream-0.1.0a1.tar.gz.

File metadata

  • Download URL: pvstream-0.1.0a1.tar.gz
  • Upload date:
  • Size: 51.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.6

File hashes

Hashes for pvstream-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 b66cebc660b1516bd81fc77386f43e1db0a100c235cc2654e0ad48a30869f30d
MD5 ee6a00730a0754214fd722fe83f8fb41
BLAKE2b-256 3345ba40f4cf0a30adcebe9e71129f1b335628993f8a33b65a2369c508ac03b4

See more details on using hashes here.

File details

Details for the file pvstream-0.1.0a1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pvstream-0.1.0a1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2c66cf883bab29b60b5018abd26959347f700c6c6aef6050af7b8c532d9e74ac
MD5 ed8d9254bfbd8ccf1230cbfbe2568e3f
BLAKE2b-256 3e59a95d221e3e2e0d4909f760e9aee81f8b191449c0e7a72009d97b331115ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page