Stream download, filter, and parse Wikimedia pageviews files
Project description
pvstream
pvstream is a Rust library with python bindings allowing you to efficiently
stream download, parse, and filter pageview from Wikimedia's hourly dumps.
The library can be used from Rust or python. In both languages you can choose between an iterator of parsed objects, made available on the fly as the file is downloaded, or a complete parquet file of parsed and filtered data.
Installation
To use pvstream in your Rust project, add it to your Cargo.toml:
[dependencies]
pvstream = { git = "https://github.com/vegardege/pvstream" }
To use pvstream in a python project, you can run this in your virtual environment:
pip install maturin
git clone https://github.com/vegardege/pvstream
cd pvstream
maturin develop --release
Or run:
maturin build --release
and pip install from target/wheels.
Usage
There are four main entry points for this library:
| Function | Input | Output |
|---|---|---|
stream_from_file |
Filename on the local file system | Iterator of parsed row structs |
stream_from_url |
URL of a remotely stored file | Iterator of parsed row structs |
parquet_from_file |
Filename on the local file system | Parquet file of parsed row structs |
parquet_from_url |
URL of a remotely stored file | Parquet file of parsed row structs |
[!CAUTION] The
_urlfunctions will stream the file directly from Wikimedia's servers. Please be kind to the servers and cache if you plan to read the same file more than once. Consider using a mirror closer to you by. You can find mirrors listed on wikimedia.org.
They all accept similar filters. In python, Regex is a str, Vec is a list, u32 is an int:
| Filter | Type | Description |
|---|---|---|
line_regex |
Option<Regex> |
Regular expression used to filter lines before parsing |
page_title |
Option<Regex> |
Regular expression used to filter page titles after parsing |
domain_codes |
Option<Vec<String>> |
List of domain codes to accept |
min_views |
Option<u32> |
Minimum amount of views needed to be accepted |
max_views |
Option<u32> |
Maximum amount of views allowed |
languages |
Option<Vec<String>> |
List of languages to accept |
domains |
Option<Vec<String>> |
List of domains to accept |
mobile |
Option<bool> |
If set, filter on whether the row belongs to a mobile site |
Learn more about the format from Wikimedia's documentation.
Example (Rust):
use pvstream::filter::FilterBuilder;
use pvstream::stream_from_file;
use std::path::PathBuf;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Path to your pageviews file
let path = PathBuf::from("pageviews-20240818-080000.gz");
// View all English mobile sites containing the word 'Rust'
let filter = FilterBuilder::new()
.domain_codes(["en.m"])
.page_title("Rust")
.build();
// Stream rows matching the filter
let rows = stream_from_file(path, &filter)?;
// Iterate over results
for row in rows {
match row {
Ok(pageview) => println!("{:?}", pageview),
Err(e) => eprintln!("Error parsing row: {:?}", e),
}
}
Ok(())
}
Example (python):
import pvstream
rows = pvstream.stream_from_file(
"pageviews-20240818-080000.gz",
domain_codes=["en.m"],
page_title="Rust",
)
for row in rows:
print(row)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pvstream-0.1.0a1.tar.gz.
File metadata
- Download URL: pvstream-0.1.0a1.tar.gz
- Upload date:
- Size: 51.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b66cebc660b1516bd81fc77386f43e1db0a100c235cc2654e0ad48a30869f30d
|
|
| MD5 |
ee6a00730a0754214fd722fe83f8fb41
|
|
| BLAKE2b-256 |
3345ba40f4cf0a30adcebe9e71129f1b335628993f8a33b65a2369c508ac03b4
|
File details
Details for the file pvstream-0.1.0a1-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: pvstream-0.1.0a1-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.9 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c66cf883bab29b60b5018abd26959347f700c6c6aef6050af7b8c532d9e74ac
|
|
| MD5 |
ed8d9254bfbd8ccf1230cbfbe2568e3f
|
|
| BLAKE2b-256 |
3e59a95d221e3e2e0d4909f760e9aee81f8b191449c0e7a72009d97b331115ea
|