Skip to main content

Add your description here

Project description

Polona Explorer

Polona Explorer is a Python project designed for analyzing historical Polish periodicals, specifically focusing on the concept of oil in Polish press from 1853-1918.

Dataset

This project is associated with the dataset "Petroleum and Press -- The Concept of Oil in Polish Periodical Press, 1853--1918" published on Zenodo.

About the Dataset

The dataset consists of Polish-language periodicals that mention petroleum published between 1 January 1853 and 31 December 1918. The periodicals were downloaded together with their metadata from the Polona aggregator in May 2023 in PDF format, based on a keyword search. These PDF files contained OCR'd digital images of physical documents held in archives and libraries and are all in the public domain.

The dataset was processed to create articles specifically pertaining to petroleum by running layout recognition and segmentation, OCR, and formatting the dataset to METS/MODS.

Access the Dataset

The full dataset contains ~17,000 zipped newspapers in METS/MODS format with a total file size of approximately 480 GB. A random sample of 75 zipped Polish weekly newspapers in public domain is also available.

Dataset Features

  • Time Range: 1853-1918
  • Language: Polish
  • Format: METS/MODS with OCR
  • Processing Tools: OCR-D and Eynollah
  • Source: Polona aggregator

Project Structure

polonaexplorer/
├── src/                 # Source code
├── tests/               # Test files
├── example/             # Example usage
├── docs/                # Documentation
├── README.md            # This file
└── pyproject.toml       # Project configuration

Installation

Install the package in development mode:

pip install -e .

Usage

See the example directory for usage examples.

Basic usage example:

from polonaexplorer.explorer import PolonaExplorer

# Initialize explorer with target words
explorer = PolonaExplorer(
    targetwords=["petrol", "oil"],
    data_path="/path/to/polona/corpus",
    out_path="/path/to/output"
)

# Search for words and extract text
explorer.get_file_stats()
result_path = explorer.generate_dataframe()

Development

For development setup:

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run coverage
coverage run -m pytest
coverage report

Documentation

Documentation is built using Sphinx:

# Install documentation dependencies
pip install -e ".[docs]"

# Build documentation
make -C docs html

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this dataset or code in your research, please cite:

Kaye, A., & Vogl, M. (2026). Petroleum and Press -- The Concept of Oil in Polish Periodical Press, 1853--1918 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18591713

Authors

  • Malte Vogl (Max Planck Institute of Geoanthropology)
  • Aleksandra Kaye (Max Planck Institute of Geoanthropology, Independent Researcher)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polonaexplorer-0.2.0.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polonaexplorer-0.2.0-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file polonaexplorer-0.2.0.tar.gz.

File metadata

  • Download URL: polonaexplorer-0.2.0.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for polonaexplorer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 19634c6e2b7cf2af5086d4cb74065653cf2839446e0e35b62862eb0fa2150c59
MD5 2227a2cb0a0e4c13ccb1b84ab26211a9
BLAKE2b-256 3df10a53f237e8f276a55bdfe51cc68574025c74d1fcc3280ab34874243669f8

See more details on using hashes here.

File details

Details for the file polonaexplorer-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: polonaexplorer-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for polonaexplorer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 23d3ab9f65409b309d20c9885227de0f48638d4c37625f67a00b196a6f9d18c1
MD5 b585f27c7fc5df1032070279498a87f4
BLAKE2b-256 530c0243218e53a89199500f9d64078e2d2636c30079be3c042fd644246fdc20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page