Skip to main content

Create datasets from WordPress sites

Project description

WPextract - WordPress Site Extractor

PyPI - Version Conda Version DOI

WPextract is a tool to create datasets from WordPress sites.

  • Archives posts, pages, tags, categories, media (including files), comments, and users
  • Uses the WordPress API to guarantee 100% accurate and complete content
  • Resolves internal links and media to IDs
  • Automatically parses multilingual sites to create parallel datasets

Quickstart

See the complete documentation for more detailed usage.

  1. Install with pipx
    $ pipx install wpextract
    
  2. Download site data
    $ wpextract download "https://example.org" out_dl
    
  3. Process into a dataset
    $ wpextract extract out_dl out_data
    

About WPextract

WPextract was built by Freddy Heppell of the GATE Project at the School of Computer Science, University of Sheffield, originally created to scrape mis/disinformation websites for research.

License

Available under the Apache 2.0 license. See LICENSE for more information.

Citing

[!NOTE] This software was developed for our EMNLP 2023 paper Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. The code has been updated since the paper was written; for archival purposes, the precise version used for the study is available on Zenodo.

We'd love to hear about your use of our tool, you can email us to let us know! Feel free to create issues and/or pull requests for new features or bugs.

If you use this tool in published work, please cite our EMNLP paper:

Freddy Heppell, Kalina Bontcheva, and Carolina Scarton. 2023. Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5729–5741, Singapore. Association for Computational Linguistics.

Permanent references to each release of this software are available from Zenodo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wpextract-1.1.1.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wpextract-1.1.1-py3-none-any.whl (53.8 kB view details)

Uploaded Python 3

File details

Details for the file wpextract-1.1.1.tar.gz.

File metadata

  • Download URL: wpextract-1.1.1.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for wpextract-1.1.1.tar.gz
Algorithm Hash digest
SHA256 721a7a637a7b5876b73d616fd427444d9dc7a212daab056ed41bfe699560508c
MD5 88aecf56a43e844ca1ce015365f0b500
BLAKE2b-256 0aa6faf893ea9db7d4f076965565420f3640f0daaedc964b10819baabcb09b16

See more details on using hashes here.

Provenance

The following attestation bundles were made for wpextract-1.1.1.tar.gz:

Publisher: publish.yml on GateNLP/wpextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file wpextract-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: wpextract-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 53.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for wpextract-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d7f626d57cf190c5cc40fe4079b6147e54272493efabcc29fb457e011e612fa6
MD5 9e4525893217fd1421880a6cee23d731
BLAKE2b-256 883c30e16c00b6fd0ce2578034d84a1c7ed878c629ad4a936e16105791671a67

See more details on using hashes here.

Provenance

The following attestation bundles were made for wpextract-1.1.1-py3-none-any.whl:

Publisher: publish.yml on GateNLP/wpextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page