Skip to main content

Create datasets from WordPress sites

Project description

WPextract - WordPress Site Extractor

PyPI - Version Conda Version DOI

WPextract is a tool to create datasets from WordPress sites.

  • Archives posts, pages, tags, categories, media (including files), comments, and users
  • Uses the WordPress API to guarantee 100% accurate and complete content
  • Resolves internal links and media to IDs
  • Automatically parses multilingual sites to create parallel datasets

Quickstart

See the complete documentation for more detailed usage.

  1. Install with pipx
    $ pipx install wpextract
    
  2. Download site data
    $ wpextract download "https://example.org" out_dl
    
  3. Process into a dataset
    $ wpextract extract out_dl out_data
    

About WPextract

WPextract was built by Freddy Heppell of the GATE Project at the School of Computer Science, University of Sheffield, originally created to scrape mis/disinformation websites for research.

License

Available under the Apache 2.0 license. See LICENSE for more information.

Citing

[!NOTE] This software was developed for our EMNLP 2023 paper Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. The code has been updated since the paper was written; for archival purposes, the precise version used for the study is available on Zenodo.

We'd love to hear about your use of our tool, you can email us to let us know! Feel free to create issues and/or pull requests for new features or bugs.

If you use this tool in published work, please cite our EMNLP paper:

Freddy Heppell, Kalina Bontcheva, and Carolina Scarton. 2023. Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5729–5741, Singapore. Association for Computational Linguistics.

Permanent references to each release of this software are available from Zenodo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wpextract-1.1.0.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

wpextract-1.1.0-py3-none-any.whl (53.8 kB view details)

Uploaded Python 3

File details

Details for the file wpextract-1.1.0.tar.gz.

File metadata

  • Download URL: wpextract-1.1.0.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for wpextract-1.1.0.tar.gz
Algorithm Hash digest
SHA256 19f3bc3d45b46a7e9190cab7a6cd55a56c9924c5cc52b01a31d9648919f55d0b
MD5 28adcee3f735bbfdfcbdc941f86a063e
BLAKE2b-256 b976fae2cfcb8f75226d85f6c6e342718c593e6be1256e617d940f5281357ab3

See more details on using hashes here.

File details

Details for the file wpextract-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: wpextract-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 53.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for wpextract-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 82d988af5aef0bba831fbc7088627e5f4528dd1bdf0c88b6b9d7a36a45bfa5c2
MD5 3457a38d655943d3b73be7bf393ae87e
BLAKE2b-256 999b8ef58ae04d6648518167e58aea31ed98de4c6e1771ea3c40fd952223df11

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page