A toolkit for extracting posts and post metadata from web forums
Project description
Harvest - A toolkit for extracting posts and post metadata from web forums
Automatic extraction of forum posts and metadata is a challenging task since forums do not expose their content in a standardized structure. Harvest performs this task reliably for many web forums and offers an easy way to extract data from web forums.
Installation
At the command line:
$ pip install harvest-webforum
If you want to install from the latest sources, you can do:
$ git clone https://github.com/fhgr/harvest.git
$ cd harvest
$ python3 setup.py install
Python library
Embedding harvest into your code is easy, as outlined below:
from urllib.request import urlopen, Request
from harvest import extract_data
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0"
url = "https://forum.videolan.org/viewtopic.php?f=14&t=145604"
req = Request(url, headers={'User-Agent': USER_AGENT})
html = urlopen(req).read().decode('utf-8')
result = extract_data(html, url)
print(result)
WEB-FORUM-52 gold standard
The corpus currently contains from 52 different web forums gold standard documents. These documents are also used by the integrations test of harvest.
Publication
- Weichselbraun, Albert, Brasoveanu, Adrian M. P., Waldvogel, Roger and Odoni, Fabian. (2020). “Harvest - An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums”. IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2020), Melbourne, Australia, Accepted 27 October 2020.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
harvest-webforum-1.1.0.tar.gz
(18.7 kB
view hashes)
Built Distribution
Close
Hashes for harvest_webforum-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e69562cfd7baaedff1d297cd182704a12a67947f88578faf951516196050d9d |
|
MD5 | 19921b0b9e6f2a2fcba3ae2ab057979c |
|
BLAKE2b-256 | daa784f0d52101417a531ed39e4d7db898a355fed55095e832849d83aa32ce8b |