Extract corpora from Wikipedia dumps
Project description
Wiki-Dump Reader
Extract corpora from wiki-dump.
Install
pip install wiki-dump-reader
Usage
The dump file *wiki-*-pages-articles.xml
should be downloaded first. Then you can iterate and get cleaned text from the text:
from wiki_dump_reader import Cleaner, iterate
cleaner = Cleaner()
for title, text in iterate('*wiki-*-pages-articles.xml'):
text = cleaner.clean_text(text)
cleaned_text, links = cleaner.build_links(text)
Just ignore links
if you don't need them:
cleaned_text, _ = cleaner.build_links(text)
See examples for an intuitive feeling.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wiki-dump-reader-0.0.4.tar.gz
(3.4 kB
view hashes)