Scraping documents from a dump XML file of Wikipedia.
Project description
Wikipedia Scraper
Parsing, tokenizing text using a Wikipedia dump XML file.
Author
- Name: T.Furukawa
- Email: tfurukawa.mail@gmail.com
Installation
pip install wscraper
Support
language
- japanese
- Japanese Wikipedia
- english
- English Wikipedia
How to Work (Command)
Check Console Commands
Please run this command.
wscraper --help
Executable commands to be listed.
Initialize
For start, you have to execute this command.
It creates necessary directory and files.
wscraper initialize
wscraper root directory is created at $HOME/.wscraper
.
If you change this path, please set environment WSCRAPER_ROOT
.
Set Global Parameters
wscraper root set --language japanese --page_chunk 1000
language
- Default language. If you do not set the parameter
language
for each corpus, this default language is used.
- Default language. If you do not set the parameter
page_chunk
- A Wikipedia dump XML file has large text data as many pages. For analysis, it is separated to several small files because of memory efficiency.
See wscraper root set -h
Import a Wikipedia XML File
A file wikipedia.xml assumes like (lang)wiki-(date)-pages-articles-multistream.xml
wscraper import /path/to/sample.xml
wscraper import /path/to/wikipedia.xml --name my_wp
See wscraper import -h
.
Check Wikipedia Resources
It can check Wikipedia corpus resources.
wscraper list
output
Available wikipedia:
- sample
- my_wp
Switch Current Corpus
wscraper switch my_wp
Check the Status of Current Corpus
wscraper status
output
current: my_wp
language [default]: japanese
Set Parameters for Current Corpus
Required parameters should be set for current corpus.
wscraper set --language english
parameters:
language
Unset Parameters
You can delete parameters by running following command.
wscraper unset --language
Rename a Corpus Name
You can rename a corpus name from $source
to $target
.
wscraper rename $source $target
Delete a Corpus
When a corpus (example: $target
) is unnecessary, it can be removed.
wscraper delete $target
How to Work (Python)
Importing iterator classes.
from wscraper.analysis import *
You can iterate pages of a corpus by writing this.
# entry
entry_iterator = EntryIterator()
# You can specify corpus name and language.
# If parameter is not given, current Wikipedia corpus is used.
# >>> EntryIterator(name = "sample", language = "japanese")
both_iterator = BothIterator()
redirection_iterator = RedirectionIterator()
for i, b in enumerate(both_iterator):
print(f"both: {i}: {type(b)}")
for i, e in enumerate(entry_iterator):
print(f"entry {i}: {e.title} {len(e.mediawiki)}")
for i, r in enumerate(redirection_iterator):
print(f"redirection: {i}: {r.source} -> {r.target}")
For example, you can give iterator a ML model.
def to_words(x):
return x.split()
# return word list for each iteration
iterator = ArticleIterator(tagger = to_words)
# If you set `type = dict`, you can get records as dictionary
# ex: { "title", "ABC...", "article": "eval(to_words(article))" }
# Iterators:
# ArticleIterator
# - 1 page / record
# - dict keys: ["title", "article"]
# ParagraphIterator:
# - N page / record
# - Description like "== A ==" is delimiter of paragraphs.
# - dict keys: ["page_title", "paragraph_title", "paragraph"]
# For example, gensim word2vec can interpret this iterator
# from gensim.models.word2vec import Word2Vec
Word2Vec(iterator)
# You can concatenate iterators to train ML model using CombinedIterator.
Word2Vec(CombinedIterator(iterator, another_iterator))
License
The source code is licensed MIT.
Please check the file LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.