Skip to main content

Scraping documents from a dump XML file of Wikipedia.

Project description

Wikipedia Scraper

Parsing, tokenizing text using a Wikipedia dump XML file.

Author

Installation

pip install wscraper

Support

language

  • japanese
    • Japanese Wikipedia
  • english
    • English Wikipedia

How to Work (Command)

Check Console Commands

Please run this command.

wscraper --help

Executable commands to be listed.

Initialize

For start, you have to execute this command.
It creates necessary directory and files.

wscraper initialize

wscraper root directory is created at $HOME/.wscraper.
If you change this path, please set environment WSCRAPER_ROOT.

Set Global Parameters

wscraper root set --language japanese --page_chunk 1000
  • language
    • Default language. If you do not set the parameter language for each corpus, this default language is used.
  • page_chunk
    • A Wikipedia dump XML file has large text data as many pages. For analysis, it is separated to several small files because of memory efficiency.

See wscraper root set -h

Import a Wikipedia XML File

A file wikipedia.xml assumes like (lang)wiki-(date)-pages-articles-multistream.xml

wscraper import /path/to/sample.xml
wscraper import /path/to/wikipedia.xml --name my_wp

See wscraper import -h.

Check Wikipedia Resources

It can check Wikipedia corpus resources.

wscraper list

output

Available wikipedia:
  - sample
  - my_wp

Switch Current Corpus

wscraper switch my_wp

Check the Status of Current Corpus

wscraper status

output

current: my_wp

language [default]: japanese

Set Parameters for Current Corpus

Required parameters should be set for current corpus.

wscraper set --language english

parameters:

  • language

Unset Parameters

You can delete parameters by running following command.

wscraper unset --language

Rename a Corpus Name

You can rename a corpus name from $source to $target.

wscraper rename $source $target

Delete a Corpus

When a corpus (example: $target) is unnecessary, it can be removed.

wscraper delete $target

How to Work (Python)

Importing iterator classes.

from wscraper.analysis import *

You can iterate pages of a corpus by writing this.

# entry
entry_iterator = EntryIterator()
# You can specify corpus name and language.
# If parameter is not given, current Wikipedia corpus is used.
# >>> EntryIterator(name = "sample", language = "japanese")
both_iterator = BothIterator()
redirection_iterator = RedirectionIterator()

for i, b in enumerate(both_iterator):
    print(f"both: {i}: {type(b)}")

for i, e in enumerate(entry_iterator):
    print(f"entry {i}: {e.title} {len(e.mediawiki)}")

for i, r in enumerate(redirection_iterator):
    print(f"redirection: {i}: {r.source} -> {r.target}")

For example, you can give iterator a ML model.

def to_words(x):
    return x.split()

# return word list for each iteration
iterator = ArticleIterator(tagger = to_words)
# If you set `type = dict`, you can get records as dictionary
# ex: { "title", "ABC...", "article": "eval(to_words(article))" }

# Iterators:
#   ArticleIterator
#     - 1 page / record
#     - dict keys: ["title", "article"]
#   ParagraphIterator:
#     - N page / record
#     - Description like "== A ==" is delimiter of paragraphs.
#     - dict keys: ["page_title", "paragraph_title", "paragraph"]

# For example, gensim word2vec can interpret this iterator

# from gensim.models.word2vec import Word2Vec
Word2Vec(iterator)

# You can concatenate iterators to train ML model using CombinedIterator.
Word2Vec(CombinedIterator(iterator, another_iterator))

License

The source code is licensed MIT.

Please check the file LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wscraper-0.2.1.tar.gz (14.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page