Scraping documents from a dump XML file of Wikipedia.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Wikipedia Scraper

Parsing, tokenizing text using a Wikipedia dump XML file.

Author

Name: T.Furukawa
Email: tfurukawa.mail@gmail.com

Installation

pip install wscraper

Support

language

japanese
- Japanese Wikipedia
english
- English Wikipedia

How to Work (Command)

Check Console Commands

Please run this command.

wscraper --help

Executable commands to be listed.

Initialize

For start, you have to execute this command.
It creates necessary directory and files.

wscraper initialize

wscraper root directory is created at $HOME/.wscraper.
If you change this path, please set environment WSCRAPER_ROOT.

Set Global Parameters

wscraper root set --language japanese --page_chunk 1000

language
- Default language. If you do not set the parameter language for each corpus, this default language is used.
page_chunk
- A Wikipedia dump XML file has large text data as many pages. For analysis, it is separated to several small files because of memory efficiency.

See wscraper root set -h

Import a Wikipedia XML File

A file wikipedia.xml assumes like (lang)wiki-(date)-pages-articles-multistream.xml

wscraper import /path/to/sample.xml
wscraper import /path/to/wikipedia.xml --name my_wp

See wscraper import -h.

Check Wikipedia Resources

It can check Wikipedia corpus resources.

wscraper list

output

Available wikipedia:
  - sample
  - my_wp

Switch Current Corpus

wscraper switch my_wp

Check the Status of Current Corpus

wscraper status

output

current: my_wp

language [default]: japanese

Set Parameters for Current Corpus

Required parameters should be set for current corpus.

wscraper set --language english

parameters:

language

Unset Parameters

You can delete parameters by running following command.

wscraper unset --language

Rename a Corpus Name

You can rename a corpus name from $source to $target.

wscraper rename $source $target

Delete a Corpus

When a corpus (example: $target) is unnecessary, it can be removed.

wscraper delete $target

How to Work (Python)

Importing iterator classes.

from wscraper.analysis import *

You can iterate pages of a corpus by writing this.

# entry
entry_iterator = EntryIterator()
# You can specify corpus name and language.
# If parameter is not given, current Wikipedia corpus is used.
# >>> EntryIterator(name = "sample", language = "japanese")
both_iterator = BothIterator()
redirection_iterator = RedirectionIterator()

for i, b in enumerate(both_iterator):
    print(f"both: {i}: {type(b)}")

for i, e in enumerate(entry_iterator):
    print(f"entry {i}: {e.title} {len(e.mediawiki)}")

for i, r in enumerate(redirection_iterator):
    print(f"redirection: {i}: {r.source} -> {r.target}")

For example, you can give iterator a ML model.

def to_words(x):
    return x.split()

# return word list for each iteration
iterator = ArticleIterator(tagger = to_words)
# If you set `type = dict`, you can get records as dictionary
# ex: { "title", "ABC...", "article": "eval(to_words(article))" }

# Iterators:
#   ArticleIterator
#     - 1 page / record
#     - dict keys: ["title", "article"]
#   ParagraphIterator:
#     - N page / record
#     - Description like "== A ==" is delimiter of paragraphs.
#     - dict keys: ["page_title", "paragraph_title", "paragraph"]

# For example, gensim word2vec can interpret this iterator

# from gensim.models.word2vec import Word2Vec
Word2Vec(iterator)

# You can concatenate iterators to train ML model using CombinedIterator.
Word2Vec(CombinedIterator(iterator, another_iterator))

License

The source code is licensed MIT.

Please check the file LICENSE.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.1

Aug 2, 2022

0.2.0

Nov 6, 2021

0.1.0

Oct 25, 2021

0.0.3

Apr 15, 2021

0.0.2

Apr 6, 2021

0.0.1

Mar 27, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wscraper-0.2.1.tar.gz (14.6 kB view hashes)

Uploaded Aug 2, 2022 Source

Hashes for wscraper-0.2.1.tar.gz

Hashes for wscraper-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`fb948a16e1d86691f85f930622368c0a625f8278aa620517830699811aba204b`
MD5	`de29ba5631ce9c72719ab537422ca937`
BLAKE2b-256	`05d08f361bd88a20bd742221bb319b7e8f8ce6383b31e502482f2a61b1b9e47c`