Skip to main content

A tool for extracting plain text from Wikipedia dumps

Project description

WikiExtractor

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database backup dump, e.g. https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for English.

The tool is written in Python and requires Python 3 but no additional library. Warning: problems have been reported on Windows due to poor support for StringIO in the Python implementation on Windows.

For further information, see the Wiki.

Wikipedia Cirrus Extractor

cirrus-extractor.py is a version of the script that performs extraction from a Wikipedia Cirrus dump. Cirrus dumps contain text with already expanded templates.

Cirrus dumps are available at: cirrussearch.

Details

WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions.

In order to speed up processing:

  • multiprocessing is used for dealing with articles in parallel
  • a cache is kept of parsed templates (only useful for repeated extractions).

Installation

The script may be invoked directly:

python -m wikiextractor.WikiExtractor <Wikipedia dump file>

It can also be installed from PyPi by doing:

pip install wikiextractor

or locally with:

(sudo) python setup.py install

The installer also installs two scripts for direct invocation:

wikiextractor  	(equivalent to python -m wikiextractor.WikiExtractor)
extractPage		(to extract a single page from a dump)

Usage

Wikiextractor

The script is invoked with a Wikipedia dump file as an argument:

python -m wikiextractor.WikiExtractor <Wikipedia dump file> [--templates <extracted template file>]

The option --templates extracts the templates to a local file, which can be reloaded to reduce the time to perform extraction.

The output is stored in several files of similar size in a given directory. Each file will contains several documents in this document format.

usage: wikiextractor [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html] [-l] [-ns ns1,ns2]
			 [--templates TEMPLATES] [--no-templates] [--html-safe HTML_SAFE] [--processes PROCESSES]
			 [-q] [--debug] [-a] [-v]
			 input

Wikipedia Extractor:
Extracts and cleans text from a Wikipedia database dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

	<doc id="" url="" title="">
	    ...
	    </doc>

If the program is invoked with the --json flag, then each file will                                            
contain several documents formatted as json ojects, one per line, with                                         
the following structure

	{"id": "", "revid": "", "url": "", "title": "", "text": "..."}

The program performs template expansion by preprocesssng the whole dump and
collecting template definitions.

positional arguments:
  input                 XML wiki dump file

optional arguments:
  -h, --help            show this help message and exit
  --processes PROCESSES
			    Number of processes to use (default 79)

Output:
  -o OUTPUT, --output OUTPUT
			    directory for extracted files (or '-' for dumping to stdout)
  -b n[KMG], --bytes n[KMG]
			    maximum bytes per output file (default 1M)
  -c, --compress        compress output files using bzip
  --json                write output in json format instead of the default <doc> format

Processing:
  --html                produce HTML output, subsumes --links
  -l, --links           preserve links
  -ns ns1,ns2, --namespaces ns1,ns2
			    accepted namespaces
  --templates TEMPLATES
			    use or create file containing templates
  --no-templates        Do not expand templates
  --html-safe HTML_SAFE
			    use to produce HTML safe output within <doc>...</doc>

Special:
  -q, --quiet           suppress reporting progress info
  --debug               print debug info
  -a, --article         analyze a file containing a single article (debug option)
  -v, --version         print program version

Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.

Option --no-templates significantly speeds up the extractor, avoiding the cost of expanding MediaWiki templates.

For further information, visit the documentation.

Cirrus Extractor

usage: cirrus-extract.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [-ns ns1,ns2] [-q]
                         [-v]
                         input

Wikipedia Cirrus Extractor:
Extracts and cleans text from a Wikipedia Cirrus dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

	<doc id="" url="" title="" language="" revision="">
        ...
        </doc>

positional arguments:
  input                 Cirrus Json wiki dump file

optional arguments:
  -h, --help            show this help message and exit

Output:
  -o OUTPUT, --output OUTPUT
                        directory for extracted files (or '-' for dumping to
                        stdin)
  -b n[KMG], --bytes n[KMG]
                        maximum bytes per output file (default 1M)
  -c, --compress        compress output files using bzip

Processing:
  -ns ns1,ns2, --namespaces ns1,ns2
                        accepted namespaces

Special:
  -q, --quiet           suppress reporting progress info
  -v, --version         print program version

extractPage

Extract a single page from a Wikipedia dump file.

usage: extractPage [-h] [--id ID] [--template] [-v] input

Wikipedia Page Extractor:
Extracts a single page from a Wikipedia dump file.

positional arguments:
  input          XML wiki dump file

optional arguments:
  -h, --help     show this help message and exit
  --id ID        article number
  --template     template number
  -v, --version  print program version

License

The code is made available under the GNU Affero General Public License v3.0.

Reference

If you find this code useful, please refer it in publications as:

@misc{Wikiextractor2015,
  author = {Giusepppe Attardi},
  title = {WikiExtractor},
  year = {2015},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/attardi/wikiextractor}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikiextractor-3.0.6.tar.gz (43.3 kB view details)

Uploaded Source

Built Distribution

wikiextractor-3.0.6-py3-none-any.whl (46.4 kB view details)

Uploaded Python 3

File details

Details for the file wikiextractor-3.0.6.tar.gz.

File metadata

  • Download URL: wikiextractor-3.0.6.tar.gz
  • Upload date:
  • Size: 43.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for wikiextractor-3.0.6.tar.gz
Algorithm Hash digest
SHA256 7041ff57884c16409e9b911ccd0106c218b88cbceabb86dd1443637afa8284de
MD5 c8a986cfd99d8dcee239f0acfa818b73
BLAKE2b-256 8e6c21050e72c7e42f606689b1a4cbfbbf65c81e79addc69a9451de450e2c672

See more details on using hashes here.

File details

Details for the file wikiextractor-3.0.6-py3-none-any.whl.

File metadata

  • Download URL: wikiextractor-3.0.6-py3-none-any.whl
  • Upload date:
  • Size: 46.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for wikiextractor-3.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f4506990bf9c14e3cc92dc85a0a89554fef975cc1b28dcb9a4d31c73a8ef2133
MD5 d7ec1ae3849996898c4202331ff9bd1c
BLAKE2b-256 3f37b110fd8a87bab4120ff55061c508361bd1b012d789ff469af8fbd7cc3a38

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page