Skip to main content

This is a simple application that extracts articles from Wikipedia backup files.

Project description

WikipediaMultistreamExtractor

This is a simple application that extracts articles from Wikipedia backup files.

Installation

pip install wikipedia_multistream_extractor

Usage

From the CLI:

 wikipedia_multistream_extractor SRC DEST

As a library:

from wikipedia_multistream_extractor import dump_wiki_pages, extract_wiki_pages


dump_wiki_pages(src, dst, preprocessor=None, writer=None)

# Or

pages = extract_wiki_pages(src) 

You can pass a preprocessor to dump_wiki_pages if you wanted to transform the data.

  • Your function should accept two arguments page and title.
  • It should return the modified page as a str, unless your writer expects something different.
  • If you return None the page will be skipped by the writer.

You can also pass a writer if the intended output is not xml.

  • Your function should accept four arguments page, title, dest_path and safe_title.
  • IF you pass a writer it's up to you two write the file, not further action is taken.
  • safe_title should be a safe filename to use when writing the file, title is almost certainly not.

This library was built on Windows using Python 3.9.13.

It should work on other platforms but it has not been tested.

License

The license is MIT, see the LICENSE file for details.

Contributing

PRs welcome, please have type annotations and docstrings.

Todos

  • Extracting a single article, based on the index offsets.
  • Extracting the index files.
  • Create a UI that displays the index file contents and allows the users to select articles.
  • Extracting multiple articles at offsets.
  • Use multiprocessing?
  • Write some tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikipedia_multistream_extractor-0.0.2.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file wikipedia_multistream_extractor-0.0.2.tar.gz.

File metadata

File hashes

Hashes for wikipedia_multistream_extractor-0.0.2.tar.gz
Algorithm Hash digest
SHA256 ccda019a114471b83c7f47831bc563345fbe15447aa6d13c9984e126a579e15b
MD5 97c1901dd911c07706ba781a32b82c3d
BLAKE2b-256 42f8fd377f7e6cf18c098beb74672a99ff5e56d60fc87210f9dc4d6cd0050bd8

See more details on using hashes here.

File details

Details for the file wikipedia_multistream_extractor-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for wikipedia_multistream_extractor-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 82c127d4c9454dd0d8d1e2256d7ec92108e8097bcc10413966467b7274ad901b
MD5 c03c177694ac2ed58bf99f8a113b9b49
BLAKE2b-256 ad5e384517cd7cbf4b94e12b99d9ed0e9fbf4ecee58aecf87e241084ecad5aa6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page