This is a simple application that extracts articles from Wikipedia backup files.
Project description
WikipediaMultistreamExtractor
This is a simple application that extracts articles from Wikipedia backup files.
Installation
pip install wikipedia_multistream_extractor
Usage
From the CLI:
wikipedia_multistream_extractor SRC DEST
As a library:
from wikipedia_multistream_extractor import dump_wiki_pages, extract_wiki_pages
dump_wiki_pages(src, dst, preprocessor=None, writer=None)
# Or
pages = extract_wiki_pages(src)
You can pass a preprocessor to dump_wiki_pages if you wanted to transform the data.
- Your function should accept two arguments
pageandtitle. - It should return the modified
pageas a str, unless your writer expects something different. - If you return None the page will be skipped by the writer.
You can also pass a writer if the intended output is not xml.
- Your function should accept four arguments
page,title,dest_pathandsafe_title. - IF you pass a
writerit's up to you two write the file, not further action is taken. safe_titleshould be a safe filename to use when writing the file,titleis almost certainly not.
This library was built on Windows using Python 3.9.13.
It should work on other platforms but it has not been tested.
License
The license is MIT, see the LICENSE file for details.
Contributing
PRs welcome, please have type annotations and docstrings.
Todos
- Extracting a single article, based on the index offsets.
- Extracting the index files.
- Create a UI that displays the index file contents and allows the users to select articles.
- Extracting multiple articles at offsets.
- Use multiprocessing?
- Write some tests.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wikipedia_multistream_extractor-0.0.2.tar.gz.
File metadata
- Download URL: wikipedia_multistream_extractor-0.0.2.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccda019a114471b83c7f47831bc563345fbe15447aa6d13c9984e126a579e15b
|
|
| MD5 |
97c1901dd911c07706ba781a32b82c3d
|
|
| BLAKE2b-256 |
42f8fd377f7e6cf18c098beb74672a99ff5e56d60fc87210f9dc4d6cd0050bd8
|
File details
Details for the file wikipedia_multistream_extractor-0.0.2-py3-none-any.whl.
File metadata
- Download URL: wikipedia_multistream_extractor-0.0.2-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82c127d4c9454dd0d8d1e2256d7ec92108e8097bcc10413966467b7274ad901b
|
|
| MD5 |
c03c177694ac2ed58bf99f8a113b9b49
|
|
| BLAKE2b-256 |
ad5e384517cd7cbf4b94e12b99d9ed0e9fbf4ecee58aecf87e241084ecad5aa6
|