This is a simple application that extracts articles from Wikipedia backup files.
Project description
WikipediaMultistreamExtractor
This is a simple application that extracts articles from Wikipedia backup files.
Installation
pip install wikipedia_multistream_extractor
Usage
From the CLI:
wikipedia_multistream_extractor SRC DEST
As a library:
from wikipedia_multistream_extractor import dump_wiki_pages, extract_wiki_pages
dump_wiki_pages(src, dst, preprocessor=None, writer=None)
# Or
pages = extract_wiki_pages(src)
You can pass a preprocessor
to dump_wiki_pages if you wanted to transform the data.
- Your function should accept two arguments
page
andtitle
. - It should return the modified
page
as a str, unless your writer expects something different. - If you return None the page will be skipped by the writer.
You can also pass a writer
if the intended output is not xml.
- Your function should accept four arguments
page
,title
,dest_path
andsafe_title
. - IF you pass a
writer
it's up to you two write the file, not further action is taken. safe_title
should be a safe filename to use when writing the file,title
is almost certainly not.
This library was built on Windows using Python 3.9.13.
It should work on other platforms but it has not been tested.
License
The license is MIT, see the LICENSE file for details.
Contributing
PRs welcome, please have type annotations and docstrings.
Todos
- Extracting a single article, based on the index offsets.
- Extracting the index files.
- Create a UI that displays the index file contents and allows the users to select articles.
- Extracting multiple articles at offsets.
- Use multiprocessing?
- Write some tests.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wikipedia_multistream_extractor-0.0.2.tar.gz
.
File metadata
- Download URL: wikipedia_multistream_extractor-0.0.2.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ccda019a114471b83c7f47831bc563345fbe15447aa6d13c9984e126a579e15b |
|
MD5 | 97c1901dd911c07706ba781a32b82c3d |
|
BLAKE2b-256 | 42f8fd377f7e6cf18c098beb74672a99ff5e56d60fc87210f9dc4d6cd0050bd8 |
File details
Details for the file wikipedia_multistream_extractor-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: wikipedia_multistream_extractor-0.0.2-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82c127d4c9454dd0d8d1e2256d7ec92108e8097bcc10413966467b7274ad901b |
|
MD5 | c03c177694ac2ed58bf99f8a113b9b49 |
|
BLAKE2b-256 | ad5e384517cd7cbf4b94e12b99d9ed0e9fbf4ecee58aecf87e241084ecad5aa6 |