Python package for converting xml and epubs to text files
Project description
epub conversion
---------------
Create text corpuses using epubs and wiki dumps.
This is a python package with a Converter for epub and xml (wiki dumps) to text, lines, or Python generators.
Usage:
------
### Epub usage
#### Book by book
To convert epubs to text files, usage is straightforward. First create a converter object:
converter = Converter("my_ebooks_folder/")
Then using this converter let's concatenate all the text within the ebooks into a single mega text file:
converter.convert("my_succinct_text_file.gz")
#### Line by line
You can also proceed line by line:
from epub_conversion.utils import open_book
book = open_book("twilight.epub")
lines = convert_epub_to_lines(book)
### Wikidump usage
#### Redirections
Suppose you are interested in all redirections in a given Wikipedia dump file
that is still compressed, then you can access the dump as follows:
wiki = epub_conversion.wiki_decoder.almost_smart_open("enwiki.bz2")
Taking this dump as our **input** let us now use a generator to output all pairs of `title` and `redirection title` in this dump:
redirections = {redirect_from:redirect_to
for redirect_from, redirect_to in epub_conversion.wiki_decoder.get_redirection_list(wiki)
}
#### Page text
Suppose you are interested in the lines within each page's text section only, then:
for line in epub_conversion.wiki_decoder.convert_wiki_to_lines(wiki):
process_line( line )
See Also:
---------
* [Wikipedia NER](https://github.com/JonathanRaiman/wikipedia_ner) a Python module that uses `epub_conversion` to process Wikipedia dumps and output only the lines that contain page to page links, with the link anchor texts extracted, and all markup removed.
---------------
Create text corpuses using epubs and wiki dumps.
This is a python package with a Converter for epub and xml (wiki dumps) to text, lines, or Python generators.
Usage:
------
### Epub usage
#### Book by book
To convert epubs to text files, usage is straightforward. First create a converter object:
converter = Converter("my_ebooks_folder/")
Then using this converter let's concatenate all the text within the ebooks into a single mega text file:
converter.convert("my_succinct_text_file.gz")
#### Line by line
You can also proceed line by line:
from epub_conversion.utils import open_book
book = open_book("twilight.epub")
lines = convert_epub_to_lines(book)
### Wikidump usage
#### Redirections
Suppose you are interested in all redirections in a given Wikipedia dump file
that is still compressed, then you can access the dump as follows:
wiki = epub_conversion.wiki_decoder.almost_smart_open("enwiki.bz2")
Taking this dump as our **input** let us now use a generator to output all pairs of `title` and `redirection title` in this dump:
redirections = {redirect_from:redirect_to
for redirect_from, redirect_to in epub_conversion.wiki_decoder.get_redirection_list(wiki)
}
#### Page text
Suppose you are interested in the lines within each page's text section only, then:
for line in epub_conversion.wiki_decoder.convert_wiki_to_lines(wiki):
process_line( line )
See Also:
---------
* [Wikipedia NER](https://github.com/JonathanRaiman/wikipedia_ner) a Python module that uses `epub_conversion` to process Wikipedia dumps and output only the lines that contain page to page links, with the link anchor texts extracted, and all markup removed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
epub-conversion-1.0.7.tar.gz
(5.7 kB
view details)
File details
Details for the file epub-conversion-1.0.7.tar.gz
.
File metadata
- Download URL: epub-conversion-1.0.7.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db0fb6c6878ffbee84e14428eb2a5a69cddfe257c29435a3cc9eefdec743b89b |
|
MD5 | 856600f2879ac1041ec071a557ff9a90 |
|
BLAKE2b-256 | a775d312eb095d498777cdfc818f938c8b44ef49c70dc84aea6b7aacbadbb505 |