Render website to ebook to make it easier to read on devices
Project description
colusa
Render website to ebook to make it easier to read on devices.
Installation
python3 setup.py install
Usage
Start from scratch
First of all, we need to generate a configuration file for colusa
to work on.
colusa
has builtin command to generate a template configuration as starter.
Run following command to generate configuration file:
$ colusa init new_ebook.json
colusa
will generate a configuration file as below:
{
"title": "__fill the title__",
"author": "__fill the author__",
"version": "v1.0",
"homepage": "__fill url to home page__",
"output_dir": "__fill output dir__",
"urls": []
}
We have to modify the configuration file to fill up valid information.
Add content to ebook
After generating the configuration file, all we need to do is adding (blog posts, articles, webs)' urls to urls
field in the configuration file.
Example for final configuration:
{
"title": "The Great Mental Models",
"author": "Farnam Street Media Inc",
"version": "v1.0",
"homepage": "https://fs.blog",
"output_dir": "fsblog",
"urls": [
"https://fs.blog/2018/04/first-principles/",
"https://fs.blog/2016/04/second-order-thinking/",
"https://fs.blog/2017/06/thought-experiment/",
"https://fs.blog/2018/05/probabilistic-thinking/",
"https://fs.blog/2019/12/survivorship-bias/"
]
}
Update ebook content
We can update ebook content by modifying the urls
, by adding or removing url in the urls
, the result ebook will be changed.
Generate ebook content
After adding or removing url in urls
, we need to invoke colusa
to have it regenerate ebook content. Run following command at terminal:
$ colusa generate new_ebook.json
By invoking above command, colusa
will download webpages (specified in urls
), parse, transform them to asciidoc format, and save them to output_dir
. colusa
also create a neccessary information for ebook compilating at later steps.
Compile ebook for consuming purpose
Prerequisites
Before generating ebook, we need to install asciidoctor tools. Follow install guideline on following websites:
- for generating html: https://asciidoctor.org
- for generating epub: https://asciidoctor.org/docs/asciidoctor-epub3/
- for generating pdf: https://asciidoctor.org/docs/asciidoctor-pdf/
Generating ebooks
To help with generating ebook, colusa
also create a Makefile
in the root folder of the ebook. In the Makefile
, there are three common targets that we can use to generate ebook in html, epub, pdf formats.
# to generate html
$ make html
# to generate epub
$ make epub
# to generate pdf
$ make pdf
Generated ebooks will be saved to ./output
folder.
user:output/ $ ls [10:55:55]
total 3056
drwxr-xr-x 10 320B images
-rw-r--r-- 1 699K index.epub
-rw-r--r-- 1 131K index.html
-rw-r--r--@ 1 694K index.pdf
List of Supported Website
Currently colusa
supports only limited number of websites. The following list all support websites:
- https://untools.co
- https://unintendedconsequenc.es
- https://blog.acolyer.org
- https://fs.blog
- https://increment.com
- https://slack.engineering
- https://medium.com
- https://www.cs.rutgers.edu/~pxk/
- https://www.preethikasireddy.com
- https://engineering.atspotify.com
- https://truyenfull.vn
- https://avikdas.com
- https://www.infoq.com
Contribution
Contribution is welcome. You can open issues to request for supporting more websites, open PR to help with those issues, or anything else like documentation, code contribution.
Changelog
v0.12.0 (2022-07-24)
New
-
Add new command to crawl an URL. [Nguyen Huu Hoa]
Crawl an URL (website) to help generate list of URL, mostly story chapters
Changes
-
Add some debug capability. [Nguyen Huu Hoa]
-
Add support for new websites. [Nguyen Huu Hoa]
Other
-
Merge branch 'main' of github.com:huuhoa/colusa. [Nguyen Huu Hoa]
-
Chg(plugins/truyenfull): clean ads content. [Nguyen Huu Hoa]
-
Chg(plugins/truyenfull): clean ads content. [Nguyen Huu Hoa]
v0.11.0 (2022-02-17)
Changes
- Support for tangthuvien (#72) [Huu Hoa NGUYEN]
Fix
- Gitchangelog ignore pattern. [Nguyen Huu Hoa]
v0.10.0 (2021-10-16)
Changes
-
Update dev requirements. [Nguyen Huu Hoa]
-
Improve code coverage (#21) [Huu Hoa NGUYEN]
Mock up two functions download_image and download_content
download_content
will return existing cached file, so that we don't have to redownload every time we run the testdownload_image
will just return True, do nothing, so that we don't have to download images
Other
-
Add: support for techtarget.com (#32) [Huu Hoa NGUYEN]
- chg(asciidoc_visitor): support parsing datasrc and data-srcset for img
- add(web): support for techtarget.com
v0.9.0 (2021-08-26)
New
-
Integration tests (#20) [Huu Hoa NGUYEN]
- chg: add tox.ini for running tox
- chg(colusa): move colusa source to src folder
-
Support parsing site xp123.com (#18) [Huu Hoa NGUYEN]
Other
-
Chore: update setup.cfg for version location. [Nguyen Huu Hoa]
-
Prepare for next release. [Nguyen Huu Hoa]
-
Refactor(etr): Rework on Extractor (#19) [Huu Hoa NGUYEN]
-
refactor(etr): move _parse_yoast from a plugin extract to base Extractor
-
refactor(etr): rename methods for clarification
- rename
internal_init
to_find_main_content
- rename
get_author
to fieldauthor
and_parse_author
for parsing value - rename
get_published
to fieldpublished
and_parse_published
for parsing value - rename
get_title
to fieldtitle
and_parse_title
for parsing value - add
_parse_metadata
for parsing all related metadata from html
- rename
-
refactor(etr): change signature of Extractor._find_main_content
_find_main_content
is now return bs.Tag instead of setting value for fieldmain_content
. The change make it more clear for purpose of_find_main_content
, i.e. only to find the main content, does not modify anything_parse_metadata
will be executed after we found the main content
-
v0.8.0 (2021-08-22)
New
-
Support rendering additional book properties (#16) [Huu Hoa NGUYEN]
In the book configuration file, add new array
book_properties
with content is list of strings. Those strings will be render as book properties on master file (index.asciidoc)Example:
"book_properties": [ "ifdef::backend-pdf[]", ":front-cover-image: image:cover.pdf[]", ":notitle:", "endif::[]", "ifdef::backend-epub3[]", ":front-cover-image: image:cover.png[]", "endif::[]" ]
Above example will instruct asciidoctor processor to use:
- cover.pdf as front cover image when generating pdf
- cover.png as front cover image when generating epub3
Changes
- Render html table as native asciidoc table (#17) [Huu Hoa NGUYEN]
Other
-
Prepare for bump version 0.8. [Nguyen Huu Hoa]
-
Add(plugins): support for scrumcrazy.wordpress.com. [Nguyen Huu Hoa]
v0.7.0 (2021-08-20)
Changes
-
Improve article parsing (#15) [Huu Hoa NGUYEN]
- add: agilethought support to get article's author
- add: support website tech.trivago.com
-
Metadata rendering (#14) [Huu Hoa NGUYEN]
render metadata in a more clean way, the format should be
by **{author}** on {published_date} at {url | domain}
Other
-
Chore: refactor project's setup configurations. [Nguyen Huu Hoa]
-
Setup codeql-analysis. [Huu Hoa NGUYEN]
-
Setup dependabot. [Huu Hoa NGUYEN]
-
Dev: update requirements_dev.txt. [Nguyen Huu Hoa]
-
Docs: update CHANGELOG. [Nguyen Huu Hoa]
v0.6.0 (2021-08-18)
New
- Support new website https://agilethought.com (#11) [Huu Hoa NGUYEN]
Changes
- Etr: improve metadata rendering for generated articles (#10) [Huu Hoa NGUYEN]
v0.5.1 (2021-08-14)
Changes
- Update requirements for package and dev. [Nguyen Huu Hoa]
v0.5.0 (2021-08-14)
New
- Support new website https://staffeng.com. [Nguyen Huu Hoa]
Changes
-
Initial configuration for bumpversion. [Nguyen Huu Hoa]
-
Add comments at the beginning of included files. [Nguyen Huu Hoa]
new version of asciidoctor removes leading and trailing empty lines of included files therefore the beginning of new section in the included files will not be separated as expected. The work around is to add comment line at the very beginning of included files.
-
Support for parsing some common webblogs. [Nguyen Huu Hoa]
-
Support for parsing hbr.org. [Nguyen Huu Hoa]
-
Support parsing content for blog-content and wikipedia. [Nguyen Huu Hoa]
-
Support parsing srcset dimension with 'x' specification. [Nguyen Huu Hoa]
-
Add suffix to generated file name to prevent name colliding. [Nguyen Huu Hoa]
-
Support detecting webpage content inside
main
tag. [Nguyen Huu Hoa] -
Passthrough the table tag content. [Nguyen Huu Hoa]
-
Special treatment for image inside an anchor tag. [Nguyen Huu Hoa]
-
Cleanup code to get content class of a website. [Nguyen Huu Hoa]
-
Heading level of generated asciidoc. [Nguyen Huu Hoa]
-
Add support for parsing new websites. [Nguyen Huu Hoa]
Fix
-
Correct config for bumpversion. [Nguyen Huu Hoa]
-
Slugify that import non existing unicode from idna. [Nguyen Huu Hoa]
-
Get correct image suffix by parsing url first to get only
path
in URL. [Nguyen Huu Hoa]
Other
-
Docs: add some documents. [Nguyen Huu Hoa]
-
Add: proper configuration for packaging. [Nguyen Huu Hoa]
- add bump_version support
v0.4.0 (2020-10-14)
New
-
Support new website https://www.infoq.com. [Nguyen Huu Hoa]
-
Yaml configuration (#9) [Huu Hoa NGUYEN]
- feat: support configuration file in YAML format
- Configuration file format is determined by extension, i.e json or yml
- chg: add help str to make error report more concise
- chg: add logs statements in various place
- chg: dev: add CHANGELOG.md to record changes
Changes
-
Cleanup output for content from truyenfull.vn. [Nguyen Huu Hoa]
-
Tolerate for some non-comforms htmls. [Nguyen Huu Hoa]
-
Add customization options for generated ebook. [Nguyen Huu Hoa]
- metadata: type bool, default is True. Metadata such as published_date, source url are generated after article (chapter) title if True
- title_prefix_trim: type string, default is empty. When specified, string value in
title_prefix_trim
will be removed from article (chapter) title
Other
- Docs: Prepare for release v0.4.0. [Nguyen Huu Hoa]
v0.3.1 (2020-09-26)
Changes
-
Update meta information for pypi.org package (#8) [Huu Hoa NGUYEN]
- Add long description
- Add classifiers
- Prepare for release version 0.3.1
v0.3.0 (2020-09-26)
New
-
Support book parts. [Nguyen Huu Hoa]
Render additional information for book parts
-
Support website truyenfull.vn, avikdas.com. [Nguyen Huu Hoa]
-
Support web engineering.atspotify.com. [Nguyen Huu Hoa]
-
Support website www.preethikasireddy.com. [Nguyen Huu Hoa]
-
Support website cs.rutgers.edu. [Nguyen Huu Hoa]
-
Support website medium.com. [Nguyen Huu Hoa]
-
Support website slack.engineering. [Nguyen Huu Hoa]
-
Support website increment.com. [Nguyen Huu Hoa]
-
Add pdf target to Makefile for generating pdf format. [Nguyen Huu Hoa]
Changes
-
Rename project from symphony to colusa (#7) [Huu Hoa NGUYEN]
Colusa is not yet existed on pypi.org, so I rename the project in order to be able to upload it to pypi.org
-
Cleanup code. [Nguyen Huu Hoa]
-
Cleanup etr.Transform to remove obsolete methods. [Nguyen Huu Hoa]
-
Add coloring log for improved experiences in using app. [Nguyen Huu Hoa]
-
Support to render
code
tag in asciidoc_visitor. [Nguyen Huu Hoa] -
Implement Visitor pattern for saving document to asciidoc (#6) [Huu Hoa NGUYEN]
- implement visitor pattern for writing asciidoc file format
- implement visit methods for various tags
- suppress empty text in anchor tag
- add support for knowledgegraph.today
- all unknown PageElement are classified to visit_unknown
- support pre tag
-
Update gitignore to exclude PyCharm IDE generated files. [Nguyen Huu Hoa]
-
Update usage for symphony. [Nguyen Huu Hoa]
- to initialize new ebook: symphony init
- to generate ebook contents: symphony generate
-
Add setup.py for easier installation. [Nguyen Huu Hoa]
-
Implement plugin archiecture for Extractor, Transformer (#4) [Huu Hoa NGUYEN]
-
Update requirements.txt. [Nguyen Huu Hoa]
-
Ignore empty heading. [Nguyen Huu Hoa]
-
Support getting metadata from opengraph and getting main content from microformats - hentry. [Nguyen Huu Hoa]
-
Report error when cannot understand a website. [Nguyen Huu Hoa]
-
Update format for ul, ol. [Nguyen Huu Hoa]
-
Default get_title for Extractor to get information from meta tag og:title or get from header>title. [Nguyen Huu Hoa]
-
Support rendering code block. [Nguyen Huu Hoa]
-
Don't use Renderer any more, since Transformer is enough to render asciidoc content. [Nguyen Huu Hoa]
-
Transform tags: table, pre. [Nguyen Huu Hoa]
-
Change encoding when saving content to cache and loading from cache. [Nguyen Huu Hoa]
-
Correct rendering table code. [Nguyen Huu Hoa]
-
Remove line break in heading tags. [Nguyen Huu Hoa]
-
Add more information to README. [Nguyen Huu Hoa]
-
Update README.md for Usage. [Nguyen Huu Hoa]
Fix
-
Error when check for existent of
paragraph-image
infigure
class. [Nguyen Huu Hoa]Error occurs when tag
figure
does not have any classes, the node.get('class') will return None. Therefore when checking for existent of a text in None will throw exception. -
Remove extra line separators in truyenfull.vn. [Nguyen Huu Hoa]
-
Generating new configuration. [Nguyen Huu Hoa]
-
Rendering image in asciidoc. [Nguyen Huu Hoa]
-
Parsing url and dimension in srcset attribute of img tag. [Nguyen Huu Hoa]
Other
-
Refactor: remove classmethods inside Transformer to make it open for extension. [Nguyen Huu Hoa]
-
Refactor: move etr register extractor, transformer and their factory functions from etr_factory to etr. [Nguyen Huu Hoa]
- Cleanup code
- Make it more clean in using register decorators
-
Docs: Update README.md. [Nguyen Huu Hoa]
-
Docs: Update README.md for installing tools to generate ebooks. [Nguyen Huu Hoa]
-
Tests: add skeleton for unit testing. [Nguyen Huu Hoa]
-
Refactor: remove unused params in Transformer. [Nguyen Huu Hoa]
-
Refactor: cleanup code. [Nguyen Huu Hoa]
-
Refactor: create Symphony class to handle all business logics. [Nguyen Huu Hoa]
- downloading urls
- transform and generating ebook contents
- generating ebook master file
- generating Makefile
-
Refactor: create symphony package and move all etr related to symphony. [Nguyen Huu Hoa]
-
Docs: Update README.md. [Nguyen Huu Hoa]
-
Wip: new way to transform html content to asciidoc format. [Nguyen Huu Hoa]
-
Update README.md (#1) [Anh Le (Andy)]
Add installation script
-
Cleanup ads on fsblog. [Nguyen Huu Hoa]
-
Add support for fs.blog. [Nguyen Huu Hoa]
-
Refactor code to make it easier to support new websites. [Nguyen Huu Hoa]
-
Cleanup code and support rendering article's metadata. [Nguyen Huu Hoa]
-
Change string formats to use f-string style. [Nguyen Huu Hoa]
-
Add requirements.txt. [Nguyen Huu Hoa]
-
Cleanup. [Nguyen Huu Hoa]
-
Fix factory creator for unintendedconsequences. [Nguyen Huu Hoa]
-
Generate makefile to generate ebook. [Nguyen Huu Hoa]
-
Support render pre tag. [Nguyen Huu Hoa]
-
Support config file to make it more general in supporting more repetitive works. [Nguyen Huu Hoa]
-
Update ignote patterns. [Nguyen Huu Hoa]
-
Generalize to accept new webblog. [hoanh]
-
Update tool to generate index.asciidoc from template. [hoanh]
-
Tools to download unintendedsequences. [hoanh]
-
First version. [hoanh]
-
Initial commit. [Huu Hoa NGUYEN]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file colusa-0.12.0.macosx-11.0-arm64.tar.gz
.
File metadata
- Download URL: colusa-0.12.0.macosx-11.0-arm64.tar.gz
- Upload date:
- Size: 50.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3253823a3b0f47304b17c30e60e2119f383021f77d3473a14024f162261d1e2c |
|
MD5 | feb8b2342f7fe1e8ce3a7a1daa09c994 |
|
BLAKE2b-256 | 6db31e6b37dc5a6ec11e06dcd8252858047bb04fdf486ba32c776017d0be7c81 |
File details
Details for the file colusa-0.12.0-py3-none-any.whl
.
File metadata
- Download URL: colusa-0.12.0-py3-none-any.whl
- Upload date:
- Size: 35.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8989146387833b6f8e2db752cd490b638bcd153757b03ab76870bdd448c9c0f1 |
|
MD5 | 9fef9625807b1f7b1e0933c17008d649 |
|
BLAKE2b-256 | f433e095041006688b54c5c6414e63650d4f3cf74fa2b8de2a108c94050cf1fd |