Skip to main content

A web type setter

Project description

License: BSD PyPI version test Downloads Coverage Documentation Status

degrotesque — A web type setter.

Introduction

degrotesque beautifies the web.

degrotesque is a Python script. It loads an HTML file from the disc — or several in batch, one after the other — and for each, it replaces some commonly used non-typographic characters, such as ", ', -, etc. into their typographic representant for improving the pages' appearance.

E.g.:

"Well - that's not what I had expected."

will become:

“Well — that's not what I had expected.”

I think, it looks much better.

The starting and ending quotes have been replaced by “ and ”, respectively, the ' by ' and the - by an —. Of course, this script omits HTML-elements. It keeps the complete format as-is, and replaces characters by their proper HTML entity name or the respective unicode character.

It is meant to be a relatively reliable post-processing step for web pages before releasing them.

Background

I often write my texts and web pages using a plain editor. As such, the character " is always used for quotes, a dash is always a minus, etc.

I wanted to have a tool that automatically recognizes which characters should be replaced by their more typographic counterpart and applies the according rules.

I think it’s a pity that major Desktop Publishing applications do this on the fly but many and even major web sites still show us plain ASCII characters.

degrotesque does the job pretty fine. After writing / building my pages, the tool converts them to a prettier and typographically more correct form. The structure and format of the pages is completely remained. And as said, it works reliable.

If you need any consultations, please let me know. If you know better, too.

Download and Installation

The current version is degrotesque-2.0.6. You may install degrotesque using

python -m pip install degrotesque

You may download a copy or fork the code at degrotesque's github page. Besides, you may download the current release here:

License

degrotesque is licensed under BSD license.

Documentation

Usage

degrotesque is implemented in Python. It is started on the command line.

The option -i <PATH> / --input <PATH> tells the script which file(s) shall be read — you may name a file or a folder, here. If the option -r / --recursive is set, the given folder will be processed recursively.

The tool processes HTML files, XML files, and their derivatives. The extensions of file types that are processed are given in Appendix A. You may change the extensions of files to process using the -e <EXTENSION>[,<EXTENSION>]* / --extensions <EXTENSION>[,<EXTENSION>]* option.

The files are read one by one and the replacement of plain ASCII chars by some nicer ones is based upon a chosen set of “actions”. Known and default actions are given in Appendix B. You may select the actions to apply using the -a <ACTION_NAME>[,<ACTION_NAME>]* / --actions <ACTION_NAME>[,<ACTION_NAME>]* option. The default actions are masks, quotes.english, dashes, ellipsis, math, apostrophe, and commercial. Per default, HTML entities are inserted. If you rather wish to have unicode values, use the option -u / --unicode.

The files are assumed to be encoded using UTF-8 per default. You may change the encoding using the option -E <ENCODING> / --encoding <ENCODING>.

The script does not change the quotation marks of HTML elements, of course. As well, the contents of several elements, such as <code> or <pre>, are skipped. You may change the list of elements which contents shall not be processed using the option -s <ELEMENT_NAME>[,<ELEMENT_NAME>]* / --skip <ELEMENT_NAME>[,<ELEMENT_NAME>]*. The list of elements that are skipped per default is given in Appendix C.

After the actions have been applied to its contents, the file is saved. By default, a backup of the original file is saved under the same name, with the appendix “.orig”. You may omit the creation of these backup files using the option -B / --no-backup.

Please note that “masks” is a special action set that disallows the application of some other actions so that, e.g., the dividers in ISBN numbers are not replaced by &ndash;. The masks action set is given in Appendix D.

Options

The script has the following options:

  • --input/-i <PATH>: the file or the folder to process
  • --recursive/-r: Set if the folder — if given — shall be processed recursively
  • --no-backup/-B: Set if no backup files shall be generated
  • --unicode/-u: When set, unicode characters instead of HTML-entities are used
  • --extensions/-e <EXTENSION>[,<EXTENSION>]*: The extensions of files that shall be processed
  • --encoding/-E <ENCODING>: The assumed encoding of the files
  • --skip/-s <ELEMENT_NAME>[,<ELEMENT_NAME>]*: Elements which contents shall not be changed
  • --actions/-a <ACTION_NAME>[,<ACTION_NAME>]*: Name the actions that shall be applied
  • --help: Prints the help screen

Usage Examples

degrotesque -i my_page.html -a quotes.german

Replaces single and double quotes within the file “my_page.html” by their typographic German counterparts.

degrotesque -i my_folder -r --no-backup

Applies the default actions to all files that match the extension in the folder “my_folder” and all subfolders. No backup files are generated.

Application Programming Interface — API

You may as well embedd degrotesque within your own applications. The usage is very straightforward:

import degrotesque
# build the degrotesque instance with default values
degrotesque = degrotesque.Degrotesque()
# apply degroteque
prettyHTML = degrotesque.prettify(plainHTML)

The default values can be replaced using some of the class' interfaces (methods):

# change the actions to apply (by naming them)
# here: apply french quotes and math symbols
degrotesque.setActions("quotes.french,math")
# change the elements which contents shall be skipped
# here: skip the contents of "code",
#  "script", and "style" elements
degrotesque.setToSkip("code,script,style")

You may as well consult the degrotesque pydoc code documentation.

Further Documentation

Implementation Notes

  • I tried Genshi, BeautifulSoup, and lxml. All missed in keeping the code unchanged. So the parser just skips HTML-elements and the contents of some special elements, see above. Works in most cases.

Examples / Users

Change Log

degrotesque-2.0.6 (05.02.2023)

  • Patched documentation (return types)
  • Set proper formatting for readthedocs
  • It's not 2.0.4 due to caching by readthedocs

degrotesque-2.0.2 (04.02.2023)

  • Corrected installation and execution as a console script

degrotesque-2.0 (05.01.2023)

  • Changed the license to BSD.
  • Using github actions for testing on push instead of using Travis CI
  • Cleaned up project tree
  • Adding an mkdocs documentation

Older Versions

Summary

Well, have fun. If you have any comments / ideas / issues, please submit them to degrotesque's issue tracker on github or drop me a mail.

Appendices

Appendix A: Default Extensions

Files with the following extensions are parsed per default:

  • html, htm, xhtml,
  • php, phtml, phtm, php2, php3, php4, php5,
  • asp,
  • jsp, jspx,
  • shtml, shtm, sht, stm,
  • vbhtml,
  • ppthtml,
  • ssp, jhtml

Appendix B: Named Actions

The following action sets are currently implemented.

Please note that the actions are realized using regular expressions. I decided not to show them in the following for a better readability and show the visible changes only.

Action Name From Opening String From Closing String To Opening String To Closing String
quotes.english ' '
" "
quotes.french < >
<< >> « »
quotes.german ' '
" "
to_quotes ' ' <q> </q>
" " <q> </q>
<< >> <q> </q>
< > <q> </q>
commercial (c) ©
(r) ®
(tm)
dashes -
<NUMBER>-<NUMBER> <NUMBER>–<NUMBER>
bullets *
ellipsis ...
apostrophe ' '
math +/- ±
1/2 ½
1/4 ¼
3/4 ¾
~
!=
<=
>=
<NUMBER>*<NUMBER> <NUMBER>×<NUMBER>
<NUMBER>x<NUMBER> <NUMBER>×<NUMBER>
<NUMBER>/<NUMBER> <NUMBER>÷<NUMBER>
dagger **
*

Appendix C: Skipped Elements

The contents of the following elements are not processed by default:

  • script
  • code
  • style
  • pre
  • ?
  • ?php
  • %
  • %=
  • %@
  • %--
  • %!
  • !--

Appendix D: Masking Action Set

The “masks” action set is masking some patterns to avoid replacements. When matching, the matching string is kept. The actions are given in the following. Please note that the numbers in { } brackets give the number of subsequent elements.

  • 978-<NUMBER>-<NUMBER>-<NUMBER>-<NUMBER>{1}<NO_NUMBER>: avoid ISBN replacement
  • 979-<NUMBER>-<NUMBER>-<NUMBER>-<NUMBER>{1}<NO_NUMBER>: avoid ISBN replacement
  • <NUMBER>-<NUMBER>-<NUMBER>-<NUMBER>{1}<NO_NUMBER>: avoid ISBN replacement
  • ISSN <NUMBER>{4}-<NUMBER>{4}: avoid ISSN replacement

© Daniel Krajzewicz 2020–2023

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

degrotesque-2.0.6.tar.gz (13.8 kB view hashes)

Uploaded Source

Built Distribution

degrotesque-2.0.6-py3-none-any.whl (13.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page