Fixes mojibake and other problems with Unicode, after the fact

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jbalonso lumitim rspeer

These details have not been verified by PyPI

Project links

Project description

ftfy: fixes text for you

>>> from ftfy import fix_encoding
>>> print(fix_encoding("(à¸‡'âŒ£')à¸‡"))
(ง'⌣')ง

The full documentation of ftfy is available at ftfy.readthedocs.org. The documentation covers a lot more than this README, so here are some links into it:

Testimonials

“My life is livable again!” — @planarrowspace
“A handy piece of magic” — @simonw
“Saved me a large amount of frustrating dev work” — @iancal
“ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.” — Brennan Young
“I have no idea when I’m gonna need this, but I’m definitely bookmarking it.” — /u/ocrow

What it does

Here are some examples (found in the real world) of what ftfy can do:

ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:

>>> import ftfy
>>> ftfy.fix_text('âœ” No problems')
'✔ No problems'

Does this sound impossible? It's really not. UTF-8 is a well-designed encoding that makes it obvious when it's being misused, and a string of mojibake usually contains all the information we need to recover the original string.

ftfy can fix multiple layers of mojibake simultaneously:

>>> ftfy.fix_text('The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.')
"The Mona Lisa doesn't have eyebrows."

It can fix mojibake that has had "curly quotes" applied on top of it, which cannot be consistently decoded until the quotes are uncurled:

>>> ftfy.fix_text("l’humanitÃ©")
"l'humanité"

ftfy can fix mojibake that would have included the character U+A0 (non-breaking space), but the U+A0 was turned into an ASCII space and then combined with another following space:

>>> ftfy.fix_text('Ã\xa0 perturber la rÃ©flexion')
'à perturber la réflexion'
>>> ftfy.fix_text('Ã perturber la rÃ©flexion')
'à perturber la réflexion'

ftfy can also decode HTML entities that appear outside of HTML, even in cases where the entity has been incorrectly capitalized:

>>> # by the HTML 5 standard, only 'P&Eacute;REZ' is acceptable
>>> ftfy.fix_text('P&EACUTE;REZ')
'PÉREZ'

These fixes are not applied in all cases, because ftfy has a strongly-held goal of avoiding false positives -- it should never change correctly-decoded text to something else.

The following text could be encoded in Windows-1252 and decoded in UTF-8, and it would decode as 'MARQUɅ'. However, the original text is already sensible, so it is unchanged.

>>> ftfy.fix_text('IL Y MARQUÉ…')
'IL Y MARQUÉ…'

Installing

ftfy is a Python 3 package that can be installed using pip or uv pip:

pip install ftfy

(Or use pip3 install ftfy on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)

If you use poetry, you can use ftfy as a dependency in the usual way (such as poetry add ftfy).

Local development

ftfy is developed using uv. You can build a virtual environment with its local dependencies by running uv venv, and test it with uv run pytest.

Who maintains ftfy?

I'm Robyn Speer, also known as Elia Robyn Lake. You can find my projects on GitHub and my posts on my own blog.

Citing ftfy

ftfy has been used as a crucial data processing step in major NLP research.

It's important to give credit appropriately to everyone whose work you build on in research. This includes software, not just high-status contributions such as mathematical models. All I ask when you use ftfy for research is that you cite it.

ftfy has a citable record on Zenodo. A citation of ftfy may look like this:

Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.
http://doi.org/10.5281/zenodo.2591652

In BibTeX format, the citation is::

@misc{speer-2019-ftfy,
  author       = {Robyn Speer},
  title        = {ftfy},
  note         = {Version 5.5},
  year         = 2019,
  howpublished = {Zenodo},
  doi          = {10.5281/zenodo.2591652},
  url          = {https://doi.org/10.5281/zenodo.2591652}
}

Important license clarifications

If you do not follow ftfy's license, you do not have a license to ftfy.

This sounds obvious and tautological, but there are people who think open source licenses mean that they can just do what they want, especially in the field of generative AI. It's a permissive license but you still have to follow it. The Apache license is the only thing that gives you permission to use and copy ftfy; otherwise, all rights are reserved.

If you use or distribute ftfy, you must follow the terms of the Apache license, including that you must attribute the author of ftfy (Robyn Speer) correctly.

You may not make a derived work of ftfy that obscures its authorship, such as by putting its code in an AI training dataset, including the code in AI training at runtime, or using a generative AI that copies code from such a dataset.

At my discretion, I may notify you of a license violation, and give you a chance to either remedy it or delete all copies of ftfy in your possession.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jbalonso lumitim rspeer

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

6.3.1

Oct 26, 2024

6.3.0

Oct 10, 2024

6.2.3

Aug 6, 2024

6.2.2

Aug 6, 2024

6.2.1

Aug 6, 2024

6.2.0

Mar 15, 2024

6.1.3

Nov 21, 2023

6.1.1

Feb 9, 2022

6.1.0.post1

Feb 9, 2022

6.1.0

Feb 9, 2022

6.0.3

May 24, 2021

6.0.1

Apr 16, 2021

6.0

Apr 9, 2021

5.9

Feb 11, 2021

5.8

Jul 20, 2020

5.7

Feb 18, 2020

5.6.1

Feb 18, 2020

5.6

Aug 7, 2019

5.5.1

Jan 22, 2019

5.5.0

Sep 6, 2018

5.4.1

Jun 15, 2018

5.4.0

Jun 7, 2018

5.3.0

Jan 26, 2018

5.2.0

Nov 27, 2017

5.1.1

May 15, 2017

5.1

Apr 11, 2017

5.0.2

Mar 24, 2017

5.0.1

Mar 10, 2017

5.0

Mar 9, 2017

4.4.3

May 15, 2017

4.4.2

Mar 24, 2017

4.4.1

Mar 10, 2017

4.4

Mar 9, 2017

4.3.1

Jan 17, 2017

4.2.0

Sep 28, 2016

4.1.1

Apr 13, 2016

4.1.0

Feb 25, 2016

4.0.0

May 11, 2015

3.4.0

Jan 15, 2015

3.3.0

Aug 16, 2014

3.2.0

Jun 27, 2014

3.1.3

May 15, 2014

3.1.2

Jan 29, 2014

3.1.1

Jan 29, 2014

3.1.0

Jan 29, 2014

3.0.5

Nov 1, 2013

3.0.4

Oct 1, 2013

3.0.3

Sep 9, 2013

3.0.2

Sep 4, 2013

3.0.1

Aug 30, 2013

3.0

Aug 26, 2013

2.0.2

Jun 20, 2013

2.0.1

Mar 19, 2013

2.0

Jan 30, 2013

1.0

Aug 24, 2012

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ftfy-6.3.1.tar.gz (308.9 kB view details)

Uploaded Oct 26, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ftfy-6.3.1-py3-none-any.whl (44.8 kB view details)

Uploaded Oct 26, 2024 Python 3

File details

Details for the file ftfy-6.3.1.tar.gz.

File metadata

Download URL: ftfy-6.3.1.tar.gz
Upload date: Oct 26, 2024
Size: 308.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for ftfy-6.3.1.tar.gz
Algorithm	Hash digest
SHA256	`9b3c3d90f84fb267fe64d375a07b7f8912d817cf86009ae134aa03e1819506ec`
MD5	`8951f7ffa3aeb09c8cb77e29321a92c1`
BLAKE2b-256	`a5d38650919bc3c7c6e90ee3fa7fd618bf373cbbe55dff043bd67353dbb20cd8`

See more details on using hashes here.

File details

Details for the file ftfy-6.3.1-py3-none-any.whl.

File metadata

Download URL: ftfy-6.3.1-py3-none-any.whl
Upload date: Oct 26, 2024
Size: 44.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for ftfy-6.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7c70eb532015cd2f9adb53f101fb6c7945988d023a085d127d1573dc49dd0083`
MD5	`281b4d6ad88248f40cd28e125aac7438`
BLAKE2b-256	`ab6e81d47999aebc1b155f81eca4477a616a70f238a2549848c38983f3c22a82`

See more details on using hashes here.

ftfy 6.3.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Project description

ftfy: fixes text for you

Testimonials

What it does

Installing

Local development

Who maintains ftfy?

Citing ftfy

Important license clarifications

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes