Skip to main content

Easily get clean data, direct from Python source

Project description

Travis CI build status PyPI Package latest release PyPI Package monthly downloads Supported versions Supported implementations Wheel packaging support

Usage

data = lines("""
    There was an old woman who lived in a shoe.
    She had so many children, she didn't know what to do;
    She gave them some broth without any bread;
    Then whipped them all soundly and put them to bed.
""")

will result in:

['There was an old woman who lived in a shoe.',
 "She had so many children, she didn't know what to do;",
 'She gave them some broth without any bread;',
 'Then whipped them all soundly and put them to bed.']

Note that the “extra” newlines and leading spaces have been taken care of and discarded.

Discussion

One often needs to state data in program source. Python, however, needs its lines indented just so. Multi-line strings therefore often have extra spaces and newline characters you didn’t really want. Many developers “fix” this by using Python list literals, but that has its own problems: it’s tedious, more verbose, and often less legible.

The textdata package makes it easy to have clean, nicely-whitespaced data specified in your program, but to get the data without extra whitespace cluttering things up. It’s permissive of the layouts needed to make Python code look and work right, without reflecting those requirements in the resulting data.

Python string methods give easy ways to clean text up, but it’s no joy reinventing that particular wheel every time you need it–especially since many of the details are nitsy, low-level, and a little tricky. textdata is a “just give me the text!” module that replaces a la carte text cleanups with simple, well-tested code that doesn’t lengthen your program or require constant wheel-reinvention.

Text

In addition to lines, textlines works similarly and with the same parameters, but joins the resulting lines into a unified string.:

data = textlines("""
    There was an old woman who lived in a shoe.
    She had so many children, she didn't know what to do;
    She gave them some broth without any bread;
    Then whipped them all soundly and put them to bed.
""")

Yields:

"There was an old woman who lived in a shoe.\nShe ... to bed."
# where the ... abbreviates exactly the characters you'd expect

Note that while textlines returns a single string, it maintains the (useful) newlines. Its result is still line-oriented. If you want to elide the newlines, use textlines(text, join=' ') and the newline characters will be replaced with spaces.

API Options

Both lines and textlines provide provide routinely-needed cleanups:

  • remove starting and ending blank lines (which are usually due to Python source formatting)

  • remove blank lines internal to your text block

  • remove common indentation

  • strip leading/trailing spaces other than the common prefix (leading spaces removed by request, trailing by default)

  • strip any comments from the end of lines

  • join lines together with your choice of separator string

lines(text, noblanks=True, dedent=True, lstrip=False, rstrip=True, cstrip=True, join=False)

Returns text as a series of cleaned-up lines.

  • text is the text to be processed.

  • noblanks => all blank lines are eliminated, not just starting and ending ones. (default True).

  • dedent => strip a common prefix (usually whitespace) from each line (default True).

  • lstrip => strip all left (leading) space from each line (default False). Note that lstrip and dedent are mutually exclusive ways of handling leading space.

  • rstrip => strip all right (trailing) space from each line (default True)

  • cstrip => strip comments (from # to the end of each line (default True)

  • join => either False (do nothing), True (concatenate lines with \n), or a string that will be used to join the resulting lines (default False)

textlines(text, noblanks=True, dedent=True, lstrip=False, rstrip=True, cstrip=True, join='\n')

Does the same helpful cleanups as lines(), but returns result as a single string, with lines separated by newlines (by default) and without a trailing newline.

Words

Often the data you need to encode is almost, but not quite, a series of words. A list of names, a list of color names–values that are mostly single words, but sometimes have an embedded spaces. textdata has you covered:

>>> words(' Billy Bobby "Mr. Smith" "Mrs. Jones"  ')
['Billy', 'Bobby', 'Mr. Smith', 'Mrs. Jones']

Embedded quotes (either single or double) can be used to construct “words” (or phrases) containing whitespace (including tabs and newlines).

words isn’t a full parser, so there are some extreme cases like arbitrarily nested quotations that it can’t handle. It isn’t confused, however, by embedded apostrophes and other common gotchas. For example:

>>> words("don't be blue")
["don't", "be", "blue"]

>>> words(""" "'this'" works '"great"' """)
["'this'", 'works', '"great"']

words is a good choice for situations where you want a compact, friendly, whitespace-delimited data representation–but a few of your entries need more than just str.split().

Comments

If you need to embed more than a few lines of immediate data in your program, you may want some comments to explain what’s going on. By default, textdata strip out Python-like comments (from # to end of line). So:

exclude = words("""
    __pycache__ *.pyc *.pyo     # compilation artifacts
    .hg* .git*                  # repository artifacts
    .coverage                   # code tool artifacts
    .DS_Store                   # platform artifacts
""")

Yields:

['__pycache__', '*.pyc', '*.pyo', '.hg*', '.git*',
 '.coverage', '.DS_Store']

You could of course write it out as:

exclude = [
    '__pycache__', '*.pyc', '*.pyo',   # compilation artifacts
    '.hg*', '.git*',                   # repository artifacts
    '.coverage',                       # code tool artifacts
    '.DS_Store'                        # platform artifacts
]

But you’d need more nitsy punctuation, and it’s less compact.

If however you want to capture comments, set cstrip=False (though that is probably more useful with the lines and textlines APIs than for words).

Paragraphs

Sometimes you want to collect “paragraphs”–contiguous runs of text lines that are delineated by blank lines. Markdown and RST document formats, for example, use this convention. textdata has a paras routine to extract such paragraphs:

>>> rhyme = """
    Hey diddle diddle,

    The cat and the fiddle,
    The cow jumped over the moon.
    The little dog laughed,
    To see such sport,

    And the dish ran away with the spoon.
"""
>>> paras(rhyme)
[['Hey diddle diddle,'],
 ['The cat and the fiddle,',
  'The cow jumped over the moon.',
  'The little dog laughed,',
  'To see such sport,'],
 ['And the dish ran away with the spoon.']]

Or if you’d like paras, but each paragraph in a single string:

>>> paras(rhyme, join="\n")
['Hey diddle diddle,',
 'The cat and the fiddle,\nThe cow jumped over the moon.\nThe little dog laughed,\nTo see such sport,',
 'And the dish ran away with the spoon.']

Setting join to a space will of course concatenate the lines of each paragraph with a space. This can be useful for converting from line-oriented paragraphs into each-paragraph as a (potentially very long) single line, a format useful for cut-and-pasting into many editors and text entry boxes on the Web or for email systems.

On the off chance you want to preserve the exact intra-paragraph spacing, setting keep_blanks=True will accomplish that.

Unicode and Encodings

textdata doesn’t have any unique friction with Unicode characters and encodings. That said, any time you use Unicode characters in Python source files, care is warranted–especially in Python 2!

If your text includes Unicode, in Python 2 make sure to mark literal strings with a “u” prefix: u"". You can also do this in Python 3.3 and following. Sadly, there was a dropout of compatibility in early Python 3 releases, making it much harder to maintain a unified source base with them in the mix. (A compatibility function such as six.u from six can help alleviate much–though certainly not all–of the pain.)

It can also be helpful to declare your source encoding: put a specially-formatted comment as the first or second line of the source code:

# -- coding: <encoding name> --

This will usually be # -*- coding: utf-8 -*-, but other encodings are possible. Python 3 defaults to a UTF-8 encoding, but Python 2 assumes ASCII.

Notes

  • Version 1.3 adds a paragraph constructor, paras.

  • Version 1.2 adds comment stripping. Packaging and testing also tweaked.

  • Version 1.1.5 adds the bdist_wheel packaging format.

  • Version 1.1.3 switches from BSD to Apache License 2.0 and integrates tox testing with setup.py.

  • Version 1.1 added the words constructor.

  • Automated multi-version testing managed with the wonderful pytest, pytest-cov, and tox. Successfully packaged for, and tested against, all late-model versions of Python: 2.6, 2.7, 3.3, 3.4, as well as PyPy 2.5.1 (based on 2.7.9) and PyPy3 2.4.0 (based on 3.2.5). Module should work on Python 3.2, but dropped from testing matrix due to its age and lack of a Unicode literal making test specification much more difficult.)

  • Common line prefix is now computed without considering blank lines, so blank lines need not have any indentation on them just to “make things work.”

  • The tricky case where all lines have a common prefix, but it’s not entirely composed of whitespace, now properly handled. This is useful for lines that are already “quoted” such as with leading "|" or ">" symbols (common in Markdown and old-school email usage styles).

  • textlines() is now somewhat superfluous, now that lines() has a join kwarg. But you may prefer it for the implicit indication that it’s turning lines into text.

  • It’s tempting to define a constant such as Dedent that might be the default for the lstrip parameter, instead of having separate dedent and lstrip Booleans. The more I use singleton classes in Python as designated special values, the more useful they seem.

  • Automated multi-version testing managed with pytest and tox. Continuous integration testing with Travis-CI. Packaging linting with pyroma.

    Successfully packaged for, and tested against, all late-model versions of Python: 2.6, 2.7, 3.2, 3.3, 3.4, and 3.5 pre-release (3.5.0b3) as well as PyPy 2.6.0 (based on 2.7.9) and PyPy3 2.4.0 (based on 3.2.5).

  • The author, Jonathan Eunice or @jeunice on Twitter welcomes your comments and suggestions.

Installation

To install or upgrade to the latest version:

pip install -U textdata

To easy_install under a specific Python version (3.3 in this example):

python3.3 -m easy_install --upgrade textdata

(You may need to prefix these with sudo to authorize installation. In environments without super-user privileges, you may want to use pip’s --user option, to install only for a single user, rather than system-wide.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

textdata-1.3.0.zip (23.1 kB view details)

Uploaded Source

textdata-1.3.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textdata-1.3.0-py2.py3-none-any.whl (14.0 kB view details)

Uploaded Python 2Python 3

File details

Details for the file textdata-1.3.0.zip.

File metadata

  • Download URL: textdata-1.3.0.zip
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for textdata-1.3.0.zip
Algorithm Hash digest
SHA256 73a7481305d0d022c20f38f7da6217169aa816000c947093090dc76864fbc824
MD5 c80ba3efd6e23f652358484a4b9b0306
BLAKE2b-256 d1f43365f04f32ff4bf88aa69e4d6ece7d3120daeba3aad80f89966b6ee91b34

See more details on using hashes here.

File details

Details for the file textdata-1.3.0.tar.gz.

File metadata

  • Download URL: textdata-1.3.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for textdata-1.3.0.tar.gz
Algorithm Hash digest
SHA256 455d9105838906d3f7f3b585bb33a5b0134d77f45a17eb6f4319a54b139e75a6
MD5 ff4e32b8496187715455dfe9429010a6
BLAKE2b-256 44f609480aa5289f89c2ae1049ff509dea49d70f648431a54c9a17f0da7cc579

See more details on using hashes here.

File details

Details for the file textdata-1.3.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for textdata-1.3.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6252ec1febb18e099d324bc3b96c35e723d780cc98dc7c88b203252d74ac9998
MD5 21de37add7d575f499b241e0c16cac05
BLAKE2b-256 39a158a62ca114c32020dc24a82925a6c71f79c46decd5a673bf718534228bd1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page