Specialized & performant CSV readers, writers and enrichers for python.

These details have not been verified by PyPI

Project links

Homepage

Project description

Casanova

If you often find yourself reading CSV files using python, you will quickly notice that, while being more comfortable, csv.DictReader remains way slower than csv.reader:

# To read a 1.5G CSV file:
csv.reader: 24s
csv.DictReader: 84s
casanova.reader: 25s
csvmonkey: 3s
casanova_monkey.reader: 4s

Casanova is therefore an attempt to stick to csv.reader performance while still keeping a comfortable interface, still able to consider headers etc.

Casanova is thus a good fit for you if you need to:

Stream large CSV files without running out of memory
Enrich the same CSV files by outputing a similar file, all while adding, filtering and editing cells.
Have the possibility to resume said enrichment if your process exited
Do so in a threadsafe fashion, and be able to resume even if your output does not have the same order as the input

Installation

You can install casanova with pip with the following command:

pip install casanova

If you want to be able to use the faster casanova_monkey namespace relying on the fantastic csvmonkey library, you will also need to install it alongside:

pip install csvmonkey
# If this fails, typically on ubuntu, run the following:
sudo apt-get install clang
CC=clang pip install csvmonkey

or you can also install casanova likewise:

pip install casanova[monkey]

reader

Straightforward CSV reader exposing some information and indices about the given file's headers.

import casanova

with open('./people.csv') as f:

  # Creating a reader
  reader = casanova.reader(f)

  # Getting header information
  reader.fieldnames
  >>> ['name', 'surname']

  reader.pos
  >>> HeadersPositions(name=0, surname=1)

  name_pos = reader.pos.name
  name_pos = reader.pos['name']

  'name' in reader.pos
  >>> True

  # Iterating over the rows
  for row in reader:
    name = row[name_pos] # it's better to cache your pos outside the loop
    name = row[reader.pos.name] # this works, but is slower

  # Intersted in a single column?
  for name in reader.cells('name'):
    print(name)

  # Interested in several columns (handy but has a slight perf cost!)
  for name, surname in reader.cells(['name', 'surname']):
    print(name, surname)

  # Need also the current row when iterating on cells?
  for row, (name, surname) in reader.cells(['name', 'surname']):
    print(row, name, surname)

  # No headers? No problem.
  reader = casanova.reader(f, no_headers=True)

# Note that you can also create a reader from a path
with casanova.reader('./people.csv') as reader:
  pass

# And if you need exotic encodings
with casanova.reader('./people.csv', encoding='latin1') as reader:
  pass

# Readers can also be closed if you want to avoid context managers
reader.close()

Counting number of rows in a CSV file

To do so quickly you can use casanova.reader static count method.

import casanova

count = casanova.reader.count('./people.csv')

# You can also stop reading the file if you go beyond a number of rows
count = casanova.reader.count('./people.csv', max_rows=100)
>>> None # if the file has more than 100 rows
>>> 34   # else the actual count

casanova_monkey

import casanova_monkey

# NOTE: to rely on csvmonkey you will need to open the file in binary mode (e.g. "rb")!
with open('./people.csv', 'rb') as f:
  reader = casanova_monkey.reader(f)

  # For the lazy, slightly faster version
  reader = casanova_monkey.reader(f, lazy=True)

Arguments

file file|path: file object to read or path to open.
no_headers ?bool [False]: whether your CSV file is headless.
lazy ?bool [False]: only for casanova_monkey, whether to yield csvmonkey raw lazy-decoding items or cast them as list for better compatibility.

Attributes

fieldnames list: field names in order.
pos int|namedtuple: header positions object.

enricher

The enricher is basically a smart combination of a csv.reader and a csv.writer. It can be used to transform a given CSV file. You can then edit existing cells, add new ones and select which one from the input to keep in the output very easily, while remaining as performant as possible.

What's more, casanova's enrichers are automatically resumable, meaning that if your process exits for whatever reason, it will be easy to restart where you left last time.

Also, if you need to output lines in an arbitrary order, typically when performing tasks in a multithreaded fashion (e.g. when fetching a large numbers of web pages), casanova exports a threadsafe version of its enricher. This enricher is also resumable thanks to a data structure you can read about in this blog post.

Resuming typically requires O(n) time, n being the number of lines already done but only consumes amortized O(1) memory.

import casanova

with open('./people.csv') as f, \
     open('./enriched-people.csv', 'w') as of:
  enricher = casanova.enricher(f, of)

  # The enricher inherits from casanova.reader
  enricher.pos
  >>> HeadersPositions(name=0, surname=1)

  # You can iterate over its rows
  name_pos = enricher.pos.name
  for row in enricher:

    # Editing a cell, so that everyone is called John
    row[name_pos] = 'John'
    enricher.writerow(row)

  # Want to add columns?
  enricher = casanova.enricher(f, of, add=['age', 'hair'])

  for row in enricher:
    enricher.writerow(row, ['34', 'blond'])

  # Want to keep only some columns from input?
  enricher = casanova.enricher(f, of, add=['age'], keep=['surname'])

  for row in enricher:
    enricher.writerow(row, ['45'])

  # You can of course still use #.cells
  for row, name in enricher.cells('name', with_rows=True):
    print(row, name)

Arguments

input_file file|str: file object to read or path to open.
output_file file: file object to write.
no_headers ?bool [False]: whether your CSV file is headless.
add ?iterable<str|int>: names of columns to add to output.
keep ?iterable<str|int>: names of colums to keep from input.
resumable ?bool [False]: whether the enricher should be able to resume.
listener ?callable: a function listening to the enricher's events.

Resuming an enricher

import casanova

# NOTE: to be able to resume you will need to open the output file with "a+"
with open('./people.csv') as f, \
     open('./enriched-people.csv', 'a+') as of:

  # This will automatically start where it stopped last time
  enricher = casanova.enricher(f, of, resumable=True)

  for row in enricher:
    row[1] = 'John'
    enricher.writerow(row)

  # You can also listen to events if you need to advance loading bars etc.
  def listener(event, row):
    print(event, row)

  enricher = casanova.enricher(f, of, resumable=True, listener=listener)

  # Want more control over resuming?
  enricher = casanova.enricher(f, of, resumable=True, auto_resume=False)

  # You will then need to call #.resume yourself
  enricher.should_resume
  >>> True

  enricher.resume()

  # Knowing how many lines were already processed
  enricher.already_done_count
  >>> 45

Threadsafe version

To be safely resumable, the threadsafe version needs you to add an index column to the output so we can make sense of what was already done. Therefore, its writerow method is a bit different because it takes an additional argument being the original index of the row you need to enrich.

To help you doing so, all the enricher's iteration methods therefore yield the index alongside the row.

Note finally that resuming is only possible if one line in the input is meant to produce exactly one line in the output.

import casanova

with open('./people.csv') as f, \
     open('./enriched-people.csv', 'w') as of:

  enricher = casanova.threadsafe_enricher(f, of, add=['age', 'hair'])

  for index, row in enricher:
    enricher.writerow(index, row, ['67', 'blond'])

Threadsafe arguments

index_column ?str [index]: name of the index column.

casanova_monkey

import casanova_monkey

with open('./people.csv') as f, \
     open('./enriched-people.csv', 'w') as of:

  enricher = casanova_monkey.enricher(f, of)
  enricher = casanova_monkey.threadsafe_enricher(f, of)

reverse_reader

casanova's reverse reader lets you read a CSV file backwards while still parsing its headers first. It looks silly but it is very useful if you need to read the last lines of a CSV file in constant time & memory when resuming some process.

It is basically identical to casanova.reader except lines will be yielded in reverse.

import casanova

with open('./people.csv', 'rb') as f:
  reader = casanova.reverse_reader(f)

  next(reader)
  >>> ['Mr. Last', 'Line']

# It also comes with a static helper if you only need to read last cell
last_surname = casanova.reverse_reader.last_cell('./people.csv', 'surname')
>>> 'Mr. Last'

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.1.0

Feb 26, 2026

2.0.2

Jan 23, 2024

2.0.1

Dec 15, 2023

2.0.0

Dec 15, 2023

1.16.1

Oct 27, 2023

1.16.0

Sep 8, 2023

1.15.1

Jul 13, 2023

1.15.0

Jun 23, 2023

1.14.0

Jun 14, 2023

1.13.0

Jun 9, 2023

1.12.0

Jun 9, 2023

1.11.1

May 15, 2023

1.11.0

May 13, 2023

1.10.0

May 2, 2023

1.9.0

Apr 27, 2023

1.8.0

Apr 26, 2023

1.8.0a2 pre-release

Apr 5, 2023

1.8.0a1 pre-release

Apr 3, 2023

1.7.4

Mar 15, 2023

1.7.3

Mar 9, 2023

1.7.2

Mar 9, 2023

1.7.1

Mar 9, 2023

1.7.0

Mar 9, 2023

1.6.3

Mar 6, 2023

1.6.2

Mar 6, 2023

1.6.1

Mar 6, 2023

1.6.0

Mar 3, 2023

1.5.0

Feb 27, 2023

1.4.0

Feb 24, 2023

1.3.0

Feb 24, 2023

1.2.0

Feb 21, 2023

1.1.3

Feb 17, 2023

1.1.2

Feb 17, 2023

1.1.1

Feb 17, 2023

1.1.0

Feb 16, 2023

1.0.0

Feb 16, 2023

0.19.2

Nov 22, 2022

0.19.1

Nov 2, 2022

0.19.0

Oct 14, 2022

0.18.0

May 17, 2022

0.17.1

Feb 9, 2022

0.17.0

Oct 13, 2021

0.16.0

Sep 29, 2021

0.15.6

Jul 9, 2021

0.15.5

Jun 28, 2021

0.15.4

May 8, 2021

0.15.3

May 7, 2021

0.15.2

May 7, 2021

0.15.1

May 7, 2021

0.15.0

May 7, 2021

0.14.0

Apr 28, 2021

0.13.12

Apr 16, 2021

0.13.11

Apr 16, 2021

0.13.10

Apr 16, 2021

0.13.9

Apr 16, 2021

0.13.8

Apr 16, 2021

0.13.7

Apr 15, 2021

0.13.6

Apr 15, 2021

0.13.5

Apr 15, 2021

0.13.4

Apr 15, 2021

0.13.3

Apr 15, 2021

0.13.2

Apr 15, 2021

0.13.1

Apr 15, 2021

0.13.0

Apr 14, 2021

0.12.1

Apr 13, 2021

0.12.0

Apr 13, 2021

0.11.2

Apr 12, 2021

0.11.1

Apr 9, 2021

0.11.0

Apr 2, 2021

0.10.1

Mar 6, 2021

0.10.0

Mar 5, 2021

This version

0.9.1

Nov 18, 2020

0.9.0

Oct 27, 2020

0.8.0

May 18, 2020

0.7.0

May 13, 2020

0.6.1

May 6, 2020

0.6.0

Apr 27, 2020

0.5.0

Apr 23, 2020

0.4.0

Apr 21, 2020

0.3.0

Apr 21, 2020

0.2.0

Apr 20, 2020

0.1.0

Apr 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

casanova-0.9.1.tar.gz (14.2 kB view details)

Uploaded Nov 18, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

casanova-0.9.1-py3-none-any.whl (14.0 kB view details)

Uploaded Nov 18, 2020 Python 3

File details

Details for the file casanova-0.9.1.tar.gz.

File metadata

Download URL: casanova-0.9.1.tar.gz
Upload date: Nov 18, 2020
Size: 14.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.9

File hashes

Hashes for casanova-0.9.1.tar.gz
Algorithm	Hash digest
SHA256	`cd527e07482ead39977b9a1e944701e1337ad238e17f087324a60f71efe29779`
MD5	`3ab5b9ea4b25f66cfacd17f0056152e9`
BLAKE2b-256	`7f8f8174c8a5effe6512ff0ade5bbd2083649132bafc59969079aa17e7a9cd4e`

See more details on using hashes here.

File details

Details for the file casanova-0.9.1-py3-none-any.whl.

File metadata

Download URL: casanova-0.9.1-py3-none-any.whl
Upload date: Nov 18, 2020
Size: 14.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.9

File hashes

Hashes for casanova-0.9.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e683b9fe77de3f1501bbde7515485757513205d018ed1405385693e1dcd1823c`
MD5	`aa359aaf2012a6ae9737a9352f554fd7`
BLAKE2b-256	`fc34b5e14770cb73cd6df43d508e0959a6d94f17d988f0b32c9366342b257bfd`

See more details on using hashes here.

casanova 0.9.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Casanova

Installation

Usage

reader

enricher

reverse_reader

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes