Skip to main content

A versatile Python package that allows you to execute MongoDB-style queries on CSV files and interact with them like SQL tables.

Project description

csvquery

A versatile Python package that allows you to execute MongoDB-style queries on CSV files and interact with them like SQL tables.

Downloading

pip install csvquery

Package contents

open_csv(str path[, str delimiter = ","])

Produces a Dataset from a CSV file:

from csvquery import open_csv

dataset = open_csv("path/to/file.csv")

get_csv(str url[, str delimiter = ","])

Produces a Dataset from a URL:

from csvquery import get_csv

dataset = get_csv("http://example.com/api/data.csv")

parse_csv(str string[, str delimiter = ","])

Produces a Dataset from a string:

from csvquery import parse_csv

string = "name,age\nJohn,34\nJane,33"
dataset = parse_csv(string)

Operators

Stores all the valid query operator keywords as attributes. Using this class is optional as you can just use the keyword strings instead.

  • equal = "eq"
  • not_equal = "neq"
  • less_than = "lt"
  • greater_than = "gt"
  • less_than_or_equal = "lte"
  • greater_than_or_equal = "gte"
  • inside = "in"
  • _not = "not"
  • _and = "and"
  • _or = "or"

Comparisons

Stores some common comparison operators.

integers

A lambda function to compare integer values.

data.index("age", Comparisons.integers)

floats

A lambda function to compare floating-point values.

data.index("rate", Comparisons.floats)

strings

A lambda function to compare strings alphabetically.

data.index("name", Comparisons.strings)

default

An alias for the floats comparison.

get_date_comparison(str format_string)

Returns a function that compares dates based on the format string. See https://strftime.org/ for a list of all valid date codes.

data.index("date", Comparisons.get_date_comparison("%Y-%m-%d"))

The Dataset object

The Dataset object is similar to an SQL table. It can be obtained with the open_csv, get_csv, and parse_csv methods.

data

A two-dimensional list of the data.

for row in voter_dataset.data:
    print(row[0])
    ...

fields

A list of the dataset's fields, or column names.

for field in voter_dataset.fields:
    print(field)
    ...

index(str field[, func comparison_operation = Comparisons.default])

Sort the rows of data based on the values in a specified field. Sorting the data is optional, but doing so allows you to do binary searches which have a time complexity of just O(log(n)). The comparison_operation argument must be a function that returns True when the first argument is less than the second argument, and False if otherwise. Import Comparisons for some common comparison operations. By default, the comparison_operation is a floating-point comparison.

from csvquery import open_csv, Comparisons

dataset = open_csv("people.csv")
dataset.index("age", Comparisons.integers) # sorts people by ascending age

You can also make your own comparison operator.

dataset.index("age", lambda a, b: a**2 < b**2)

already_indexed(str field[, func comparison_operation = Comparisons.default])

Specifies that the data is already sorted by a certain field, allowing binary searches without re-sorting.

from csvquery import open_csv, Comparisons

dataset = open_csv("people.csv")
dataset.already_indexed("name", Comparisons.strings)

query(dict filter_object)

Returns all rows that match the filter_object as another Dataset.

from csvquery import open_csv

dataset = open_csv("people.csv")
dataset.index("age")

voter_dataset = dataset.query({
    "age": {          # this filter will run as a binary search since we indexed the data by age
        "gte": 18     # the query will only return people who's age is greater than or equal to 18
    },
    "citizenship" {   # this will run after the binary search to filter the narrowed-down data
        "eq": "USA"   # people will only pass this filter if their "citizenship" field is equal to "USA"
    }
})

The general structure of a filter_object is as follows:

{
    "field_1": {
        "operator_1": "value",
        "operator_2": "value",
        "operator_3": {
            "nested_operator": "value"
        },
        ...
        "operator_n": "value_n"
    },
    "field_2": {
        ...
    },
    ...
    "field_3": {
        ...
    }
}

You can also use the csvquery.Operators class instead of operator strings:

from csvquery import Operators

voters_named_john = voter_dataset.query({
    "name": {
        Operators.equal : "John"
    }
})

If you want to use a comparison operator like gt or lte on a column that was not indexed, you need to provide a comparison operator in the filter_object like so:

from csvquery import open_csv, Operations, Comparisons

dataset = open_csv("people.csv")
dataset.index("citizenship") # sorts people by citizenship

voter_dataset = dataset.query({
    "citizenship": { # binary search
        "eq": "USA"
    },
    "age" {  # not a binary search
        "gte": "18"
        "comparison": Comparisons.integers
    }
})

query_one(dict filter_object)

Returns the first row that matches the filter_object as a Dataset:

john_doe = people_dataset.query_one({"phone":"555-123-4567"})

select(list fields)

Returns the a new Dataset object with only the specified fields.

names_and_ages = people.select(["name", "age"])

select_as(dict fields)

Returns the a new Dataset object with only the specified fields, except the fields are renamed according to the fields dictionary.

names_and_ages = people.select_as({
    "first_and_last_name": "name",
    "years_of_oldness": "age"
})

select_unique(str field)

Returns a new Dataset object with only the specified field, and removes any duplicate values so that each value is unique.

names = people.select_unique("name")

add_field(str field[, func derivation = lambda r:""])

Adds another field with the specified name. By default, the field will be filled with blank values.

people.add_field("status")

You can optionally specify a function that takes the data in that row as a dictionary and outputs the new derived value per row.

people.add_field("full_name", lambda row: row["first_name"] + " " + row["last_name"]])

remove_fields(list fields)

Removes the specified fields from the Dataset.

people.remove_fields(["status", "full_name"])

rename_fields(dict fields)

Renames fields according to the fields dictionary argument.

people.rename_fields({
    "first_and_last_name": "name",
    "years_of_oldness": "age"
})

replace(list fields, func function)

Replaces the values in the specified fields list argument using the function argument, which takes the current value as input and outputs the new value.

people.replace(["first name", "last name"], lambda v: v.lower()) # makes all "first name" and "last name" values lower case

replace_derived(list fields, func derivation)

Replaces the values in the specified fields list argument using the function argument, which takes the row as a dictionary as input and outputs the new value.

def birthday_to_age(row):
    bday = datetime.strptime(row["date_of_birth"], "%Y-%m-%d")
    today = datetime.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

people.replace_derived(["age"], birthday_to_age)

join(Dataset other_dataset, list common_fields[, bool remove = True])

Adds fields from other_dataset to this Dataset and matches the rows by referencing the common_fields list argument. The common_fields list must have the field from the current Dataset and the field from the other Dataset in that order. By default, this method will remove the common field after the operation, but you can prevent this by setting the remove argument to False.

locations = open_csv("locations.csv") # has an id field
people = open_csv("people.csv") # has a location_id field that matches people to locations

people.join(locations, ["location_id", "id"])

to_dictionary()

Returns a the data as a dictionary if the Dataset has only one row (as a result of a query_one operation, for example).

john_doe = people.query_one({"phone":"555-123-4567"}) # dataset is one row high
print(john_doe.to_dictionary()["address"])

to_list()

Returns a the data as a list if the Dataset has only one column (as a result of a select operation, for example).

texans = people.query({"state":"TX"}).select("name") # dataset is one column wide
texan_names = texans.to_list()

count([list fields])

If the fields argument is left blank, returns the number of rows in the Dataset.

number_of_people = people.count()

If otherwise, returns the number of rows in which the all of the specified fields are not empty.

number_of_with_jobs = people.count(["job"]) # assuming the "job" field is left blank for unemployed people

sum(str field)

Returns a sum of all the values in that field.

total_net_worth = people.sum("net_worth")

average(str field)

Returns a average of all the values in that field.

average_net_worth = people.average("net_worth")

print_table([list fields])

Outputs your data to the console in a nice table.

voter_dataset.print_table()

You can optionally specify which columns to print.

voter_dataset.print_table(["name", "age"])

save_csv(str filepath[, str delimiter = ","[, fields = ]])

Saves the Dataset to a file. If no fields are specified, all fields will be saved.

voter_dataset.save_csv("output.csv", ";", ["name", "age"])

More examples

SQL equivalent

Classic SQL query

SELECT name, age FROM people
WHERE age >= 18 AND citizenship = "USA";

Python MongoDB-style query

voters = people.query({
    "age": {
        "gte": 18
    },
    "citizenship": "USA"
}).select(["name", "age"])

Printing certain columns

dataset = open_csv("people.csv")
dataset.print_table(dataset.fields[2:5])

Rewriting a CSV file with fewer columns and a different delimiter

dataset = open_csv("people.csv")
dataset.save_csv("people.csv", ";", dataset.fields[2:5])

The "eq" operator is optional in the top level of the dictionary

dataset = csvquery.open_csv("people.csv")
dataset.query({
    "name":"John"
})

Selecting one field

people.select("name") # doesn't need to be a list if it's just one

Chaining

(
open_csv("people.csv")
    .index("age")
    .query({"age":{"gte":18}, "citizenship":"USA"})
    .select(["name", "id"])
    .save_csv("voters.csv", ",")
    .print_table()
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvquery-1.1.3.tar.gz (12.6 kB view hashes)

Uploaded Source

Built Distribution

csvquery-1.1.3-py3-none-any.whl (4.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page