A versatile Python package that allows you to execute MongoDB-style queries on CSV files and interact with them like SQL tables.
Project description
csvquery
A versatile Python package that allows you to execute MongoDB-style queries on CSV files and interact with them like SQL tables.
Downloading
pip install csvquery
Package contents
open_csv(str path[, str delimiter = ","])
Produces a Dataset from a CSV file:
from csvquery import open_csv
dataset = open_csv("path/to/file.csv")
get_csv(str url[, str delimiter = ","])
Produces a Dataset from a URL:
from csvquery import get_csv
dataset = get_csv("http://example.com/api/data.csv")
parse_csv(str string[, str delimiter = ","])
Produces a Dataset from a string:
from csvquery import parse_csv
string = "name,age\nJohn,34\nJane,33"
dataset = parse_csv(string)
Operators
Stores all the valid query operator keywords as attributes. Using this class is optional as you can just use the keyword strings instead.
- equal = "eq"
- not_equal = "neq"
- less_than = "lt"
- greater_than = "gt"
- less_than_or_equal = "lte"
- greater_than_or_equal = "gte"
- inside = "in"
- _not = "not"
- _and = "and"
- _or = "or"
Comparisons
Stores some common comparison operators.
integers
A lambda function to compare integer values.
data.index("age", Comparisons.integers)
floats
A lambda function to compare floating-point values.
data.index("rate", Comparisons.floats)
strings
A lambda function to compare strings alphabetically.
data.index("name", Comparisons.strings)
default
An alias for the floats comparison.
get_date_comparison(str format_string)
Returns a function that compares dates based on the format string. See https://strftime.org/ for a list of all valid date codes.
data.index("date", Comparisons.get_date_comparison("%Y-%m-%d"))
The Dataset object
The Dataset object is similar to an SQL table. It can be obtained with the open_csv, get_csv, and parse_csv methods.
data
A two-dimensional list of the data.
for row in voter_dataset.data:
print(row[0])
...
fields
A list of the dataset's fields, or column names.
for field in voter_dataset.fields:
print(field)
...
index(str field[, func comparison_operation = Comparisons.default])
Sort the rows of data based on the values in a specified field. Sorting the data is optional, but doing so allows you to do binary searches which have a time complexity of just O(log(n)). The comparison_operation argument must be a function that returns True when the first argument is less than the second argument, and False if otherwise. Import Comparisons for some common comparison operations. By default, the comparison_operation is a floating-point comparison.
from csvquery import open_csv, Comparisons
dataset = open_csv("people.csv")
dataset.index("age", Comparisons.integers) # sorts people by ascending age
You can also make your own comparison operator.
dataset.index("age", lambda a, b: a**2 < b**2)
already_indexed(str field[, func comparison_operation = Comparisons.default])
Specifies that the data is already sorted by a certain field, allowing binary searches without re-sorting.
from csvquery import open_csv, Comparisons
dataset = open_csv("people.csv")
dataset.already_indexed("name", Comparisons.strings)
query(dict filter_object)
Returns all rows that match the filter_object as another Dataset.
from csvquery import open_csv
dataset = open_csv("people.csv")
dataset.index("age")
voter_dataset = dataset.query({
"age": { # this filter will run as a binary search since we indexed the data by age
"gte": 18 # the query will only return people who's age is greater than or equal to 18
},
"citizenship" { # this will run after the binary search to filter the narrowed-down data
"eq": "USA" # people will only pass this filter if their "citizenship" field is equal to "USA"
}
})
The general structure of a filter_object is as follows:
{
"field_1": {
"operator_1": "value",
"operator_2": "value",
"operator_3": {
"nested_operator": "value"
},
...
"operator_n": "value_n"
},
"field_2": {
...
},
...
"field_3": {
...
}
}
You can also use the csvquery.Operators class instead of operator strings:
from csvquery import Operators
voters_named_john = voter_dataset.query({
"name": {
Operators.equal : "John"
}
})
If you want to use a comparison operator like gt or lte on a column that was not indexed, you need to provide a comparison operator in the filter_object like so:
from csvquery import open_csv, Operations, Comparisons
dataset = open_csv("people.csv")
dataset.index("citizenship") # sorts people by citizenship
voter_dataset = dataset.query({
"citizenship": { # binary search
"eq": "USA"
},
"age" { # not a binary search
"gte": "18"
"comparison": Comparisons.integers
}
})
query_one(dict filter_object)
Returns the first row that matches the filter_object as a Dataset:
john_doe = people_dataset.query_one({"phone":"555-123-4567"})
select(list fields)
Returns the a new Dataset object with only the specified fields.
names_and_ages = people.select(["name", "age"])
select_as(dict fields)
Returns the a new Dataset object with only the specified fields, except the fields are renamed according to the fields dictionary.
names_and_ages = people.select_as({
"first_and_last_name": "name",
"years_of_oldness": "age"
})
select_unique(str field)
Returns a new Dataset object with only the specified field, and removes any duplicate values so that each value is unique.
names = people.select_unique("name")
add_field(str field[, func derivation = lambda r:""])
Adds another field with the specified name. By default, the field will be filled with blank values.
people.add_field("status")
You can optionally specify a function that takes the data in that row as a dictionary and outputs the new derived value per row.
people.add_field("full_name", lambda row: row["first_name"] + " " + row["last_name"]])
remove_fields(list fields)
Removes the specified fields from the Dataset.
people.remove_fields(["status", "full_name"])
rename_fields(dict fields)
Renames fields according to the fields dictionary argument.
people.rename_fields({
"first_and_last_name": "name",
"years_of_oldness": "age"
})
replace(list fields, func function)
Replaces the values in the specified fields list argument using the function argument, which takes the current value as input and outputs the new value.
people.replace(["first name", "last name"], lambda v: v.lower()) # makes all "first name" and "last name" values lower case
replace_derived(list fields, func derivation)
Replaces the values in the specified fields list argument using the function argument, which takes the row as a dictionary as input and outputs the new value.
def birthday_to_age(row):
bday = datetime.strptime(row["date_of_birth"], "%Y-%m-%d")
today = datetime.today()
return today.year - born.year - ((today.month, today.day) < (born.month, born.day))
people.replace_derived(["age"], birthday_to_age)
join(Dataset other_dataset, list common_fields[, bool remove = True])
Adds fields from other_dataset to this Dataset and matches the rows by referencing the common_fields list argument. The common_fields list must have the field from the current Dataset and the field from the other Dataset in that order. By default, this method will remove the common field after the operation, but you can prevent this by setting the remove argument to False.
locations = open_csv("locations.csv") # has an id field
people = open_csv("people.csv") # has a location_id field that matches people to locations
people.join(locations, ["location_id", "id"])
to_dictionary()
Returns a the data as a dictionary if the Dataset has only one row (as a result of a query_one operation, for example).
john_doe = people.query_one({"phone":"555-123-4567"}) # dataset is one row high
print(john_doe.to_dictionary()["address"])
to_list()
Returns a the data as a list if the Dataset has only one column (as a result of a select operation, for example).
texans = people.query({"state":"TX"}).select("name") # dataset is one column wide
texan_names = texans.to_list()
count([list fields])
If the fields argument is left blank, returns the number of rows in the Dataset.
number_of_people = people.count()
If otherwise, returns the number of rows in which the all of the specified fields are not empty.
number_of_with_jobs = people.count(["job"]) # assuming the "job" field is left blank for unemployed people
sum(str field)
Returns a sum of all the values in that field.
total_net_worth = people.sum("net_worth")
average(str field)
Returns a average of all the values in that field.
average_net_worth = people.average("net_worth")
print_table([list fields])
Outputs your data to the console in a nice table.
voter_dataset.print_table()
You can optionally specify which columns to print.
voter_dataset.print_table(["name", "age"])
save_csv(str filepath[, str delimiter = ","[, fields = ]])
Saves the Dataset to a file. If no fields are specified, all fields will be saved.
voter_dataset.save_csv("output.csv", ";", ["name", "age"])
More examples
SQL equivalent
Classic SQL query
SELECT name, age FROM people
WHERE age >= 18 AND citizenship = "USA";
Python MongoDB-style query
voters = people.query({
"age": {
"gte": 18
},
"citizenship": "USA"
}).select(["name", "age"])
Printing certain columns
dataset = open_csv("people.csv")
dataset.print_table(dataset.fields[2:5])
Rewriting a CSV file with fewer columns and a different delimiter
dataset = open_csv("people.csv")
dataset.save_csv("people.csv", ";", dataset.fields[2:5])
The "eq" operator is optional in the top level of the dictionary
dataset = csvquery.open_csv("people.csv")
dataset.query({
"name":"John"
})
Selecting one field
people.select("name") # doesn't need to be a list if it's just one
Chaining
(
open_csv("people.csv")
.index("age")
.query({"age":{"gte":18}, "citizenship":"USA"})
.select(["name", "id"])
.save_csv("voters.csv", ",")
.print_table()
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file csvquery-1.1.3.tar.gz
.
File metadata
- Download URL: csvquery-1.1.3.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a09946bacf9d55ef777c48711de260513ba64ae7f577e3535cbcce4084adc77 |
|
MD5 | 234ca86a5f08f0f85f0152ce5adc90c7 |
|
BLAKE2b-256 | 842419b56cebab752de251a90f646d3eae8a354771df2dd6a1b62a3616312fa8 |
File details
Details for the file csvquery-1.1.3-py3-none-any.whl
.
File metadata
- Download URL: csvquery-1.1.3-py3-none-any.whl
- Upload date:
- Size: 4.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c00cb7e9a817727f79f3af440fd6aa94cb6ee65a33cd3244c9aaf0efaf82d04 |
|
MD5 | 4324cb15a365350536c92b6cfc6fbbf7 |
|
BLAKE2b-256 | 642ac7ec0e4eca9eca42e7b44e7a33649cea9142bcc0ed49bf9eac1beec62a1a |