Convert several file data sources (like typical CSV or Excel) to python objects in a easy manner.
Project description
File data-source
The easiest way to parse CSV and Excel files, even compressed CSV files.
For example, with 2 or 1 line you can read a Excel or CSV file, even if it is a compressed CSV file:
from filedatasource import open_reader, excel2objects
# Read a whole Excel file and obtain a list of objects,
# with the fieldnames as object attributes
with open_reader('data.csv.gz') as reader:
objs = reader.read_objects()
print(objs[0].name)
# Or, even, easier:
objs = excel2objects('data.xlsx')
print(objs[0].name)
# If you do not want to load all the file to memory, you can do
with open_reader('data.csv.gz') as reader:
for obj in reader:
print(obj.name)
Install
pip install filedatasource
If you want to read an Excel file in xlsx format, you need to install the openpyxl module:
pip install openpyxl
If you want to create an Excel file in xlsx format, you need to install the xlsxwriter module:
pip install xlsxwriter
If you want to read an Excel file in old xls format, you need to install the xlrd module:
pip install xlrd
Finally, if you want to write into an old xls format, you need to install the xlwt module:
pip install xlwt
For what kind of file formats it is used
filedatasource is a Python module that allows to extract the data form typical CSV or Excel files with the following format:
field_1,field_2,field_3,...
cell_1_1,cell_1_2,cell_1_3,...
cell_2_1,cell_2_2,cell_3_3,...
...
That means, the first row is used for the column names, and the rest of rows are the value of those columns. For example, with this data.csv file:
name,surname,address
Jhon,Smith,Oxford street
Maria,Ortega,George Washington street
Or an Excel file with the following content:
name | surname | address |
---|---|---|
Jhon | Smith | Oxford street |
Maria | Ortega | George Washington street |
Note: This module does not work with other formats. If you need to parse more complex files, you need to use Panda or native modules as openpyxl or xlsxwriter. This module is thought for parsing or generating typical data sources based on files in data scientists processes.
How to read
The standard way to read a CSV is:
from filedatasource import CsvReader
with CsvReader('data.csv') as reader:
for person in reader:
print(person.name, person.surname, person.address)
If the file name ends with .gz then, CsvReader() will decompress the file during the reading. By default, ExcelReader or CsvReader obtain objects with the fieldnames as attributes.
Note: All the data extracted with CsvReader will be of type str.
Important note
This method is not recommended when the column names are not valid Python identifiers because this method will change the no valid identifiers following these rules:
- If the field name has a character that is not valid as Python identifier, it will be replaced by _.
- If the field name starts with a number, then the character 'n' is added at the beginning.
For example, if you have the following data file in CSV:
1d,D&D,Figures
20,Main dice all
...
The identifier for the object attributes will be n1d, D_D and Figures, respectively. The last field does not change because it is a valid Python identifier.
For this reason in these cases it is better to use the parameter mode, which allows to change the mode to obtain the information in each iteration. This way, you can obtain a dictionary (mode=ReadMode.DICT) or a list (mode=ReadMode.LIST) instead of a Python object. For example, this sentences do the same that the previous code does:
from filedatasource import CsvReader, ReadMode
# To get a dictionary instead an objects in each iteration
with CsvReader('data.csv', mode=ReadMode.DICT) as reader:
for person in reader:
print(person['name'], person['surname'], person['fullname'])
# To get a list of values instead of an object with attributes
with CsvReader('data.csv', mode=ReadMode.LIST) as reader:
for person in reader:
print(person[0], person[1], person[2])
All these operations can be done with Excel files using ExcelReader. For example, to read the data.xlsx with the above content you can do the following:
from filedatasource import ExcelReader
with ExcelReader('data.xslx') as reader:
for person in reader:
print(person.name, person.surname, person.address)
If the file name ends with .xls instead of .xlsx, this class will read it correctly too.
Regardless of using CsvReader or ExcelReader, you can use the tqdm() method to measure the time that you process will take. For example:
from filedatasource import ExcelReader, CsvReader
from tqdm import tqdm
with ExcelReader('data.xslx') as reader:
for person in tqdm(reader, desc='Reading the Excel file'):
print(person.name, person.surname, person.address)
with CsvReader('data.xslx') as reader:
for person in tqdm(reader, desc='Reading the CSV file'):
print(person.name, person.surname, person.address)
However, the CsvReader needs, previously, to pre-read the entire file to calculate the number of rows, and it might take a bit longer. This problem does not occur with the Excel file because it is fully loaded into memory.
In an Excel file you can know the list of sheet names by means of sheets():
from filedatasource import sheets
# It detects the type of the Excel file by the extension of the filename.
sheets('test/Example.xlsx')
sheets('test/Example.xls')
How to write a data file
With filedatasource is very easy to write, both, a CSV file (compressed or not) and an Excel file (xls or xlsx). Only you need to use CsvWriter or ExcelWriter writing the rows by parameters, with a list of values or with a dictionary. For example:
from filedatasource import CsvWriter
# Writing a row by means of parameters, a list of values or a dictionary.
with CsvWriter('data.csv.gz', fieldnames=['a', 'b', 'c']) as writer:
writer.write_row(a=1, b=2, c=3)
writer.write([2, 4, 7])
writer.write({'a': 4, 'b': 8, 'c': 31})
If the file name ends with .gz, it will be compressed automatically. Moreover, you can store an object like this:
from filedatasource import CsvWriter
class Example(object):
def __init__(self, a: int, b: int, c: int):
self.a = a
self.b = b
self.c = c
# Writing with Python objects
with CsvWriter('data.csv.gz', fieldnames=['a', 'b', 'c']) as writer:
writer.write(Example(1, 2, 3))
writer.write(Example(4, 5, 6))
If you in your object class defines properties, you can use it directly as fieldnames parameter. For example:
from filedatasource import CsvWriter
class Employee(object):
@property
def full_name(self) -> str:
return self.name + ' ' + self.surname
@property
def name(self) -> str:
return self.__name
@property
def surname(self) -> str:
return self.__surname
def __init__(self, name: str, surname: str):
self.__name = name
self.__surname = surname
# Writing an object and defining the fieldnames automatically from the object property
with CsvWriter('data.csv', fieldnames=Employee) as writer:
writer.write(Employee('John', 'Smith'))
writer.write(Employee('Maria', 'Ortega'))
By default, if the file 'data.csv' exists, it will be removed. In order to avoid it and add the rows at the end of the file, you can use the mode parameter with the value Mode.APPEND like this:
from filedatasource import CsvWriter, Mode
# Creating a CsvWriter in APPEND mode
with CsvWriter('data.csv', fieldnames=['a', 'b', 'c'], mode=Mode.APPEND) as writer:
writer.write_dicts([
{'a': 10, 'b': 11, 'c': 12},
{'a': 13, 'b': 14, 'c': 15}
])
from filedatasource import ExcelWriter, Mode
# Creating a ExcelWriter in APPEND mode (important note below)
with ExcelWriter('data.xlsx', fieldnames=['a', 'b', 'c'], mode=Mode.APPEND) as writer:
writer.write_dicts([
{'a': 10, 'b': 11, 'c': 12},
{'a': 13, 'b': 14, 'c': 15}
])
Important note: If an Excel file has more than one sheet, the append mode will remove all the rest sheets.
Working with Excel sheets
ExcelReader only works with one Excel sheet at a time. You can select the sheet to read using the parameter sheet using the sheet name or its number (starting by 0). For example, if we have the file Example.xls with two sheets, the first one named Clients, and the second one named Suppliers, you can do the following:
from filedatasource import ExcelReader
# Reading the first sheet of the file 'Example.xls'
with ExcelReader('Example.xls', sheet=0) as reader:
for row in reader:
print(row)
# Reading the second sheet of the file 'Example.xls'
with ExcelReader('Example.xls', sheet=1) as reader:
for row in reader:
print(row)
# Reading the sheet named 'Clients' of the file 'Example.xls'
with ExcelReader('Example.xls', sheet='Clients') as reader:
for row in reader:
print(row)
# Reading the sheet named 'Suppliers' of the file 'Example.xls'
with ExcelReader('Example.xls', sheet='Suppliers') as reader:
for row in reader:
print(row)
Reading and writing whole files
With filedatasource you can read or write a whole data file in two lines using DataWriter.write_lists(), DataWriter.write_dicts(), and DataWriter.write_objects() for writing list of lists, dictionaries and object, respectively; and DataReader.read_lists(), DataReader.read_rows(), and DataReader.read_objects() for getting a list of lists, dictionaries or objects with the all the file data.
For example, writing several rows:
from filedatasource import CsvWriter
with CsvWriter('data.xlsx', fieldnames=['a', 'b', 'c']) as writer:
writer.write_lists([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
writer.write_dicts([
{'a': 10, 'b': 11, 'c': 12},
{'a': 13, 'b': 14, 'c': 15}
])
writer.write_objects([
Example(16, 17, 18),
Example(19, 20, 21),
Example(22, 23, 24)
])
All these methods also work with ExcelWriter.
For reading a whole file, you only need to write the following sentences:
from filedatasource import CsvReader, ExcelReader
# Read a whole compressed CSV file and get a list of lists
with CsvReader('data.csv.gz') as reader:
lists = reader.read_lists()
fieldnames = reader.fieldnames # This line is only necessary if you want to get the fieldnames for read_lists()
print(fieldnames[0], lists[0][0])
# Read a whole CSV file and obtain a list of dictionaries, with the fieldnames as key and the data the values.
with CsvReader('data.csv') as reader:
dicts = reader.read_rows()
print(lists[0]['name'])
# Read a whole Excel file and obtain a list of objects, with the fieldnames as object attributes
with ExcelReader('data.xlsx') as reader:
objs = reader.read_objects()
print(lists[0].name)
Convert from CSV to Excel and vice-versa
With these tools, it is very simply to convert from a CSV to an Excel file and viceversa:
from filedatasource import ExcelWriter, CsvReader
with CsvReader('data.csv') as reader:
with ExcelWriter('data.xlsx', fieldnames=reader.fieldnames) as writer:
for obj in reader:
writer.write(obj)
However, it can be made easier:
from filedatasource import ExcelWriter, CsvReader
with CsvReader('data.csv') as reader:
with ExcelWriter('data.xlsx', fieldnames=reader.fieldnames) as writer:
writer.import_reader(reader)
What's more, it can be even easier:
from filedatasource import convert
convert('data.csv', 'data.xlsx')
Compare the content of two files
You can use equals() to check if two file data sources have the same content independently if they are the same file type or different. However, if one of the files are CSV and the other Excel, and some columns are not strings, the comparison are going to fail. It is better to compare the same type of files or to work only with strings to be sure the comparison is correct.
The syntax os equals() is very easy:
from filedatasource import equals
print(equals('test.xlsx', 'test.csv.gz'))
Reading CSV with field types
In Excel files you can save information maintaining their original data types. For example, if you store a row with different field types, you can retrieve the information with the same vale types. For instance:
from filedatasource import ExcelWriter, ExcelReader
with ExcelWriter('data.xlsx', fieldnames=['a', 'b', 'c', 'd']) as writer:
writer.write(['One', 2, 3.0, True])
with ExcelReader('data.xlsx') as reader:
print(reader.read_list()) # --> print ['One', 2, 3.0, True]
However, a CSV file does not contain any information about field types. And using CsvWriter and CsvReader in the same way you will get the list ['One', '2', '3.0', 'True']. That means, all the values in string format. To solve this, you can use the types parameter as following:
from filedatasource import CsvWriter, CsvReader
with CsvWriter('data.csv', fieldnames=['a', 'b', 'c', 'd']) as writer:
writer.write(['One', 2, 3.0, True])
# Define how the data type of each field using a list
with CsvReader('data.csv', types=[str, int, float, bool]) as reader:
print(reader.read_list()) # -> print ['One', 2, 3.0, True])
# Define how the data type of each field using a dictionary
with CsvReader('data.csv', types={'a': str, 'b': int, 'c': float, 'd': bool}) as reader:
print(reader.read_list()) # -> print ['One', 2, 3.0, True])
This way, you can define the data values in the correct field type. If there are fields without defining, then a string value will be returned for those fields.
Writers with predefined fieldnames
If you are using the same writer several times, you can use a subclass in order to avoid define the fieldnames several times overwriting the fieldnames property. For example:
from filedatasource import CsvWriter, Mode
from typing import List
class TestWriter(CsvWriter):
@property
def fieldnames(self) -> List[str]:
return ['a', 'b', 'c']
with TestWriter('data.csv') as writer:
writer.write_lists([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
with TestWriter(('data.csv', mode=Mode.APPEND) as writer:
writer.write_dicts([
{'a': 10, 'b': 11, 'c': 12},
{'a': 13, 'b': 14, 'c': 15}
])
However, it could be easier using just one line using the functions load() and save(), save_dicts() or save_lists(). This method load and/or save Excel or CSV files based on the file extension.
from filedatasource import load, save, save_dicts, save_lists, ReadMode
# Obtain a list of objects from a CSV file and print the first row.
objs = load('data.csv')
print(objs[0].a, objs[0].b, objs[0].c)
# Obtain a list of dictionaries from a compressed CSV file and print the first row.
dicts = load('data.csv.gz', mode=ReadMode.DICT)
print(dicts[0]['a'], dicts[0]['b'], dicts[0]['c'])
# Obtain a list of lists from an Excel file and print the first row.
lists = load('data.xlsx', mode=ReadMode.LIST)
print(lists[0][0], lists[0][1], lists[0][2])
# Write a list of objects in a XLS file
data_objects = [
Example(16, 17, 18),
Example(19, 20, 21),
Example(22, 23, 24)
]
save('data.xls', data_objects)
# Write a list of dictionaries in a XLS file
data_dict = [
{'a': 10, 'b': 11, 'c': 12},
{'a': 13, 'b': 14, 'c': 15}
]
save_dicts('data.xls', data_dict)
# Write a list of lists in a XLSX file with the column names 'a', 'b' and 'c'
data_list = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
save_lists('data.xlsx', data_list, fieldnames=['a', 'b', 'c'])
It is possible to use the function load_objs(), load_dicts() and load_lists() instead of the function load() with the parameter mode.
Also, it is possible to use the functions list2csv(), dict2csv(), objects2csv(), list2excel(), dict2excel() and objects2excel() in order to force the format of the output file independently of the file extension:
from filedatasource import list2csv, dict2csv, objects2csv
from filedatasource import list2excel, dict2excel, objects2excel
data_list = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
data_dict = [
{'a': 10, 'b': 11, 'c': 12},
{'a': 13, 'b': 14, 'c': 15}
]
data_objects = [
Example(16, 17, 18),
Example(19, 20, 21),
Example(22, 23, 24)
]
# Write a list of lists into a CSV file
list2csv('data.csv', data_list, fieldnames=['a', 'b', 'c'])
# Write a list of dicts into a CSV file
dict2csv('data.csv', data_dict, fieldnames=['a', 'b', 'c'])
dict2csv('data.csv', data_dict) # Use the key as fieldnames
# Write a list of objects into a compressed CSV file
objects2csv('data.csv.gz', data_objects, fieldnames=['a', 'b', 'c'])
objects2csv('data.csv.gz', data_objects) # Use object attributes as fieldnames
# Write a list of lists into an Excel file
list2excel('data.xlsx', data_list, fieldnames=['a', 'b', 'c'])
# Write a list of dicts into an Excel file
dict2excel('data.xlsx', data_dict, fieldnames=['a', 'b', 'c'])
# Write a list of objects into a XLS file
objects2excel('data.xls', data_objects, fieldnames=['a', 'b', 'c'])
And, you can also use the equivalent functions csv2list(), csv2dict(), csv2objects(), excel2list(), excel2dict(), and excel2objects(). For example:
from filedatasource import csv2list, csv2dict, excel2objects
# Read a whole compressed CSV file and get a list of lists
lists = csv2list('data.csv.gz')
print(lists[0][0])
# Read a whole CSV file and obtain a list of dictionaries, with the fieldnames as key and the data the values.
dicts = csv2dict('data.csv')
print(lists[0]['name'])
# Read a whole Excel file and obtain a list of objects, with the fieldnames as object attributes
objs = excel2objects('data.xlsx')
print(lists[0].name)
Builders
You have two builders to create readers or writers based on the file extension with default arguments.
from filedatasource import open_reader, open_writer
# Read a CSV
with open_reader('data.csv') as reader:
...
# Read a compressed CSV
with open_reader('data.csv.gz') as reader:
...
# Read the first sheet from an Excel
with open_reader('data.xlsx') as reader:
...
# Write a CSV
with open_writer('data.csv', fieldnames=[...]) as writer:
...
# Write a compressed CSV
with open_writer('data.csv.gz', fieldnames=[...]) as writer:
...
# Write an Excel file creating a default sheet
with open_writer('data.xlsx', fieldnames=[...]) as writer:
...
Using the open_reader() you can use the mode parameter to set how get each row.
from filedatasource import open_reader, ReadMode
# Reading the file and returning objects
with open_reader('data.xlsx') as reader:
for obj in reader:
print(obj.a, obj.b, obj.c)
# Read the file and returning dictionaries
with open_reader('data.csv', mode=ReadMode.DICT) as reader:
for d in reader:
print(d['a'], d['b'], d['c'])
# Read the file and returning lists of values
with open_reader('data.csv.gz', mode=ReadMode.LIST) as reader:
for lst in reader:
print(lst[0], lst[1], lst[2])
Size calculation
It is possible to know how many rows has a data source for both, writer or reader. For example:
from filedatasource import open_reader, open_writer
# How to know the number of rows in a reader with data
with open_reader('data.csv') as reader:
print(len(reader))
# How to know the number of rows in a writer
with open_writer('data.csv', fieldnames=['a', 'b', 'c']) as writer:
print(len(writer)) # Print 0 because the file is opened for write.
writer.write(a=1, b=2, c=3)
writer.write(a=4, b=5, c=6)
print(len(writer)) # Print 2 because has been added two rows.
Note: The file length calculation does not work with readers oppened with a file stream instead of a file path or with a writer opened with append mode and from a file stream.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file filedatasource-1.1.19.tar.gz
.
File metadata
- Download URL: filedatasource-1.1.19.tar.gz
- Upload date:
- Size: 26.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5edf75a52278d2b33251a0610fcf17a6ac644c2793e9d242dd9501ea44c8f16f |
|
MD5 | 69d0d2a4df3f34e2a39332a4f2881856 |
|
BLAKE2b-256 | f76182094cc96d0954cd768b58b504e973f8235c031b4f80291c06dfbc87740e |