Finds the right CSV separator and excludes bad lines in corrupt CSV files
Project description
Finds the right CSV separator and excludes bad lines
Example:
# You have probably seen this before, right?
import pandas as pd
pd.read_csv(r"https://github.com/zdavatz/diprela/raw/main/csv/diprela.csv")
Traceback (most recent call last):
File "C:\Users\Gamer\anaconda3\envs\dfdir\lib\site-packages\IPython\core\interactiveshell.py", line 3398, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-8-c5703c9ae399>", line 1, in <cell line: 1>
....
File "pandas\_libs\parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 10
pd.read_csv(r"https://github.com/zdavatz/diprela/raw/main/csv/diprela.csv",on_bad_lines='skip')
# Better, but everything in one column, and we have lost about 600 rows.
Schweizerische Nährwertdatenbank komplett (bearbeitet durch Lesley Grünenfelder) (Stand: 11.02.2023) ;;;;;;;;;;;;;;;;;;;;;;;;;;;Schweizerische Nährwertdatenbank komplett (bearbeitet durch Lesley Grünenfelder) (Stand: 30.01.2023) ;;;;;;;;;;;;;;;;;;;;;
0 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
1 Agar Agar;;Gelier- und Bindemittel ;Verschiede...
2 Agavensirup;Agavendicksaft ;Zucker und Süsssto...
3 Ahornsirup;;Zucker und Süssstoffe;Verschiedene...
4 Älplermagronen;zubereitet;salzige Gerichte ;Ge...
.. ...
521 Zwieback ;;Brot und Backware ;Getreide und Get...
522 Zwieback;Vollkorn;Brot und Backware ;Getreide ...
523 Zwiebel;gedünstet (ohne Zugabe von Fett und Sa...
524 Zwiebel;geröstet (ohne Zugabe von Fett und Sal...
525 Zwiebel;roh;Gemüse frisch ;Gemüse ;pro 100g es...
[526 rows x 1 columns]
# If you have problems reading a CSV file, use:
from outguncsv import read_balky_csv_files
alf3 = r"https://github.com/zdavatz/diprela/raw/main/csv/diprela.csv"
df3 = read_balky_csv_files(
csvfiles=alf3,
encoding="utf-8",
sep=None,
regexremove=(),
filepathcolumn="file",
on_bad_lines="warn",
)
df3
Out[4]:
0 ... file
0 Schweizerische Nährwertdatenbank komplett (bea... ... https://github.com/zdavatz/diprela/raw/main/cs...
1 NaN ... https://github.com/zdavatz/diprela/raw/main/cs...
2 Name ... https://github.com/zdavatz/diprela/raw/main/cs...
3 Agar Agar ... https://github.com/zdavatz/diprela/raw/main/cs...
4 Agavensirup ... https://github.com/zdavatz/diprela/raw/main/cs...
... ... ...
1213 Zwiebel ... https://github.com/zdavatz/diprela/raw/main/cs...
1214 Zwiebelkuchen ... https://github.com/zdavatz/diprela/raw/main/cs...
1215 Zwiebelkuchen ... https://github.com/zdavatz/diprela/raw/main/cs...
1216 Zwiebelkuchen ... https://github.com/zdavatz/diprela/raw/main/cs...
1217 Zwiebelkuchen ... https://github.com/zdavatz/diprela/raw/main/cs...
[1218 rows x 50 columns]
from outguncsv import read_balky_csv_files
import glob
alf = glob.glob(
r"C:\Users\Gamer\Documents\Downloads\anyascii-master\input\tables\*.tsv"
)
df = read_balky_csv_files(
csvfiles=alf, # list or string (url/file path)
encoding="utf-8",
sep=None, # if None, it does its best to find the best separator
regexremove=(), # remove lines, regex must be in binary: rb"^\s*#\s+.*$",
filepathcolumn="file", # a new colum will be created with the file path
on_bad_lines="skip", # use either skip or warn, it won't work with error
# for *args, **kwargs -> https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
)
alf = r"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
df2 = read_balky_csv_files(
csvfiles=alf,
encoding="utf-8",
sep=None,
regexremove=(),
filepathcolumn="file",
on_bad_lines="skip",
)
alf3 = r"https://github.com/zdavatz/diprela/raw/main/csv/diprela.csv"
df3 = read_balky_csv_files(
csvfiles=alf3,
encoding="utf-8",
sep=None,
regexremove=(),
filepathcolumn="file",
on_bad_lines="warn",
)
alf4 = "https://github.com/curran/data/raw/gh-pages/migrants/events.csv"
df4 = read_balky_csv_files(
csvfiles=alf4,
encoding="utf-8",
sep=",",
regexremove=(),
filepathcolumn="file",
on_bad_lines="skip",
)
df
Out[3]:
0 1 file
0 𞤢 a C:\Users\Gamer\Documents\Downloads\anyascii-ma...
1 𞤣 d C:\Users\Gamer\Documents\Downloads\anyascii-ma...
2 𞤤 l C:\Users\Gamer\Documents\Downloads\anyascii-ma...
3 𞤥 m C:\Users\Gamer\Documents\Downloads\anyascii-ma...
4 𞤦 b C:\Users\Gamer\Documents\Downloads\anyascii-ma...
.. .. ...
14439 𜾿 - C:\Users\Gamer\Documents\Downloads\anyascii-ma...
14440 𜿀 - C:\Users\Gamer\Documents\Downloads\anyascii-ma...
14441 𜿁 - C:\Users\Gamer\Documents\Downloads\anyascii-ma...
14442 𜿂 - C:\Users\Gamer\Documents\Downloads\anyascii-ma...
14443 𜿃 - C:\Users\Gamer\Documents\Downloads\anyascii-ma...
[14444 rows x 3 columns]
df2
Out[4]:
0 1 2 ... 10 11 file
0 1 0 3 ... NaN S https://raw.githubusercontent.com/pandas-dev/p...
1 2 1 1 ... C85 C https://raw.githubusercontent.com/pandas-dev/p...
2 3 1 3 ... NaN S https://raw.githubusercontent.com/pandas-dev/p...
3 4 1 1 ... C123 S https://raw.githubusercontent.com/pandas-dev/p...
4 5 0 3 ... NaN S https://raw.githubusercontent.com/pandas-dev/p...
.. ... .. .. ... ... .. ...
886 887 0 2 ... NaN S https://raw.githubusercontent.com/pandas-dev/p...
887 888 1 1 ... B42 S https://raw.githubusercontent.com/pandas-dev/p...
888 889 0 3 ... NaN S https://raw.githubusercontent.com/pandas-dev/p...
889 890 1 1 ... C148 C https://raw.githubusercontent.com/pandas-dev/p...
890 891 0 3 ... NaN Q https://raw.githubusercontent.com/pandas-dev/p...
[891 rows x 13 columns]
df3
Out[5]:
0 ... file
0 Schweizerische Nährwertdatenbank komplett (bea... ... https://github.com/zdavatz/diprela/raw/main/cs...
1 NaN ... https://github.com/zdavatz/diprela/raw/main/cs...
2 Name ... https://github.com/zdavatz/diprela/raw/main/cs...
3 Agar Agar ... https://github.com/zdavatz/diprela/raw/main/cs...
4 Agavensirup ... https://github.com/zdavatz/diprela/raw/main/cs...
... ... ...
1213 Zwiebel ... https://github.com/zdavatz/diprela/raw/main/cs...
1214 Zwiebelkuchen ... https://github.com/zdavatz/diprela/raw/main/cs...
1215 Zwiebelkuchen ... https://github.com/zdavatz/diprela/raw/main/cs...
1216 Zwiebelkuchen ... https://github.com/zdavatz/diprela/raw/main/cs...
1217 Zwiebelkuchen ... https://github.com/zdavatz/diprela/raw/main/cs...
[1218 rows x 50 columns]
df4
Out[6]:
0 ... file
0 NaN ... https://github.com/curran/data/raw/gh-pages/mi...
1 57234.0 ... https://github.com/curran/data/raw/gh-pages/mi...
2 56633.0 ... https://github.com/curran/data/raw/gh-pages/mi...
3 72740.0 ... https://github.com/curran/data/raw/gh-pages/mi...
4 55194.0 ... https://github.com/curran/data/raw/gh-pages/mi...
.. ... ... ...
955 36496.0 ... https://github.com/curran/data/raw/gh-pages/mi...
956 36500.0 ... https://github.com/curran/data/raw/gh-pages/mi...
957 36499.0 ... https://github.com/curran/data/raw/gh-pages/mi...
958 36503.0 ... https://github.com/curran/data/raw/gh-pages/mi...
959 36502.0 ... https://github.com/curran/data/raw/gh-pages/mi...
[960 rows x 22 columns]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
outguncsv-0.11.tar.gz
(14.2 kB
view details)
Built Distribution
outguncsv-0.11-py3-none-any.whl
(13.7 kB
view details)
File details
Details for the file outguncsv-0.11.tar.gz
.
File metadata
- Download URL: outguncsv-0.11.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3564be2ea2c99d9b7722a7e25012f9d6c1cf17fec5f9049a931af72b78addaf |
|
MD5 | 48ea0745687a35002ae4f49b9aa036d8 |
|
BLAKE2b-256 | a9437adc585a7fd5a9887f794d7e0f9d8e79fa84e37b3be7525c9035de80bcf6 |
File details
Details for the file outguncsv-0.11-py3-none-any.whl
.
File metadata
- Download URL: outguncsv-0.11-py3-none-any.whl
- Upload date:
- Size: 13.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c287547ee29a727b0d028b37cdc0a00b17b9bd1dfcdee406f28b94a44a23e967 |
|
MD5 | 6925ffbcee1e2b0ce542e0a8ee3fa219 |
|
BLAKE2b-256 | f9a72e229a126364a1095274624608dfca18ea10c36c7f7b2101db20eaa03c5e |