Skip to main content

Finds the right CSV separator and excludes bad lines in corrupt CSV files

Project description

Finds the right CSV separator and excludes bad lines

Example:

# You have probably seen this before, right?

import pandas as pd

pd.read_csv(r"https://github.com/zdavatz/diprela/raw/main/csv/diprela.csv")

Traceback (most recent call last):

  File "C:\Users\Gamer\anaconda3\envs\dfdir\lib\site-packages\IPython\core\interactiveshell.py", line 3398, in run_code

    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-8-c5703c9ae399>", line 1, in <cell line: 1>

  ....

    File "pandas\_libs\parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 10







pd.read_csv(r"https://github.com/zdavatz/diprela/raw/main/csv/diprela.csv",on_bad_lines='skip')

# Better, but everything in one column, and we have lost about 600 rows. 

    Schweizerische Nährwertdatenbank komplett (bearbeitet durch Lesley Grünenfelder) (Stand: 11.02.2023) ;;;;;;;;;;;;;;;;;;;;;;;;;;;Schweizerische Nährwertdatenbank komplett (bearbeitet durch Lesley Grünenfelder) (Stand: 30.01.2023) ;;;;;;;;;;;;;;;;;;;;;

0     ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;                                                                                                                                                                                                        

1    Agar Agar;;Gelier- und Bindemittel ;Verschiede...                                                                                                                                                                                                        

2    Agavensirup;Agavendicksaft ;Zucker und Süsssto...                                                                                                                                                                                                        

3    Ahornsirup;;Zucker und Süssstoffe;Verschiedene...                                                                                                                                                                                                        

4    Älplermagronen;zubereitet;salzige Gerichte ;Ge...                                                                                                                                                                                                        

..                                                 ...                                                                                                                                                                                                        

521  Zwieback ;;Brot und Backware ;Getreide und Get...                                                                                                                                                                                                        

522  Zwieback;Vollkorn;Brot und Backware ;Getreide ...                                                                                                                                                                                                        

523  Zwiebel;gedünstet (ohne Zugabe von Fett und Sa...                                                                                                                                                                                                        

524  Zwiebel;geröstet (ohne Zugabe von Fett und Sal...                                                                                                                                                                                                        

525  Zwiebel;roh;Gemüse frisch ;Gemüse ;pro 100g es...                                                                                                                                                                                                        

[526 rows x 1 columns]







# If you have problems reading a CSV file, use: 

from outguncsv import read_balky_csv_files

alf3 = r"https://github.com/zdavatz/diprela/raw/main/csv/diprela.csv"

df3 = read_balky_csv_files(

    csvfiles=alf3,

    encoding="utf-8",

    sep=None,

    regexremove=(),

    filepathcolumn="file",

    on_bad_lines="warn",

)





df3

Out[4]: 

                                                      0  ...                                               file

0     Schweizerische Nährwertdatenbank komplett (bea...  ...  https://github.com/zdavatz/diprela/raw/main/cs...

1                                                   NaN  ...  https://github.com/zdavatz/diprela/raw/main/cs...

2                                                  Name  ...  https://github.com/zdavatz/diprela/raw/main/cs...

3                                             Agar Agar  ...  https://github.com/zdavatz/diprela/raw/main/cs...

4                                           Agavensirup  ...  https://github.com/zdavatz/diprela/raw/main/cs...

                                                 ...  ...                                                ...

1213                                            Zwiebel  ...  https://github.com/zdavatz/diprela/raw/main/cs...

1214                                      Zwiebelkuchen  ...  https://github.com/zdavatz/diprela/raw/main/cs...

1215                                      Zwiebelkuchen  ...  https://github.com/zdavatz/diprela/raw/main/cs...

1216                                      Zwiebelkuchen  ...  https://github.com/zdavatz/diprela/raw/main/cs...

1217                                      Zwiebelkuchen  ...  https://github.com/zdavatz/diprela/raw/main/cs...

[1218 rows x 50 columns]
from outguncsv import read_balky_csv_files

import glob

alf = glob.glob(

    r"C:\Users\Gamer\Documents\Downloads\anyascii-master\input\tables\*.tsv"

)

df = read_balky_csv_files(

    csvfiles=alf,  # list or string (url/file path)

    encoding="utf-8",

    sep=None,  # if None, it does its best to find the best separator

    regexremove=(),  # remove lines, regex must be in binary: rb"^\s*#\s+.*$",

    filepathcolumn="file",  # a new colum will be created with the file path

    on_bad_lines="skip",  # use either skip or warn, it won't work with error

    # for *args, **kwargs -> https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

	

	

)



alf = r"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"

df2 = read_balky_csv_files(

    csvfiles=alf,

    encoding="utf-8",

    sep=None,

    regexremove=(),

    filepathcolumn="file",

    on_bad_lines="skip",

)

alf3 = r"https://github.com/zdavatz/diprela/raw/main/csv/diprela.csv"

df3 = read_balky_csv_files(

    csvfiles=alf3,

    encoding="utf-8",

    sep=None,

    regexremove=(),

    filepathcolumn="file",

    on_bad_lines="warn",

)

alf4 = "https://github.com/curran/data/raw/gh-pages/migrants/events.csv"

df4 = read_balky_csv_files(

    csvfiles=alf4,

    encoding="utf-8",

    sep=",",

    regexremove=(),

    filepathcolumn="file",

    on_bad_lines="skip",

)





df

Out[3]: 

       0  1                                               file

0      𞤢  a  C:\Users\Gamer\Documents\Downloads\anyascii-ma...

1      𞤣  d  C:\Users\Gamer\Documents\Downloads\anyascii-ma...

2      𞤤  l  C:\Users\Gamer\Documents\Downloads\anyascii-ma...

3      𞤥  m  C:\Users\Gamer\Documents\Downloads\anyascii-ma...

4      𞤦  b  C:\Users\Gamer\Documents\Downloads\anyascii-ma...

   .. ..                                                ...

14439  𜾿  -  C:\Users\Gamer\Documents\Downloads\anyascii-ma...

14440  𜿀  -  C:\Users\Gamer\Documents\Downloads\anyascii-ma...

14441  𜿁  -  C:\Users\Gamer\Documents\Downloads\anyascii-ma...

14442  𜿂  -  C:\Users\Gamer\Documents\Downloads\anyascii-ma...

14443  𜿃  -  C:\Users\Gamer\Documents\Downloads\anyascii-ma...

[14444 rows x 3 columns]

df2

Out[4]: 

       0  1  2  ...    10 11                                               file

0      1  0  3  ...   NaN  S  https://raw.githubusercontent.com/pandas-dev/p...

1      2  1  1  ...   C85  C  https://raw.githubusercontent.com/pandas-dev/p...

2      3  1  3  ...   NaN  S  https://raw.githubusercontent.com/pandas-dev/p...

3      4  1  1  ...  C123  S  https://raw.githubusercontent.com/pandas-dev/p...

4      5  0  3  ...   NaN  S  https://raw.githubusercontent.com/pandas-dev/p...

..   ... .. ..  ...   ... ..                                                ...

886  887  0  2  ...   NaN  S  https://raw.githubusercontent.com/pandas-dev/p...

887  888  1  1  ...   B42  S  https://raw.githubusercontent.com/pandas-dev/p...

888  889  0  3  ...   NaN  S  https://raw.githubusercontent.com/pandas-dev/p...

889  890  1  1  ...  C148  C  https://raw.githubusercontent.com/pandas-dev/p...

890  891  0  3  ...   NaN  Q  https://raw.githubusercontent.com/pandas-dev/p...

[891 rows x 13 columns]

df3

Out[5]: 

                                                      0  ...                                               file

0     Schweizerische Nährwertdatenbank komplett (bea...  ...  https://github.com/zdavatz/diprela/raw/main/cs...

1                                                   NaN  ...  https://github.com/zdavatz/diprela/raw/main/cs...

2                                                  Name  ...  https://github.com/zdavatz/diprela/raw/main/cs...

3                                             Agar Agar  ...  https://github.com/zdavatz/diprela/raw/main/cs...

4                                           Agavensirup  ...  https://github.com/zdavatz/diprela/raw/main/cs...

                                                 ...  ...                                                ...

1213                                            Zwiebel  ...  https://github.com/zdavatz/diprela/raw/main/cs...

1214                                      Zwiebelkuchen  ...  https://github.com/zdavatz/diprela/raw/main/cs...

1215                                      Zwiebelkuchen  ...  https://github.com/zdavatz/diprela/raw/main/cs...

1216                                      Zwiebelkuchen  ...  https://github.com/zdavatz/diprela/raw/main/cs...

1217                                      Zwiebelkuchen  ...  https://github.com/zdavatz/diprela/raw/main/cs...

[1218 rows x 50 columns]

df4

Out[6]: 

           0  ...                                               file

0        NaN  ...  https://github.com/curran/data/raw/gh-pages/mi...

1    57234.0  ...  https://github.com/curran/data/raw/gh-pages/mi...

2    56633.0  ...  https://github.com/curran/data/raw/gh-pages/mi...

3    72740.0  ...  https://github.com/curran/data/raw/gh-pages/mi...

4    55194.0  ...  https://github.com/curran/data/raw/gh-pages/mi...

..       ...  ...                                                ...

955  36496.0  ...  https://github.com/curran/data/raw/gh-pages/mi...

956  36500.0  ...  https://github.com/curran/data/raw/gh-pages/mi...

957  36499.0  ...  https://github.com/curran/data/raw/gh-pages/mi...

958  36503.0  ...  https://github.com/curran/data/raw/gh-pages/mi...

959  36502.0  ...  https://github.com/curran/data/raw/gh-pages/mi...

[960 rows x 22 columns]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

outguncsv-0.11.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

outguncsv-0.11-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file outguncsv-0.11.tar.gz.

File metadata

  • Download URL: outguncsv-0.11.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for outguncsv-0.11.tar.gz
Algorithm Hash digest
SHA256 a3564be2ea2c99d9b7722a7e25012f9d6c1cf17fec5f9049a931af72b78addaf
MD5 48ea0745687a35002ae4f49b9aa036d8
BLAKE2b-256 a9437adc585a7fd5a9887f794d7e0f9d8e79fa84e37b3be7525c9035de80bcf6

See more details on using hashes here.

File details

Details for the file outguncsv-0.11-py3-none-any.whl.

File metadata

  • Download URL: outguncsv-0.11-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for outguncsv-0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 c287547ee29a727b0d028b37cdc0a00b17b9bd1dfcdee406f28b94a44a23e967
MD5 6925ffbcee1e2b0ce542e0a8ee3fa219
BLAKE2b-256 f9a72e229a126364a1095274624608dfca18ea10c36c7f7b2101db20eaa03c5e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page