Skip to main content

This is a python library to parse files, it's giving tools to easily read a file with efficiency. It's based on linux commands like grep, sed, cat, head, tail and tested with them.

Project description

file utils

Table of contents

Examples:

Intro

This package allows to read/parse a file in python. When should we use this package? If your file is really big (> 100 000 lines), because if you want to parse a file in python you'll write:

f = open("my_file", "r")
buffer: str = f.read()
...

or:

f = open("my_file", "r")
for line in f.readlines():
    ...
  • With the first one, there is a memory issue because you must save the entire file into a buffer.
  • With the second one, there is a time issue because a loop can be very slow in python.

So, this package gives tools to easily read a file with efficiently. It's based on Linux tools like grep, sed, cat, head, tail and tested with them.
WithEOL class as the same memory problem as the first example. If you want to resolve it, you must use WithCustomDelims with the "\n" delimiter.
So, why I keep WithEOL?
WithEOL is helping me to test the code, it's using a built in rust function and I'm using it as a reference to compare with WithCustomDelims.

Installation

python

With pypi:

pip install file-utils

From source:

maturin develop

rust

cargo add file_utils

Before-starting

This package is ASCII/UTF-8 compliant, all others encoded files will not work...

Arguments-explaination

  • path: the path to the file
  • remove_empty_string: ignore the empty string "[ ]*"
  • n: get n lines with tail/head
  • n1: the beginning line to take with between
  • n2: the last line to take with between
  • restrict: if enable, if we have last N lines, it just keep the regex in those lines. If not enable, it takes last N regex

with regex:

  • regex_keep: list of regex to keep
  • regex_pass: list of regex to pass/ignore

WithEOL-python

Example-file

We will use this example file test.txt

With cat -e test.txt:

[Warning]:Entity not found$
[Error]:Unable to recover data$
[Info]:Segfault$
[Warning]:Indentation$
[Error]:Memory leaks$
[Info]:Entity not found$
[Warning]:Unable to recover data$
  $
[Error]:Segfault$
[Info]:Indentation$
[Warning]:Memory leaks$

Example-simple-head-python

1\ Simple head (can be change to tail) Code:

import file_utils_lib

path: str = "my_path_to_file"
n: int = 2 # Number of lines to read

try:
    head: list = file_utils_lib.WithEOL.head(path=path, n=n)
    print(head)
except:
    print("Unable to open/read the file")

Stdout:

['[Warning]:Entity not found', '[Error]:Unable to recover data']

Example-simple-tail-python

Code:

import file_utils_lib

path: str = "my_path_to_file"
n: int = 2 # Number of lines to read

try:
    tail: list = file_utils_lib.WithEOL.tail(path=path, n=n)
    print(tail)
except:
    print("Unable to open/read the file")

Stdout:

['[Info]:Indentation', '[Warning]:Memory leaks']

Example-simple-between-python

Code:

import file_utils_lib

path: str = "my_path_to_file"
n1: int = 2 # First line to read
n2: int = 4 # Last line to read

try:
    between: list = file_utils_lib.WithEOL.between(path=path, n1=n1, n2=n2)
    print(between)
except:
    print("Unable to open/read the file")

Stdout:

['[Error]:Unable to recover data', '[Info]:Segfault', '[Warning]:Indentation']

Example-simple-parse-python

Code:

import file_utils_lib

path: str = "my_path_to_file"

try:
    parse: list = file_utils_lib.WithEOL.parse(path=path)
    print(parse)
except:
    print("Unable to open/read the file")

Stdout:

['[Warning]:Entity not found', '[Error]:Unable to recover data', '[Info]:Segfault', '[Warning]:Indentation', '[Error]:Memory leaks', '[Info]:Entity not found', '[Warning]:Unable to recover data', '  ', '[Error]:Segfault', '[Info]:Indentation', '[Warning]:Memory leaks']

Example-simple-count_lines-python

Code:

import file_utils_lib

path: str = "my_path_to_file"

try:
    count: list = file_utils_lib.WithEOL.count_lines(path=path)
    print(count)
except:
    print("Unable to open/read the file")

Stdout:

11

Example-remove_empty_string-python

With remove_empty_string enable:
Code:

import file_utils_lib

path: str = "my_path_to_file"
n: int = 4 # First line to read

try:
    tail: list = file_utils_lib.WithEOL.tail(path=path, n=n, remove_empty_string=True)
    print(tail)
except:
    print("Unable to open/read the file")

Stdout:

['[Warning]:Unable to recover data', '[Error]:Segfault', '[Info]:Indentation', '[Warning]:Memory leaks']

With remove_empty_string disable (default option):
Code:

import file_utils_lib

path: str = "my_path_to_file"
n: int = 4 # First line to read

try:
    tail: list = file_utils_lib.WithEOL.tail(path=path, n=n, remove_empty_string=False)
    print(tail)
except:
    print("Unable to open/read the file")

Stdout:

['  ', '[Error]:Segfault', '[Info]:Indentation', '[Warning]:Memory leaks']

Example-regex_keep-python

Code:

import file_utils_lib

path: str = "my_path_to_file"
n: int = 4 # First line to read

try:
    head: list = file_utils_lib.WithEOL.head(path=path, n=n, remove_empty_string=False, regex_keep=["\[Warning\]:*", "\[Error\]:*"])
    print(head)
except:
    print("Unable to open/read the file")

Stdout:

['[Warning]:Entity not found', '[Error]:Unable to recover data', '[Warning]:Indentation']

Why there is just 3 elements instead of 4? You should look at the restrict option

Example-regex_pass-python

Code:

import file_utils_lib

path: str = "my_path_to_file"
n: int = 4 # First line to read

try:
    head: list = file_utils_lib.WithEOL.head(path=path, n=n, remove_empty_string=False, regex_pass=["\[Warning\]:*", "\[Error\]:*"])
    print(head)
except:
    print("Unable to open/read the file")

Stdout:

['[Info]:Segfault']

Why there is just 3 elements instead of 4? You should look at the restrict option

Example-restrict-python

With restrict disable:
Code:

import file_utils_lib

path: str = "my_path_to_file"
n: int = 4 # First line to read

try:
    head: list = file_utils_lib.WithEOL.head(path=path, n=4, remove_empty_string=False, regex_keep=["\[Warning\]:*", "\[Error\]:*"], restrict=False)
    print(head)
except:
    print("Unable to open/read the file")

Stdout:

['[Warning]:Entity not found', '[Error]:Unable to recover data', '[Warning]:Indentation', '[Error]:Memory leaks']

With restrict enbale(default):
Code:

import file_utils_lib

path: str = "my_path_to_file"
n: int = 4 # First line to read

try:
    head: list = file_utils_lib.WithEOL.head(path=path, n=4, remove_empty_string=False, regex_keep=["\[Warning\]:*", "\[Error\]:*"], restrict=True)
    print(head)
except:
    print("Unable to open/read the file")

Stdout:

['[Warning]:Entity not found', '[Error]:Unable to recover data', '[Warning]:Indentation']

WithCustomDelims-python

How-to-use-it-python

It it like WithEOL but with a list of custom delimiter. For example:

import file_utils_lib

path: str = "my_path_to_file"
n: int = 2 # Number of lines to read

try:
    head: list = file_utils_lib.WithEOL.head(path=path, n=n)
    print(head)
except:
    print("Unable to open/read the file")

Stdout:

['[Warning]:Entity not found', '[Error]:Unable to recover data']

has the same behavious as

import file_utils_lib

path: str = "my_path_to_file"
n: int = 2 # Number of lines to read

try:
    head: list = file_utils_lib.WithCustomDelims.head(path=path, n=n, delimiter=['\n])
    print(head)
except:
    print("Unable to open/read the file")

Stdout:

['[Warning]:Entity not found', '[Error]:Unable to recover data']

So, you use it as same as WithEOL but with a list of custom delimiter.

What-delim-can-be-used

All string can be used like:

  • ";"
  • "abc"
  • "éà"
  • ::
  • "小六号"
  • "毫"

With-more-than-one-delimiter

If my file contains:

;À ;la ;;
pêche éèaux moules, @moules, ::小六号moules::Je n'veux小六号 plus ::y 
aller éèmaman小六号

We'll have with ";", "\n", "éè", "@", "小六号", "::"

import file_utils_lib

path: str = "my_path_to_file"

try:
    parse: list = file_utils_lib.WithCustomDelims.parse(path=path, delimiter=[";", "\n", "éè", "@", "::"])
    print(parse)
except:
    print("Unable to open/read the file")

Stdout

['', 'À ', 'la ', '', '', 'pêche ', 'aux moules, ', 'moules, ', '', 'moules', "Je n'veux", ' plus ', 'y ', 'aller ', 'maman', '']

How-to-use-the-rust-crate?

You must import the library with

use file_utils_lib::with_custom_delims::WithCustomDelims;

or

use file_utils_lib::with_eol::WithEOL;

Then, you can use the same functions as python because there are the same behavious.
Example:

use file_utils_lib::with_custom_delims::WithCustomDelims;

fn main() {
    let mut delimiters: Vec<String> = Vec::new();
    delimiters.push("\n".to_string());
    let n: usize = 10;
    let res: Vec<String> = WithCustomDelims::head(
        "my path".to_string(),
        n,
        delimiters,
        false,
        Vec::new(),
        Vec::new(),
        true,
        1024,
    );
}

has the same behaviour as

import file_utils_lib

path: str = "my_path_to_file"
n: int = 2 # Number of lines to read

try:
    head: list = file_utils_lib.WithEOL.head(path=path, n=n)
    print(head)
except:
    print("Unable to open/read the file")

Python-class

If we translate the rust into python, we'll have:

class WithEOL:
    # head: Read the n first lines
    # if n > (numbers of lines in the file) => return the whole file
    def head(path: str, n: int, \
                remove_empty_string: bool = False, \
                regex_keep: list = [] \
                regex_pass: list = [] \
                restrict: bool = True):
        ...

    # between: Read the lines [n1, n2]
    # if n1 > n2 => return an empty list
    # if n1 > (numbers of lines in the file) => return an empty list
    def between(path: str, n1: int, n2: int \
                remove_empty_string: bool = False, \
                regex_keep: list = [] \
                regex_pass: list = [] \
                restrict: bool = True):
        ...
    
    # tail: Read the n last lines
    # if n > (numbers of lines in the file) => return the whole file
    def tail(path: str, n: int, \
                remove_empty_string: bool = False, \
                regex_keep: list = [] \
                regex_pass: list = [] \
                restrict: bool = True):
        ...
    
    # parse: Read the whole file
    def parse(path: str, \ 
                remove_empty_string: bool = False \
                regex_keep: list = [] \
                regex_pass: list = []):
        ...

    # Count the number of lines
    def count_lines(path: str \
                    remove_empty_string: bool = False, \
                    regex_keep: list = [] \
                    regex_pass: list = []):
        ...

class WithCustomDelims:
    # head: Read the n first lines
    # if n > (numbers of lines in the file) => return the whole file
    def head(path: str, n: int, delimiter: list \
                remove_empty_string: bool = False, \
                regex_keep: list = [] \
                regex_pass: list = [] \
                restrict: bool = True \
                buffer_size: int = 1024):
        ...

    # between: Read the lines [n1, n2]
    # if n1 > n2 => return an empty list
    # if n1 > (numbers of lines in the file) => return an empty list
    def between(path: str, n1: int, n2: int, delimiter: list \
                remove_empty_string: bool = False, \
                regex_keep: list = [] \
                regex_pass: list = [] \
                restrict: bool = True \
                buffer_size: int = 1024):
        ...
    
    # tail: Read the n last lines
    # if n > (numbers of lines in the file) => return the whole file
    def tail(path: str, n: int, delimiter: list \
                remove_empty_string: bool = False, \
                regex_keep: list = [] \
                regex_pass: list = [] \
                restrict: bool = True \
                buffer_size: int = 1024):
        ...
    
    # parse: Read the whole file
    def parse(path: str, delimiter: list \
                remove_empty_string: bool = False \
                regex_keep: list = [] \
                regex_pass: list = [] \
                buffer_size: int = 1024):
        ...

    # Count the number of lines
    def count_lines(path: str, delimiter: list \
                    remove_empty_string: bool = False, \
                    regex_keep: list = [] \
                    regex_pass: list = [] \
                    buffer_size: int = 1024):
        ...

Rust-Structure

Take a look at https://docs.rs/file_utils/latest/file_utils_lib/

Structure

  • src/: all sources files
  • tests/: all tests for rust
  • tests_files/: all files used for tests
  • tests_python/: a python script to test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

file_utils-0.1.3.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

file_utils-0.1.3-cp312-cp312-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file file_utils-0.1.3.tar.gz.

File metadata

  • Download URL: file_utils-0.1.3.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.2

File hashes

Hashes for file_utils-0.1.3.tar.gz
Algorithm Hash digest
SHA256 b7562e46f09b31748aae4e62f6febf97a5b657cc30b9c52a6788316f07ac2de2
MD5 19a60eec011835b72e92a07617e7ef63
BLAKE2b-256 a2f283041daba415c937f87eaa9aa18757693ea3927c504e5f547385c7d383ef

See more details on using hashes here.

File details

Details for the file file_utils-0.1.3-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for file_utils-0.1.3-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 571b0a5df946f2f4ecc32478a8d1ce8afba7ff6820acd29d63f69224d1a83dcb
MD5 05e424273bc183c47b6334559ec70cc6
BLAKE2b-256 3ee3a7c1835d55052e4b30884c5e725b584effd354d6bac08f093ada767ccc64

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page