Skip to main content

Data parser to parse newline delimited logs into tabular format.

Project description

Parse raw logs to tabular format

This package helps to parse new line delimited logs to tabular formats. The user provides the regex, file path and column names, and a dataframe will be returned.
Depending on the supplied mode (local/spark), a pandas dataframe or a spark dataframe will be returned.

Features

Local mode

  1. Regex matching is done using multiprocessing.
  2. Glob searching for files.
  3. Lazy evaluation of files. This allows larger than memory datasets to be parsed, but note that upon parsing, the resultant pandas dataframe must be able to fit in memory.

Installation

Purely for local usage (No pyspark)

pip install data-parser

Both local and pyspark

pip install data-parser[pyspark]

Usage - Local (Pandas)

from data_parser import DataSource

# Bind 9: Feb  5 09:12:11 ns1 named[80090]: client 192.168.10.12#3261: query: www.server.example IN A
dns = DataSource(
    path='/path/to/dnsdir/*.txt',  # Glob patterns supported
    mode='local'
)

# Pandas dataframe is returned
dns_df = dns.parse(
    regex='^([A-Z][a-z]{2})\s+(\d+) (\d{2}\:\d{2}\:\d{2}) (\S+).+client ([^\s#]+)#(\d+)',
    col_names=['month', 'day', 'time', 'nameserver', 'query_ip', 'port'],
    on_error='raise'
)

Usage - Spark (Pyspark)

from data_parser import DataSource

# Bind 9: Feb  5 09:12:11 ns1 named[80090]: client 192.168.10.12#3261: query: www.server.example IN A
dns = DataSource(
    path='/path/to/dns/log',
    mode='spark'
)

# Spark dataframe is returned
dns_df = dns.parse(
    regex='^([A-Z][a-z]{2})\s+(\d+) (\d{2}\:\d{2}\:\d{2}) (\S+).+client ([^\s#]+)#(\d+)',
    col_names=['month', 'day', 'time', 'nameserver', 'query_ip', 'port'],
    on_error='raise'
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_parser-0.0.3.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_parser-0.0.3-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file data_parser-0.0.3.tar.gz.

File metadata

  • Download URL: data_parser-0.0.3.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.5

File hashes

Hashes for data_parser-0.0.3.tar.gz
Algorithm Hash digest
SHA256 603e10d3545be6ebdf26f156c261d68fca83ba8181790b85a6708cc2028481d8
MD5 040b8eae2102962a42a646c2ac881451
BLAKE2b-256 121a8173c02602628e06bf7914d47cd7a6849a15f37ecf56abb3a4db625a904e

See more details on using hashes here.

File details

Details for the file data_parser-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: data_parser-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.5

File hashes

Hashes for data_parser-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0f4936221a50471869d802b6ae100bf2f0c1db957411ec9791d319f5bc12214b
MD5 b8f8aaf938b0d495ed85de43c685f7ed
BLAKE2b-256 fd0ac6860778eb3de09a321dcee3180fb7c75d3349e90195b4c1e430c90b6e13

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page