Simple file collector - compress/serve/send/anonymizie files

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

filecollector

build

Service for collecting and processing files (with hooks)

Features

collect files and compress them (on command)
anonymization
run custom scripts on output file / processed files
start/stop simple fileserver (at collect output location)
stream collected files to fluentd (line by line)

Requirements

python 2.7+ / python 3.5+
pip

Installation

pip install filecollector

Usage

It has 2 main components right now: collector and server. Collector is responsible to collect/anonymize the files and run hook scripts on those. Server is only a browser for the collected files.

At the start you need to create a yaml configuration file for the collector. Only this configuration is required as an input for filecollector.

Start the collector

filecollector collector start --config filecollector.yaml -p /my/pid/dir

Start the server

filecollector server start --config filecollector.yaml -p /my/pid/dir

Configration

Simple configuration example

server:
    port: 1999
    folder: "../example/files" 
collector:
    files:
    - path: "example/example*.txt"
      label: "example"
    rules:
    - pattern:  \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4}
      replacement: "[REDACTED]"
    processFileScript: example/scripts/process_file.sh
    compress: true
    useFullPath: true
    outputScript: example/scripts/output_file.sh
    processFilesFolderScript: example/scripts/tmp_folder.sh
    deleteProcessedTemplateFiles: true
    outputLocation: "example/files"

Running simple example:

# start collector 
filecollector collector start --config example/filecollector.yaml -p /my/pid/dir
# start server for browsing
filecollector server start --config example/filecollector.yaml -p /my/pid/dir

Fluentd configuration example

collector:
    files:
    - path: "example/example*.txt"
      label: "txt"
    rules:
    - pattern:  \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4}
      replacement: "[REDACTED]"
    compress: false
    useFullPath: true
    deleteProcessedTempFilesOneByOne: true
    outputLocation: "example/files"
    processor:
      host: "localhost"
      port: 24224
      tag: example

Fluentd configuration:

<source>
  @type forward
  port 24224
  bind 0.0.0.0
</source>

<match example.**>
   @type stdout
</match>

Running fluentd example:

# start fluentd 
fluentd --config example/fluentd.conf
# start collector 
filecollector collector start --config example/fluentd-filecollector.yaml -p /my/pid/dir

Configuration options

`server`

The server block, it contains configurations related with the filecollector server component.

`server.port`

Port that will be used by the filecollector server.

`server.folder`

The folder that is server by the file server.

`collector`

The collector block, it contains configurations related with the filecollector collector component.

`collector.files`

List of files (with name and label) that needs to be collected. The name options can be used as wildcards.

`collector.rules`

List of anonymization rules that can be run against the file inputs. (pattern field for matching, replacement for the replacement on match)

`collector.compress`

At the end of the filecollection, the output folder is compressed. The default value is true.

`collector.compressFormat`

Compression format, possible values: zip, tar, gztar, bztar. Default value is zip.

`collector.outputLocation`

Output location (directory), where the processed file(s) will be stored.

`collector.useFullPath`

Use full path for processed files (inside outputLocation). Can be useful if because of the wildcard patterns, the base file name are the same for different files from different folders. Default value is true.

`collector.processFileScript`

Script that runs agains 1 processed file. It gets the filename and the label for a processed file.

`collector.processFilesFolderScript`

Script that runs once after the files are collected. It gets the folder name (where the files are processed) as an input.

`collector.preProcessScript`

Script that runs before the files are collected. It gets the folder name (where the files are processed) as an input.

`collector.outputScript`

Script that runs once with the compressed output file name as an input.

`collector.deleteProcessedTempFiles`

After collection of the files + compression, the collected files are deleted. Can be useful to disable this behaviour compress option is disabled. Default value is true.

`collector.deleteProcessedTempFilesOneByOne`

If this option is set, files are deleted right after processed (one at a time). That can be useful if compression is disabled, and you would like to stream large files to fluentd. Default value is false.

`collector.fluentProcessor`

Fluentd related section for processing files line by line - streaming data by fluentd forward protocol.

`collector.fluentProcessor.host`

Fluentd host (for forward protocol). Default value: localhost.

`collector.fluentProcessor.port`

Fluentd port (for forward protocol). Default value: 24224.

`collector.fluentProcessor.tag`

Fluentd tag for streaming lines. The generated tag for forward protocol is <collector.fluentProcessor.tag>.<file label for monitored file>.

`collector.fluentProcessor.messageField`

The processed lines are mapped for this field before data has been sent to Fluentd. Default value: message.

`collector.fluentProcessor.includeTime`

If this is enabled, current time is included in the fluentd data event. (as time field). Default value: false.

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.1

Jun 25, 2020

0.2.0

Jun 19, 2020

0.1.2

Jun 15, 2020

0.1.1

May 27, 2020

This version

0.1.0

May 1, 2020

0.0.3

Apr 27, 2020

0.0.1

Apr 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filecollector-0.1.0.tar.gz (13.9 kB view hashes)

Uploaded May 1, 2020 Source

Built Distribution

filecollector-0.1.0-py2.py3-none-any.whl (14.5 kB view hashes)

Uploaded May 1, 2020 Python 2 Python 3

Hashes for filecollector-0.1.0.tar.gz

Hashes for filecollector-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`777e7a53f294774735d2b06bf9e3ece8d6e94044ca402e2389b75ac730553132`
MD5	`18f6f7c7170a40458d75337e7842e946`
BLAKE2b-256	`50a13046d72aa656f3f794e7dd68286aa4d1f4fb3ab512f4a7d364e6f5e3f411`

Hashes for filecollector-0.1.0-py2.py3-none-any.whl

Hashes for filecollector-0.1.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`92d435ae79687fbbb2d9946589030fcd8622d3ef144af1310fa719dc77346b56`
MD5	`d535a9815b1733e4e35d4f83dec5bf1d`
BLAKE2b-256	`29be46ae02f887e90a9afd6ecd889f08366deeb1406009f5e523c7870ce9ce3c`

filecollector 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

filecollector

Features

Requirements

Installation

Usage

Start the collector

Start the server

Configration

Simple configuration example

Fluentd configuration example

Configuration options

server

server.port

server.folder

collector

collector.files

collector.rules

collector.compress

collector.compressFormat

collector.outputLocation

collector.useFullPath

collector.processFileScript

collector.processFilesFolderScript

collector.preProcessScript

collector.outputScript

collector.deleteProcessedTempFiles

collector.deleteProcessedTempFilesOneByOne

collector.fluentProcessor

collector.fluentProcessor.host

collector.fluentProcessor.port

collector.fluentProcessor.tag

collector.fluentProcessor.messageField

collector.fluentProcessor.includeTime

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`server`

`server.port`

`server.folder`

`collector`

`collector.files`

`collector.rules`

`collector.compress`

`collector.compressFormat`

`collector.outputLocation`

`collector.useFullPath`

`collector.processFileScript`

`collector.processFilesFolderScript`

`collector.preProcessScript`

`collector.outputScript`

`collector.deleteProcessedTempFiles`

`collector.deleteProcessedTempFilesOneByOne`

`collector.fluentProcessor`

`collector.fluentProcessor.host`

`collector.fluentProcessor.port`

`collector.fluentProcessor.tag`

`collector.fluentProcessor.messageField`

`collector.fluentProcessor.includeTime`