Simple file collector - compress/serve/send/anonymizie files
Project description
filecollector
Service for collecting and processing files (with hooks)
Features
- collect files and compress them (on command)
- anonymization
- run custom scripts on output file / processed files
- start/stop simple fileserver (at collect output location)
- stream collected files to fluentd (line by line)
Requirements
- python 3.5+
- pip
Installation
pip install filecollector
Usage
It has 2 main components right now: collector and server. Collector is responsible to collect/anonymize the files and run hook scripts on those. Server is only a browser for the collected files.
At the start you need to create a yaml
configuration file for the collector.
Only this configuration is required as an input for filecollector
.
Start the collector
filecollector collector start --config filecollector.yaml -p /my/pid/dir
Start the server
filecollector server start --config filecollector.yaml -p /my/pid/dir
Configration
Simple configuration example
server:
port: 1999
folder: "../example/files"
collector:
files:
- path: "example/example*.txt"
label: "example"
rules:
- pattern: \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4}
replacement: "[REDACTED]"
processFileScript: example/scripts/process_file.sh
compress: true
useFullPath: true
outputScript: example/scripts/output_file.sh
processFilesFolderScript: example/scripts/tmp_folder.sh
deleteProcessedTemplateFiles: true
outputLocation: "example/files"
Running simple example:
# start collector
filecollector collector start --config example/filecollector.yaml -p /my/pid/dir
# start server for browsing
filecollector server start --config example/filecollector.yaml -p /my/pid/dir
Running with docker:
# collector only
docker run --rm -v /my/path/to/config:/my/path/to/config oleewere/filecollector:latest --config /my/path/to/config/filecollector.yaml
Fluentd configuration example
collector:
files:
- path: "example/example*.txt"
label: "txt"
rules:
- pattern: \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4}
replacement: "[REDACTED]"
compress: false
useFullPath: true
deleteProcessedTempFilesOneByOne: true
outputLocation: "example/files"
fluentProcessor:
host: "localhost"
port: 24224
tag: example
Fluentd configuration:
<source>
@type forward
port 24224
bind 0.0.0.0
</source>
<match example.**>
@type stdout
</match>
Running fluentd example:
# start fluentd
fluentd --config example/fluentd.conf
# start collector
filecollector collector start --config example/fluentd-filecollector.yaml -p /my/pid/dir
Configuration options
server
The server block, it contains configurations related with the filecollector server component.
server.port
Port that will be used by the filecollector server.
server.folder
The folder that is server by the file server.
collector
The collector block, it contains configurations related with the filecollector collector component.
collector.files
List of files (with path
and label
fields are required) that needs to be collected.
collector.files.path
Filename or wildcard that represents the files that is needed to be collected.
collector.files.label
It will be used as a first level folder inside outputLocation
for the file collection. E.g. if this value is mylabel
, the output will go into <outputLocation>/mylabel
collector.files.folderPrefix
If it's set, the filecollector will put the files into this folder (inside outputLocation
, but before label
)
collector.files.useFullPath
Use full path for processed files (inside outputLocation
). Can be useful if because of the wildcard patterns, the base file name are the same for different files from different folders. It overrides the collector.useFullPath
option.
collector.files.excludes
The path
options can be used as wildcards.
There are other options like: useFullPath
or excludes
. The useFullPath
option is the same as the below one, but it overrides the global behavior (default: false). The excludes
option is a list of path patterns that should be excluded from path
pattern matches.
collector.rules
List of anonymization rules that can be run against the file inputs. (pattern
field for matching, replacement
for the replacement on match)
collector.compress
At the end of the filecollection, the output folder is compressed. Default value is true
.
collector.compressFormat
Compression format, possible values: zip
, tar
, gztar
, bztar
. Default value is zip
.
collector.outputLocation
Output location (directory), where the processed file(s) will be stored.
collector.useFullPath
Use full path for processed files (inside outputLocation
). Can be useful if because of the wildcard patterns, the base file name are the same for different files from different folders. Default value is true
.
collector.checkDiskSpace
IF this option is set, before file processing - based on the file size and requiredDiskSpaceRatio
option - it will check you have enough space for copying those files into your working directory or not. Default value is true
.
collector.requiredDiskSpaceRatio
This value is used for checkDiskSpace
option. The required disk space will be calculated with this value (as a multiplier, e.g. if my files full size is 100MB and this setting is 1.6, the processing will require 160MB free space). Default value is 1.0
.
collector.processFileScript
Script that runs agains 1 processed file. It gets the filename and the label for a processed file.
collector.processFilesFolderScript
Script that runs once after the files are collected. It gets the folder name (where the files are processed) as an input.
collector.preProcessScript
Script that runs before the files are collected. It gets the folder name (where the files are processed) as an input.
collector.outputScript
Script that runs once with the compressed output file name as an input.
'collector.deleteCompressedFile'
Delete compressed file at the end of the file collection. That can be useful e.g. if an output script upload the compressed file somewhere adn it is needed to do a cleanup. Default value is false
.
collector.deleteProcessedTempFiles
After collection of the files + compression, the collected files are deleted. Can be useful to disable this behaviour compress
option is disabled. Default value is true
.
collector.deleteProcessedTempFilesOneByOne
If this option is set, files are deleted right after processed (one at a time). That can be useful if compression is disabled, and you would like to stream large files to fluentd. Default value is false
.
collector.fluentProcessor
Fluentd related section for processing files line by line - streaming data by fluentd forward protocol.
collector.fluentProcessor.host
Fluentd host (for forward protocol). Default value: localhost
.
collector.fluentProcessor.port
Fluentd port (for forward protocol). Default value: 24224
.
collector.fluentProcessor.tag
Fluentd tag for streaming lines. The generated tag for forward protocol is <collector.fluentProcessor.tag>.<file label for monitored file>
.
collector.fluentProcessor.messageField
The processed lines are mapped for this field before data has been sent to Fluentd. Default value: message
.
collector.fluentProcessor.includeTime
If this is enabled, current time is included in the fluentd data event. (as time
field). Default value: false
.
collector.logger
Logger related configurations for the collector.
collector.logger.level
Logger level that is used during logging.
collector.logger.format
Format of the log events. Default value is %(message)s
.
collector.logger.file
File where the log will write the logging events, if it is not set, only stdout will be used.
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file filecollector-0.2.0.tar.gz
.
File metadata
- Download URL: filecollector-0.2.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 553f4a47e62cd4feafdd4375a114a6ee02b45c30c3a87f60b8b20a536077b5ab |
|
MD5 | 65382ea4d1d341909c216bfea0eda8f4 |
|
BLAKE2b-256 | 1a2a9b441df4172555ed5d6410ef434046549153c37dc9280ee982572b4889ab |
File details
Details for the file filecollector-0.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: filecollector-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65155f969d8ec67a95f4e154ba16489ff5000249efaa0a8edc635efe3147cc0d |
|
MD5 | f70a58930e8ea997e3bdf819fa8f543b |
|
BLAKE2b-256 | 68f2fa20ea8d1f6cd2f790875916127aad3d26a696ba5b6f33c0495d954cb906 |