Parse log files from an ERDDAP server
Project description
erddaplogs
A package for analysing traffic to an ERDDAP server by parsing nginx and apache logs.
Installation
-
From pypi, using pip
pip install erddaplogs
-
From the repo, using pip
#First, clone the repo:
git clone https://github.com/callumrollo/erddaplogs.git
cd erddaplogs
pip install -r requirements-dev.txt # install the dependencies
pip install .
Example usage
First, get the logs copied locally to a directory you can read and unzip them. e.g.:
rsync /var/log/nginx/* logs
gzip -dfr * logs
Next, run erddaplogs
from erddaplogs.logparse import ErddapLogParser
parser = ErddapLogParser()
parser.load_nginx_logs("example_data/nginx_example_logs/") # replace with the path to your logs
parser.parse_datasets_xml("example_data/datasets.xml") # replace with the path to your xml, or remove this line
parser.filter_non_erddap()
parser.filter_spam()
parser.filter_locales()
parser.filter_user_agents()
parser.filter_common_strings()
parser.get_ip_info()
parser.filter_organisations()
parser.parse_columns()
parser.export_data(output_dir=".") # Put the path to the output dir here. Preferably somewhere your ERDDAP can read
This will read nginx logs from the user specified directory and write two files <timestamp>_anonymized_requests.csv
and <timestamp>_aggregated_locations.csv
with anonymized requests and aggregated location data respectively.
ErddapLogParser can be run on a static directory of logs or as a cron job e.g. once per day. If run repeatedly, it will create a new file for anonymized_requests
with only anonymized requests that have been received since the script was last run. The aggregated_locations
file will be updated with the new request locations, only one file with cumulative location totals is retained.
To re-analyze all the input requests, first delete the output files in output_dir
then re-run.
Optionally, the resulting anonymized data can be shared on your ERDDAP in two datasets requests
and locations
. To do this, add the contents of the example xml files requests.xml
and locations.xml
from the example_data
directory to your datasets.xml
. Make sure to update the entries for fileDir and institution. The other fields can remain as-is.
You can see what the resulting stats look like on the VOTO ERDDAP server:
- https://erddap.observations.voiceoftheocean.org/erddap/tabledap/requests.html
- https://erddap.observations.voiceoftheocean.org/erddap/tabledap/locations.html
For more analysis options and plots, see the example jupyter notebook
Example Jupyter Notebook
You can find an example Jupyter Notebook here. It performs the following steps:
- Read in apache and nginx logs, combine them into one consistent dataframe
- Find the ips that made the greatest number of requests. Get their info from ip-api.com
- Remove suspected spam/bot requests
- Perform basic analysis to graph number of requests and users over time, most popular datasets/datatypes and geographic distribution of users
A rather out od date blog post explaining this notebook in more detail can be found at https://callumrollo.com/weblogparse.html
A note on example data
If you don't have your own ERDDAP logs to hand, you can use the example data in example_data/nginx_example_logs
. This is anonymized data from a production ERDDAP server erddap.observations.voiceoftheocean.org. The ip addresses have been randomly generated, as have the user agents. All subscription emails have been replaced with fake@example.com
License
This project is licensed under MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for erddaplogs-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a7833334bbe45e56a339034f8c925f926571503fcc9e2101409c1e9d96c7b80 |
|
MD5 | 3670339f0a4c6c251013e49bf01e4e20 |
|
BLAKE2b-256 | ba74db1de4b66c5219f5ed555c38800051cc10257efce87142d12368afbb1482 |