Python command line application to extract features from a column of URLs inside a CSV or TSV dataset.
Project description
url2features
Status - Functional - Beta Release
This is an application to add features to a dataset that are derived from processing the content of columns that contain URLs.
It will accept a CSV, TSV or XLS file and output an extended version of the dataset with additional columns appended. When run with default settings it will produce a small number of very simple numerical summaries.
Additional feature flags unlock features that are more computationally intensive.
Released and distributed via setuptools/PyPI/pip for Python 3.
Additional detail available in the documentation
Distribution
Released and distributed via setuptools/PyPI/pip for Python 3.
Resources & Dependencies
For Domain Registration data we use the python package : whois. However, rather than constantly hit that API endpoint we create a local dataset of domain registration dates. This data is installed with the package.
The data was initialised with the following script:
python scripts/init_dom_reg_data.py
We then updated the data file using multiple datasets of common domains and some specific to the requirements of our project. These update script runs were executed as follows:
python scripts/update_dom_reg_data.py data/top_50_domains.csv Domain
python scripts/update_dom_reg_data.py data/top_50_domains_v2.csv site
This is used to build a local cached library of domain registration dates.
Features
Each type of feature can be unlocked through the use of a specific command line switch:
-simple Default: False. Basic string derived features
-protocol Default: False. Features derives from the URL protocol.
-host Default: False. Features describing the host, including domain structure and registration.
-tld Default: False. Information about the Top Level Domain
-path Default: False. Features extracted from the path between host and file
-file Default: False. The file extension and type referenced by the URL
-params Default: False. The query parameters at the end of the URL
-dns Default: False. DNS related information.
Usage
You can use this application multiple ways
Runner
Use the runner without installing the application. The following example will generate all features on the test data.
./url2features-runner.py -columns=url -simple -host -tld -protocol -file -params data/test.csv > data/output.csv
This will send the time performance profile to STDERR as shown below:
Computation Time Profile for each Feature Set
---------------------------------------------
simple 0:00:00.002620
protocol 0:00:00.001280
host 0:00:00.002900
tld 0:00:00.001703
file 0:00:00.002157
params 0:00:00.002215
Directory as package
Alternatively, you can invoke the directory as a package:
python -m url2features -columns=url -host -tld data/test.csv > data/output.csv
From Install
Or you can simply install the package and use the command line application directly
url2features -h
Will print out the help
Installation
Installation from the source tree:
python setup.py install
(or via pip from PyPI):
pip install url2features
Now, the url2features
command is available::
url2features -columns=url -simple data/test.csv > data/output.csv
This will take the Input CSV, calculate some simple summary features and produce an Output CSV with features appended as new columns.
For more complicated features see the additional options (outlined above).
Acknowledgements
Python package built using the bootstrap cmdline template by jgehrcke
Datasets for calculating features taken from the following sources
- https://github.com/tb0hdan/domains
- https://datahub.io/core/top-level-domain-names#resource-top-level-domain-names_zip
- https://datahub.io/core/top-level-domain-names/r/top-level-domain-names.csv.csv
Some features ideas taken and adapted from the following:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file url2features-0.4.0.tar.gz
.
File metadata
- Download URL: url2features-0.4.0.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.0 setuptools/51.1.0 requests-toolbelt/0.8.0 tqdm/4.61.2 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f2408aa5cc561cf4ba02b315e906ec1f0ae1e1ca3591546a388c6b80c02cd06 |
|
MD5 | a9d2f2faf944166ab5510519b70dde0b |
|
BLAKE2b-256 | c2383ea066efa1acb17b142975b612bc12e696313664714a126284ca64ce31bb |