Skip to main content

Makes a network out of a URLs in a dataset of tweets

Project description

Domain Network

A package to create a domain network of the URLs mentioned in a dataset of texts. In the current version it works for tweets. It may process any kind of text in the future versions.

Installation

The easiest way to install the domain_network package is to use the following command in a terminal:

pip install domain-network

Usage

To run the module using Command Line Interface (CLI) run the following:

  • For the whole process starting with raw tweets:
python -m domainNetwork  --input_dir ["data/twitterAPI_lang_en/*/*.json"] --conf_dir  [‘config/sample_config.ini’] --min_edge_weight [20] --min_node_size [20] \
--min_stand_alone_size [50]   --urls_file_name  ["output/urls.csv"] \
--network_output_file_name  ["output/network.csv"] --netloc_output_file_name ["output/netloc.csv"] \
--netloc_origin_output_file_name  ["output/netloc_origin.csv"] 
  • For making domain network of a pre-processed file which includes extracted netlocs:
python -m domainNetwork  --conf_dir  [‘config/sample_config.ini’] --min_edge_weight [20] --min_node_size [20] \
--min_stand_alone_size [50]  --network_only true  --urls_file_name  ["data/urls.csv"] \
--network_output_file_name  ["output/network.csv"] --netloc_output_file_name ["output/netloc.csv"] \
--netloc_origin_output_file_name  ["output/netloc_origin.csv"] 

Parameters:

--input_dir : Directory of tweet files

--conf_dir : File path of the config file. Read Config file section for more details.

--min_edge_weight : Min number of users that mentioned both source and target of the edge in their tweets.

--min_node_size : Min number of times that a web page is mentioned in total, for connected nodes.

--min_stand_alone_size: Min number of times that a web page is mentioned in total, for stand-alone nodes.

--network_only : If you want to use a preprocessed file which includes the netlocs

--urls_file_name : File path of preprocessed tweets with netlocs. Can be output/input file in the above mentioned situations.

--network_output_file_name: File path of the generated network, in .csv format.

--netloc_output_file_name : File path of the list of web sites, after filtering, in .csv format.

--netloc_origin_output_file_name : File path of the original list of web sites, in .csv format.

Output

The main output of this package is network.csv which includes source, target and the weight. Output file can be given to a visualization tool, e.g. networkx in python for the visualization

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for domain-network, version 0.0.9
Filename, size File type Python version Upload date Hashes
Filename, size domain_network-0.0.9-py3-none-any.whl (10.1 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size domain_network-0.0.9.tar.gz (6.2 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page