Skip to main content

Makes a network out of a URLs in a dataset of tweets

Project description

Domain Network

A package to create a domain network of the URLs mentioned in a dataset of texts. In the current version it works for tweets. It may process any kind of text in the future versions. ##Installation The easiest way to install the domain_network package is to use the following command in a terminal:

pip install domain-network

Usage

To run the module using Command Line Interface (CLI) run the following:

  • For the whole process starting with raw tweets:
python -m domainNetwork  --input_dir ["data/twitterAPI_lang_en/*/*.json"] --conf_dir  [‘config/sample_config.ini’] --min_edge_weight [20] --min_node_size [20] \
--min_stand_alone_size [50]   --urls_file_name  ["output/urls.csv"] \
--network_output_file_name  ["output/network.csv"] --netloc_output_file_name ["output/netloc.csv"] \
--netloc_origin_output_file_name  ["output/netloc_origin.csv"] 
  • For making domain network of a pre-processed file which includes extracted netlocs:
python -m domainNetwork  --conf_dir  [‘config/sample_config.ini’] --min_edge_weight [20] --min_node_size [20] \
--min_stand_alone_size [50]  --network_only true  --urls_file_name  ["data/urls.csv"] \
--network_output_file_name  ["output/network.csv"] --netloc_output_file_name ["output/netloc.csv"] \
--netloc_origin_output_file_name  ["output/netloc_origin.csv"] 

Parameters:

--input_dir : Directory of tweet files

--conf_dir : File path of the config file. Read Config file section for more details.

--min_edge_weight : Min number of users that mentioned both source and target of the edge in their tweets.

--min_node_size : Min number of times that a web page is mentioned in total, for connected nodes.

--min_stand_alone_size: Min number of times that a web page is mentioned in total, for stand-alone nodes.

--network_only : If you want to use a preprocessed file which includes the netlocs

--urls_file_name : File path of preprocessed tweets with netlocs. Can be output/input file in the above mentioned situations.

--network_output_file_name: File path of the generated network, in .csv format.

--netloc_output_file_name : File path of the list of web sites, after filtering, in .csv format.

--netloc_origin_output_file_name : File path of the original list of web sites, in .csv format.

Output

The main output of this package is network.csv which includes source, target and the weight. Output file can be given to a visualization tool, e.g. networkx in python for the visualization

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domain_network-0.0.1.tar.gz (5.9 kB view hashes)

Uploaded Source

Built Distribution

domain_network-0.0.1-py3-none-any.whl (9.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page