Makes a network out of a URLs in a dataset of tweets
Project description
Domain Network
A package to create a domain network of the URLs mentioned in a dataset of texts. In the current version it works for tweets. It may process any kind of text in the future versions.
Installation
The easiest way to install the domain_network package is to use the following command in a terminal:
pip install domain-network
Usage
To run the module using Command Line Interface (CLI) run the following:
- For the whole process starting with raw tweets:
python -m domainNetwork --input_dir ["data/twitterAPI_lang_en/*/*.json"] --conf_dir [‘config/sample_config.ini’] --min_edge_weight [20] --min_node_size [20] \
--min_stand_alone_size [50] --urls_file_name ["output/urls.csv"] \
--network_output_file_name ["output/network.csv"] --netloc_output_file_name ["output/netloc.csv"] \
--netloc_origin_output_file_name ["output/netloc_origin.csv"]
- For making domain network of a pre-processed file which includes extracted netlocs:
python -m domainNetwork --conf_dir [‘config/sample_config.ini’] --min_edge_weight [20] --min_node_size [20] \
--min_stand_alone_size [50] --network_only true --urls_file_name ["data/urls.csv"] \
--network_output_file_name ["output/network.csv"] --netloc_output_file_name ["output/netloc.csv"] \
--netloc_origin_output_file_name ["output/netloc_origin.csv"]
Parameters:
--input_dir : Directory of tweet files
--conf_dir : File path of the config file. Read Config file section for more details.
--min_edge_weight : Min number of users that mentioned both source and target of the edge in their tweets.
--min_node_size : Min number of times that a web page is mentioned in total, for connected nodes.
--min_stand_alone_size: Min number of times that a web page is mentioned in total, for stand-alone nodes.
--network_only : If you want to use a preprocessed file which includes the netlocs
--urls_file_name : File path of preprocessed tweets with netlocs. Can be output/input file in the above mentioned situations.
--network_output_file_name: File path of the generated network, in .csv format.
--netloc_output_file_name : File path of the list of web sites, after filtering, in .csv format.
--netloc_origin_output_file_name : File path of the original list of web sites, in .csv format.
--selected_users_fp : Specifies the target group of users, i.e. active users that we are interested in their domain network
Output
The main output of this package is network.csv which includes source, target and the weight. Output file can be given to a visualization tool, e.g. networkx in python for the visualization
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file domain_network-0.1.2.tar.gz
.
File metadata
- Download URL: domain_network-0.1.2.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82dd9ffb455e1e2d75ad9e2a13d3b91c2dbf461317b45de7c91d20ba3298dc70 |
|
MD5 | ad62dba7f8992780d641342e8187d1be |
|
BLAKE2b-256 | 170f2d05d575a3aee3d99824a84aaf16f06d4b093f224e48239d9538417a927f |
File details
Details for the file domain_network-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: domain_network-0.1.2-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bde854b967e8daf52616b14f8f48fe4e4795bd74c4ab64ad47ac385b9649e61 |
|
MD5 | 3db3d93fc3c833946a50fff3d1fe2642 |
|
BLAKE2b-256 | 0000b55e106b3fdae01b9cf20dbdb114ccf69734fd18506fc6e2b7f78e5796dd |