Skip to main content

Dynamic item set clustering UI tool: The goal of this package is to split datasets (e.g. words defined by several variables) into subsets that are as comparable as possible.

Project description

Badges

fair-software.eu recommendations
(1/5) code repository github repo badge
(2/5) license github license badge
(3/5) community registry
(4/5) citation DOI
(5/5) checklist workflow cii badge
howfairis fair-software badge
Other best practices  
Static analysis workflow scq badge
Coverage workflow scc badge
Documentation
GitHub Actions  
Build build
Citation data consistency cffconvert
SonarCloud sonarcloud
MarkDown link checker markdown-link-check

About Discuit

Discuit is a Dynamic item set clustering UI tool (with the UI part on streamlit)

Discuit can split datasets (e.g. words defined by several variables) into subsets that are as comparable as possible.

The package takes a csv file as input and generates a defined number of matched sets for a given number of continuous and categorical variables. One of the categorical variables can be selected to be split absolutely even across sets. Discuit generates the following output:

  • csv file ammended with the set membership for each item
  • txt file that reports on the outcomes of statistics tests

If you want to use this package and it's functionality with a graphic user interface, check the app on streamlit.

The project setup is documented in project_setup.md.

Installation

Discuit is available on PyPI!

pip install discuit

Or install the latest development version directly from GitHub:

git clone git@github.com:doerte/discuit-project.git
cd discuit-project
python3 -m pip install .

Using Discuit

In the terminal, run Discuit with the following command: discuit "name of input file" [number of desired sets] --columns l/c/n/a/d --runs [desired number of runs]

Example: discuit example/input.csv 2 --columns l a n c n d --runs 3

This will run Discuit with the provided testfile and create 2 subsets. The columns in the file are identified as "label", "categorical", "numerical", "absolute", "numerical" and "disregard" (in that order). The program will run 3 times (and create 3 output files).

Required input

The input file needs to be a .csv file with a first line containing headings followed by rows that represent the different items. Each column specifies one variable.

When launching the script, please specify per column what kind a data the script should expect:

  • (l)abel: just a label, will not be taken into consideration, could be the itemname or itemnumber. This can only be assigned once.
  • (n)umerical: a numerical variable, such as frequency or AoA,
  • (c)ategorical: a categorical variable, such as "transitivity" or "accuracy",
  • (a)bsolute: this needs to be perfectly divided between sets. This can only be assigned once.
  • (d)isregard: a column that does not need to be taken into account for the split, but contains other information you have in the same file.

The package will try maximally 20 times to come-up with a good split. If it doesn't it will give up and output it's last try. You can always run it again. Often it will succeed eventually. If not, consider dropping variables.

If you run the script without specifying --columns, you will be asked what you want per column. If you don't specify the desired number of runs, it will generate 1 output file.

Missing data

If you choose an "absolute split variable", this variable cannot have missing data. The program will exit if it does. For categorical variables, a dummy category is created that holds the items with missing data. For numerical variables, missing data is replaced with the average of this variable. If you prefer a different approach, please prepare your input file in a way that does not include missing data.

Contributing

If you want to contribute to the development of Discuit, have a look at the contribution guidelines.

Credits

This package was created with Cookiecutter and the NLeSC/python-template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discuit-0.4.1.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

discuit-0.4.1-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file discuit-0.4.1.tar.gz.

File metadata

  • Download URL: discuit-0.4.1.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for discuit-0.4.1.tar.gz
Algorithm Hash digest
SHA256 4ddafa586b2d92078ee6fe38bd3482301754e93802dbc4a1d2d6474938e9fd48
MD5 af2f70f342ed557190afa677015e98b6
BLAKE2b-256 f83a75449352ba3c68ef58511810dffd99ac3a7e66fecaf47a192b74ab9e4872

See more details on using hashes here.

File details

Details for the file discuit-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: discuit-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for discuit-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e16f30fdfefa040607e282279d21a1955f57c70fb4f90e29c9bc2cc4387daf0b
MD5 0970ab70c210acdfbfaef147cb44e679
BLAKE2b-256 8e7d346123e537ce7c7414b179ebff08ebfce3ffa96a40889bfd1b74f3802479

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page