Text Similarity Index Processor
Project description
Text Similarity Index processor
What is the project intented to solve?
Resolving the Technical Debt in "Test/Requirement/Issues/Any-text" repos with unique id using Natural Language Processing Continuous
duplicate monitoring system in place to check the duplication of any new text added to "Test/Requirement/Issues/Any-text" bank.
Grouping of similar "Test/Requirement/Issues/Any-text" helps in reduction of "Test/Requirement/Issues/Any-text" yet quality quotient remain same.
Cycle time of test execution comes down as similar tests are identified for merging.
Repeated requirement can be reduced Issues list can be merged/reduced
Technology stack
Python with few python packages
Dependencies
Python 3.7.3
[packages]
pip, mutmut, pytest, xlrd, xlsxwriter, pandas, codecov, pytest-cov, pylint
Installation
pip install similarity-processor
Usage & Configuration
1.How to use the tool from the source code:
From any editor which support Python (pref: pycharm, set similarity_processor and text-de-duplication_monitoring as root by right clicking and selecting option)
Make sure to set the right python interpreter and make sure it lists all the packages mentioned as mandate.
Option 1: UI
Execute the similarity_ui.py
, which will open the UI window where you need to enter the options like,
- Path to the test/requirement/other other document to be analyzed.
- Unique ID in the csv/xlsx column ID(0/1 etc...)
- Steps/Description id for content matching (column of interest IDs in the csv/xlsx seperated by , like 1,2,3)
- If new requirement / test to me checked with existing, enable the check box and paste the content to be checked in the new text box.
Option 2: commandline
$ python similarity_processor\similarity_cmd.py --h
usage: similarity_cmd.py [-h] [--path --p] [--uniqid --u]
[--colint --c]
Text Similarity Index Processor
optional arguments:
-h, --help show this help message and exit
--path --p the Input file path
--uniqid --u uniq id index(column) of the input file
--colint --c the col of interest
2.How to use the tool after pip install similarity-processor
Option 1: To use only the similarity for simple texts with out writing result to xlsx
>>> from similarity_processor import similarity_core
>>> x = similarity_core.text_to_vector("this is a sample test")
>>> y = similarity_core.text_to_vector("this is a sample")
>>> w = similarity_core.get_cosine(x,y)
>>> print(w)
0.8944271909999159
Option 2: Generate similarity for a group of text like "Test cases, requirement etc... which is present in xlsx
>>> from similarity_processor.similarity_io import SimilarityIO
>>> similarity_io_obj = SimilarityIO("TestBank.xlsx", 0, "1,2,3", 0, None)
>>> similarity_io_obj.orchestrate_similarity()
Arguments:Path to the input file, Unique id value column id in xlsx, Interested columns in xlsx, Are you checking a new text against a existing text bank ?, If yes: new text
Output will be available in same folder as input file
files are,
- If any duplicate ids in the unique id
- A recommendation file with similarity values
- A merged file with data in the "interested columns in xlsx"
Option 3: Generate similarity for a group of text like "Test cases, requirement etc... which is present in xlsx through commandline
>python -m similarity_processor.similarity_cmd --h
>python -m similarity_processor.similarity_cmd --p "TestBank.xlsx" --u 0 --c "1,2,3"
Option 4: Generate similarity for a group of text like "Test cases, requirement etc... which is present in xlsx through UI
>python -m similarity_processor.similarity_ui
- Path to the test/requirement/other other document to be analyzed.
- Unique ID in the csv/xlsx column ID(0/1 etc...)
- Steps/Description id for content matching (column of interest IDs in the csv/xlsx seperated by , like 1,2,3)
- If new requirement / test to me checked with existing, enable the check box and paste the content to be checked in the new text box.
How to test the software
- To test the tool use : navigate to "text_de_duplication_monitoring" which is the root directory
- issue
pytest -v
to run all the tests
-
To report the pytest in html: issue command
pytest --html=report.html
-
To run test for coverage:
pytest --cov-report html --cov="similarity_processor"
-
pydoc creation
python -m pydoc -w module_name
-
mutation testing using mutmut
mutmut --paths-to-mutate "path_to \ similarity_processor" run
-
pylint execution on code
pylint similarity_processor test >"path_to_save_file\pylint.txt"
-
jscpd execution on root folder
jscpd --min-tokens 20 --reporters "html" --mode "strict" --format "python" --output . .
Limitations
- Input is accepted only via xlsx
- Stand alone application not web enabled
- Users have to fetch the input to csv/xlsx
- Tool is not yet plugged to TFS, ALM etc
Improvements/ Road-map
- Increase the test efficiency based on mutation testing output.
- Make the tool web enabled (using python flask...).
- Create hook to TFS, ALM etc so that this tool we can download the test/ requirement/ defects and do further processing.
- Enable the tool to do similarity check on code base.
Contact / Getting help
Brijesh Krishnan <brijesh.krishnank@philips.com>
Dattatreya Vellal <dsvellal@philips.com>
License
The MIT License (MIT) Copyright © [2019] Koninklijke Philips N.V, https://www.philips.com
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for similarity_processor-0.0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | d21f9c435e651f5b39980d5a11ff37534ef210f071213426b33b346b64274a12 |
|
MD5 | 5be76870bd69a2fb07b5607673984d17 |
|
BLAKE2b-256 | a4769ff44d6b0a62d210f783283305f018a7f87e802900ce33535804e83faf8b |
Hashes for similarity_processor-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c07bf07b7069f6a23e9ab66799879ee9f9b6e137727e625d7e42ecb1af6addcf |
|
MD5 | 8b8a5b0adffbf80f53881c0e590b69f3 |
|
BLAKE2b-256 | 41c3be85969c67770a0db7722e7e14c28a862086355480c8f4e578a48186ec24 |