Insight Extractor Package
Project description
TakeBlipInsightExtractor Package
Data & Analytics Research
Overview
Here is presented these content:
Intro
The Insight Extractor offers a way to analyze huge volumes of textual data in order to identify, cluster and detail subjects. This project achieves this results by way of applying a proprietary Named Entity Recognition (NER) algorithm followed by a clustering algorithm. The IE Cloud also allows any person to use this tool without having too many computational resources available to themselves.
The package outputs four types of files:
-
Wordcloud: It's an image file containing a wordcloud describing the most frequent subjects on the text. The colours represent the groups of similar subjects.
-
Wordtree: It's an html file which contains the graphic relationship between the subjects and the examples of uses in sentences. It's an interactive graphic where the user can navigate along the tree.
-
Hierarchy: It's a json file which contains the hierarchical relationship between subjects.
-
Table: It's a csv file containing the following columns:
Message | Entities | Groups | Structured Message sobre cobranca inexistente|[{'value': 'cobrança', 'lowercase_value': 'cobrança', 'postags': 'SUBS', 'type': 'financial'}]|['cobrança']|sobre cobrança inexistente
Parameters
The following parameters need to be set by the user on the command line:
- embedding_path: path to the embedding model, the file should end with .kv;
- postagging_model_path: path to the postagging model, the file should end with .pkl;
- postagging_label_path: path to the postagging label file, the file should end with .pkl;
- ner_model_path: path to the ner model, the file should end with .pkl;
- ner_label_path: path to the ner label file, the file should end with .pkl;
- file: path to the csv file the user wants to analyze;
- user_email: user's Take Blip email where they want to receive the analysis;
- bot_name: bot ID.
The following parameters have default settings, but can be customized by the user;
- node_messages_examples: it is an int representing the number of examples outputed for each subject on the Wordtree file. The default value is 100;
- similarity_threshold: it is a float representing the similarity threshold between the subject groups. The default value is 0.65, we recommend that this parameter not be modified;
- percentage_threshold: it is a float representing the frequency percentile of subject from which they are not removed from the analysis. The default value is 0.9;
- batch_size: it is an int representing the batch size. The default value is 50;
- chunk_size: it is an int representing chunk file size for upload in storaged. The default value is 1024;
- separator: it is a str for the csv file delimiter character. The default value is '|'.
Example of initialization e usage:
- Import main packages;
- Initialize main variables;
- Initialize eventhub logger;
- Initialize Insight Extractor;
- Insight Extractor usage.
An example of the above steps could be found in the python code below:
- Import main packages
import uuid
from TakeBlipInsightExtractor.insight_extractor import InsightExtractor
from TakeBlipInsightExtractor.outputs.eventhub_log_sender import EventHubLogSender
- Initialize main variables
embedding_path = '*.kv'
postag_model_path = '*.pkl'
postag_label_path = '*.pkl'
ner_model_path = '*.pkl'
ner_label_path = '*.pkl'
user_email = 'your_email@host.com'
bot_name = 'my_bot_for_insight_extractor'
application_name = 'your application'
eventhub_name = '*'
eventhub_connection_string = '*'
file_name = '*'
input_data = '*.csv'
separator = '|'
similarity_threshold = 0.65
node_messages_examples = 100
batch_size = 1024
percentage_threshold = 0.7
- Initialize eventhub logger
correlation_id = str(uuid.uuid3(uuid.NAMESPACE_DNS, user_email + bot_name))
logger = EventHubLogSender(application_name=application_name,
user_email=user_email,
bot_name=bot_name,
file_name=file_name,
correlation_id=correlation_id,
connection_string=eventhub_connection_string,
eventhub_name=eventhub_name)
- Initialize Insight Extractor
insight_extractor = InsightExtractor(input_data,
separator=separator,
similarity_threshold=similarity_threshold,
embedding_path=embedding_path,
postagging_model_path=postag_model_path,
postagging_label_path=postag_label_path,
ner_model_path=ner_model_path,
ner_label_path=ner_label_path,
user_email=user_email,
bot_name=bot_name,
logger=logger)
- Insight Extractor usage
insight_extractor.predict(percentage_threshold=percentage_threshold,
node_messages_examples=node_messages_examples,
batch_size=batch_size)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ie_package-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0788981a1d6e4d5260f0076841e9cfb3c659ca172e7d87644117649ca6cf539a |
|
MD5 | 9ddbd075e62cd1eddfe88908653cb6f4 |
|
BLAKE2b-256 | d600661345e2e6be4bd45e0decd1c1239c8386df8925d89d3c1c404a554dcc12 |