Insight Extractor Package
Project description
TakeBlipInsightExtractor Package
Data & Analytics Research
Overview
Here is presented these content:
- Intro
- Run
- Example of initialization and usage
Intro
The Insight Extractor offers a way to analyze huge volumes of textual data in order to identify, cluster and detail subjects. This project achieves this results by way of applying a proprietary Named Entity Recognition (NER) algorithm followed by a clustering algorithm. The IE Cloud also allows any person to use this tool without having too many computational resources available to themselves.
The package outputs four types of files:
- Wordcloud: It's an image file containing a wordcloud describing the most frequent subjects on the text. The colours represent the groups of similar subjects.
- Wordtree: It's an html file which contains the graphic relationship between the subjects and the examples of uses in sentences. It's an interactive graphic where the user can navigate along the tree.
- Hierarchy: It's a json file which contains the hierarchical relationship between subjects.
- Table: It's a csv file containing the following columns:
- Message: Original message;
- Entities: Entities found in original message;
- Groups: Entity groups found;
- Structured Message: Relevant content (structured message).
Parameters
The following parameters need to be set by the user on the command line:
- embedding_path: path to the embedding model, the file should end with .kv;
- postagging_model_path: path to the postagging model, the file should end with .pkl;
- postagging_label_path: path to the postagging label file, the file should end with .pkl;
- ner_model_path: path to the ner model, the file should end with .pkl;
- ner_label_path: path to the ner label file, the file should end with .pkl;
- file: path to the csv file the user wants to analyze;
- user_email: user's Take Blip email where they want to receive the analysis;
- bot_name: bot ID.
The following parameters have default settings, but can be customized by the user;
- node_messages_examples: it is an int representing the number of examples outputed for each subject on the Wordtree file. The default value is 100;
- similarity_threshold: it is a float representing the similarity threshold between the subject groups. The default value is 0.65, we recommend that this parameter not be modified;
- percentage_threshold: it is a float representing the frequency percentile of subject from which they are not removed from the analysis. The default value is 0.9;
- batch_size: it is an int representing the batch size. The default value is 50;
- chunk_size: it is an int representing chunk file size for upload in storaged. The default value is 1024;
- separator: it is a str for the csv file delimiter character. The default value is '|'.
Example of initialization e usage:
- Import main packages;
- Initialize main variables;
- Initialize eventhub logger;
- Initialize Insight Extractor;
- Insight Extractor usage.
An example of the above steps could be found in the python code below:
- Import main packages
import uuid
from TakeBlipInsightExtractor.insight_extractor import InsightExtractor
from TakeBlipInsightExtractor.outputs.eventhub_log_sender import EventHubLogSender
- Initialize main variables
embedding_path = '*.kv'
postag_model_path = '*.pkl'
postag_label_path = '*.pkl'
ner_model_path = '*.pkl'
ner_label_path = '*.pkl'
user_email = 'your_email@host.com'
bot_name = 'my_bot_for_insight_extractor'
application_name = 'your application'
eventhub_name = '*'
eventhub_connection_string = '*'
file_name = '*'
input_data = '*.csv'
separator = '|'
similarity_threshold = 0.65
node_messages_examples = 100
batch_size = 1024
percentage_threshold = 0.7
- Initialize eventhub logger
correlation_id = str(uuid.uuid3(uuid.NAMESPACE_DNS, user_email + bot_name))
logger = EventHubLogSender(application_name=application_name,
user_email=user_email,
bot_name=bot_name,
file_name=file_name,
correlation_id=correlation_id,
connection_string=eventhub_connection_string,
eventhub_name=eventhub_name)
- Initialize Insight Extractor
insight_extractor = InsightExtractor(input_data,
separator=separator,
similarity_threshold=similarity_threshold,
embedding_path=embedding_path,
postagging_model_path=postag_model_path,
postagging_label_path=postag_label_path,
ner_model_path=ner_model_path,
ner_label_path=ner_label_path,
user_email=user_email,
bot_name=bot_name,
logger=logger)
- Insight Extractor usage
insight_extractor.predict(percentage_threshold=percentage_threshold,
node_messages_examples=node_messages_examples,
batch_size=batch_size)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for TakeBlipInsightExtractor-0.0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c9b3774b79ee8a1aa612d979e1adc1c3cf7b2faec70245a3d7b633f6c33b819b |
|
MD5 | 6c19f4243bd256637392dd0f31402af4 |
|
BLAKE2b-256 | e8cda96b89a497fffbcc69b4eddadda14a595d064b0f60e3d593cd3e835dead3 |
Hashes for TakeBlipInsightExtractor-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7ed0dc924c6e7802d964ab629a17e664819e3df739b731cd70fecfc53d9c08e |
|
MD5 | 2fd61513ea6fc19387df871bfe863fb2 |
|
BLAKE2b-256 | 921c8eb51252c0e63391c8f43bf9bb082dea13a3b27b16b879db6287c7262134 |