Skip to main content

Insight Extractor Package

Project description

TakeBlipInsightExtractor Package

Data & Analytics Research

Overview

Here is presented these content:

  • Intro
  • Run
  • [Example of initialization e usage](#Example of initialization e usage)

Intro

The Insight Extractor offers a way to analyze huge volumes of textual data in order to identify, cluster and detail subjects. This project achieves this results by way of applying a proprietary Named Entity Recognition (NER) algorithm followed by a clustering algorithm. The IE Cloud also allows any person to use this tool without having too many computational resources available to themselves.

The package outputs four types of files:

  • Wordcloud: It's an image file containing a wordcloud describing the most frequent subjects on the text. The colours represent the groups of similar subjects.

  • Wordtree: It's an html file which contains the graphic relationship between the subjects and the examples of uses in sentences. It's an interactive graphic where the user can navigate along the tree.

  • Hierarchy: It's a json file which contains the hierarchical relationship between subjects.

  • Table: It's a csv file containing the following columns:

      Message                   |  Entities                                                                                    | Groups     | Structured Message
      sobre cobranca inexistente|[{'value': 'cobrança', 'lowercase_value': 'cobrança', 'postags': 'SUBS', 'type': 'financial'}]|['cobrança']|sobre cobrança inexistente
    

Parameters

The following parameters need to be set by the user on the command line:

  • embedding_path: path to the embedding model, the file should end with .kv;
  • postagging_model_path: path to the postagging model, the file should end with .pkl;
  • postagging_label_path: path to the postagging label file, the file should end with .pkl;
  • ner_model_path: path to the ner model, the file should end with .pkl;
  • ner_label_path: path to the ner label file, the file should end with .pkl;
  • file: path to the csv file the user wants to analyze;
  • user_email: user's Take Blip email where they want to receive the analysis;
  • bot_name: bot ID.

The following parameters have default settings, but can be customized by the user;

  • node_messages_examples: it is an int representing the number of examples outputed for each subject on the Wordtree file. The default value is 100;
  • similarity_threshold: it is a float representing the similarity threshold between the subject groups. The default value is 0.65, we recommend that this parameter not be modified;
  • percentage_threshold: it is a float representing the frequency percentile of subject from which they are not removed from the analysis. The default value is 0.9;
  • batch_size: it is an int representing the batch size. The default value is 50;
  • chunk_size: it is an int representing chunk file size for upload in storaged. The default value is 1024;
  • separator: it is a str for the csv file delimiter character. The default value is '|'.

Example of initialization e usage:

  1. Import main packages;
  2. Initialize main variables;
  3. Initialize eventhub logger;
  4. Initialize Insight Extractor;
  5. Insight Extractor usage.

An example of the above steps could be found in the python code below:

  1. Import main packages
import uuid
from TakeBlipInsightExtractor.insight_extractor import InsightExtractor
from TakeBlipInsightExtractor.outputs.eventhub_log_sender import EventHubLogSender
  1. Initialize main variables
embedding_path = '*.kv'
postag_model_path = '*.pkl'
postag_label_path = '*.pkl'
ner_model_path = '*.pkl'
ner_label_path = '*.pkl'

user_email = 'your_email@host.com'
bot_name = 'my_bot_for_insight_extractor'
application_name = 'your application'

eventhub_name = '*'
eventhub_connection_string = '*'

file_name = '*'
input_data = '*.csv'
separator = '|'

similarity_threshold = 0.65
node_messages_examples = 100
batch_size = 1024
percentage_threshold = 0.7
  1. Initialize eventhub logger
correlation_id = str(uuid.uuid3(uuid.NAMESPACE_DNS, user_email + bot_name))
logger = EventHubLogSender(application_name=application_name,
                           user_email=user_email,
                           bot_name=bot_name,
                           file_name=file_name,
                           correlation_id=correlation_id,
                           connection_string=eventhub_connection_string,
                           eventhub_name=eventhub_name)
  1. Initialize Insight Extractor
insight_extractor = InsightExtractor(input_data,
                                     separator=separator,
                                     similarity_threshold=similarity_threshold,
                                     embedding_path=embedding_path,
                                     postagging_model_path=postag_model_path,
                                     postagging_label_path=postag_label_path,
                                     ner_model_path=ner_model_path,
                                     ner_label_path=ner_label_path,
                                     user_email=user_email,
                                     bot_name=bot_name,
                                     logger=logger)
  1. Insight Extractor usage
insight_extractor.predict(percentage_threshold=percentage_threshold,
                          node_messages_examples=node_messages_examples,
                          batch_size=batch_size)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

insight-extractor-packaage-0.0.1.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

insight_extractor_packaage-0.0.1-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file insight-extractor-packaage-0.0.1.tar.gz.

File metadata

  • Download URL: insight-extractor-packaage-0.0.1.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.2 readme-renderer/34.0 requests/2.24.0 requests-toolbelt/0.10.0 urllib3/1.26.13 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.12

File hashes

Hashes for insight-extractor-packaage-0.0.1.tar.gz
Algorithm Hash digest
SHA256 89eb4be7fbabf432613f729d7bcfa4f63a1e5bc2acf5f571b5f811deb3940605
MD5 e0e4106e29546413dc0f8a2c591fa67d
BLAKE2b-256 338d65da0355b30dda53e41cf907675da7aed47e01b97e4becab586f5715d163

See more details on using hashes here.

File details

Details for the file insight_extractor_packaage-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: insight_extractor_packaage-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 31.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.2 readme-renderer/34.0 requests/2.24.0 requests-toolbelt/0.10.0 urllib3/1.26.13 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.12

File hashes

Hashes for insight_extractor_packaage-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c3bc65fc368e1bbe572ee70b6e730532e09e6d067506d2d0f5c1f5eb6303a1c7
MD5 5450b14338abdbde32c7bf472a830bff
BLAKE2b-256 1ead0685ea5e987d8982de35deee4d48a455f7e6ad5d5ba443dd47dd40d6cc46

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page