Skip to main content

MapIntel is a system for acquiring intelligence from vast collections of text data by representing each document as a multidimensional vector that captures its own semantics. The system is designed to handle complex Natural Language queries and visual exploration of the corpus.

Project description

MapIntel

ci doc

Category Tools
Development black ruff mypy docformatter
Package version pythonversion downloads
Documentation mkdocs
Communication gitter discussions

Introduction

MapIntel is a system for acquiring intelligence from vast collections of text data by representing each document as a multidimensional vector that captures its semantics. The system is designed to handle complex Natural Language queries while it provides Question-Answering functionality. Additionally, it allows for a visual exploration of the corpus. The MapIntel uses a retriever engine that first finds the closest neighbors to the query embedding and identifies the most relevant documents. It also leverages the embeddings by projecting them onto two dimensions while preserving the multidimensional landscape, resulting in a map where semantically related documents form topical clusters which we capture using topic modeling. This map aims to promote a fast overview of the corpus while allowing a more detailed exploration and interactive information encountering process. MapIntel can be used to explore many types of corpora.

MapIntel UI screenshot

Installation

For user installation, mapintel is currently available on the PyPi's repository, and you can install it via pip:

pip install mapintel

Development installation requires cloning the repository and then using PDM to install the project as well as the main and development dependencies:

git clone https://github.com/NOVA-IMS-Innovation-and-Analytics-Lab/MapIntel.git
cd mapintel
pdm install

Configuration

MapIntel aims to be a flexible system that can run with any user provided corpus. In order to achieve this goal, it standardizes the data and models, while the deployment of all services is expected to be on AWS. An example of how to fully set up a MapIntel instance can be found at MapIntel-News. After deploying the required services, a file .env should be created at the root of the project with environmental variables that are described below.

AWS credentials

The following environmental variable should be included in the .env file:

  • AWS_PROFILE_NAME

The user should have permissions to interact with the services described below.

Data

An OpenSearch database instance should be deployed in AWS with documents contained in an index called document. Each document is expected to have the content, date, embedding, embedding2d and topic fields with the following types:

  • content: text type that contains the main text of the document.
  • date: long type that represents the ordinal format of a date.
  • embedding: knn_vector type that represents the embedding vector of the document.
  • embedding2d: float type that represents the 2D embedding vector of the document.
  • topic: keyword type that assigns a topic label to each document.

The relevant environmental variables are the following:

  • OPENSEARCH_ENDPOINT: The AWS endpoint of the OpenSearch deployed instance.
  • OPENSEARCH_PORT: The port of the instance.
  • OPENSEARCH_USERNAME: The username.
  • OPENSEARCH_PASSWORD: The password.

Models

MapIntel uses three models trained on the user provided data. The first is a Haystack retriever model, the second is a model that reduces the dimensions of the embeddings to 2D, while the third is a generator model used for question-answering. The corresponding environmental variables are the following:

  • HAYSTACK_RETRIEVER_MODEL: The value of the parameter embedding_model of the Haystack class EmbeddingRetriever.
  • SAGEMAKER_DIMENSIONALITY_REDUCTIONER_ENDPOINT: The SageMaker endpoint of the deployed dimensionality reductioner.
  • SAGEMAKER_GENERATOR_MODEL_ENDPOINT: The SageMaker endpoint of the deployed generator.

Usage

To run the application use the following command:

mapintel

Then the server starts and listens to connections at http://localhost:8080. You may open the browser and use this URL to interact with the MapIntel UI.

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mapintel-1.0.tar.gz (12.3 kB view hashes)

Uploaded Source

Built Distribution

mapintel-1.0-py3-none-any.whl (10.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page