Skip to main content

MapIntel is a system for acquiring intelligence from vast collections of text data by representing each document as a multidimensional vector that captures its own semantics. The system is designed to handle complex Natural Language queries and visual exploration of the corpus.

Project description

MapIntel

ci doc

Category Tools
Development black ruff mypy docformatter
Package version pythonversion downloads
Documentation mkdocs
Communication gitter discussions

Introduction

MapIntel is a system for acquiring intelligence from vast collections of text data by representing each document as a multidimensional vector that captures its semantics. The system is designed to handle complex Natural Language queries while it provides Question-Answering functionality. Additionally, it allows for a visual exploration of the corpus. The MapIntel uses a retriever engine that first finds the closest neighbors to the query embedding and identifies the most relevant documents. It also leverages the embeddings by projecting them onto two dimensions while preserving the multidimensional landscape, resulting in a map where semantically related documents form topical clusters which we capture using topic modeling. This map aims to promote a fast overview of the corpus while allowing a more detailed exploration and interactive information encountering process. MapIntel can be used to explore many types of corpora.

MapIntel UI screenshot

Installation

For user installation, mapintel is currently available on the PyPi's repository, and you can install it via pip:

pip install mapintel

Development installation requires cloning the repository and then using PDM to install the project as well as the main and development dependencies:

git clone https://github.com/NOVA-IMS-Innovation-and-Analytics-Lab/MapIntel.git
cd mapintel
pdm install

Configuration

MapIntel aims to be a flexible system that can run with any user provided corpus. In order to achieve this goal, it standardizes the data and models, while the deployment of all services is expected to be on AWS. An example of how to fully set up a MapIntel instance can be found at MapIntel-News. After deploying the required services, a file .env should be created at the root of the project with environmental variables that are described below.

AWS credentials

The following environmental variable should be included in the .env file:

  • AWS_PROFILE_NAME

The user should have permissions to interact with the services described below.

Data

An OpenSearch database instance should be deployed in AWS with documents contained in an index called document. Each document is expected to have the content, date, embedding, embedding2d and topic fields with the following types:

  • content: text type that contains the main text of the document.
  • date: long type that represents the ordinal format of a date.
  • embedding: knn_vector type that represents the embedding vector of the document.
  • embedding2d: float type that represents the 2D embedding vector of the document.
  • topic: keyword type that assigns a topic label to each document.

The relevant environmental variables are the following:

  • OPENSEARCH_ENDPOINT: The AWS endpoint of the OpenSearch deployed instance.
  • OPENSEARCH_PORT: The port of the instance.
  • OPENSEARCH_USERNAME: The username.
  • OPENSEARCH_PASSWORD: The password.

Models

MapIntel uses three models trained on the user provided data. The first is a Haystack retriever model, the second is a model that reduces the dimensions of the embeddings to 2D, while the third is a generator model used for question-answering. The corresponding environmental variables are the following:

  • HAYSTACK_RETRIEVER_MODEL: The value of the parameter embedding_model of the Haystack class EmbeddingRetriever.
  • SAGEMAKER_DIMENSIONALITY_REDUCTIONER_ENDPOINT: The SageMaker endpoint of the deployed dimensionality reductioner.
  • SAGEMAKER_GENERATOR_MODEL_ENDPOINT: The SageMaker endpoint of the deployed generator.

Usage

To run the application use the following command:

mapintel

Then the server starts and listens to connections at http://localhost:8080. You may open the browser and use this URL to interact with the MapIntel UI.

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mapintel-1.0.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

mapintel-1.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file mapintel-1.0.tar.gz.

File metadata

  • Download URL: mapintel-1.0.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.7

File hashes

Hashes for mapintel-1.0.tar.gz
Algorithm Hash digest
SHA256 d426d014cc7482c520bbb4aab1c9cb8e50fa795a3032d3e6f9f5b3f28f6e8c46
MD5 92bf7f02ddf1f8d7f14dac126f03d4ef
BLAKE2b-256 bb4e2fddc77effce993575c7ec613db20ff25ab87fd4ba090d713a63f956bd17

See more details on using hashes here.

File details

Details for the file mapintel-1.0-py3-none-any.whl.

File metadata

  • Download URL: mapintel-1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.7

File hashes

Hashes for mapintel-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 442f5bee1b4d2f7679b06a739076f9703e4c1a6b732790bf6cc50db38b9dba32
MD5 c84dd78abb99b799d0aa419306c2244c
BLAKE2b-256 62996a87d35840dbcf016cf16e40cc33cb8b3270135940890ad8ba4b6f437bf2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page