NarrativeMapper is a text analysis pipeline that uncovers the dominant narratives and emotional tones within online communities.

Project description

NarrativeMapper

Overview

Whether you're coding in Python or simply running a single command in your terminal, NarrativeMapper gives you instant insight into the dominant stories behind the noise.

Ever wonder what stories are dominating Reddit, Twitter, or any corner of the internet? NarrativeMapper clusters similar online discussions and uses OpenAI’s GPT to summarize the dominant narratives, tone, and sentiment. Built for researchers, journalists, analysts, and anyone trying to make sense of the chaos.

Extracts dominant narratives from messy text data
Clusters similar posts using embeddings + UMAP + HDBSCAN
Summarizes each cluster with GPT
Analyzes sentiment per narrative
Plug-and-play pipeline: CLI, class-based, or functional

Click to view actual models being used

Uses OpenAI Embeddings (OpenAI's text-embedding-3-large)
Dimensionality reduction (UMAP)
Density-based clustering (HDBSCAN)
Topic summary + sentiment extraction (OpenAI's Chat Completions API, model gpt-4o-mini + Hugging Face's distilbert-base-uncased-finetuned-sst-2-english)

Installation and Setup

Installation:

Click to view installation process

Install via PyPI:

pip install NarrativeMapper

Setup:

Click to view setup process

Create a .env file in your root directory (same folder where your script runs).
Inside the .env file, add your OpenAI API key like this:

OPENAI_API_KEY=your-api-key-here

Before importing narrative_mapper, make sure to load your .env like this:

from dotenv import load_dotenv
load_dotenv()

from narrative_mapper import *

(Make sure to keep your .env file private and add it to your .gitignore if you're using Git.)

How to Use

Option 1: CLI (zero code)

Run NarrativeMapper directly from the terminal:

narrativemapper path/to/your.csv online_group_name

This will:

Load the CSV
Automatically embed, cluster, and summarize the comments
Output a formatted results file in the current directory (output_summary.txt)
Print the summarized narratives and sentiment to the terminal

File output example from this dataset:

Run Timestamp: 2025-04-08 19:12:45
Online Group Name: reddit_space_subreddit

Summary: The core themes of this cluster revolve around the awe and nostalgia associated with the Voyager missions, the challenges of long-term space exploration, and imaginative reflections on extraterrestrial life and science fiction.
Sentiment: NEUTRAL
Comments: 25
---

Summary: The core theme of this cluster revolves around discussions of space, galaxies, the universe's vastness, and the implications of astronomical phenomena on our understanding of life and existence.
Sentiment: NEGATIVE
Comments: 95
---

Summary: The cluster revolves around personal experiences and emotions related to witnessing solar eclipses and auroras, highlighting the awe, excitement, and challenges of viewing these celestial events.
Sentiment: NEUTRAL
Comments: 63
---

Summary: The core theme of this cluster revolves around admiration and appreciation for a stunning astronomical photograph, with many comments expressing curiosity about the techniques used to capture it and requests for high-resolution versions for personal use.
Sentiment: POSITIVE
Comments: 48
---

Summary: The cluster primarily discusses concerns and criticisms regarding Boeing's safety record, management practices, and the implications for NASA's reliance on Boeing for crewed space missions, juxtaposed with support for SpaceX's advancements in space travel.
Sentiment: NEGATIVE
Comments: 53
---

Summary: The cluster discusses concerns about space debris, the irresponsibility of space agencies, particularly China, and the need for international cooperation and regulation to address the growing problem of litter in Earth's orbit.
Sentiment: NEGATIVE
Comments: 20
---

Note: Make sure you're running the CLI from the same directory where your .env file is located (Unless you have set OPENAI_API_KEY globally in your environment).

Option 2: Class-Based Interface

from dotenv import load_dotenv
load_dotenv()

from narrative_mapper import *
import pandas as pd

file_df = pd.read_csv("file-path")

#initialize NarrativeMapper object
mapper = NarrativeMapper(file_df, "r/antiwork")

#embeds semantic vectors
mapper.load_embeddings(batch_size=100)

#clustering: main UMAP and HDBSCAN variables along with kwargs for more customizability.
umap_kwargs =  {'min_dist': 0.0}
mapper.cluster(n_components=20, n_neighbors=20, min_cluster_size=40, min_samples=15, umap_kwargs=umap_kwargs)

#summarize each cluster's topic and sentiment
mapper.summarize(max_sample_size=500)

#export in your preferred format
summary_dict = mapper.format_to_dict()
text_df = mapper.format_by_text()
cluster_df = mapper.format_by_cluster()

#saving DataFrames to csv
text_df.to_csv("comments_by_cluster.csv", index=False)
cluster_df.to_csv("cluster_summary.csv", index=False)

Option 3: Functional Interface

from dotenv import load_dotenv
load_dotenv()

from narrative_mapper import *
import pandas as pd

df = pd.read_csv("file-path")

#manual control over each step:
embeddings = get_embeddings(file_df, batch_size=100)
cluster_df = cluster_embeddings(embeddings, n_components=20, n_neighbors=20, min_cluster_size=40, min_samples=15)
summary_df = summarize_clusters(cluster_df, max_sample_size=500)

#export/format options
summary_dict = format_to_dict(summary_df, online_group_name="r/antiwork")
text_df = format_by_text(summary_df, online_group_name="r/antiwork")
cluster_df = format_by_cluster(summary_df, online_group_name="r/antiwork")

Output Formats

This example is based off of 1800 r/antiwork comments from the top 300 posts within the last year (Date of Writing: 2025-04-03).

The three formatter functions return the following:

format_to_dict() returns dict with following format:

format_to_dict output example

{
    "online_group_name": "r/antiwork",
    "clusters": [
        {
            "cluster": 0,
            "cluster_summary": "The core theme of this cluster revolves around the frustrations and challenges of the modern job application and interview process, highlighting issues such as discrimination, exploitative practices, and the disconnect between employers and candidates.",
            "sentiment": "NEGATIVE",
            "text_count": 76
        },
        {
            "cluster": 1,
            "cluster_summary": "The core theme of this cluster revolves around the debate over low wages in the fast food industry, the impact of wage increases on business practices and pricing, and the broader implications for workers' livelihoods and economic conditions.",
            "sentiment": "NEGATIVE",
            "text_count": 100
        },
        {
            "cluster": 2,
            "cluster_summary": "The cluster reflects widespread frustration and despair among younger generations regarding economic instability, unaffordable living costs, inadequate healthcare, and the perceived indifference of older generations towards their struggles.",
            "sentiment": "NEGATIVE",
            "text_count": 112
        },
        {
            "cluster": 3,
            "cluster_summary": "The core theme of this cluster revolves around employee dissatisfaction with management practices, workplace exploitation, and the importance of asserting one's rights and boundaries in a toxic work environment.",
            "sentiment": "NEGATIVE",
            "text_count": 464
        },
        {
            "cluster": 4,
            "cluster_summary": "The core theme of this cluster revolves around dissatisfaction with traditional work structures, advocating for reduced work hours, better work-life balance, and criticism of corporate exploitation and the lack of employee rights.",
            "sentiment": "NEGATIVE",
            "text_count": 95
        },
        {
            "cluster": 5,
            "cluster_summary": "The core theme of this cluster revolves around wealth inequality, criticizing the hoarding of wealth by billionaires and the systemic issues that perpetuate economic disparity and exploitation of the working class.",
            "sentiment": "NEGATIVE",
            "text_count": 95
        },
        {
            "cluster": 6,
            "cluster_summary": "The comments express strong criticism of capitalism, highlighting themes of exploitation, corporate greed, and the detrimental impact of billionaires and CEOs on workers and society.",
            "sentiment": "NEGATIVE",
            "text_count": 89
        }
    ]
}

format_by_cluster() returns pandas DataFrame with columns:

format_by_cluster columns

online_group_name: online group name
cluster: numeric cluster number
cluster_summary: summary of the cluster
text_count: sampled textual messages per cluster
aggregated_sentiment: net sentiment, of form 'NEGATIVE', 'POSITIVE', 'NEUTRAL'
text: the list of textual messages that are part of the cluster
all_sentiments: this is a list containing dict items of the form '{'label': 'NEGATIVE', 'score': 0.9896971583366394}' for each message (sentiment calculated by distilbert-base-uncased-finetuned-sst-2-english).

CSV to show output format

format_by_text() returns pandas DataFrame with columns:

format_by_text columns

online_group_name: online group name
cluster: numeric cluster number
cluster_summary: summary of the cluster
text: the sampled textual message (this function returns all of them row by row)
sentiment: dict item holding sentiment calculation, of the form '{'label': 'NEGATIVE', 'score': 0.9896971583366394}' (sentiment calculated by distilbert-base-uncased-finetuned-sst-2-english).

CSV to show output format

Pipeline Architecture & API Overview

Pipeline:

CSV Text Data → Embeddings → Clustering → Summarization → Formatting

Functions:

#Converts each message into a 3072-dimensional vector using OpenAI's text-embedding-3-large.
get_embeddings(file_df, batch_size=...)

#Clusters the embeddings using UMAP (for reduction) and HDBSCAN (for density-based clustering).
cluster_embeddings(
    embeddings, 
    n_components=..., 
    n_neighbors=..., 
    min_cluster_size=..., 
    min_samples=..., 
    umap_kwargs=..., 
    hdbscan_kwags=...
    )

#Uses GPT (via Chat Completions) for cluster summaries and Hugging Face for sentiment analysis.
summarize_clusters(clustered_df, max_sample_size=...)

#Returns structured output as a dictionary (ideal for JSON export).
format_to_dict(summary_df)

#Returns a DataFrame where each row summarizes a cluster.
format_by_cluster(summary_df)

#Returns a DataFrame where each row is an individual comment with its sentiment and cluster label.
format_by_text(summary_df)

NarrativeMapper Class

Instance Attributes:

class NarrativeMapper:
    def __init__(self, df, online_group_name: str):
        self.file_df               # DataFrame of csv file
        self.online_group_name     # Name of the online community or data source
        self.embeddings_df         # DataFrame after embedding
        self.cluster_df            # DataFrame after clustering
        self.summary_df            # DataFrame after summarization

Methods:

load_embeddings(batch_size=...)
cluster(
    n_components=..., 
    n_neighbors=..., 
    min_cluster_size=..., 
    min_samples=..., 
    umap_kwargs=..., 
    hdbscan_kwargs=...
    )
summarize(max_sample_size=...)
format_by_text()
format_by_cluster()
format_to_dict()

Parameter Reference

Click to expand

n_components: The number of dimensions UMAP reduces the embedding vectors to. Lower values simplify the data for clustering.
n_neighbors: Influences UMAP’s balance between local and global structure. Higher values emphasize global relationships.
min_cluster_size: In HDBSCAN, the minimum number of points required to form a cluster. Smaller values allow more granular clusters.
min_samples: A density sensitivity parameter in HDBSCAN. Higher values make clustering more conservative.
umap_kwargs: Allows for input of other UMAP parameters.
hdbscan_kwags: Allows for input of other HDBSCAN parameters.
batch_size: Number of messages processed per API request to avoid token limits. Choose smaller values the larger your textual messages are.
max_sample_size: Maximum number of comments sampled per cluster for summarization.

Estimated Cost (OpenAI Pricing)

Estimated cost: $0.13 to $0.28 per 1 million tokens.

Example: A CSV containing 1,000 Reddit comments costs approximately $0.01 to process.

Click for pricing details

The OpenAI text-embedding-3-large model costs approximately $0.13 per 1 million input tokens. Determined by the total tokens of your input textual messages.

The Chat Completions model used for summarization (gpt-4o-mini) is $0.15 per 1 million input tokens. The max_sample_size parameter (referenced later) helps reduce costs by limiting how many comments are passed into gpt-4o-mini for each cluster. This can significantly reduce the Chat Completions token usage.

The gpt-4o-mini input prompt (excluding the text) and output summary are both very short (<100 tokens), so their cost contribution is negligible.

Project details

Release history Release notifications | RSS feed

0.3.4

Apr 12, 2025

0.3.3

Apr 12, 2025

0.3.2

Apr 12, 2025

0.3.1

Apr 11, 2025

0.3.0

Apr 11, 2025

0.2.9

Apr 11, 2025

0.2.8

Apr 10, 2025

0.2.7

Apr 10, 2025

0.2.6

Apr 10, 2025

0.2.5

Apr 10, 2025

0.2.4

Apr 10, 2025

0.2.3

Apr 10, 2025

This version

0.2.2

Apr 9, 2025

0.2.1

Apr 8, 2025

0.2.0

Apr 8, 2025

0.1.9

Apr 7, 2025

0.1.8

Apr 7, 2025

0.1.7

Apr 7, 2025

0.1.6

Apr 6, 2025

0.1.5

Apr 5, 2025

0.1.4

Apr 5, 2025

0.1.3

Apr 4, 2025

0.1.2

Apr 4, 2025

0.1.1

Apr 4, 2025

0.1.0

Apr 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

narrativemapper-0.2.2.tar.gz (17.2 kB view details)

Uploaded Apr 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

narrativemapper-0.2.2-py3-none-any.whl (20.6 kB view details)

Uploaded Apr 9, 2025 Python 3

File details

Details for the file narrativemapper-0.2.2.tar.gz.

File metadata

Download URL: narrativemapper-0.2.2.tar.gz
Upload date: Apr 9, 2025
Size: 17.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for narrativemapper-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`03c71a118e76c344482b8c4d7fb3b1cd2f32eab424df5a91fbbfc9567658d563`
MD5	`5883cd722fcfb2a22fe761e5fbefd9a2`
BLAKE2b-256	`8d7d3bb78f7d27f7180444d3f427a533d35a365a0bcd963a45a938cc177617de`

See more details on using hashes here.

File details

Details for the file narrativemapper-0.2.2-py3-none-any.whl.

File metadata

Download URL: narrativemapper-0.2.2-py3-none-any.whl
Upload date: Apr 9, 2025
Size: 20.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for narrativemapper-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`27ad01795dc2feb16be39a98f0cacaedef17fa0a8082236556095b38ac190b68`
MD5	`708b430d61db7a3746369e6213af1e3f`
BLAKE2b-256	`1e9caf97542fde7e56fdadcdd9ecfceee7ed758ffabd44844a787c45fbf4d273`

See more details on using hashes here.

NarrativeMapper 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

NarrativeMapper

Overview

Installation and Setup

How to Use

Option 1: CLI (zero code)

Option 2: Class-Based Interface

Option 3: Functional Interface

Output Formats

Pipeline Architecture & API Overview

NarrativeMapper Class

Parameter Reference

Estimated Cost (OpenAI Pricing)

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes