Skip to main content

Process and profile text datasets interactively

Project description

Texture: Structured Text Analytics

PyPi

Texture is a system for exploring and creating structured insights with your text datasets.

  1. Interactive Attribute Profiles: Texture visualizes structured attributes alongside your text data in interactive, cross-filterable charts.
  2. Flexible attribute definitions: Attribute charts can come from different tables and any level of a document such as words, sentences, or documents.
  3. Embedding based operations: Texture helps you use vector embeddings to search for similar text and summarize your data.

screenshot of Texture interface

Install and run

Install texture with pip:

pip install texture-viz

Then you can run in a python script or notebook by providing a dataframe with your text data and attributes.

import texture
texture.run(df)

Texture Configuration

You can optionally pass arguments to the run command to configure the interface. Configuration options are:

  • data: pd.DataFrame: The dataframe to parse and visualize.
  • schema: a dataset schema describing the columns, types, and tables (calculated automatically if none provided)
  • load_tables: Dict[str, pd.DataFrame]: A dictionary of tables to load into the schema. The key is the table name and the value is the dataframe.
  • create_new_embedding_func: A function that takes a string and returns a vector embedding (see example below)

There are several reserved column names in the main table that are used in the interface:

  • id: A unique identifier for each row.
  • vector: A column containing embeddings for the text data.
  • umap_x and umap_y: Columns containing 2d projections of the embeddings.

We provide various preprocessing functions to calculate embeddings, projections, and word tables. You can use these functions to preprocess your data before launching the Texture app.

import pandas as pd
import texture
from texture.models import DatasetSchema, Column, DerivedSchema

P = "https://raw.githubusercontent.com/cmudig/Texture/main/examples/vis_papers/"

df_main = pd.read_parquet(P + "1_main.parquet")
df_words = pd.read_parquet(P + "2_words.parquet")
df_authors = pd.read_parquet(P + "3_authors.parquet")
df_keywords = pd.read_parquet(P + "4_keywords.parquet")

load_tables = {
    "main_table": df_main,
    "words_table": df_words,
    "authors_table": df_authors,
    "keywords_table": df_keywords,
}

# Create schema for the dataset that decides how the data will be visualized
schema = DatasetSchema(
    name="main_table",
    columns=[
        Column(name="Title", type="text"),
        Column(name="Abstract", type="text"),
        Column(
            name="word",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=True,
                table_name="words_table",
                derived_from="Abstract",
                derived_how=None,
            ),
        ),
        Column(
            name="pos",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=True,
                table_name="words_table",
                derived_from="Abstract",
                derived_how=None,
            ),
        ),
        Column(
            name="author",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=False,
                table_name="authors_table",
                derived_from=None,
                derived_how=None,
            ),
        ),
        Column(
            name="keyword",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=False,
                table_name="keywords_table",
                derived_from=None,
                derived_how=None,
            ),
        ),
        Column(name="Year", type="number"),
        Column(name="Conference", type="categorical"),
        Column(name="PaperType", type="categorical"),
        Column(name="CitationCount_CrossRef", type="number"),
        Column(name="Award", type="categorical"),
    ],
    primary_key=Column(name="id", type="number"),
    origin="uploaded",
    has_embeddings=True,
    has_projection=True,
)

def get_embedding(value: str):
    import sentence_transformers

    model = sentence_transformers.SentenceTransformer("all-mpnet-base-v2")
    e = model.encode(value)

    return e

texture.run(
    schema=schema, load_tables=load_tables, create_new_embedding_func=get_embedding
)

Dev install

See DEV.md for dev workflows and setup.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texture_viz-0.0.7.tar.gz (3.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

texture_viz-0.0.7-py3-none-any.whl (3.2 MB view details)

Uploaded Python 3

File details

Details for the file texture_viz-0.0.7.tar.gz.

File metadata

  • Download URL: texture_viz-0.0.7.tar.gz
  • Upload date:
  • Size: 3.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.11.4 Darwin/24.2.0

File hashes

Hashes for texture_viz-0.0.7.tar.gz
Algorithm Hash digest
SHA256 25dade8f487868da584c9c735f402f13a327c9c1b462aa65cd84a202a763ad2f
MD5 9745fe3d332eb6ef8958d69d501f8494
BLAKE2b-256 97d416f79bb100a548ebbd23cf17228b06c435e17497fb409253091afa249051

See more details on using hashes here.

File details

Details for the file texture_viz-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: texture_viz-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.11.4 Darwin/24.2.0

File hashes

Hashes for texture_viz-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 56c55a6a1267f5182f2e18b82221778f8bd3cac82589a129d430246801c4bd95
MD5 bca5d462e256f151b73c0742ea7d9768
BLAKE2b-256 37545fc387162998b021d25d5d0023762250e715bf3f0ada4c7e57beadc56106

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page