Process and profile text datasets interactively
Project description
Texture: Structured Text Analytics
Texture is a system for exploring and creating structured insights with your text datasets.
- Interactive Attribute Profiles: Texture visualizes structured attributes alongside your text data in interactive, cross-filterable charts.
- Flexible attribute definitions: Attribute charts can come from different tables and any level of a document such as words, sentences, or documents.
- Embedding based operations: Texture helps you use vector embeddings to search for similar text and summarize your data.
Install and run
Install texture with pip:
pip install texture-viz
Then you can run in a python script or notebook by providing a dataframe with your text data and attributes.
import texture
texture.run(df)
Texture Configuration
You can optionally pass arguments to the run command to configure the interface. Configuration options are:
data: pd.DataFrame: The dataframe to parse and visualize.schema: a dataset schema describing the columns, types, and tables (calculated automatically if none provided)load_tables: Dict[str, pd.DataFrame]: A dictionary of tables to load into the schema. The key is the table name and the value is the dataframe.create_new_embedding_func: A function that takes a string and returns a vector embedding (see example below)
There are several reserved column names in the main table that are used in the interface:
id: A unique identifier for each row.vector: A column containing embeddings for the text data.umap_xandumap_y: Columns containing 2d projections of the embeddings.
We provide various preprocessing functions to calculate embeddings, projections, and word tables. You can use these functions to preprocess your data before launching the Texture app.
import pandas as pd
import texture
from texture.models import DatasetSchema, Column, DerivedSchema
P = "https://raw.githubusercontent.com/cmudig/Texture/main/examples/vis_papers/"
df_main = pd.read_parquet(P + "1_main.parquet")
df_words = pd.read_parquet(P + "2_words.parquet")
df_authors = pd.read_parquet(P + "3_authors.parquet")
df_keywords = pd.read_parquet(P + "4_keywords.parquet")
load_tables = {
"main_table": df_main,
"words_table": df_words,
"authors_table": df_authors,
"keywords_table": df_keywords,
}
# Create schema for the dataset that decides how the data will be visualized
schema = DatasetSchema(
name="main_table",
columns=[
Column(name="Title", type="text"),
Column(name="Abstract", type="text"),
Column(
name="word",
type="categorical",
derivedSchema=DerivedSchema(
is_segment=True,
table_name="words_table",
derived_from="Abstract",
derived_how=None,
),
),
Column(
name="pos",
type="categorical",
derivedSchema=DerivedSchema(
is_segment=True,
table_name="words_table",
derived_from="Abstract",
derived_how=None,
),
),
Column(
name="author",
type="categorical",
derivedSchema=DerivedSchema(
is_segment=False,
table_name="authors_table",
derived_from=None,
derived_how=None,
),
),
Column(
name="keyword",
type="categorical",
derivedSchema=DerivedSchema(
is_segment=False,
table_name="keywords_table",
derived_from=None,
derived_how=None,
),
),
Column(name="Year", type="number"),
Column(name="Conference", type="categorical"),
Column(name="PaperType", type="categorical"),
Column(name="CitationCount_CrossRef", type="number"),
Column(name="Award", type="categorical"),
],
primary_key=Column(name="id", type="number"),
origin="uploaded",
has_embeddings=True,
has_projection=True,
)
def get_embedding(value: str):
import sentence_transformers
model = sentence_transformers.SentenceTransformer("all-mpnet-base-v2")
e = model.encode(value)
return e
texture.run(
schema=schema, load_tables=load_tables, create_new_embedding_func=get_embedding
)
Dev install
See DEV.md for dev workflows and setup.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file texture_viz-0.0.7.tar.gz.
File metadata
- Download URL: texture_viz-0.0.7.tar.gz
- Upload date:
- Size: 3.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.1 CPython/3.11.4 Darwin/24.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25dade8f487868da584c9c735f402f13a327c9c1b462aa65cd84a202a763ad2f
|
|
| MD5 |
9745fe3d332eb6ef8958d69d501f8494
|
|
| BLAKE2b-256 |
97d416f79bb100a548ebbd23cf17228b06c435e17497fb409253091afa249051
|
File details
Details for the file texture_viz-0.0.7-py3-none-any.whl.
File metadata
- Download URL: texture_viz-0.0.7-py3-none-any.whl
- Upload date:
- Size: 3.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.1 CPython/3.11.4 Darwin/24.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56c55a6a1267f5182f2e18b82221778f8bd3cac82589a129d430246801c4bd95
|
|
| MD5 |
bca5d462e256f151b73c0742ea7d9768
|
|
| BLAKE2b-256 |
37545fc387162998b021d25d5d0023762250e715bf3f0ada4c7e57beadc56106
|