Skip to main content

Quote identification, attribution and resolution.

Project description

SaysWho

SaysWho is a Python package for identifying and attributing quotes in text. It uses a combination of logic and grammer to find quotes and their speakers, then uses a coreferencing model to better clarify who is speaking. It's built on Textacy and SpaCy.

Notes

  • Corefencing is an experimental feature not fully integrated into SpaCy, and the current pipeline is built on SpaCy 3.4. I haven't had any problems using it with SpaCy 3.5+, but it takes some finesse to navigate the different versions.

  • SaysWho grew out of a larger project for analyzing newspaper articles from Lexis between ~250 and ~2000 words, and it is optimized to navitage the syntax and common errors particular to that text.

  • The output of this version is kind of open-ended, and possibly not as useful as it could be. HTML viz is coming, but I'm open to any suggestions about how this could be more useful!

Installation

Install and update using pip:

$ pip install sayswho

Install the pre-trained SpaCy coreferencing pipeline.

$ pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whl

(Optional) If you want to use the most recent version of SpaCy, you will need to update it here. (see Notes)

$ pip install spacy -U

Download the SpaCy large english model.

$ spacy download en_core_web_lg

A Simple Example

Sample text adapted from here:

Nets Coach Jacque Vaughn was optimistic when discussing Ben Simmons's prospects on NBA TV.

“It’s been great, being able to check in with Ben," Vaughn said, via Nets Daily. “I look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward.

"He has an innate ability to impact the basketball game on both ends of the floor. So, we missed that in the Philly series and looking forward to it.”

Simmons arrived in Brooklyn during the 2021-22 season, but did not play that year after a back injury. The 26-year-old would make 42 appearances (33 starts) during a tumult-filled season for Brooklyn.

“He is on the court. No setbacks," Vaughn later told reporters about Simmons' workouts. “We’ll continue to see him improve through the offseason.”

Instantiate SaysWho and run .attribute on target text.

from sayswho import SaysWho

sw = SaysWho(text)

See speaker, cue and content of every quote with .quotes.

print(sw.quotes)
[DQTriple(speaker=[Vaughn], cue=[said], content=“It’s been great, being able to check in with Ben,"),
 DQTriple(speaker=[Vaughn], cue=[said], content=“I look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward."),
 DQTriple(speaker=[Vaughn], cue=[told], content=“He is on the court. No setbacks,"),
 DQTriple(speaker=[Vaughn], cue=[told], content=“We’ll continue to see him improve through the offseason.”)]

See resolved entity clusters with .clusters.

print(sw.clusters)
[[Ben Simmons's,
  Ben,
  a healthy Ben Simmons,
  him,
  He,
  Simmons,
  The 26-year-old,
  He,
  Simmons'x,
  him],
 [Nets Coach Jacque Vaughn, Vaughn, I, Vaughn],
 [Nets, The team, our, we],
 [an innate ability to impact the basketball game on both ends of the floor,
  that,
  it],
 [the 2021-22 season, that year],
 [Brooklyn, Brooklyn, We]]

Use .print_clusters() to see unique text in each cluster, easier to read.

sw.print_clusters()
0 {'Ben', 'He', 'The 26-year-old', 'a healthy Ben Simmons', "Simmons'x", "Ben Simmons's", 'Simmons', 'him'}
1 {'I', 'Nets Coach Jacque Vaughn', 'Vaughn'}
2 {'The team', 'our', 'we', 'Nets'}
3 {'it', 'an innate ability to impact the basketball game on both ends of the floor', 'that'}
4 {'that year', 'the 2021-22 season'}
5 {'Brooklyn', 'We'}

Quote/cluster matches are saved to .quote_matches as namedtuples.

for qm in sw.quote_matches:
    print(qm)
QuoteClusterMatch(quote_index=0, cluster_index=1)
QuoteClusterMatch(quote_index=1, cluster_index=1)
QuoteClusterMatch(quote_index=2, cluster_index=1)
QuoteClusterMatch(quote_index=3, cluster_index=1)

Use .expand_match() to view and interpret quote/cluster matches.

sw.expand_match()
QUOTE : 0
 DQTriple(speaker=[Vaughn], cue=[said], content=“It’s been great, being able to check in with Ben,") 

CLUSTER : 1
 ['Nets Coach Jacque Vaughn', 'Vaughn'] 

QUOTE : 1
 DQTriple(speaker=[Vaughn], cue=[said], content=“I look forward to coaching a healthy Ben Simmons. The team is excited to have him healthy, being part of our program and moving forward.") 

CLUSTER : 1
 ['Nets Coach Jacque Vaughn', 'Vaughn'] 

QUOTE : 2
 DQTriple(speaker=[Vaughn], cue=[told], content=“He is on the court. No setbacks,") 

CLUSTER : 1
 ['Nets Coach Jacque Vaughn', 'Vaughn'] 

QUOTE : 3
 DQTriple(speaker=[Vaughn], cue=[told], content=“We’ll continue to see him improve through the offseason.”) 

CLUSTER : 1
 ['Nets Coach Jacque Vaughn', 'Vaughn'] 

Use .render_to_html() to output an HTML file with your text, highlighted quotes, and associated clusters.

sw.render_to_html(article_title="My Article Title")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sayswho-0.1.4.tar.gz (14.8 kB view hashes)

Uploaded Source

Built Distribution

sayswho-0.1.4-py3-none-any.whl (15.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page