Skip to main content

No project description provided

Project description

news-deja-vu

Python package for News Deja Vu

News Deja Vu is a novel semantic search tool that leverages transformer large language models and a bi-encoder approach to identify historical news articles that share semantic similarities with modern news queries. News Déjà Vu first recognizes and masks entities, in order to focus on broader parallels rather than the specific named entities being discussed. Then, a contrastively trained, lightweight bi-encoder retrieves historical articles that are most similar semantically to a modern query.

Example Usage:

ner_model = 'bert-base-NER'
same_story_model = 'dell-research-harvard/same-story'

# Download historic news articles
corpus = download('american stories:1840')
# Perform NER inference
ner_output = ner_and_mask(corpus, ner_model, batch_size = batch_size)
# Embed with biencoder
embeddings = embed(ner_output, same_story_model)

# NER inference for query sentences
query_masked_input = ner_and_mask(sample_query_sentences, ner_model, batch_size = batch_size)
# Embed query sentences
query_embeddings = embed(query_masked_input, same_story_model)

# Search for closest matches in historical corpus
dist_list, nn_list = find_nearest_neighbours(query_embeddings, embeddings, k=1)

# Output results
results_dict = {i: {"query": sample_query_sentences[i], "neighbor": dataset[nn_list[i]]} for i in range(len(sample_query_sentences))}
with open('data/test_data/query_results_1840.json', 'w') as f:
    json.dump(results_dict, f, indent = 4, default=str)

Or, in much simpler form:

corpus = download('american stories:1840')
results = search_same_story(sample_query_sentences, corpus, ner_model, same_story_model, k = 1)

Outputs are query texts matched with their nearest matches in the historical corpus.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newsdejavu-0.0.3.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

newsdejavu-0.0.3-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file newsdejavu-0.0.3.tar.gz.

File metadata

  • Download URL: newsdejavu-0.0.3.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.64.1 CPython/3.8.10

File hashes

Hashes for newsdejavu-0.0.3.tar.gz
Algorithm Hash digest
SHA256 521712c094eb6acefc09350a4844c67cb9e44fda89dfa9c2e85d62f5914c1100
MD5 d95ccb5ffa1314d596188086a73c272d
BLAKE2b-256 c8cc96822ca808086857f31c9756085f6eb2edb1fe2dd2070aa6865c1d3428e6

See more details on using hashes here.

File details

Details for the file newsdejavu-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: newsdejavu-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.64.1 CPython/3.8.10

File hashes

Hashes for newsdejavu-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d48e47f2252322939a4cca573021beea5c2fb87debb962494f701211d7dcc374
MD5 155387c71d284ae73647fcb8e85a481b
BLAKE2b-256 e051e7df208117da324272e2fa3bb8de742a159f31f5e7b8eb003145293c8fe2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page