Skip to main content

A package to extract hystorical news sentiments

Project description

dataQuest

The code in this repository implements a pipeline to extract specific articles from a large corpus.

Currently, this tool is tailored for the Delpher Kranten corpus, but it can be adapted for other corpora as well.

Articles can be filtered based on individual or multiple features such as title, year, decade, or a set of keywords. To select the most relevant articles, we utilize models such as tf-idf. These models are configurable and extendable.

Getting Started

Clone this repository to your working station to obtain examples and python scripts:

git clone https://github.com/UtrechtUniversity/dataQuest.git

Prerequisites

To install and run this project you need to have the following prerequisites installed.

- Python [>=3.9, <3.11]

Installation

To run the project, ensure to install the dataQuest package that is part of this project.

pip install dataQuest

Built with

These packages are automatically installed in the step above:

Usage

1. Preparation

Data Prepration

Before proceeding, ensure that you have the data prepared in the following format: The expected format is a set of JSON files compressed in the .gz format. Each JSON file contains metadata related to a newsletter, magazine, etc., as well as a list of article titles and their corresponding bodies. These files may be organized within different folders or sub-folders. Below is a snapshot of the JSON file format:

{
    "newsletter_metadata": {
        "title": "Newspaper title ..",
        "language": "NL",
        "date": "1878-04-29",
        ...
    },
    "articles": {
        "1": {
            "title": "title of article1 ",
            "body": [
                "paragraph 1 ....",
                "paragraph 2...."
            ]
        },
        "2": {
            "title": "title of article2",
            "body": [
                "text..."  
             ]
        }
    }
}    

In our use case, the harvested KB data is in XML format. We have provided the following script to transform the original data into the expected format.

from dataQuest.preprocessor.parser import XMLExtractor

extractor = XMLExtractor(Path(input_dir), Path(output_dir))
extractor.extract_xml_string()

Navigate to scripts folder and run:

python3 convert_input_files.py 
   --input_dir path/to/raw/xml/data 
   --output_dir path/to/converted/json/compressed/output

Customize input-file

In order to add a new corpus to dataQuest you should:

  • prepare your input data in the JSON format explained above.
  • add a new input_file_type to INPUT_FILE_TYPES
  • implement a class that inherits from input_file.py. This class is customized to read a new data format. In our case-study we defined delpher_kranten.py.

2. Filter articles

You can select articles based on a single filter or a combination of filters. Articles can be filtered by title, year, decade, or a set of keywords defined in the config.json file. Logical operators such as AND, OR, and NOT can be used to combine filtering expressions.

In the following example, you select articles that include any of the specified keywords AND were published between 1800 and 1910 AND do not contain advertisements (e.g., "Advertentie").

 "filters": [
        {
            "type": "AndFilter",
                "filters": [
                        {
                            "type": "YearFilter",
                            "start_year": 1800,
                            "end_year": 1910
                        },
                        {
                            "type": "NotFilter",
                            "filter": {
                                "type": "ArticleTitleFilter",
                                "article_title": "Advertentie"
                            },
                            "level": "article"
                        },
                        {
                            "type": "KeywordsFilter",
                            "keywords": ["sustainability", "green"]
                        }
                ]
        }
 ],

To select the most relevant articles:

  1. articles are selected based the filters in the config file

  2. selected articles are categorized based on a specified period-type, such as year or decade. This categorization is essential for subsequent steps, especially in case of applying tf-idf or other models to specific periods.

  3. Select the most relevant articles related to the specified topic (defined by the provided keywords). 3.1. Select articles that contain any of the specified keywords in their title.

    3.2. Utilize TF-IDF (the default model), which can be extended to other models.

python3 scripts/filter_articles.py 

    --input-dir "path/to/converted/json/compressed/" 
    
    --output-dir "output/" 
    
    --input-type "delpher_kranten" 
    
    --glob "*.gz"
    
    --period-type "decade"

In our case:

  • The input data consists of compressed JSON files with the .gz extension.
  • The input type is "delpher_kranten".
  • Selected articles are categorized by decade.

Output

The output consists of a .csv file for each period, such as one file per decade. Each file contains the file_path and article_id of the filtered articles, along with an additional column, selected, which indicates the articles labeled as the most relevant by the model (e.g., TF-IDF).

There are different strategies for selecting the final articles. You should specify one of the following criteria in config.py:

  • Percentage: Select a percentage of articles with the highest scores.

  • Maximum Number: Specify the maximum number of articles to select based on their scores.

  • Threshold: Set a threshold for the cosine similarity value between the embeddings of the keyword list and each article.

  "article_selector":
    {
      "type": "percentage",
      "value": "30"
    },
    
    OR
  
  "article_selector":
    {
      "type": "threshold",
      "value": "0.02"
    },
    
    OR
    
   "article_selector":
    {
      "type": "num_articles",
      "value": "200"
    }, 

3. Generate output

As the final step of the pipeline, the text of the selected articles is saved in a .csv file, which can be used for manual labeling. The user has the option to choose whether the text should be divided into paragraphs or a segmentation of the text. This feature can be set in config.py.

"output_unit": "paragraph"

OR

"output_unit": "full_text"

OR
"output_unit": "segmented_text"
"sentences_per_segment": 10
python3 scripts/generate_output.py 
--input-dir "output/output_timestamped/” 
--output-dir “output/output_results/“  
--glob “*.csv”

About the Project

Date: February 2024

Researcher(s):

Pim Huijnen (p.huijnen@uu.nl)

Research Software Engineer(s):

License

The code in this project is released under MIT license.

Contributing

Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

To contribute:

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Contact

Pim Huijnen - p.huijnen@uu.nl

Project Link: https://github.com/UtrechtUniversity/dataQuest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataquest-0.0.0.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataQuest-0.0.0-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file dataquest-0.0.0.tar.gz.

File metadata

  • Download URL: dataquest-0.0.0.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.8

File hashes

Hashes for dataquest-0.0.0.tar.gz
Algorithm Hash digest
SHA256 40be6632ba5fea15a48ea890231c09aaa52a7c3b00eb74703f8adf97cefb66cc
MD5 bb5443b3da50ae065282c913e5c12fd9
BLAKE2b-256 ce9507803d5792e38ab2068db0dba9e53a092bd50bb92206aee830306075e153

See more details on using hashes here.

File details

Details for the file dataQuest-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: dataQuest-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.8

File hashes

Hashes for dataQuest-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 abaf1ad4b85816a5e16ed78f259211d5f827baf3355b0bc0a10f77fd0a40bd6b
MD5 3c659306ebd239b3768df22439fe9f2c
BLAKE2b-256 eb7e00239641f2ec6c1f03be1f2abdfd3cd86e6c58a4735b47c09f76fffdde12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page