A Python library containing document processing functions
Project description
Congreso Utils
Description:
A Python package designed to streamline the analysis of legislative documents from the Congreso de los Diputados (Spain). This package is designed to help on the usage of the data base https://doi.org/10.5281/zenodo.11195944 created by this authors. The data base contains 16 json files where each contains all the congres and senate records of the corresponding term. The terms are named after the term they represent (C, I, II, III,... XV). Whith this notebook you'll be able to:
- Load JSON data effortlessly from Zenodo
- Explore and filter documents using diverse criteria
- Analyze document content with text processing techniques
- Generate informative statistics and visualizations
- Term Selection and Data Loading:
Loading Data:
The first step is to load the JSON data using the load_jsons function, you will need to load the terms you are interested in. Pass a list containing the desired Roman numerals (terms) as input:
- from congreso import congreso as c (after installing this library)
- terms = ["XV", "XIV"]
- t = c.load_jsons(terms)
Use functions with term input:
fields = c.get_all_fields(t["XV"]) print(fields)
Function Usage:
- num_docs_term(term): Retrieves the number of documents for a specific term (e.g., num_docs_term(t["XV"])).
- get_all_fields(term): Returns a list of all unique fields present in the documents for a term.
- get_docs_by_date(term, date): Filters documents for a term based on a specific date (YYYYMMDD format).
- get_documents_interval_dates(term, start_date, end_date): Filters documents for a term within a date range (YYYYMMDD format).
- key_word_search(word, term): Finds documents for a term that contain a particular keyword within the "texto" field.
- count_docs_with_aperance(word, term): Counts the number of documents for a term that contain a specified word within the "texto" field.
- mentions_per_doc(word, term): Calculates the frequency of a phrase (sequence of words) within each document of a term's document list.
- display_field_values(term, field): Analyzes the values of a particular field for a term, returning a DataFrame showing unique values and their corresponding document counts.
- filter_field_by_value(term, field, value): Filters documents for a term based on a specific field and value.
- visualize_ndia(term): (analyzes 'ndia' field for document counts per day)
- productive_days_percentage(term): (calculates percentage of days with documents and total documents)
- docs_per_day(term): (calculates average documents produced per day)
- filter_encabezado(term: list[dict]) Filters documents based on a specific field ("encabezado" with only two types: "BOCG" and "DS"). Useful for focused searches.
- add_texto_length(term: list[dict]) Adds a new field ("texto_length") to each document, containing the length of the text within the "texto" field. Facilitates text analysis based on length.
- docs_filtered_by_lenght(term: list[dict], upper_threshold = 1000000, lower_threshold = 0) Filters documents based on the text length within the "texto" field. Useful for analyzing shorter or longer documents.
License:
This package is distributed under the MIT License (see LICENSE file for details).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file congreso-1.0.4.tar.gz
.
File metadata
- Download URL: congreso-1.0.4.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6dc1fa818da870532a453e31b12fbed58d467d864d60b757cab4068daf64398b |
|
MD5 | 7c79897486a1d39960feac854fec3178 |
|
BLAKE2b-256 | b48f3063cd283384b1d5a6d3f8756d336826d99155fc55a48ff646957c831967 |
File details
Details for the file congreso-1.0.4-py3-none-any.whl
.
File metadata
- Download URL: congreso-1.0.4-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a36c34f5a63cc2615d6cfa082194bdda444b7e581a23c69503bfd314e1ee7b2b |
|
MD5 | 50f5cb36650d8353a294551e1e9cc673 |
|
BLAKE2b-256 | a7f5706f3ade095f5b08c963739612d79ebed469e04f2fe0e3e061c6aad1bbb2 |