Skip to main content

A generalized implementation of a dictionary-based content coder.

Project description

ContentCoder

AI Reading Machine

ContentCoder is a Python-based text analysis tool that enables users to process and analyze text using custom linguistic dictionaries. It is inspired by tools like LIWC (Linguistic Inquiry and Word Count) and provides robust methods for tokenization, text analysis, and frequency calculations.

Note: Approximately 98% of this README was generated by ChatGPT — it may not be entirely accurate, but at a quick glance, it looks pretty spot-on.

Features

  • Custom Dictionary-Based Analysis
  • Support for LIWC-style dictionaries (2007 & 2022 formats)
  • Efficient text tokenization
  • Wildcard and abbreviation handling
  • Punctuation and big word analysis
  • Dictionary export in multiple formats (JSON, CSV, Poster format, etc.)
  • High-performance wildcard matching with memory optimization

Installation

Ensure you have Python 3.9+ installed. ContentCoder is all native Python and does not require dependencies for installation.

pip install contentcoder

Folder Structure

src/contentcoder/
│── __init__.py
│─ ContentCoder.py
│─ ContentCodingDictionary.py
│─ happiestfuntokenizing.py
│─ create_export_dir.py

Quick Start

1. Import the ContentCoder class

from contentcoder.ContentCoder import ContentCoder

2. Initialize the Analyzer

cc = ContentCoder(dicFilename='path/to/dictionary.dic', fileEncoding='utf-8-sig')

3. Analyze a Text Sample

text = "An abrupt sound startled him. Off to the right he heard it, and his ears, expert in such matters, could not be mistaken. Again he heard the sound, and again. Somewhere, off in the blackness, someone had fired a gun three times."
results = cc.Analyze(text, relativeFreq=True, dropPunct=True, retainCaptures=False, returnTokens=True, wildcardMem=True)
print(results)

Expected output:

{
  "WC": 23,
  "Dic": 5.4,
  "BigWords": 6.0,
  "Numbers": 3.0,
  "AllPunct": 0.0,
  "Period": 3.0,
  "Comma": 0.0,
  "QMark": 0.0,
  "Exclam": 0.0,
  "Apostro": 0.0
}

Main Functions & Usage

1. Analyze(text, **options)

Analyzes a given text and returns a dictionary of results.

Parameters:

  • inputText (str): The text to analyze.
  • relativeFreq (bool): If True, returns relative frequencies. Otherwise, raw frequencies.
  • dropPunct (bool): If True, punctuation is removed before processing.
  • retainCaptures (bool): If True, captures and stores wildcard-matched words.
  • returnTokens (bool): If True, returns tokenized text.
  • wildcardMem (bool): If True, speeds up wildcard processing by storing past matches.

Example Usage:

result = cc.Analyze("Hello world! This is a test sentence.", returnTokens=True)

2. GetResultsHeader()

Returns a list of all available output categories.

Example Usage:

print(cc.GetResultsHeader())

Expected output:

["WC", "Dic", "BigWords", "Numbers", "AllPunct", "Period", "Comma", "QMark", "Exclam", "Apostro"]

3. GetResultsArray(resultsDICT, rounding=4)

Formats the results of Analyze() into a CSV-friendly list.

Example Usage:

text = "The government plays an important role."
result = cc.Analyze(text)
csv_row = cc.GetResultsArray(result)
print(csv_row)

Expected output:

[6, 4.3, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

4. ExportCaptures(filename, fileEncoding='utf-8-sig', wildcardsOnly=False, fullset=True)

Exports wildcard-captured words and their frequencies to a CSV file.

Example Usage:

cc.ExportCaptures("captured_words.csv")

5. ExportDict2022Format(dicOutFilename, fileEncoding, **options)

Exports the loaded dictionary in LIWC-22 format.

Example Usage:

cc.dict.ExportDict2022Format("dictionary_2022.dicx")

6. UpdateCategories(dicTerm, newCategories)

Updates the categories associated with a dictionary term.

Example Usage:

cc.dict.UpdateCategories(dicTerm="happiness", newCategories={"positive_emotion": 1.0, "joy": 0.5})

Example: Processing a Large CSV File with tqdm

This script reads a large CSV file and processes each text in the "body" column.

import csv
from tqdm import tqdm
from contentcoder.ContentCoder import ContentCoder

cc = ContentCoder(dicFilename='dictionary.dic', fileEncoding='utf-8-sig')

with open("Comments.csv", "r", encoding="utf-8-sig") as csvfile, \
     open("Output.csv", "w", encoding="utf-8-sig", newline="") as csvfile_out:

    reader = csv.DictReader(csvfile)
    writer = csv.writer(csvfile_out)
    writer.writerow(["id"] + cc.GetResultsHeader())

    for row in tqdm(reader, desc="Processing", unit=" comments"):
        row_id = row["id"]
        text = row["comment_text"]
        result = cc.Analyze(text)
        csv_row = cc.GetResultsArray(result)
        writer.writerow([row_id] + csv_row)

print("Finished!")

License

MIT License © 2021

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contentcoder-1.0.5.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contentcoder-1.0.5-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file contentcoder-1.0.5.tar.gz.

File metadata

  • Download URL: contentcoder-1.0.5.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for contentcoder-1.0.5.tar.gz
Algorithm Hash digest
SHA256 74ff3a7bc800a1734ef92de8b2bb425c88b5f94139f934c773c946a72efd73b9
MD5 f43fb863da38eac404132cb04f53a2b5
BLAKE2b-256 c092c6b0e03ead78026ffc27b2e116e5efc7f74c2713611988f91228553ba6d1

See more details on using hashes here.

File details

Details for the file contentcoder-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: contentcoder-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.10

File hashes

Hashes for contentcoder-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 11b83837b17e50a9b2df176a4f2989b0cde680b5e45ee184ae67753cc8f6b602
MD5 a30b4d9e32fa414745ec47813dc175f0
BLAKE2b-256 6e8696cec7f19852dd50a39d2a39dd2007ed779ef99def6c00d85b20fbc6b0b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page