A generalized implementation of a dictionary-based content coder.
Project description
ContentCoder
ContentCoder is a Python-based text analysis tool that enables users to process and analyze text using custom linguistic dictionaries. It is inspired by tools like LIWC (Linguistic Inquiry and Word Count) and provides robust methods for tokenization, text analysis, and frequency calculations.
🔥 Features
- Custom Dictionary-Based Analysis
- Support for LIWC-style dictionaries (2007 & 2022 formats)
- Efficient text tokenization
- Wildcard and abbreviation handling
- Punctuation and big word analysis
- Dictionary export in multiple formats (JSON, CSV, Poster format, etc.)
- High-performance wildcard matching with memory optimization
🚀 Installation
Make sure you have Python 3.9+ installed. Clone this repository and install dependencies:
git clone https://github.com/your-repo/ContentCoder.git
cd ContentCoder
pip install -r requirements.txt
📁 Folder Structure
src/contentcoder/
│── __init__.py
│── ContentCoder.py
│── ContentCodingDictionary.py
│── happiestfuntokenizing.py
│── create_export_dir.py
📌 Quick Start
1. Import the ContentCoder class
from contentcoder.ContentCoder import ContentCoder
2. Initialize the Analyzer
cc = ContentCoder(dicFilename='path/to/dictionary.dic', fileEncoding='utf-8-sig')
3. Analyze a Text Sample
text = "Libraries are crucial to our society."
results = cc.Analyze(text, relativeFreq=True, dropPunct=True, retainCaptures=True, returnTokens=False, wildcardMem=True)
print(results)
Expected output:
{
"WC": 6,
"Dic": 4.5,
"BigWords": 2.0,
"Numbers": 0.0,
"AllPunct": 0.0,
"Period": 0.0,
"Comma": 0.0,
"QMark": 0.0,
"Exclam": 0.0,
"Apostro": 0.0,
"Libraries": 1.0,
"crucial": 1.0,
"society": 1.0
}
📖 Main Functions & Usage
1️⃣ Analyze(text, **options)
Analyzes a given text and returns a dictionary of results.
Parameters:
inputText(str): The text to analyze.relativeFreq(bool): IfTrue, returns relative frequencies. Otherwise, raw frequencies.dropPunct(bool): IfTrue, punctuation is removed before processing.retainCaptures(bool): IfTrue, captures and stores wildcard-matched words.returnTokens(bool): IfTrue, returns tokenized text.wildcardMem(bool): IfTrue, speeds up wildcard processing by storing past matches.
Example Usage:
result = cc.Analyze("Hello world! This is a test sentence.", returnTokens=relativeFreq=True)
2️⃣ GetResultsHeader()
Returns a list of all available output categories.
Example Usage:
print(cc.GetResultsHeader())
Expected output:
["WC", "Dic", "BigWords", "Numbers", "AllPunct", "Period", "Comma", "QMark", "Exclam", "Apostro"]
3️⃣ GetResultsArray(resultsDICT, rounding=4)
Formats the results of Analyze() into a CSV-friendly list.
Example Usage:
text = "The government plays an important role."
result = cc.Analyze(text)
csv_row = cc.GetResultsArray(result)
print(csv_row)
Expected output:
[6, 4.3, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
4️⃣ ExportCaptures(filename, fileEncoding='utf-8-sig', wildcardsOnly=False, fullset=True)
Exports wildcard-captured words and their frequencies to a CSV file.
Example Usage:
cc.ExportCaptures("captured_words.csv")
5️⃣ ExportDict2007Format(dicOutFilename, fileEncoding, separateDicts=False, separateDictsFolder=None)
Exports the loaded dictionary in LIWC-2007 format.
Example Usage:
cc.dict.ExportDict2007Format("dictionary_2007.dic")
6️⃣ ExportDict2022Format(dicOutFilename, fileEncoding, **options)
Exports the loaded dictionary in LIWC-22 format.
Example Usage:
cc.dict.ExportDict2022Format("dictionary_2022.dicx")
7️⃣ ExportDictJSON(filename, fileEncoding, indent=4)
Exports the dictionary mapping to a JSON file.
Example Usage:
cc.dict.ExportDictJSON("dictionary.json")
8️⃣ UpdateCategories(dicTerm, newCategories)
Updates the categories associated with a dictionary term.
Example Usage:
cc.dict.UpdateCategories(dicTerm="happiness", newCategories={"positive_emotion": 1.0, "joy": 0.5})
🔄 Example: Processing a Large CSV File with tqdm
This script reads a large CSV file and processes each text in the "body" column.
import csv
from tqdm import tqdm
from contentcoder.ContentCoder import ContentCoder
cc = ContentCoder(dicFilename='dictionary.dic', fileEncoding='utf-8-sig')
with open("Comments.csv", "r", encoding="utf-8-sig") as csvfile:
reader = csv.DictReader(csvfile)
total_lines = sum(1 for _ in open("Comments.csv")) - 1 # Count rows
for row in tqdm(reader, desc="Processing", unit=" comments"):
text = row["body"]
result = cc.Analyze(text)
⚡ Performance Optimizations
- Uses wildcard caching to speed up regex evaluations.
- Tokenization is optimized for handling social media text.
- Processes large datasets efficiently using streaming CSV reads.
📜 Dictionary Formats Supported
- LIWC-2007 (
.dic) - LIWC-22 (
.dicx,.csv) - JSON Exports
- Custom Hierarchical Category Mapping
🤝 Contributing
Pull requests are welcome! If you find bugs or have feature requests, open an issue.
📄 License
MIT License © 2021
📝 Acknowledgments
Developed by Ryan L. Boyd, Ph.D.
For academic and research purposes. Or, you know, whatever.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contentcoder-1.0.1.tar.gz.
File metadata
- Download URL: contentcoder-1.0.1.tar.gz
- Upload date:
- Size: 23.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b8a5869c45045fadbfc4a148ad86daf3f1758db6e1b6cda3f25803e4ec694a3
|
|
| MD5 |
bcf2054e02c33914deda58fa819052d2
|
|
| BLAKE2b-256 |
c453791a74847647948778a949b44f68d93d10a4b54308444c1203a1ecbac7be
|
File details
Details for the file contentcoder-1.0.1-py3-none-any.whl.
File metadata
- Download URL: contentcoder-1.0.1-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de6b5446fc849d1d9d8840840b72d4aa99317b8326fd2977af0b16fc942fc2d1
|
|
| MD5 |
b893937197bbd75e848d8358b950414d
|
|
| BLAKE2b-256 |
5e2b6e79affa5231162b34a66642b225b0f3e1b423ea347d18804b86235f37af
|