A python module that extract information (language, profession...) from X[ex-Twitter] profile bios
Project description
Twitter Profile Predictor
Description
This package is designed to predict the characteristics of a Twitter profile based on bios.
Predicted characteristics :
- language (language model detection)
- professions
- declared statuses
- gender
- age
- topics
Methodology
Language is identified using language models.
All the other characteristics are estimated by identifying keywords and bi-words in the bios.
Installation
To install this package run the command:
pip install --upgrade twitter-profile-predictor
WARNING
This package use cld2-cffi, you may have problem installing cld2-cffi while installing twitter-profile-predictor. If it's the case please consider installing cld2-cffi in your environment before and then install twitter-profile-predictor.
Example of use
import twitter_profile_predictor as tpp
# Load the module
extractor = tpp.bios_analyzer("je suis president du monde et journaliste chez sciencespo")
# Tokenization
print('tokenization')
extractor.full_tokenize()
# The tokens are saved in the attribute full_tokens
print(extractor.full_tokens)
# Profession identification
print("profession extraction")
# The professions are saved in the attribute professions
print(extractor.get_professions())
You can find a tutorial with examples here.
Documentation
Our methodology is based on the analyisis of X (ex-Twitter) bios (also called description). Around 48% of all X users have a bios accessible on their profile[@culottaPredictingTwitterUser2016]. Our method have been developped on a specific subset of 30.000 users followers of french members of parliament. For this reason we recommended it for french datasets
Language
The first analysis we made was to estimate the language of each user.
Giving the difficulty of the task on very small texts we use 3 different
available python language detector :
langid.classify,
langdetect.detect and
cld3.get_language, and we then get
the right language by majority vote or return False if no language
matches with another.
Language in our dataset
| Language | Number of users |
|---|---|
| French | 17956 |
| English | 2449 |
| Spanish | 132 |
| Italian | 71 |
| German | 62 |
| Arabic | 43 |
| Portuguese | 28 |
| Catalan | 24 |
| Others | 181 |
| Unrecognized | 1141 |
Table: Main languages used in the description of users on Twitter
We can see in the table above the main languages used by the users. The great majority (92.4%) of the descriptions are in French or English. Some users are non-French speakers. These users are here because they participate actively in the French political debate, following French MPs and being connected to the rest of the digital public sphere.
Code
You can call language using :
import twitter_profile_predictor as tpp
bios = "example of twitter bios"
# direct function
tpp.lang(bios, mainfrench = False)
# bios analyser module
extractor = tpp.bios_analyzer(tpp)
extractor.get_lang()
extractor.language
The language returned as string with ISO 639-1 Language Codes (Two-letter codes):
- English: "en"
- French: "fr"
- Spanish: "es"
- German: "de"
- Chinese: "zh"
- Japanese: "ja"
- Russian: "ru"
- Arabic: "ar"
- Hindi: "hi"
- Portuguese: "pt"
- ecc...
Profession
Estimating Twitter users' occupation from profile information is a common task in computational social sciences, particularly in attempts to assess socioeconomic status (SES) [@ghazouaniAssessingSocioeconomicStatus2019]. SES is considered a set of material, cultural, and social capital, clearly related to the profession.
The easiest way to estimate users' occupation is to use the auto-declared occupation in the user's profile description. According to [@preotiuc-pietroAnalysisUserOccupational2015], 20% of active users declare their occupational status in their description.
A common technique is to automatically search for professional occupations in the description and link them to standardized classifications such as the nomenclature des Professions et Catégories Socioprofessionnelles (CSP2020) for France, the Standard Occupation Classification (SOC2020) [@SOC2020Office] for the UK or USA, or the International Standard Classification of Occupations (ISCO-08) [@heMethodEstimatingIndividual2023; @sloanWhoTweetsDeriving2015].
There are four main difficulties in this task for assessing SES [@sloanWhoTweetsDeriving2015]:
- Distinguishing main occupations among hobbies or secondary occupations[^1].
- Accessing representativity on Twitter among different classes.
- Ensuring that the use of given professional keywords is actually auto-declarative.
- Wrong and/or biased auto-declared professions.
We tackle these problems by replacing automatic detection of profession with human-annotated professions, focusing mainly on the most common declared occupations.
Here's our method:
- Tokenize the description, removing punctuation and stop words (in English and French).
- Extract the most common words and bi-words.
- Recognize words (or bi-words) referring to auto-declaration of an occupation (e.g., journalist, professor, city councilor...).
- Link the declared occupation with standard classifications (CSP2020, SOC2020, ISCO-08).
- Check (by human verification) if the use of the keyword in bios actually refers to auto-declaration of professions and remove keywords with accuracy under 80%.
- Highlight the classes and sub-classes relevant for our analysis.
This technique not only allows us to obtain higher accuracy due to human verification but mainly allows us to identify more users by adapting the method to our specific dataset.
Here's how we tackle the challenges cited before:
- In this work, we don't try to predict SES but only to estimate types of occupation to link them with online sharing behaviors. We allow users to label multiple professions for users who declare more than one.
- Representativity is not a problem because we do not aim to make real-world scale statistics but to understand how the profession of users influences the recommendation algorithm specifically on Twitter.
- Statistical studies present 30% of misclassification by automatic keyword detection. We rely on human verification and annotation, considering status words apart from professions, and checking the use of keywords with accuracy over 80%[^2].
- We acknowledge that what we measure are only "declared professions." We concentrate on the most used words and bi-words, ensuring statistical significance of the result and reducing the risk of misclassification.
Because our dataset is 80% French, we choose to analyze the professions collected using the French standard classification CSP2020. Table 1 shows the most important categories identified in our dataset. We notice an over-representation of highly educated positions (Cadres et professions intellectuelles supérieures) from information, politics, and business, which has already been observed and is mainly due to the fact that professionals from those categories have a higher probability of stating their professional titles on their description.
| Professional category | Classification CSP2020 (French) | Users | Example of professions |
|---|---|---|---|
| Information, arts, and entertainment professions | Professions de l’information, de l’art et des spectacles (3500) | 2315 | Journalist, producer... |
| Business, IT, and administration professionals | Cadres et professions intermédiaires administratives, commerciales et techniques des entreprises (3700, 3800, 4600) | 2160 | Developer, CEO... |
| Elected officers and political representatives | Cadres administratifs et techniques de la fonction publique (3300) | 1192 | Mayor, minister... |
| Professors and higher scientific professions | Professeurs et professions scientifiques supérieures (3400) | 636 | Professor, historian... |
| Other | Recognized professions in other CSP categories | 530 | Nurse, architect... |
[^1]: For example, "Engineer and gardener on the weekend." Is that an engineer or a hunter? "Actor, producer, film director, and writer." What is the main profession?
[^2]: For instance, we notice that plural occupation keywords (e.g., "parliamentarians") most often don't refer to auto-declaration of profession.
Code
You can get the list of professions and keywords by calling :
tbb.get_pro_kewords()
Declared Status
In addition to occupations, Twitter users often use words indicating their status. We used the same method as for occupations, examining the most frequent words and bi-words, and identified keywords indicating a status associated with the user. We found at least one status for 31% of the users. Here are the different statuses we have identified.
Professional Status:
These keywords provide information about users' occupations but indicate a status or position rather than a sector or profession. Here are the different identified statuses: advisor, director, president, founder, enthusiast, leader, assistant, manager, student, executive, delegate, alumni, candidate, vice president, professional, specialist, independent, elected, PhD, expert, important member, secretary-general, ambassador, collaborator, administrator, coordinator, doctor, board member, deputy director, apprentice, master, technician, worker. Identifying these statuses allows us to avoid misclassification problems related to professions.
Type of Actor:
On Twitter, many accounts do not belong to individual users but to organizations. To distinguish one from the other, we identified a series of keywords characterizing the type of actor owning the account. List: association, official account, personal account, university, media, committee, federation, agency, startup, company, union, foundation, city hall, think tank. Thanks to this status, we can distinguish official and personal accounts. These types of keywords are less precise than others for determining account ownership, but we judge them reasonable to use.
Affiliation to a Group:
These statuses indicate a declaration of affiliation to a political or cultural group. The statuses do not specify which affiliation it is, only if it exists. List: Activist, member, citizen, supporter, fan, volunteer.
Degree and University:
In France, certain universities and degrees can impact identity, indicating membership in a social group. Among the most used words, we recognized some of these universities. List: Sciences Po, Sorbonne, Mines.
Code
You can get the list of statuses and keywords by calling :
tbb.get_status_kewords()
Age
The age of users is delicate because we do not have verification methods to assess our approach. However, here we are not attempting to estimate the age of all users but only that of a few users. So, we choose to rely on some keywords that allow us to associate users with a notion of youth or old age. Here are the identified keywords.
Old: retraité, retraitée, senior, vieux, ex, exjournaliste, exprésident.
Young: étudiant, student, etudiant, étudiante, etudiante, jeune, junior.
We obtain an age for 2.2% of the users.
Code
You can get the list of age keywords by calling :
tbb.get_age_kewords()
Gender
The last demographic information we try to estimate is gender. In our dataset, very few users declare their pronouns, so it is not feasible to use this technique to estimate the gender of users. However, French (the predominant language in our dataset) is a gendered language, allowing us to identify the gender of the author (e.g., "je suis infirmier" indicates a male subject, while "je suis infirmière" indicates a female subject).
We then face two problems: (1) there is no model in French that can determine with certainty whether a word is feminine and whether it refers to the subject of the sentence; (2) if we identify feminine subjects, how can we ensure that they are indeed self-declarative and refer to a person of feminine gender rather than another feminine entity (company, association, etc.). To address these issues, we have chosen to use the previously identified keywords to determine the status and occupation. Indeed, we have already taken care to determine that these keywords are self-referential, and in the case of professional keywords, we know they refer to an individual rather than an entity.
So we review the self-declarative keywords, and when possible, we note if these keywords are specifically masculine or specifically feminine. Here are some examples of identified keywords.
Masculine: président, conseiller, directeur, chef, passionné, consultant, citoyen, délégué...
Feminine: conseillère, directrice, ingénieure, présidente, experte, citoyenne, passionnée...
We notice that this method introduces a bias, as we can identify the gender only of users declaring an occupation or status. Additionally, we observe that in French, the masculine is considered neutral, so for certain professions, we can only identify the feminine (e.g., "docteure" is feminine, but "docteur" can be used for both male and female). Therefore, we have a slight selection bias towards women. Finally, given the unbalanced population, we expect to identify more men than women.
However, because of the big size of our sample, we judge that we can still identify gender effects in recommender systems using this method.
We obtain a gender for 20.5% of the users (14.5% men, 6% women).
Code
You can get the list of gender keywords by calling :
tbb.get_gender_kewords()
Topics
As we are interested in the recommendation mechanism, it is sometimes valuable to know the interests of users in our dataset (e.g., politics, digital, communication, etc.), as these interests could strongly influence the consumed content. Using the same technique as for professions, we have identified and classified keywords referring to specific topics in users' bios. These words are not intended to indicate the discipline, occupation, or profile of the user but simply an affinity with a topic.
Here are the identified topics: politics, digital, communication, culture, public sector, innovation, law, education, tech, sciences, health, business, research, marketing, development, music, management, economy, environment/biodiversity, ecology, security, entrepreneurship, data, history, finance, press, energy, justice, agriculture, recruitment, social media, strategy, climate, real estate, construction, tourism, sustainable development, literature, mobility, diplomacy, agri-food, design, philosophy, sociology, engineering.
Code
You can get the list of topics and keywords by calling :
tbb.get_topic_kewords()
Bibliography
[@culottaPredictingTwitterUser2016]: Culotta, A., Ravi, N. K., & Cutler, J. (2016). Predicting Twitter User Demographics Using Distant Supervision from Website Traffic Data. Journal of Artificial Intelligence Research, 55, 389-408. doi:10.1613/jair.4935.
[@ghazouaniAssessingSocioeconomicStatus2019]: Ghazouani, D., Lancieri, L., Ounelli, H., & Jebari, C. (2019). Assessing Socioeconomic Status of Twitter Users: A Survey. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 388-398. doi:10.26615/978-954-452-056-4_046.
[@preotiuc-pietroAnalysisUserOccupational2015]: Preoţiuc-Pietro, D., Lampos, V., & Aletras, N. (2015). An Analysis of the User Occupational Class through Twitter Content. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1754-1764. doi:10.3115/v1/P15-1169.
[@SOC2020Office]: SOC 2020 - Office for National Statistics.
[@sloanWhoTweetsDeriving2015]: Sloan, L., Morgan, J., Burnap, P., & Williams, M. (2015). Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data. PLOS ONE, 10(3), e0115545. doi:10.1371/journal.pone.0115545.
Contact
For any problem you can contact me at : tim.faverjon@sciencespo.fr
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file twitter_profile_predictor-0.2.2.tar.gz.
File metadata
- Download URL: twitter_profile_predictor-0.2.2.tar.gz
- Upload date:
- Size: 3.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
858dee202d421b7c3e9fb400bff8d8fada47473e16c7cf0b90b109d5353e4034
|
|
| MD5 |
f12acfbcb42e4bffc77f1eb07161b489
|
|
| BLAKE2b-256 |
d5e7af53ff0f5809a75107e2be877c67a02add8c6bcda5959771a5c0409d057b
|
File details
Details for the file twitter_profile_predictor-0.2.2-py3-none-any.whl.
File metadata
- Download URL: twitter_profile_predictor-0.2.2-py3-none-any.whl
- Upload date:
- Size: 6.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0344a7a2f0b8d372b3c8570878f8a341c312f6431437b6f4418cf01f81420fe3
|
|
| MD5 |
7cec030c7e7644bcaac535e364df2dbd
|
|
| BLAKE2b-256 |
368331f72a72bbc51868b20e2091627d1406d6fc2645b1d92ee33def954a9254
|