Skip to main content

Location based social network (LBSN) data structure format & transfer tool

Project description

PyPI version pylint pipeline

LBSNTransform

A python package that uses the common location based social network (LBSN) data structure concept (ProtoBuf) to import, transform and export Social Media data such as Twitter and Flickr.

Motivation

The goal is to provide a common interface to handle Social Media Data, without custom adjustment to the myriad API Endpoints available. As an example, consider the ProtoBuf spec "Post", which can be a Tweet on Twitter, a Photo shared on Flickr, or a post on Reddit. This tool is based on a 4-Facet conceptual framework for LBSN, introduced in a paper by Dunkel et al. (2018). In addition, the GDPR directly requests Social Media Network operators to allow users to transfer accounts and data inbetween services. While there are attempts by Google, Facebook etc. (see data-transfer-prject), it is not currently possible. With this structure concept, a primary motivation is to systematically characterize LBSN data aspects in a common scheme that enables privacy-by-design for connected software, data handling and database design.

Description

This tool enables data import from a Postgres database, JSON, or CSV and export to CSV, LBSN ProtoBuf or a LBSN prepared Postgres Database. The tool will map Social Media endpoints (e.g. Twitter tweets) to a common LBSN Interchange Structure format in ProtoBuf. The tool can also be imported to other Python projects with import lbsntransform for on-the-fly conversion.

Quick Start

You can install the newest version with all its dependencies directly from the Git Repository:

pip install --upgrade git+git://gitlab.vgiscience.de:lbsn/lbsntransform.git

or install latest release using pip:

pip install lbsntransform

.. for non-developers, another option is to simply download the latest build and run with custom args,
e.g. with the following command line args

lbsntransform.exe --Origin 3 --LocalInput --LocalFileType '*.json' --transferlimit 1000 --CSVOutput

.. with the above input args, the the tool will:

  • read local json from /01_Input/
  • and store lbsn records as CSV and ProtoBuf in /02_Output/

A full list of possible input args is available with lbsntransform --help config.py:

usage: lbsntransform [-h] [-sO ORIGIN] [-lI] [-lT LOCALFILETYPE]
                     [-iP INPUTPATH] [-iS] [-pO DBPASSWORD_OUTPUT]
                     [-uO DBUSER_OUTPUT] [-aO DBSERVERADDRESSOUTPUT]
                     [-nO DBNAMEOUTPUT] [-pI DBPASSWORD_INPUT]
                     [-uI DBUSER_INPUT] [-aI DBSERVERADDRESSINPUT]
                     [-nI DBNAMEINPUT] [-t TRANSFERLIMIT] [-tC TRANSFERCOUNT]
                     [-nR NUMBEROFRECORDSTOFETCH] [-tR] [-rR] [-iG]
                     [-rS STARTWITHDBROWNUMBER] [-rE ENDWITHDBROWNUMBER]
                     [-d DEBUGMODE] [-gL GEOCODELOCATIONS]
                     [-igS IGNOREINPUTSOURCELIST] [-iT INPUTTYPE] [-mR] [-CSV]
                     [-CSVal] [-CSVdelim CSVDELIMITOR] [-rL]
                     [-sF SKIPUNTILFILE] [-mGA MINGEOACCURACY]

optional arguments:
  -h, --help            show this help message and exit
  -sO ORIGIN, --Origin ORIGIN
                        Type of input source. Defaults to 3: Twitter (1 -
                        Instagram, 2 - Flickr, 3 - Twitter)

Local Input:
  -lI, --LocalInput     Process local json or csv
  -lT LOCALFILETYPE, --LocalFileType LOCALFILETYPE
                        If localread, specify filetype (json, csv etc.)
  -iP INPUTPATH, --InputPath INPUTPATH
                        Optionally provide path to input folder, otherwise
                        ./Input/ will be used. You can also provide a web-url
                        starting with http
  -iS, --isStackedJson  Typical form is [{json1},{json2}], if is_stacked_json
                        is True: will process stacked jsons in the form of
                        {json1}{json2} (no comma)

DB Output:
  -pO DBPASSWORD_OUTPUT, --dbPassword_Output DBPASSWORD_OUTPUT
  -uO DBUSER_OUTPUT, --dbUser_Output DBUSER_OUTPUT
                        Default: example-user-name2
  -aO DBSERVERADDRESSOUTPUT, --dbServeraddressOutput DBSERVERADDRESSOUTPUT
                        e.g. 111.11.11.11 . Optionally add port to use, e.g.
                        111.11.11.11:5432. 5432 is the default port
  -nO DBNAMEOUTPUT, --dbNameOutput DBNAMEOUTPUT
                        e.g.: test_db

DB Input:
  -pI DBPASSWORD_INPUT, --dbPassword_Input DBPASSWORD_INPUT
  -uI DBUSER_INPUT, --dbUser_Input DBUSER_INPUT
                        Default: example-user-name
  -aI DBSERVERADDRESSINPUT, --dbServeraddressInput DBSERVERADDRESSINPUT
                        e.g. 111.11.11.11. Optionally add port to use, e.g.
                        111.11.11.11:5432. 5432 is the default port
  -nI DBNAMEINPUT, --dbNameInput DBNAMEINPUT
                        e.g.: test_db

Additional settings:
  -t TRANSFERLIMIT, --transferlimit TRANSFERLIMIT
  -tC TRANSFERCOUNT, --transferCount TRANSFERCOUNT
                        Default to 50k: After how many parsed records should
                        the result be transferred to the DB. Larger values
                        improve speed, because duplicate check happens in
                        Python and not in Postgres Coalesce; larger values are
                        heavier on memory.
  -nR NUMBEROFRECORDSTOFETCH, --numberOfRecordsToFetch NUMBEROFRECORDSTOFETCH
  -tR, --disableTransferReactions
  -rR, --disableReactionPostReferencing
                        Enable this option in args to prevent empty posts
                        stored due to Foreign Key Exists Requirement 0 = Save
                        Original Tweets of Retweets in "posts"; 1 = do not
                        store Original Tweets of Retweets; !Not implemented: 2
                        = Store Original Tweets of Retweets as
                        "post_reactions"
  -iG, --ignoreNonGeotagged
  -rS STARTWITHDBROWNUMBER, --startWithDBRowNumber STARTWITHDBROWNUMBER
  -rE ENDWITHDBROWNUMBER, --endWithDBRowNumber ENDWITHDBROWNUMBER
  -d DEBUGMODE, --debugMode DEBUGMODE
                        Needs to be implemented.
  -gL GEOCODELOCATIONS, --geocodeLocations GEOCODELOCATIONS
                        Defaults to None. Provide path to CSV file with
                        location geocodes (CSV Structure: lat, lng, name)
  -igS IGNOREINPUTSOURCELIST, --ignoreInputSourceList IGNOREINPUTSOURCELIST
                        Provide a list of input_source types that will be
                        ignored (e.g. to ignore certain bots etc.)
  -iT INPUTTYPE, --inputType INPUTTYPE
                        Input type, e.g. "post", "profile", "friendslist",
                        "followerslist" etc.
  -mR, --mapFullRelations
                        Defaults to False. Set to true to map full relations,
                        e.g. many-to-many relationships such as user_follows,
                        user_friend, user_mentions etc. are mapped in a
                        separate table
  -CSV, --CSVOutput     Set to True to Output all Submit values to CSV
  -CSVal, --CSVallowLinebreaks
                        If set to False will not remove intext-linebreaks ( or
                        ) in output CSVs
  -CSVdelim CSVDELIMITER, --CSVdelimiter CSVDELIMITER
                        'Provide CSV delimiter to use. Default is comma(,).
                        Note: to pass tab, use variable substitution ($"\t")'
  -rL, --recursiveLoad  Process Input Directories recursively (depth: 2)
  -sF SKIPUNTILFILE, --skipUntilFile SKIPUNTILFILE
                        If local input, skip all files until file with name x
                        appears (default: start immediately)
  -mGA MINGEOACCURACY, --minGeoAccuracy MINGEOACCURACY
                        Set to "latlng", "place", or "city" to limit output
                        based on min geoaccuracy

Built With

  • lbsnstructure - A common language independend and cross-network social-media datascheme
  • protobuf - Google's data interchange format
  • psycopg2 - Python-PostgreSQL Database Adapter
  • ppygis3 - A PPyGIS port for Python
  • shapely - Geometric objects processing in Python
  • emoji - Emoji handling in Python

Contributing

Field mapping from and to ProtoBuffers from different Social Media sites is provided in classes field_mapping_xxx.py.
As an example, mapping of the Twitter json structure is given (see class FieldMappingTwitter). This class may be used to extend
functionality to cover other networks such as Flickr or Foursquare.

For development & testing, make a local clone of this repository

git clone git@gitlab.vgiscience.de:lbsn/lbsntransform.git

..and (e.g.) create package in develop mode to symlink the folder to your
Python's site-packages folder with:

python setup.py develop

(use python setup.py develop --uninstall to uninstall tool in develop mode)

Now you can run the tool in your shell with (Origin 3 = Twitter):

lbsntransform --Origin 3 --LocalInput --LocalFileType '*.json' --transferlimit 1000 --CSVOutput

..or import the package to other python projects with:

import lbsntransform

Versioning and Changelog, and Download

For the releases available, see the tags on this repository. The latest windows build that is available for download is 0.1.4. For all other systems use cx_freeze to build executable:

python cx_setup.py build

The versioning (major.minor.patch) is automated using python-semantic-release. Commit messages that follow the Angular Commit Message Conventions will be automatically interpreted, followed by version bumps if necessary. Examples:

  • fix: hotfix for bug xy will result in a patch version bump
  • feat: feature for processing xy will result in minor version bump
perf(cluster): faster generation of alpha shapes

BRAKING CHANGE: Easy buffer option removed.

.. will result in a major release bump.

Some types used in this project:

feat: A new feature
fix: A bug fix
docs: Documentation only changes
style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
refactor: A code change that neither fixes a bug nor adds a feature
perf: A code change that improves performance
test: Adding missing or correcting existing tests
chore: Changes to the build process or auxiliary tools and libraries such as documentation generation

Except for feature and fixes, no version bumps will be made.

Authors

  • Alexander Dunkel - Initial work

See also the list of contributors.

License

This project is licensed under the GNU GPLv3 or any higher - see the LICENSE.md file for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lbsntransform-0.8.0.tar.gz (68.8 kB view hashes)

Uploaded Source

Built Distribution

lbsntransform-0.8.0-py3-none-any.whl (71.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page