Skip to main content

Data extraction and analysis tool

Project description

GitData

Data Wrangling for Everyone.

GitData is an easy to use, fast, scalable, distributed data extraction system with a rich set of commands that provide ways to gather, manage and query data in an unusually rich variety of ways.

Concepts

GitData stores data as facts.

Facts are triples of the form (subject, predicate, object) where subject is typically an entity, predicate is typically an attribute of that entity and object is the value of the attribute. In the case where the attribute represents a relationship between entities, the object is another entity.

Commands

GitData shares many of the commands and concepts you are familiar with from the git source code revision control system with some important differences which make it ideal for working with data.

Data repostitories

Data repositories are where GitData stores the data it is managing. That data is typically pulled in from other data sources and is stored in the data repository for quick access.

gitdata init   # initialize a new data repository
gitdata status # show repository status

Remotes

Remotes are connections you can establish within your data repository to make it easier to access data from external sources like the internet or somewhere on your network or even a local disk. When you add a remote you give it a name which can then be used to refer to that remote from within the repostitory.

To see the remotes for a data repository you can just run the gitdata remote command which will list the names of the repositories. If you want to see the URLs the remotes correspond to you can use the -v flag to produce a verbose listing.

gitdata remote      # list remotes
gitdata remote -v   # verbose list remotes
Adding Remotes

Adding a remote so you can refer to the remote by the short name is as simple as using remote add <shortname> <url>.

Removing Remotes

You can remove a remote from your project by using the gitdata remote rm <shortname> command.

Showing

Data repositories are a collection of entities containing facts. To view any entity within the repostitory you can use the gitdata show <name> command, where name is the name of the entity. So, for example, if you've stored a remote in your repostitry, you can see the details of that remote by using the show command.

Fetch

The gitdata fetch command copies facts from a somewhere else into your gitdata repository. The location being fetched from can be a remote or can be anywhere else you can get to from your computer. The facts fetched will be placed into a temporary holding area that will allow you to work with them without committing to making them a permanent part of your repository.

To fetch simply gitdata fetch <location> where <location> is either a remote that you've already added to your repository, or any other location such as a URL or a local file.

When you run fetch it will read the data in whatever form it is and digest it into facts ready for you to work with alongside any other data in your repository. If you decide you want to keep the facts as part of your data repository then you can use the gitdata add and gitdata commit commands to add them to your data repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitdata-cli-0.2.0.tar.gz (13.5 kB view hashes)

Uploaded Source

Built Distribution

gitdata_cli-0.2.0-py3-none-any.whl (18.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page