Optimus is the missing framework for cleaning and pre-processing data in a distributed fashion.
Project description
To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges:
Installation (pip):
In your terminal just type pip install pyoptimus
Requirements
- Python>=3.7
Examples
You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.
Also you can go to Examples and found specific notebooks about data cleaning, data munging, profiling, data enrichment and how to create ML and DL models.
Besides check the Cheat Sheet
Feedback
Feedback is what drive Optimus future, so please take a couple of minutes to help shape the Optimus' Roadmap: http://bit.ly/optimus_survey
Also if you want to a suggestion or feature request use https://github.com/hi-primus/optimus/issues
Start Optimus
Start Optimus using "pandas"
, "dask"
, "cudf"
or "dask_cudf"
.
from optimus import Optimus
op = Optimus("pandas")
Loading data
Now Optimus can load data in csv, json, parquet, avro, excel from a local file or URL.
#csv
df = op.load.csv("../examples/data/foo.csv")
#json
df = op.load.json("../examples/data/foo.json")
# using a url
df = op.load.json("https://raw.githubusercontent.com/hi-primus/optimus/develop-21.8/examples/data/foo.json")
# parquet
df = op.load.parquet("../examples/data/foo.parquet")
# ...or anything else
df = op.load.file("../examples/data/titanic3.xls")
Also, you can load data from oracle, redshift, mysql and postgres.
Saving Data
#csv
df.save.csv("data/foo.csv")
# json
df.save.json("data/foo.json")
# parquet
df.save.parquet("data/foo.parquet")
You can also save data to oracle, redshift, mysql and postgres.
Create dataframes
Also, you can create a dataframe from scratch
df = op.create.dataframe({
'A': ['a', 'b', 'c', 'd'],
'B': [1, 3, 5, 7],
'C': [2, 4, 6, None],
'D': ['1980/04/10', '1980/04/10', '1980/04/10', '1980/04/10']
})
Using display
you have a beautiful way to show your data with extra information like column number, column data type and marked white spaces.
display(df)
Cleaning and Processing
Optimus was created to make data cleaning a breeze. The API was designed to be super easy to newcomers and very familiar for people that comes from Pandas.
Optimus expands the standard DataFrame functionality adding .rows
and .cols
accessors.
For example you can load data from a url, transform and apply some predefined cleaning functions:
new_df = df\
.rows.sort("rank", "desc")\
.cols.lower(["names", "function"])\
.cols.date_format("date arrival", "yyyy/MM/dd", "dd-MM-YYYY")\
.cols.years_between("date arrival", "dd-MM-YYYY", output_cols="from arrival")\
.cols.normalize_chars("names")\
.cols.remove_special_chars("names")\
.rows.drop(df["rank"]>8)\
.cols.rename("*", str.lower)\
.cols.trim("*")\
.cols.unnest("japanese name", output_cols="other names")\
.cols.unnest("last position seen", separator=",", output_cols="pos")\
.cols.drop(["last position seen", "japanese name", "date arrival", "cybertronian", "nulltype"])
Troubleshooting
If you have issues, see our Troubleshooting Guide
Contributing to Optimus
Contributions go far beyond pull requests and commits. We are very happy to receive any kind of contributions
including:
- Documentation updates, enhancements, designs, or bugfixes.
- Spelling or grammar fixes.
- README.md corrections or redesigns.
- Adding unit, or functional tests
- Triaging GitHub issues -- especially determining whether an issue still persists or is reproducible.
- Blogging, speaking about, or creating tutorials about Optimus and its many features.
- Helping others on our official chats
Backers and Sponsors
Become a backer or a sponsor and get your image on our README on Github with a link to your site.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyoptimus-21.8.0b3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8648ddef4fcf574f069ea573aa738a495a27586be8e0f216683568b6f7428925 |
|
MD5 | 4745db6170009984de8b9c1c4bd0f2bc |
|
BLAKE2b-256 | 1cf26f6aeab7200fa478ee63fc2604567f7b7487d7f25cd0a94b7ef8cfd16454 |