pset1 test
Project description
Pset 1
Objectives
In this problem set you will:
-
Use Pipenv to build and manage your application
-
Further explore CI/CD, pulling data from AWS S3 and performing an analysis
-
Use decorators and context managers to implement an atomic write and pset submission
-
Dip your toes into hashing and parquet, which will be important in many of our problem sets
Table of Contents generated with DocToc
- Before you begin...
- Problems
Before you begin...
Add your Build and Code Climate badges to the top of this README, using the markdown template for your master branch. If you are not using Travis, replace with a build badge for the appropriate service.
Document code and read documentation
For some problems we have provided starter code. Please look carefully at the doc strings and follow all input and output specifications.
For other problems, we might ask you to create new functions, please document them using doc strings! Documentation factors into the "python quality" portion of your grade.
Docker shortcut
See drun_app:
docker-compose build
./drun_app python # Equivalent to docker-compose run app python
Pipenv
This pset will require dependencies. Rather than using a requirements.txt, we will use pipenv to give us a pure, repeatable, application environment.
Installation
If you are using the Docker environment, you should be good to go. Mac/windows users should install pipenv into their main python environment as instructed. If you need a new python environment, you can use a base conda installation. See Canvas for instructions.
pipenv install --dev
pipenv run python some_python_file
If you get a TypeError, see this issue
Usage
Rather than python some_file.py
, you should run pipenv run python some_file.py
or pipenv shell
etc
Never pip install something! Instead you should pipenv install pandas
*or pipenv install --dev pytest
. Use --dev
if your app only needs the
*dependency for development, not to actually do it's job.
You should avoid installing "IDE apps" like ipython or black into your Pipenv environment - if you don't need it during CI or for the code to work, it should not be an application dependency.
Pycharm works great with pipenv
Be sure to commit any changes to your Pipfile and Pipfile.lock!
Debugging pipenv installs
If a pipenv install fails, it can be hard to see what the root cause is. Try these tips:
-
Ensure you don't have a bad package in your Pipfile (pipenv sometimes adds them even if the install fails). You can manually delete lines for packages you don't want.
-
Try verbose mode, with
pipenv install -v ...
. -
Try manually installing with
pip install ...
orpip install -v ...
. This will often show you the 'real' error. If it works, don't forget topipenv install ...
to ensure the dep is tracked appropriately
Pipenv inside docker
Because of the way docker freezes the operating system, installing a new package within docker is a two-step process:
docker-compose build
# Now i want a new thing
./drun_app pipenv install pandas # Updates pipfile, but does not rebuild image
# running ./drun_app python -c "import pandas" will fail!
# Rebuild
docker-compose build
./drun_app python -c "import pandas" # Now this works
Credentials and data
Git should be for code, not data, so we've created an S3 bucket for problem set
file distribution. For this problem set, we've uploaded a data set of your
answers to the "experience demographics" quiz that you should have completed in
the first week. In order to access the data in S3, we need to install and
configure awscli
both for running the code locally and running our tests in
Travis.
You should have created an IAM key in your AWS account. DO NOT SHARE THE SECRET KEY WITH ANYONE. It gives anyone access to the S3 bucket. It must not be committed to your code.
For more reference on security, see Travis Best Practices and Removing Sensitive Data.
Using awscli
AWS provides a CLI tool that helps interact with the many different services they offer.
Installation (via pipenv)
We have already installed awscli
into your pipenv. It is available within the
pipenv shell and the docker container via the same mechanism.
pipenv run aws --help
./drun_app aws --help
Configure awscli
locally
Now
configure
awscli
to access the S3 bucket. You have a few options for how to do so:
- Make a .env file for you pset
- Globally configure your client (not recommended)
- Manually manage environment variables
Note that, regardless of the option, your code does not change - the code assumes it is running in a pre-configured environment. In fact, your code will likely be configured differently on your machine vs on Travis.
Personally, for local development, I find using named
profiles
(eg, AWS_PROFILE=csci_student
) the easiest, since you can configure once on
your system and then just indicate the profile for each pset. You will likely
need to provide the raw credentials in CI/CD.
Make a .env (pronounced "dotenv") file
Create a .env file that looks something like this:
AWS_PROFILE=XXX
AWS_ACCESS_KEY_ID=XXXX
AWS_SECRET_ACCESS_KEY=XXX
OTHER_ENV_VAR=...
The .env
file should be in your root project directory, the same directory as
this README.
DO NOT commit this file to the repo (it's already in your .gitignore)
Both docker and pipenv will automatically inject these variables into your
environment! Whenever you need new env variables, add them to a dotenv. You
DO NOT need to install any additional modules (notably python-dotenv
) in
order for this to work.
See env refs for docker and pipenv for more details.
Caveats
Beware some small 'gotchas' with dotenv files:
-
Pycharm does not load them directly, even if using a pipenv interpreter. You can install plugins to accomplish this, or manually add the env variables to Pycharm's run configuration. Try not to add them to the pytest run config!
-
If using
pipenv shell
or a docker shell, changes to the dotenv will not be loaded until you exit and restart the shell
Global config
According to the documentation the easiest way is:
aws configure
AWS Access Key ID [None]: ACCESS_KEY
AWS Secret Access Key [None]: SECRET_KEY
Default region name [None]:
Default output format [None]:
However, this is not generally recommended, if you want to use a special AWS key just for this class. It is more secure to create a dedicated IAM user/key with no permissions except those needed for this course.
There are other, more complicated, configurations outlined in the documentation. Feel free to use a solution using environment variables, a credentials file, a profile, etc.
Copy the data locally
Read the Makefile and .travis.yml to see how to copy the data locally.
Note that we are using a Requestor Pays bucket. You are responsible for usage charges.
You should now have a new folder called data
in your root directory with the
data we'll use for this problem set. You can find more details breaking down
this command at the S3 Documentation
Site.
Set the Travis environment variables
This is Advanced Python for Data Science, so of course we also want to automate
our tests and pset using CI/CD. Unfortunately, we can't upload our .env or run
aws configure
and interactively enter the credentials for the Travis builds,
so we have to configure Travis to use the access credentials without
compromising the credentials in our git repo.
We've provided a working .travis.yml
configuration that only requires the AWS
credentials when running on the master branch, but you will still need to the
final step of adding the variables for your specific pset repository.
To add environment variables to your Travis environment, you can use of the following options:
- Navigating to the settings, eg https://travis-ci.com/csci-e-29/YOUR_PSET_REPO/settings
- The Travis CLI
- encrypting into the
.travis.yml
as instructed here.
Preferably, you should only make your 'prod' credentials available on your master branch: Travis Settings
You can chose the method you think is most appropriate. Our only requirement is that THE KEYS SHOULD NOT BE COMMITTED TO YOUR REPO IN PLAIN TEXT ANYWHERE.
For more information, check out the Travis Documentation on Environment Variables
IMPORTANT: If you find yourself getting stuck or running into issues, please post on Piazza and ask for help. We've provided most of the instructions necessary for this step and do not want you spinning your wheels too long just trying to download the data.
Problems
Canvas helpers
Note the module pset_1/canvas.py. As you complete this problem set, think about anything you do which is a generalized utility for Canvas (eg, you might want to reuse in your next psets). Write those functions in this module for now.
You should stick with using canvasapi (unless you find something better!). Use the python code to help you understand the API!
Example functions you may want to write here:
-
Get an active canvas course by name (
-> canvasapi.course.Course
). Note the many helper functions on theCourse
class. -
Get an assignment or quiz by name (
canvasapi.assignment.Assignment
andcanvasapi.quiz.Quiz
). Note that every quiz is also an assignment, withis_quiz=True
andquiz_id
properties. -
Take a quiz or submit an assignment (at least the generalizable parts).
Hashed strings
It can be extremely useful to hash a string or other data for various reasons - to distribute/partition it, to anonymize it, or otherwise conceal the content.
Implement a standardized string hash
Use sha256
as the backbone algorithm from
hashlib. Hashlib is part of
the Python standard library, so you do not have to install it in order to import
it into your modules.
A salt
is a prefix that may be added to increase the randomness or otherwise
change the outcome. It may be a str
or bytes
string, or empty.
Implement it in hash_str.py, where the return value is the
.digest()
of the hash, as a bytes
array:
def hash_str(some_val: Union[str, bytes], salt: Union[str, bytes] = "") -> bytes:
"""Converts strings to hash digest
See: https://en.wikipedia.org/wiki/Salt_(cryptography)
:param some_val: thing to hash, can be str or bytes
:param salt: Add randomness to the hashing, can be str or bytes
:return: sha256 hash digest of some_val with salt, type bytes
"""
Note you will need to
encode string
values into bytes (use .encode()
!).
As an example, hash_str('world!', salt='hello, ').hex()[:6] == '68e656'
Note that if we ever ask you for a bytes value in Canvas, the expectation is the hexadecimal representation as illustrated above.
Salt and Pepper
Note, however, that hashing isn't very secure without a secure salt. We can
take raw bytes
to get something with more entropy than standard text provides.
Let's designate an environment variable, CSCI_SALT
, which will contain
hex-encoded bytes (be careful here
'68e656'.encode()
is not treating the input as hexadecimal!).
Implement the function pset_1.hash_str.get_csci_salt
which pulls and decodes an environment variable. In Canvas, you will be given a
random salt taken from random.org for real security.
Additionally, we will add a per-course
pepper, which we will
define as the UUID of the current canvas course. You will fetch this from the
Canvas API. Implement the function get_csci_pepper
. This will be added to
the salt to form the final 'salt' used for secure hashing.
(Note, the author is not a master cryptographer! We may be abusing the terms 'salt' and 'pepper')
Atomic writes
Use the module pset_1.io
. We will implement an atomic writer.
Atomic writes are used to ensure we never have an incomplete file output. Basically, they perform the operations:
- Create a temporary file which is unique (possibly involving a random file name)
- Allow the code to take its sweet time writing to the file
- Rename the file to the target destination name.
If the target and temporary file are on the same filesystem, the rename operation is atomic - that is, it can only completely succeed or fail entirely, and you can never be left with a bad file state.
See notes in Luigi and the Thanksgiving Bug
Implement an atomic write
Start with the following in io.py:
@contextmanager
def atomic_write(file: Union[str, os.PathLike], mode: str="w", as_file: bool=True, **kwargs) -> ContextManager:
"""Write a file atomically
:param file: str or :class:`os.PathLike` target to write
:param bool as_file: if True, the yielded object is a :class:File.
(eg, what you get with `open(...)`). Otherwise, it will be the
temporary file path string
:param kwargs: anything else needed to open the file
:raises: FileExistsError if target exists
Example::
with atomic_write("hello.txt") as f:
f.write("world!")
"""
...
Key considerations:
- You can use tempfile, write to the same directory as the target, or both. What are the tradeoffs? Add code comments for anything critical
- Ensure the file does not already exist before you begin writing
- Ensure the file is deleted if the writing code fails
- Ensure the temporary file has the same extension(s) as the target. This is
important for any code that may infer something from the path (for example,
extensions can have multiple periods such as
.tar.gz
). - If the writing code fails and you try again, the temp file should be new - you don't want the context to reopen the same temp file.
- Others?
Ensure these considerations are reflected in your unit tests!
Every file written in this class must be written atomically, via this function or otherwise.
Parquet
Excel is a very poor file format compared to modern column stores. In main.py use Parquet via Pandas to transform the provided excel file into a better format.
The user id is already hashed for you in this file. However, you should set the index to the user ID and sort by that value before writing.
The new file should keep the same name, but with the extension changed to
.parquet
. The new file should otherwise keep the same path as the original
file.
Ensure you use your atomic write (consider using as_file=False
with
.to_parquet
). If the file exists already, you do not need to write it out
again.
In main.py read back just the hashed id column and print it. DO NOT read the entire data set!
Your main script and submission context manager
Implement top level execution in pset_1/_main_.py to
show your work, answer the submission quiz, and submit the assignment to Canvas
via CI/CD. It can be invoked with python -m pset_1
.
Take a look at the submit.py
from Pset 0. You must submit the same metadata
along with your submission. Copy over any relevant code into one or more
modules in this pset. Think about how you can improve or generalize the code.
In addition to any other improvements you make, you should rewrite the
submission (try...finally
) using a context manager. Think about what args you
should pass to enter the context, and what context object you might want to
yield:
with pset_submission(...) as ...:
...
Ensure the assignment is NOT submitted if the quiz submission fails!
We will use this construct for all future psets, so try to maximize the amount of generic pset work that gets done in the context manager!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.