Skip to main content

Abstraction layer for Living Standards Measurement Survey data

Project description

* Streaming dvc files
A =dvc pull= will download dvc files to your local repository.
But this may not be the best way to proceed! In particular, =dvc=
offers an api which permits one to "stream" or cache files, leaving
your storage local to the working repository free of big data
files.

To illustrate,
#+begin_src python
import dvc.api
import pandas as pd

with dvc.api.open('BigRemoteFile.dta',mode='rb') as dta:
df = pd.read_stata(dta)
#+end_src
This will result in a =pandas.DataFrame= in RAM, but will use no
additional disk (except that, depending on what's being used as the
dvc store, the file may actually be stored in =.dvc/cache=; this
cache can be cleared with =dvc gc=).

** Pulling dvc files
If you need the actual file instead of a "stream" you can instead
"pull" the dvc files, using
#+begin_src sh
dvc pull
#+end_src
and files should be added from the remote dvc data store to your
working repository.

* Adding New Data
** Additional S3 Credentials
Write access to the remote s3 repository requires additional credentials; contact =ligon@berkeley.edu= to obtain these.

** Procedure to Add Data
To add a new LSMS-style survey to the repo, you'll follow the
following steps. Here we give the example of adding a 2015--16
survey from Uganda, obtained from
https://microdata.worldbank.org/index.php/catalog/3460. The same
steps should work for you /mutatis mutandis/:

1. Create a directory corresponding to the country or area; e.g.,
#+begin_src sh
mkdir Uganda
#+end_src
2. Create a /sub/-directory indicating the time period for the
survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16
#+end_src
3. Create a =Documentation= sub-directory for each survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16/Documentation
#+end_src
In this directory include the following files:
- SOURCE :: A text file giving both a url (if available) and
citation information for the dataset.
- LICENSE :: A text file containing a description of the license
or other terms under which you've obtained the data.
4. Add other documentation useful for understanding the data to the
=Documentation= sub-directory.

5. Add all the contents of the =Documentation= folder to the =git= repo;
e.g.,
#+begin_src sh
cd ./Uganda/2015-16/Documentation
git add .
git commit -m"Add Uganda 2015-16 documentation to repo."
git push
#+end_src

6. Create a =Data= sub-directory for each survey; e.g.,
#+begin_src sh
mkdir Uganda/2015-16/Data
#+end_src

7. Obtain a copy of the data you're interested in, perhaps as a zip
file or other archive. Store this in some temporary place, and
unzip (or whatever) the files into the relevant Country/Year/Data
directory, taking care to preserve any useful directory structure
in the archive. E.g.,
#+begin_src sh
cd Uganda/2015-16 && unzip -j /tmp/UGA_2015_UNPS_v01_M_STATA8.zip
#+end_src
8. Add the data you've unarchived to =dvc=, then add the /pointers/
(i.e., files with a .dvc extension to git). For the Uganda case we assume that
all the relevant data comes in the form of =stata= *.dta files,
since this is what we downloaded from the World Bank. For example,
#+begin_src sh
cd ../Data
dvc add *.dta
git commit -m"Add Uganda/2015-16/Data/*.dta files to dvc store."
git pull && git push
#+end_src
9. Push the data files to the dvc store. Make sure you have good
internet connection! Then a simple
#+begin_src sh
dvc push
#+end_src
will copy the data to the remote data store. NB: If this is the
first time you've done this for this repository, then you'll
first need to jump through some simple hoops to authenticate with
gdrive.
10. With the files pushed to the dvc store, you won't need them
locally anymore, so you can do something like
#+begin_src sh
cd ../Data && rm *.dta
#+end_src
or (if you have a more complex directory structure) perhaps
#+begin_src sh
find ../Data -name \*.dta -exec rm \{\} \;
#+end_src

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lsms_library-0.2.2.tar.gz (18.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lsms_library-0.2.2-py3-none-any.whl (21.7 MB view details)

Uploaded Python 3

File details

Details for the file lsms_library-0.2.2.tar.gz.

File metadata

  • Download URL: lsms_library-0.2.2.tar.gz
  • Upload date:
  • Size: 18.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.5 Linux/6.6.76-08096-g300882a0a131

File hashes

Hashes for lsms_library-0.2.2.tar.gz
Algorithm Hash digest
SHA256 0ba93ab96e36e2d847edcc61124c47f34d902816c6d663b3d37f0e9a3543c9c6
MD5 aa9834e0e3ba9c05f0f1e87e50108f8f
BLAKE2b-256 60c815b81c36d168d93415c6b2f93ab566f579ad73931c5a4fa2679a33ebe929

See more details on using hashes here.

File details

Details for the file lsms_library-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: lsms_library-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 21.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.5 Linux/6.6.76-08096-g300882a0a131

File hashes

Hashes for lsms_library-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dce49649e3ac679bcd8ef8a7c52e0952bb12afb3c2a2220db114f2ed2b58f7d9
MD5 61b8f1325a0f3ff1e96c1941bc84b548
BLAKE2b-256 51c23ca794181caee7726018552e2e17277557416bf5e3cd81743095cc916a7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page