Skip to main content

Python Preloaded - Bundle Python executable with preloaded modules

Project description

Python Preloaded

Project repo: https://github.com/albertz/python-preloaded

Problem:

The startup time of CPython including loading big libraries like PyTorch or TensorFlow is too slow. In case of slow file systems, I have seen startup times including such import of 10-20 seconds.

Very simple idea:

Keep the state of CPython right after we imported the big libraries and make it available instantly when needed. When loading the state, we can continue to run any random Python script (we can use runpy).

Installation

pip install preloaded

Now you should be able to run py-preloaded-bundle-fork-server.py. For example usage, see the example below.

Method 1: Fork server

Start CPython and import the libraries. Then keep the process running as a fork server. Whenever a new instance it needed, we make a fork (os.fork), and apply a similar logic as reptyr. Some technical details are here.

This solution is very portable across Unix. I tested it so far on Linux and MacOSX, but it should run on most other Unixes as well.

Example

Create the starter script python-tf.bin:

$ py-preloaded-bundle-fork-server.py tensorflow -o python-tf.bin

This starter script is supposed to be a dropin replacement to python itself.

For testing, there is demo-import-tensorflow.py, with only the following content:

import tensorflow as tf
print("TF:", tf.__version__)

Now try to run it directly, and measure the time:

$ time python3 demo-import-tensorflow.py
TF: 2.3.0

________________________________________________________
Executed in    8.31 secs    fish           external
   usr time    3.39 secs  278.00 micros    3.39 secs
   sys time    0.67 secs   83.00 micros    0.67 secs

This is on a slow filesystem, NFS specifically. This is already after the files are cached (I just ran the same command immediately before). Otherwise, the startup time is even over 14 seconds.

The starter script was not run yet, so the first start is just as slow:

$ time ./python-tf.bin demo-import-tensorflow.py
Existing socket but can not connect: [Errno 111] Connection refused
Import module: tensorflow
TF: 2.3.0

________________________________________________________
Executed in    8.35 secs    fish           external
   usr time    3.19 secs  768.00 micros    3.19 secs
   sys time    0.72 secs  228.00 micros    0.72 secs

Now it is running in the background. It is in no way fixed to demo-import-tensorflow.py but could also run any other script now. However, we continue the demo with the same script:

$ time ./python-tf.bin demo-import-tensorflow.py
Existing socket, connected
Open new PTY
Send PTY fd to server
Wait for server to be ready
Entering PTY proxy loop
TF: 2.3.0

________________________________________________________
Executed in  261.56 millis    fish           external
   usr time   64.24 millis  542.00 micros   63.70 millis
   sys time   33.59 millis  163.00 micros   33.43 millis

As you see, the startup time is now very fast. This is also just as fast when executed at a later time, when the files are not cached anymore.

Interactively test the starter script environment:

$ ./python-tf.bin -m IPython

Method 2: Process pool

We always keep some pool (e.g. N=10 instances) of CPython + preloaded libraries alive in the background, and once we need a new instance, we just pick one from the pool.

This shares a lot of logic with the fork server. The main difference basically is that we use subprocess.Popen instead of os.fork.

(Currently not implemented)

Method 3: Program checkpoint on disk

Use some checkpointing tool (CRIU) to store the state of CPython right after we imported the libraries. Then later we can load this checkpoint (very fast).

CRIU currently needs root access for dump/restore. However, there is ongoing work to support a non-root option in https://github.com/checkpoint-restore/criu/pull/1930.

Or maybe DMTCP is a better alternative to CRIU?

(Currently incomplete)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preloaded-1.20221009.143936.tar.gz (12.4 kB view details)

Uploaded Source

File details

Details for the file preloaded-1.20221009.143936.tar.gz.

File metadata

  • Download URL: preloaded-1.20221009.143936.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.7

File hashes

Hashes for preloaded-1.20221009.143936.tar.gz
Algorithm Hash digest
SHA256 3592f64ab384c3b9aa30bc02ad62264950f033dd29cdfadbf66b1cb71e25cc70
MD5 4af269d84b78bb53d03a5f89384ed741
BLAKE2b-256 3300de16ebd778966d4bb0e639c46489a97594ecccf6e9be8d935937cf5d7f92

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page