image

Few models may have a 120 - 200 Second, one time boot up time.

We have a huge catalogue of models. To make good use of resources, we only run the models that are actually being used. When a model hasn't been used for a little while, we turn it off.

When you make a request to run a prediction on a model, you'll get a fast response if the model is "warm" (already running), and a slower response if the model is "cold" (starting up). Machine learning models are often very large and resource intensive, and we have to fetch and load several gigabytes of code for some models. In some cases this process can take several minutes.

Cold boots can also happen when there's a big spike in demand. We autoscale by running multiple copies of a model on different machines, but the model can take a while to become ready.

For popular public models, cold boots are uncommon because the model is kept "warm" from all the activity. For less-frequently used models, cold boots are more frequent.