-
Notifications
You must be signed in to change notification settings - Fork 28
[HWORKS-2662] Extensive improvements to serving docs #556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
54 commits
Select commit
Hold shift + click to select a range
85cf573
[HWORKS-2662] Extensive improvements to serving docs
javierdlrm 373132d
Fix lint
javierdlrm 2601ae0
Fix lint
javierdlrm 68637d0
Fix lint
javierdlrm 8b363c6
Add info about init params and feature logging
javierdlrm 503a3e8
Update docs/concepts/mlops/serving.md
javierdlrm 5712ae5
Update docs/concepts/mlops/serving.md
javierdlrm d81e4e4
Update docs/concepts/mlops/serving.md
javierdlrm 8a354de
Update docs/concepts/mlops/serving.md
javierdlrm debf029
Update docs/user_guides/mlops/serving/autoscaling.md
javierdlrm ec289c3
Update docs/user_guides/mlops/serving/deployment.md
javierdlrm 207cb7d
Update docs/user_guides/mlops/serving/deployment.md
javierdlrm 773abf1
Update docs/user_guides/mlops/serving/deployment.md
javierdlrm 8c2856e
Update docs/user_guides/projects/python-deployment/python-deployment.md
javierdlrm a9638dc
Update docs/user_guides/mlops/serving/troubleshooting.md
javierdlrm 9d9cbf6
Update docs/user_guides/mlops/serving/scheduling.md
javierdlrm 0fe80c0
Update docs/user_guides/mlops/serving/resources.md
javierdlrm 90fa7f1
Update docs/user_guides/mlops/serving/deployment.md
javierdlrm 60a5562
Address comments
javierdlrm ecd0d73
Fix trailing space
javierdlrm 10bfd8d
Update docs/concepts/mlops/serving.md
javierdlrm 5d72a72
Update docs/user_guides/mlops/serving/external-access.md
javierdlrm c8fb64e
Update docs/user_guides/mlops/serving/resources.md
javierdlrm 3a1176d
Update docs/user_guides/mlops/serving/resources.md
javierdlrm 281a6ab
Update docs/user_guides/mlops/serving/scheduling.md
javierdlrm 666ec3e
Update docs/user_guides/mlops/serving/scheduling.md
javierdlrm bd36445
Update docs/user_guides/mlops/serving/transformer.md
javierdlrm 019c2fe
Update docs/user_guides/projects/python-deployment/python-deployment.md
javierdlrm e5e0e41
Update docs/user_guides/mlops/serving/scheduling.md
javierdlrm 6fd88e9
Update docs/user_guides/mlops/serving/transformer.md
javierdlrm b706493
Update docs/user_guides/projects/python-deployment/python-deployment.md
javierdlrm b4e1262
Update docs/user_guides/projects/python-deployment/python-deployment.md
javierdlrm c6b2c63
Update docs/user_guides/projects/python-deployment/python-deployment.md
javierdlrm 2e35129
Update docs/user_guides/projects/python-deployment/python-deployment.md
javierdlrm b2a6499
Update docs/user_guides/projects/python-deployment/python-deployment.md
javierdlrm 6d5fd68
Update docs/user_guides/projects/python-deployment/rest-api.md
javierdlrm 26de22c
Update docs/user_guides/projects/python-deployment/rest-api.md
javierdlrm 0c567b1
Update docs/user_guides/projects/python-deployment/troubleshooting.md
javierdlrm 50f9b68
Update docs/user_guides/projects/python-deployment/rest-api.md
javierdlrm 29e95cc
Update docs/user_guides/projects/python-deployment/python-deployment.md
javierdlrm 769c89d
Update docs/user_guides/projects/python-deployment/rest-api.md
javierdlrm 81439d9
Update docs/user_guides/mlops/serving/autoscaling.md
javierdlrm f9083a0
Update docs/user_guides/mlops/serving/autoscaling.md
javierdlrm b35c0a4
Update docs/user_guides/mlops/serving/autoscaling.md
javierdlrm 048aac6
Update docs/user_guides/mlops/serving/autoscaling.md
javierdlrm 6011e45
Update docs/user_guides/mlops/serving/autoscaling.md
javierdlrm 550f50c
Update docs/user_guides/mlops/serving/rest-api.md
javierdlrm 1b5dc34
Update docs/user_guides/mlops/serving/rest-api.md
javierdlrm 8aebbeb
Update docs/user_guides/mlops/serving/scheduling.md
javierdlrm 9a03246
Update docs/user_guides/mlops/serving/scheduling.md
javierdlrm 1f0ed83
Update docs/user_guides/projects/python-deployment/python-deployment.md
javierdlrm d3cccd0
Update docs/user_guides/mlops/serving/transformer.md
javierdlrm 92613f4
Update docs/user_guides/mlops/serving/transformer.md
javierdlrm 60fa9dc
Update docs/user_guides/mlops/serving/transformer.md
javierdlrm File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file added
BIN
+94.4 KB
docs/assets/images/guides/mlops/serving/deployment_adv_form_scaling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added
BIN
+97.2 KB
docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,32 +1,41 @@ | ||
| In Hopsworks, you can easily deploy models from the model registry in KServe or in Docker containers (for Hopsworks Community). | ||
| KServe is the defacto open-source framework for model serving on Kubernetes. | ||
| You can deploy models in either programs, using the HSML library, or in the UI. | ||
| In Hopsworks, you can easily deploy models from the model registry using [KServe](https://kserve.github.io/website/latest/), the standard open-source framework for model serving on Kubernetes. | ||
| You can deploy models programmatically using [`Model.deploy`][hsml.model.Model.deploy] or via the UI. | ||
| A KServe model deployment can include the following components: | ||
|
|
||
| **`Transformer`** | ||
| **`Predictor (KServe component)`** | ||
|
|
||
| : A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client. | ||
| : A predictor runs a model server (Python, TensorFlow Serving, or vLLM) that loads a trained model, handles inference requests and returns predictions. | ||
|
|
||
| **`Predictor`** | ||
| **`Transformer (KServe component)`** | ||
|
|
||
| : A predictor is a ML model in a Python object that takes a feature vector as input and returns a prediction as output. | ||
| : A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client. | ||
| Not available for vLLM deployments. | ||
|
|
||
| **`Inference Logger`** | ||
|
|
||
| : Hopsworks logs inputs and outputs of transformers and predictors to a ^^Kafka topic^^ that is part of the same project as the model. | ||
| Not available for vLLM deployments. | ||
|
|
||
| **`Inference Batcher`** | ||
|
|
||
| : Inference requests can be batched to improve throughput (at the cost of slightly higher latency). | ||
|
|
||
| **`Istio Model Endpoint`** | ||
|
|
||
| : You can publish a model over ^^REST(HTTP)^^ or ^^gRPC^^ using a Hopsworks API key. | ||
| : You can publish a model over REST(HTTP) or gRPC using a Hopsworks API key, accessible via **path-based routing** through Istio. | ||
| API keys have scopes to ensure the principle of least privilege access control to resources managed by Hopsworks. | ||
| For more details on path-based routing of requests through Istio, see [REST API Guide](../../user_guides/mlops/serving/rest-api.md). | ||
|
|
||
| !!! warning "Host-based routing" | ||
| The Istio Model Endpoint supports host-based routing for inference requests; however, this approach is considered legacy. | ||
| Path-based routing is recommended for new deployments. | ||
|
|
||
| Models deployed on KServe in Hopsworks can be easily integrated with the Hopsworks Feature Store using either a Transformer or Predictor Python script, that builds the predictor's input feature vector using the application input and pre-computed features from the Feature Store. | ||
|
|
||
| <img src="../../../assets/images/concepts/mlops/kserve.svg"> | ||
|
|
||
| !!! info "Model Serving Guide" | ||
| More information can be found in the [Model Serving guide](../../user_guides/mlops/serving/index.md). | ||
|
|
||
| !!! tip "Python deployments" | ||
| For deploying Python scripts without a model artifact, see the [Python Deployments](../../user_guides/projects/python-deployment/python-deployment.md) page. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,154 @@ | ||
| # How To Configure Scaling For A Deployment | ||
|
|
||
| ## Introduction | ||
|
|
||
| This guide explains how to set up **autoscaling** for model deployments using either the [web UI](#web-ui) or the [Python API](#code). | ||
|
|
||
| Deployments use [Knative Pod Autoscaler (KPA)](https://knative.dev/docs/serving/autoscaling/) to automatically scale the number of replicas based on traffic. | ||
| Autoscaling enables the deployment to use resources more efficiently, by growing and shrinking the allocated resources according to its actual, real-time usage. | ||
|
|
||
| See [Scale metrics](#scale-metrics) and [Scaling parameters](#scaling-parameters) for details on the available scaling options. | ||
|
|
||
| ## Web UI | ||
|
|
||
| ### Step 1: Create new deployment | ||
|
|
||
| If you have at least one model already trained and saved in the Model Registry, navigate to the deployments page by clicking on the `Deployments` tab on the navigation menu on the left. | ||
|
|
||
| <p align="center"> | ||
| <figure> | ||
| <img src="../../../../assets/images/guides/mlops/serving/deployments_tab_sidebar.png" alt="Deployments navigation tab"> | ||
| <figcaption>Deployments navigation tab</figcaption> | ||
| </figure> | ||
| </p> | ||
|
|
||
| Once in the deployments page, you can create a new deployment by either clicking on `New deployment` (if there are no existing deployments) or on `Create new deployment` it the top-right corner. | ||
| Both options will open the deployment creation form. | ||
|
|
||
| ### Step 2: Go to advanced options | ||
|
|
||
| A simplified creation form will appear including the most common deployment fields from all available configurations. | ||
| Autoscaling is part of the advanced options of a deployment. | ||
| To navigate to the advanced creation form, click on `Advanced options`. | ||
|
|
||
| <p align="center"> | ||
| <figure> | ||
| <img style="max-width: 55%; margin: 0 auto" src="../../../../assets/images/guides/mlops/serving/deployment_simple_form_adv_options.png" alt="Advance options"> | ||
| <figcaption>Advanced options. Go to advanced deployment creation form</figcaption> | ||
| </figure> | ||
| </p> | ||
|
|
||
| ### Step 3: Configure autoscaling | ||
|
|
||
| In the `Autoscaling` section of the advanced form, you can configure the scaling parameters for the predictor and/or the transformer (if available). | ||
| You can set the scale metric, target value, minimum and maximum instances, as well as the panic and stable window parameters. | ||
|
|
||
| <p align="center"> | ||
| <figure> | ||
| <img src="../../../../assets/images/guides/mlops/serving/deployment_adv_form_scaling.png" alt="Autoscaling configuration for the predictor and transformer components"> | ||
| <figcaption>Autoscaling configuration for the predictor and transformer</figcaption> | ||
| </figure> | ||
| </p> | ||
|
|
||
| Once you are done with the changes, click on `Create new deployment` at the bottom of the page to create the deployment for your model. | ||
|
|
||
| ## Code | ||
|
|
||
| ### Step 1: Connect to Hopsworks | ||
|
|
||
| === "Python" | ||
|
|
||
| ```python | ||
| import hopsworks | ||
|
|
||
| project = hopsworks.login() | ||
|
|
||
| # get Hopsworks Model Registry handle | ||
| mr = project.get_model_registry() | ||
|
|
||
| # get Hopsworks Model Serving handle | ||
| ms = project.get_model_serving() | ||
| ``` | ||
|
|
||
| ### Step 2: Define the predictor scaling configuration | ||
|
|
||
| You can use the [`PredictorScalingConfig`][hsml.scaling_config.PredictorScalingConfig] class to configure the scaling options according to your preferences. | ||
| Default values for scaling metrics and parameters are listed in the [Scale metrics](#scale-metrics) and [Scaling parameters](#scaling-parameters) sections above. | ||
|
|
||
| === "Python" | ||
|
|
||
| ```python | ||
| from hsml.scaling_config import PredictorScalingConfig | ||
|
|
||
| predictor_scaling = PredictorScalingConfig( | ||
| min_instances=1, max_instances=5, scale_metric="RPS", target=100 | ||
javierdlrm marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ) | ||
| ``` | ||
|
|
||
| ### Step 3 (Optional): Define the transformer scaling configuration | ||
javierdlrm marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| If a transformer script is also provided, you can use the [`TransformerScalingConfig`][hsml.scaling_config.TransformerScalingConfig] class to configure the scaling options according to your preferences. | ||
| Default values for scaling metrics and parameters are listed in the [Scale metrics](#scale-metrics) and [Scaling parameters](#scaling-parameters) sections above. | ||
|
|
||
| === "Python" | ||
|
|
||
| ```python | ||
| from hsml.scaling_config import TransformerScalingConfig | ||
|
|
||
| transformer_scaling = TransformerScalingConfig( | ||
javierdlrm marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| min_instances=1, max_instances=3, scale_metric="CONCURRENCY", target=50 | ||
| ) | ||
| ``` | ||
|
|
||
| ### Step 4: Create a deployment with the scaling configuration | ||
|
|
||
| === "Python" | ||
|
|
||
| ```python | ||
| my_model = mr.get_model("my_model", version=1) | ||
|
|
||
| # optional | ||
| my_transformer = ms.create_transformer( | ||
| script_file="Resources/my_transformer.py", | ||
| scaling_configuration=transformer_scaling | ||
| ) | ||
|
|
||
| my_deployment = my_model.deploy( | ||
| scaling_configuration=predictor_scaling, | ||
| # optional: | ||
| transformer=my_transformer | ||
| ) | ||
| ``` | ||
|
|
||
| ### API Reference | ||
|
|
||
| [`PredictorScalingConfig`][hsml.scaling_config.PredictorScalingConfig] | ||
|
|
||
| [`TransformerScalingConfig`][hsml.scaling_config.TransformerScalingConfig] | ||
javierdlrm marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Scale metrics | ||
javierdlrm marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The autoscaler supports two metrics to determine when to scale. | ||
| See [Knative autoscaling metrics](https://knative.dev/docs/serving/autoscaling/autoscaling-metrics/) for more details. | ||
|
|
||
| | Scale Metric | Default Target | Description | | ||
| | ------------ | -------------- | ------------------------------- | | ||
| | RPS | 200 | Requests per second per replica | | ||
| | CONCURRENCY | 100 | Concurrent requests per replica | | ||
|
|
||
| ## Scaling parameters | ||
javierdlrm marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The following parameters can be used to fine-tune the autoscaling behavior. | ||
| See [scale bounds](https://knative.dev/docs/serving/autoscaling/scale-bounds/), [autoscaling concepts](https://knative.dev/docs/serving/autoscaling/autoscaling-concepts/) and [scale-to-zero](https://knative.dev/docs/serving/autoscaling/scale-to-zero/) in the Knative documentation for more details. | ||
|
|
||
| | Parameter | Default | Range | Description | | ||
| | ----------------------------- | ------- | ------ | ------------------------------------------- | | ||
| | `minInstances` | — | ≥ 0 | Minimum replicas (0 enables scale-to-zero) | | ||
| | `maxInstances` | — | ≥ 1 | Maximum replicas (cannot be less than min) | | ||
| | `panicWindowPercentage` | 10.0 | 1–100 | Panic window as percentage of stable window | | ||
| | `stableWindowSeconds` | 60 | 6–3600 | Stable window duration in seconds | | ||
| | `panicThresholdPercentage` | 200.0 | > 0 | Traffic threshold to trigger panic mode | | ||
| | `scaleToZeroRetentionSeconds` | 0 | ≥ 0 | Time to retain pods before scaling to zero | | ||
|
|
||
| !!! note "Cluster-level constraints" | ||
| ==Administrators== can set cluster-wide limits on the maximum and minimum number of instances. When the minimum is set to 0, scale-to-zero is enforced for all deployments. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.