logicalclocks · javierdlrm · Mar 12, 2026 · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026
diff --git a/docs/assets/images/guides/mlops/serving/deployment_adv_form_scaling.png b/docs/assets/images/guides/mlops/serving/deployment_adv_form_scaling.png
diff --git a/docs/assets/images/guides/mlops/serving/deployment_endpoints.png b/docs/assets/images/guides/mlops/serving/deployment_endpoints.png
diff --git a/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png b/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png
diff --git a/docs/concepts/hopsworks.md b/docs/concepts/hopsworks.md
@@ -32,4 +32,4 @@ Data can be also be securely shared between projects.
 ## Data Science Platform
 
 You can develop feature engineering, model training and inference pipelines in Hopsworks.
-There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, many bundled modular project python environments for managing python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow.
+There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, many bundled modular project Python environments for managing Python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow.
diff --git a/docs/concepts/mlops/serving.md b/docs/concepts/mlops/serving.md
@@ -1,32 +1,41 @@
-In Hopsworks, you can easily deploy models from the model registry in KServe or in Docker containers (for Hopsworks Community).
-KServe is the defacto open-source framework for model serving on Kubernetes.
-You can deploy models in either programs, using the HSML library, or in the UI.
+In Hopsworks, you can easily deploy models from the model registry using [KServe](https://kserve.github.io/website/latest/), the standard open-source framework for model serving on Kubernetes.
+You can deploy models programmatically using [`Model.deploy`][hsml.model.Model.deploy] or via the UI.
 A KServe model deployment can include the following components:
 
-**`Transformer`**
+**`Predictor (KServe component)`**
 
-:   A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client.
+:   A predictor runs a model server (Python, TensorFlow Serving, or vLLM) that loads a trained model, handles inference requests and returns predictions.
 
-**`Predictor`**
+**`Transformer (KServe component)`**
 
-:   A predictor is a ML model in a Python object that takes a feature vector as input and returns a prediction as output.
+:   A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client.
+    Not available for vLLM deployments.
 
 **`Inference Logger`**
 
 :   Hopsworks logs inputs and outputs of transformers and predictors to a ^^Kafka topic^^ that is part of the same project as the model.
+    Not available for vLLM deployments.
 
 **`Inference Batcher`**
 
 :   Inference requests can be batched to improve throughput (at the cost of slightly higher latency).
 
 **`Istio Model Endpoint`**
 
-:   You can publish a model over ^^REST(HTTP)^^ or ^^gRPC^^ using a Hopsworks API key.
+:   You can publish a model over REST(HTTP) or gRPC using a Hopsworks API key, accessible via **path-based routing** through Istio.
     API keys have scopes to ensure the principle of least privilege access control to resources managed by Hopsworks.
+    For more details on path-based routing of requests through Istio, see [REST API Guide](../../user_guides/mlops/serving/rest-api.md).
+
+    !!! warning "Host-based routing"
+        The Istio Model Endpoint supports host-based routing for inference requests; however, this approach is considered legacy.
+        Path-based routing is recommended for new deployments.
 
 Models deployed on KServe in Hopsworks can be easily integrated with the Hopsworks Feature Store using either a Transformer or Predictor Python script, that builds the predictor's input feature vector using the application input and pre-computed features from the Feature Store.
 
 <img src="../../../assets/images/concepts/mlops/kserve.svg">
 
 !!! info "Model Serving Guide"
     More information can be found in the [Model Serving guide](../../user_guides/mlops/serving/index.md).
+
+!!! tip "Python deployments"
+    For deploying Python scripts without a model artifact, see the [Python Deployments](../../user_guides/projects/python-deployment/python-deployment.md) page.
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
@@ -25,7 +25,7 @@ This is a quick-start of the Hopsworks Feature Store; using a fraud use case we
 
 ### Batch
 
-This is a batch use case variant of the fraud tutorial, it will give you a high level view on how to use our python APIs and the UI to navigate the feature groups.
+This is a batch use case variant of the fraud tutorial, it will give you a high level view on how to use our Python APIs and the UI to navigate the feature groups.
 
 | Notebooks |
 | --- |

diff --git a/docs/user_guides/fs/feature_view/feature-vectors.md b/docs/user_guides/fs/feature_view/feature-vectors.md
@@ -239,7 +239,6 @@ However, you can retrieve the untransformed feature vectors without applying mod
             entry=[{"id": 1}, {"id": 2}], transform=False
         )
 
-
         ```
 
 ## Retrieving feature vector without on-demand features
@@ -258,7 +257,6 @@ To achieve this, set the  parameters `transform` and `on_demand_features` to `Fa
             entry=[{"id": 1}, {"id": 2}], transform=False, on_demand_features=False
         )
 
-
         ```
 
 ## Passing Context Variables to Transformation Functions
@@ -274,7 +272,6 @@ After [defining a transformation function using a context variable](../transform
             entry=[{"pk1": 1}], transformation_context={"context_parameter": 10}
         )
 
-
         ```
 
 ## Choose the right Client

diff --git a/docs/user_guides/fs/feature_view/helper-columns.md b/docs/user_guides/fs/feature_view/helper-columns.md
@@ -41,7 +41,6 @@ for computing the [on-demand feature](../../../concepts/fs/feature_group/on_dema
             inference_helper_columns=["expiry_date"],
         )
 
-
         ```
 
 ### Inference Data Retrieval
@@ -88,7 +87,6 @@ However, they can be optionally fetched with inference or training data.
             ]
         ]
 
-
         ```
 
 #### Online inference
@@ -129,7 +127,6 @@ However, they can be optionally fetched with inference or training data.
             passed_features={"days_valid": days_valid},
         )
 
-
         ```
 
 ## Training Helper columns
@@ -156,7 +153,6 @@ For example one might want to use feature like `category` of the purchased produ
             training_helper_columns=["category"],
         )
 
-
         ```
 
 ### Training Data Retrieval
@@ -190,7 +186,6 @@ However, they can be optionally fetched.
             training_dataset_version=1, training_helper_columns=True
         )
 
-
         ```
 
 !!! note

diff --git a/docs/user_guides/fs/feature_view/model-dependent-transformations.md b/docs/user_guides/fs/feature_view/model-dependent-transformations.md
@@ -55,7 +55,6 @@ Additionally, Hopsworks also allows users to specify custom names for transforme
             transformation_functions=[add_two, add_one_multiple],
         )
 
-
         ```
 
 ### Specifying input features
@@ -77,7 +76,6 @@ The features to be used by a model-dependent transformation function can be spec
             ],
         )
 
-
         ```
 
 ### Using built-in transformations
@@ -106,7 +104,6 @@ The only difference is that they can either be retrieved from the Hopsworks or i
             ],
         )
 
-
         ```
 
 To attach built-in transformation functions from the `hopsworks` module they can be directly imported into the code from `hopsworks.builtin_transformations`.
@@ -134,7 +131,6 @@ To attach built-in transformation functions from the `hopsworks` module they can
             ],
         )
 
-
         ```
 
 ## Using Model Dependent Transformations
@@ -160,7 +156,6 @@ Model-dependent transformation functions can also be manually applied to a featu
         # Apply Model Dependent transformations
         encoded_feature_vector = fv.transform(feature_vector)
 
-
         ```
 
 ### Retrieving untransformed feature vector and batch inference data
@@ -185,5 +180,4 @@ To achieve this, set the `transform` parameter to False.
         # Fetching untransformed batch data.
         untransformed_batch_data = feature_view.get_batch_data(transform=False)
 
-
         ```
diff --git a/docs/user_guides/fs/feature_view/training-data.md b/docs/user_guides/fs/feature_view/training-data.md
@@ -154,7 +154,6 @@ Once you have [defined a transformation function using a context variable](../tr
             transformation_context={"context_parameter": 10},
         )
 
-
         ```
 
 ## Read training data with primary key(s) and event time

diff --git a/docs/user_guides/mlops/serving/api-protocol.md b/docs/user_guides/mlops/serving/api-protocol.md
@@ -3,13 +3,9 @@
 ## Introduction
 
 Hopsworks supports both REST and gRPC as API protocols for sending inference requests to model deployments.
-While REST API protocol is supported in all types of model deployments, support for gRPC is only available for models served with [KServe](predictor.md#serving-tool).
+While REST API protocol is supported in all types of model deployments, gRPC is currently supported for **Python model deployments** only.
 
-!!! warning
-    At the moment, the gRPC API protocol is only supported for **Python model deployments** (e.g., scikit-learn, xgboost).
-    Support for Tensorflow model deployments is coming soon.
-
-## GUI
+## Web UI
 
 ### Step 1: Create a new deployment
 
@@ -40,17 +36,7 @@ To navigate to the advanced creation form, click on `Advanced options`.
 
 ### Step 3: Select the API protocol
 
-Enabling gRPC as the API protocol for a model deployment requires KServe as the serving platform for the deployment.
-Make sure that KServe is enabled by activating the corresponding checkbox.
-
-<p align="center">
-  <figure>
-    <img style="max-width: 85%; margin: 0 auto" src="../../../../assets/images/guides/mlops/serving/deployment_adv_form_kserve.png" alt="KServe enabled in advanced deployment form">
-    <figcaption>Enable KServe in the advanced deployment form</figcaption>
-  </figure>
-</p>
-
-Then, you can select the API protocol to be enabled in your model deployment.
+You can select the API protocol to be enabled in your model deployment in the advanced deployment form.
 
 <p align="center">
   <figure>
@@ -102,7 +88,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott
   my_deployment = ms.create_deployment(my_predictor)
   my_deployment.save()
 
-
   ```
 
 ### API Reference

diff --git a/docs/user_guides/mlops/serving/autoscaling.md b/docs/user_guides/mlops/serving/autoscaling.md
@@ -0,0 +1,154 @@
+# How To Configure Scaling For A Deployment
+
+## Introduction
+
+This guide explains how to set up **autoscaling** for model deployments using either the [web UI](#web-ui) or the [Python API](#code).
+
+Deployments use [Knative Pod Autoscaler (KPA)](https://knative.dev/docs/serving/autoscaling/) to automatically scale the number of replicas based on traffic.
+Autoscaling enables the deployment to use resources more efficiently, by growing and shrinking the allocated resources according to its actual, real-time usage.
+
+See [Scale metrics](#scale-metrics) and [Scaling parameters](#scaling-parameters) for details on the available scaling options.
+
+## Web UI
+
+### Step 1: Create new deployment
+
+If you have at least one model already trained and saved in the Model Registry, navigate to the deployments page by clicking on the `Deployments` tab on the navigation menu on the left.
+
+<p align="center">
+  <figure>
+    <img src="../../../../assets/images/guides/mlops/serving/deployments_tab_sidebar.png" alt="Deployments navigation tab">
+    <figcaption>Deployments navigation tab</figcaption>
+  </figure>
+</p>
+
+Once in the deployments page, you can create a new deployment by either clicking on `New deployment` (if there are no existing deployments) or on `Create new deployment` it the top-right corner.
+Both options will open the deployment creation form.
+
+### Step 2: Go to advanced options
+
+A simplified creation form will appear including the most common deployment fields from all available configurations.
+Autoscaling is part of the advanced options of a deployment.
+To navigate to the advanced creation form, click on `Advanced options`.
+
+<p align="center">
+  <figure>
+    <img  style="max-width: 55%; margin: 0 auto" src="../../../../assets/images/guides/mlops/serving/deployment_simple_form_adv_options.png" alt="Advance options">
+    <figcaption>Advanced options. Go to advanced deployment creation form</figcaption>
+  </figure>
+</p>
+
+### Step 3: Configure autoscaling
+
+In the `Autoscaling` section of the advanced form, you can configure the scaling parameters for the predictor and/or the transformer (if available).
+You can set the scale metric, target value, minimum and maximum instances, as well as the panic and stable window parameters.
+
+<p align="center">
+  <figure>
+    <img src="../../../../assets/images/guides/mlops/serving/deployment_adv_form_scaling.png" alt="Autoscaling configuration for the predictor and transformer components">
+    <figcaption>Autoscaling configuration for the predictor and transformer</figcaption>
+  </figure>
+</p>
+
+Once you are done with the changes, click on `Create new deployment` at the bottom of the page to create the deployment for your model.
+
+## Code
+
+### Step 1: Connect to Hopsworks
+
+=== "Python"
+
+  ```python
+  import hopsworks
+
+  project = hopsworks.login()
+
+  # get Hopsworks Model Registry handle
+  mr = project.get_model_registry()
+
+  # get Hopsworks Model Serving handle
+  ms = project.get_model_serving()
+  ```
+
+### Step 2: Define the predictor scaling configuration
+
+You can use the [`PredictorScalingConfig`][hsml.scaling_config.PredictorScalingConfig] class to configure the scaling options according to your preferences.
+Default values for scaling metrics and parameters are listed in the [Scale metrics](#scale-metrics) and [Scaling parameters](#scaling-parameters) sections above.
+
+=== "Python"
+
+  ```python
+  from hsml.scaling_config import PredictorScalingConfig
+
+  predictor_scaling = PredictorScalingConfig(
+      min_instances=1, max_instances=5, scale_metric="RPS", target=100
+  )
+  ```
+
+### Step 3 (Optional): Define the transformer scaling configuration
+
+If a transformer script is also provided, you can use the [`TransformerScalingConfig`][hsml.scaling_config.TransformerScalingConfig] class to configure the scaling options according to your preferences.
+Default values for scaling metrics and parameters are listed in the [Scale metrics](#scale-metrics) and [Scaling parameters](#scaling-parameters) sections above.
+
+=== "Python"
+
+  ```python
+  from hsml.scaling_config import TransformerScalingConfig
+
+  transformer_scaling = TransformerScalingConfig(
+      min_instances=1, max_instances=3, scale_metric="CONCURRENCY", target=50
+  )
+  ```
+
+### Step 4: Create a deployment with the scaling configuration
+
+=== "Python"
+
+  ```python
+  my_model = mr.get_model("my_model", version=1)
+
+  # optional
+  my_transformer = ms.create_transformer(
+      script_file="Resources/my_transformer.py",
+      scaling_configuration=transformer_scaling
+  )
+
+  my_deployment = my_model.deploy(
+    scaling_configuration=predictor_scaling,
+    # optional:
+    transformer=my_transformer
+  )
+  ```
+
+### API Reference
+
+[`PredictorScalingConfig`][hsml.scaling_config.PredictorScalingConfig]
+
+[`TransformerScalingConfig`][hsml.scaling_config.TransformerScalingConfig]
+
+## Scale metrics
+
+The autoscaler supports two metrics to determine when to scale.
+See [Knative autoscaling metrics](https://knative.dev/docs/serving/autoscaling/autoscaling-metrics/) for more details.
+
+| Scale Metric | Default Target | Description                     |
+| ------------ | -------------- | ------------------------------- |
+| RPS          | 200            | Requests per second per replica |
+| CONCURRENCY  | 100            | Concurrent requests per replica |
+
+## Scaling parameters
+
+The following parameters can be used to fine-tune the autoscaling behavior.
+See [scale bounds](https://knative.dev/docs/serving/autoscaling/scale-bounds/), [autoscaling concepts](https://knative.dev/docs/serving/autoscaling/autoscaling-concepts/) and [scale-to-zero](https://knative.dev/docs/serving/autoscaling/scale-to-zero/) in the Knative documentation for more details.
+
+| Parameter                     | Default | Range  | Description                                 |
+| ----------------------------- | ------- | ------ | ------------------------------------------- |
+| `minInstances`                | —       | ≥ 0    | Minimum replicas (0 enables scale-to-zero)  |
+| `maxInstances`                | —       | ≥ 1    | Maximum replicas (cannot be less than min)  |
+| `panicWindowPercentage`       | 10.0    | 1–100  | Panic window as percentage of stable window |
+| `stableWindowSeconds`         | 60      | 6–3600 | Stable window duration in seconds           |
+| `panicThresholdPercentage`    | 200.0   | > 0    | Traffic threshold to trigger panic mode     |
+| `scaleToZeroRetentionSeconds` | 0       | ≥ 0    | Time to retain pods before scaling to zero  |
+
+!!! note "Cluster-level constraints"
+    ==Administrators== can set cluster-wide limits on the maximum and minimum number of instances. When the minimum is set to 0, scale-to-zero is enforced for all deployments.
-Original file line number
+Diff line change
@@ Expand Up @@
                 transformation_context={"context_parameter": 10},
             )
             ```
     ## Read training data with primary key(s) and event time
@@ Expand Down @@