diff --git a/docs/assets/images/guides/mlops/serving/deployment_adv_form_scaling.png b/docs/assets/images/guides/mlops/serving/deployment_adv_form_scaling.png new file mode 100644 index 0000000000..22cc64ae16 Binary files /dev/null and b/docs/assets/images/guides/mlops/serving/deployment_adv_form_scaling.png differ diff --git a/docs/assets/images/guides/mlops/serving/deployment_endpoints.png b/docs/assets/images/guides/mlops/serving/deployment_endpoints.png deleted file mode 100644 index c97a835987..0000000000 Binary files a/docs/assets/images/guides/mlops/serving/deployment_endpoints.png and /dev/null differ diff --git a/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png b/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png new file mode 100644 index 0000000000..285ebb899b Binary files /dev/null and b/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png differ diff --git a/docs/concepts/hopsworks.md b/docs/concepts/hopsworks.md index fffc95efbc..fdeb6c1fba 100644 --- a/docs/concepts/hopsworks.md +++ b/docs/concepts/hopsworks.md @@ -32,4 +32,4 @@ Data can be also be securely shared between projects. ## Data Science Platform You can develop feature engineering, model training and inference pipelines in Hopsworks. -There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, many bundled modular project python environments for managing python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow. +There is support for version control (GitHub, GitLab, BitBucket), Jupyter notebooks, a shared distributed file system, many bundled modular project Python environments for managing Python dependencies without needing to write Dockerfiles, jobs (Python, Spark, Flink), and workflow orchestration with Airflow. diff --git a/docs/concepts/mlops/serving.md b/docs/concepts/mlops/serving.md index 37c3d4c358..be8168c033 100644 --- a/docs/concepts/mlops/serving.md +++ b/docs/concepts/mlops/serving.md @@ -1,19 +1,20 @@ -In Hopsworks, you can easily deploy models from the model registry in KServe or in Docker containers (for Hopsworks Community). -KServe is the defacto open-source framework for model serving on Kubernetes. -You can deploy models in either programs, using the HSML library, or in the UI. +In Hopsworks, you can easily deploy models from the model registry using [KServe](https://kserve.github.io/website/latest/), the standard open-source framework for model serving on Kubernetes. +You can deploy models programmatically using [`Model.deploy`][hsml.model.Model.deploy] or via the UI. A KServe model deployment can include the following components: -**`Transformer`** +**`Predictor (KServe component)`** -: A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client. +: A predictor runs a model server (Python, TensorFlow Serving, or vLLM) that loads a trained model, handles inference requests and returns predictions. -**`Predictor`** +**`Transformer (KServe component)`** -: A predictor is a ML model in a Python object that takes a feature vector as input and returns a prediction as output. +: A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client. + Not available for vLLM deployments. **`Inference Logger`** : Hopsworks logs inputs and outputs of transformers and predictors to a ^^Kafka topic^^ that is part of the same project as the model. + Not available for vLLM deployments. **`Inference Batcher`** @@ -21,8 +22,13 @@ A KServe model deployment can include the following components: **`Istio Model Endpoint`** -: You can publish a model over ^^REST(HTTP)^^ or ^^gRPC^^ using a Hopsworks API key. +: You can publish a model over REST(HTTP) or gRPC using a Hopsworks API key, accessible via **path-based routing** through Istio. API keys have scopes to ensure the principle of least privilege access control to resources managed by Hopsworks. + For more details on path-based routing of requests through Istio, see [REST API Guide](../../user_guides/mlops/serving/rest-api.md). + + !!! warning "Host-based routing" + The Istio Model Endpoint supports host-based routing for inference requests; however, this approach is considered legacy. + Path-based routing is recommended for new deployments. Models deployed on KServe in Hopsworks can be easily integrated with the Hopsworks Feature Store using either a Transformer or Predictor Python script, that builds the predictor's input feature vector using the application input and pre-computed features from the Feature Store. @@ -30,3 +36,6 @@ Models deployed on KServe in Hopsworks can be easily integrated with the Hopswor !!! info "Model Serving Guide" More information can be found in the [Model Serving guide](../../user_guides/mlops/serving/index.md). + +!!! tip "Python deployments" + For deploying Python scripts without a model artifact, see the [Python Deployments](../../user_guides/projects/python-deployment/python-deployment.md) page. diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index ba806925ed..41a07d1e97 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -25,7 +25,7 @@ This is a quick-start of the Hopsworks Feature Store; using a fraud use case we ### Batch -This is a batch use case variant of the fraud tutorial, it will give you a high level view on how to use our python APIs and the UI to navigate the feature groups. +This is a batch use case variant of the fraud tutorial, it will give you a high level view on how to use our Python APIs and the UI to navigate the feature groups. | Notebooks | | --- | diff --git a/docs/user_guides/fs/feature_view/feature-vectors.md b/docs/user_guides/fs/feature_view/feature-vectors.md index c16778a532..8f47b012e2 100644 --- a/docs/user_guides/fs/feature_view/feature-vectors.md +++ b/docs/user_guides/fs/feature_view/feature-vectors.md @@ -239,7 +239,6 @@ However, you can retrieve the untransformed feature vectors without applying mod entry=[{"id": 1}, {"id": 2}], transform=False ) - ``` ## Retrieving feature vector without on-demand features @@ -258,7 +257,6 @@ To achieve this, set the parameters `transform` and `on_demand_features` to `Fa entry=[{"id": 1}, {"id": 2}], transform=False, on_demand_features=False ) - ``` ## Passing Context Variables to Transformation Functions @@ -274,7 +272,6 @@ After [defining a transformation function using a context variable](../transform entry=[{"pk1": 1}], transformation_context={"context_parameter": 10} ) - ``` ## Choose the right Client diff --git a/docs/user_guides/fs/feature_view/helper-columns.md b/docs/user_guides/fs/feature_view/helper-columns.md index cbb22f305e..cb394e2cf9 100644 --- a/docs/user_guides/fs/feature_view/helper-columns.md +++ b/docs/user_guides/fs/feature_view/helper-columns.md @@ -41,7 +41,6 @@ for computing the [on-demand feature](../../../concepts/fs/feature_group/on_dema inference_helper_columns=["expiry_date"], ) - ``` ### Inference Data Retrieval @@ -88,7 +87,6 @@ However, they can be optionally fetched with inference or training data. ] ] - ``` #### Online inference @@ -129,7 +127,6 @@ However, they can be optionally fetched with inference or training data. passed_features={"days_valid": days_valid}, ) - ``` ## Training Helper columns @@ -156,7 +153,6 @@ For example one might want to use feature like `category` of the purchased produ training_helper_columns=["category"], ) - ``` ### Training Data Retrieval @@ -190,7 +186,6 @@ However, they can be optionally fetched. training_dataset_version=1, training_helper_columns=True ) - ``` !!! note diff --git a/docs/user_guides/fs/feature_view/model-dependent-transformations.md b/docs/user_guides/fs/feature_view/model-dependent-transformations.md index 66ebfe518a..cf1cbd142c 100644 --- a/docs/user_guides/fs/feature_view/model-dependent-transformations.md +++ b/docs/user_guides/fs/feature_view/model-dependent-transformations.md @@ -55,7 +55,6 @@ Additionally, Hopsworks also allows users to specify custom names for transforme transformation_functions=[add_two, add_one_multiple], ) - ``` ### Specifying input features @@ -77,7 +76,6 @@ The features to be used by a model-dependent transformation function can be spec ], ) - ``` ### Using built-in transformations @@ -106,7 +104,6 @@ The only difference is that they can either be retrieved from the Hopsworks or i ], ) - ``` To attach built-in transformation functions from the `hopsworks` module they can be directly imported into the code from `hopsworks.builtin_transformations`. @@ -134,7 +131,6 @@ To attach built-in transformation functions from the `hopsworks` module they can ], ) - ``` ## Using Model Dependent Transformations @@ -160,7 +156,6 @@ Model-dependent transformation functions can also be manually applied to a featu # Apply Model Dependent transformations encoded_feature_vector = fv.transform(feature_vector) - ``` ### Retrieving untransformed feature vector and batch inference data @@ -185,5 +180,4 @@ To achieve this, set the `transform` parameter to False. # Fetching untransformed batch data. untransformed_batch_data = feature_view.get_batch_data(transform=False) - ``` diff --git a/docs/user_guides/fs/feature_view/training-data.md b/docs/user_guides/fs/feature_view/training-data.md index d57d452603..de3c3c68fb 100644 --- a/docs/user_guides/fs/feature_view/training-data.md +++ b/docs/user_guides/fs/feature_view/training-data.md @@ -154,7 +154,6 @@ Once you have [defined a transformation function using a context variable](../tr transformation_context={"context_parameter": 10}, ) - ``` ## Read training data with primary key(s) and event time diff --git a/docs/user_guides/mlops/serving/api-protocol.md b/docs/user_guides/mlops/serving/api-protocol.md index b5f11e8989..cb42e5b848 100644 --- a/docs/user_guides/mlops/serving/api-protocol.md +++ b/docs/user_guides/mlops/serving/api-protocol.md @@ -3,13 +3,9 @@ ## Introduction Hopsworks supports both REST and gRPC as API protocols for sending inference requests to model deployments. -While REST API protocol is supported in all types of model deployments, support for gRPC is only available for models served with [KServe](predictor.md#serving-tool). +While REST API protocol is supported in all types of model deployments, gRPC is currently supported for **Python model deployments** only. -!!! warning - At the moment, the gRPC API protocol is only supported for **Python model deployments** (e.g., scikit-learn, xgboost). - Support for Tensorflow model deployments is coming soon. - -## GUI +## Web UI ### Step 1: Create a new deployment @@ -40,17 +36,7 @@ To navigate to the advanced creation form, click on `Advanced options`. ### Step 3: Select the API protocol -Enabling gRPC as the API protocol for a model deployment requires KServe as the serving platform for the deployment. -Make sure that KServe is enabled by activating the corresponding checkbox. - -

-

- KServe enabled in advanced deployment form -
Enable KServe in the advanced deployment form
-
-

- -Then, you can select the API protocol to be enabled in your model deployment. +You can select the API protocol to be enabled in your model deployment in the advanced deployment form.

@@ -102,7 +88,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_deployment = ms.create_deployment(my_predictor) my_deployment.save() - ``` ### API Reference diff --git a/docs/user_guides/mlops/serving/autoscaling.md b/docs/user_guides/mlops/serving/autoscaling.md new file mode 100644 index 0000000000..28cad9b653 --- /dev/null +++ b/docs/user_guides/mlops/serving/autoscaling.md @@ -0,0 +1,154 @@ +# How To Configure Scaling For A Deployment + +## Introduction + +This guide explains how to set up **autoscaling** for model deployments using either the [web UI](#web-ui) or the [Python API](#code). + +Deployments use [Knative Pod Autoscaler (KPA)](https://knative.dev/docs/serving/autoscaling/) to automatically scale the number of replicas based on traffic. +Autoscaling enables the deployment to use resources more efficiently, by growing and shrinking the allocated resources according to its actual, real-time usage. + +See [Scale metrics](#scale-metrics) and [Scaling parameters](#scaling-parameters) for details on the available scaling options. + +## Web UI + +### Step 1: Create new deployment + +If you have at least one model already trained and saved in the Model Registry, navigate to the deployments page by clicking on the `Deployments` tab on the navigation menu on the left. + +

+

+ Deployments navigation tab +
Deployments navigation tab
+
+

+ +Once in the deployments page, you can create a new deployment by either clicking on `New deployment` (if there are no existing deployments) or on `Create new deployment` it the top-right corner. +Both options will open the deployment creation form. + +### Step 2: Go to advanced options + +A simplified creation form will appear including the most common deployment fields from all available configurations. +Autoscaling is part of the advanced options of a deployment. +To navigate to the advanced creation form, click on `Advanced options`. + +

+

+ Advance options +
Advanced options. Go to advanced deployment creation form
+
+

+ +### Step 3: Configure autoscaling + +In the `Autoscaling` section of the advanced form, you can configure the scaling parameters for the predictor and/or the transformer (if available). +You can set the scale metric, target value, minimum and maximum instances, as well as the panic and stable window parameters. + +

+

+ Autoscaling configuration for the predictor and transformer components +
Autoscaling configuration for the predictor and transformer
+
+

+ +Once you are done with the changes, click on `Create new deployment` at the bottom of the page to create the deployment for your model. + +## Code + +### Step 1: Connect to Hopsworks + +=== "Python" + + ```python + import hopsworks + + project = hopsworks.login() + + # get Hopsworks Model Registry handle + mr = project.get_model_registry() + + # get Hopsworks Model Serving handle + ms = project.get_model_serving() + ``` + +### Step 2: Define the predictor scaling configuration + +You can use the [`PredictorScalingConfig`][hsml.scaling_config.PredictorScalingConfig] class to configure the scaling options according to your preferences. +Default values for scaling metrics and parameters are listed in the [Scale metrics](#scale-metrics) and [Scaling parameters](#scaling-parameters) sections above. + +=== "Python" + + ```python + from hsml.scaling_config import PredictorScalingConfig + + predictor_scaling = PredictorScalingConfig( + min_instances=1, max_instances=5, scale_metric="RPS", target=100 + ) + ``` + +### Step 3 (Optional): Define the transformer scaling configuration + +If a transformer script is also provided, you can use the [`TransformerScalingConfig`][hsml.scaling_config.TransformerScalingConfig] class to configure the scaling options according to your preferences. +Default values for scaling metrics and parameters are listed in the [Scale metrics](#scale-metrics) and [Scaling parameters](#scaling-parameters) sections above. + +=== "Python" + + ```python + from hsml.scaling_config import TransformerScalingConfig + + transformer_scaling = TransformerScalingConfig( + min_instances=1, max_instances=3, scale_metric="CONCURRENCY", target=50 + ) + ``` + +### Step 4: Create a deployment with the scaling configuration + +=== "Python" + + ```python + my_model = mr.get_model("my_model", version=1) + + # optional + my_transformer = ms.create_transformer( + script_file="Resources/my_transformer.py", + scaling_configuration=transformer_scaling + ) + + my_deployment = my_model.deploy( + scaling_configuration=predictor_scaling, + # optional: + transformer=my_transformer + ) + ``` + +### API Reference + +[`PredictorScalingConfig`][hsml.scaling_config.PredictorScalingConfig] + +[`TransformerScalingConfig`][hsml.scaling_config.TransformerScalingConfig] + +## Scale metrics + +The autoscaler supports two metrics to determine when to scale. +See [Knative autoscaling metrics](https://knative.dev/docs/serving/autoscaling/autoscaling-metrics/) for more details. + +| Scale Metric | Default Target | Description | +| ------------ | -------------- | ------------------------------- | +| RPS | 200 | Requests per second per replica | +| CONCURRENCY | 100 | Concurrent requests per replica | + +## Scaling parameters + +The following parameters can be used to fine-tune the autoscaling behavior. +See [scale bounds](https://knative.dev/docs/serving/autoscaling/scale-bounds/), [autoscaling concepts](https://knative.dev/docs/serving/autoscaling/autoscaling-concepts/) and [scale-to-zero](https://knative.dev/docs/serving/autoscaling/scale-to-zero/) in the Knative documentation for more details. + +| Parameter | Default | Range | Description | +| ----------------------------- | ------- | ------ | ------------------------------------------- | +| `minInstances` | — | ≥ 0 | Minimum replicas (0 enables scale-to-zero) | +| `maxInstances` | — | ≥ 1 | Maximum replicas (cannot be less than min) | +| `panicWindowPercentage` | 10.0 | 1–100 | Panic window as percentage of stable window | +| `stableWindowSeconds` | 60 | 6–3600 | Stable window duration in seconds | +| `panicThresholdPercentage` | 200.0 | > 0 | Traffic threshold to trigger panic mode | +| `scaleToZeroRetentionSeconds` | 0 | ≥ 0 | Time to retain pods before scaling to zero | + +!!! note "Cluster-level constraints" + ==Administrators== can set cluster-wide limits on the maximum and minimum number of instances. When the minimum is set to 0, scale-to-zero is enforced for all deployments. diff --git a/docs/user_guides/mlops/serving/deployment-state.md b/docs/user_guides/mlops/serving/deployment-state.md index 7bc60e5815..022bfec415 100644 --- a/docs/user_guides/mlops/serving/deployment-state.md +++ b/docs/user_guides/mlops/serving/deployment-state.md @@ -17,7 +17,7 @@ The following is the state transition diagram for deployments. States are composed of a [status](#deployment-status) and a [condition](#deployment-conditions). While a status represents a high-level view of the state, conditions contain more detailed information closely related to infrastructure terms. -## GUI +## Web UI ### Step 1: Inspect deployment status @@ -62,7 +62,7 @@ Additionally, you can find the nº of instances currently running by scrolling d !!! info "Scale-to-zero capabilities" If scale-to-zero capabilities are enabled, you can see how the nº of instances of a running deployment goes to zero and the status changes to `idle`. - To enable scale-to-zero in a deployment, see [Resource Allocation Guide](resources.md) + To enable scale-to-zero in a deployment, see [Resources Guide](resources.md) ## Code @@ -86,7 +86,6 @@ Additionally, you can find the nº of instances currently running by scrolling d ```python deployment = ms.get_deployment("mydeployment") - ``` ### Step 3: Inspect deployment state @@ -98,7 +97,6 @@ Additionally, you can find the nº of instances currently running by scrolling d state.describe() - ``` ### Step 4: Check nº of running instances @@ -112,7 +110,6 @@ Additionally, you can find the nº of instances currently running by scrolling d # nº of transformer instances deployment.transformer.resources.describe() - ``` ### API Reference @@ -125,18 +122,25 @@ Additionally, you can find the nº of instances currently running by scrolling d The status of a deployment is a high-level description of its current state. -??? info "Show deployment status" +??? info "Deployment statuses" + + | Status | Description | + | -------- | -------------------------------------------------------------------------------------------------------------------------------------------- | + | CREATING | Deployment artifacts are being prepared | + | CREATED | Deployment has never been started | + | STARTING | Deployment is starting | + | RUNNING | Deployment is ready and running. Predictions are served without additional latencies. | + | IDLE | Deployment is ready but scaled to zero or has no active replicas. Higher latencies (cold-start) are expected on the first inference request. | + | FAILED | Terminal state. The deployment has encountered an unrecoverable error. More details can be found in the status condition. | + | UPDATING | Deployment is applying updates to the running instances | + | STOPPING | Deployment is stopping | + | STOPPED | Deployment has been stopped | + +## How States Are Determined + +Deployment state is determined from multiple sources: the database state (whether the deployment has been deployed and its revision), KServe InferenceService conditions, pod presence (available replicas for predictor and transformer), and the artifact filesystem (whether the deployment artifact files are ready). - | Status | Description | - | -------- | ------------------------------------------------------------------------------------------------------------------------ | - | CREATED | Deployment has never been started | - | STARTING | Deployment is starting | - | RUNNING | Deployment is ready and running. Predictions are served without additional latencies. | - | IDLE | Deployment is ready, but idle. Higher latencies (i.e., cold-start) are expected in the first incoming inference requests | - | FAILED | Deployment is in a failed state, which can be due to multiple reasons. More details can be found in the condition | - | UPDATING | Deployment is applying updates to the running instances | - | STOPPING | Deployment is stopping | - | STOPPED | Deployment has been stopped | +A revision ID and deployment version are used to distinguish between STARTING (first generation) and UPDATING (subsequent changes to a running deployment). ## Deployment conditions @@ -147,7 +151,7 @@ Status conditions contain three pieces of information: type, status and reason. While the type describes the purpose of the condition, the status represents its progress. Additionally, a reason field is provided with a more descriptive message of the status. -??? info "Show deployment conditions" +??? info "Deployment conditions" | Type | Status | Description | | ----------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | diff --git a/docs/user_guides/mlops/serving/deployment.md b/docs/user_guides/mlops/serving/deployment.md index c514a0fd8f..de0a5b1819 100644 --- a/docs/user_guides/mlops/serving/deployment.md +++ b/docs/user_guides/mlops/serving/deployment.md @@ -8,12 +8,13 @@ description: Documentation on how to deployment Machine Learning (ML) models and In this guide, you will learn how to create a new deployment for a trained model. -!!! warning - This guide assumes that a model has already been trained and saved into the Model Registry. - To learn how to create a model in the Model Registry, see [Model Registry Guide](../registry/index.md#exporting-a-model) +!!! info + This guide covers model deployments, which require a model saved in the Model Registry. + To learn how to create a model in the Model Registry, see [Model Registry Guide](../registry/index.md#exporting-a-model). + For Python deployments (running a Python script without a model artifact), see [Python Deployments](../../projects/python-deployment/python-deployment.md). -Deployments are used to unify the different components involved in making one or more trained models online and accessible to compute predictions on demand. -For each deployment, there are four concepts to consider: +Model deployments are used to unify the different components involved in making one or more trained models online and accessible to compute predictions on demand. +For each model deployment, there are four concepts to understand: !!! info "" 1. [Model files](#model-files) @@ -21,7 +22,7 @@ For each deployment, there are four concepts to consider: 3. [Predictor](#predictor) 4. [Transformer](#transformer) -## GUI +## Web UI ### Step 1: Create a deployment @@ -42,29 +43,17 @@ Both options will open the deployment creation form. A simplified creation form will appear including the most common deployment fields from all available configurations. We provide default values for the rest of the fields, adjusted to the type of deployment you want to create. -In the simplified form, select the model framework used to train your model. -Then, select the model you want to deploy from the list of available models under `pick a model`. - -After selecting the model, the rest of fields are filled automatically. -We pick the last model version and model artifact version available in the Model Registry. -Moreover, we infer the deployment name from the model name. - -!!! notice "Deployment name validation rules" - A valid deployment name can only contain characters a-z, A-Z and 0-9. - -!!! info "Predictor script for Python models" - For Python models, you must select a custom [predictor script](#predictor) that loads and runs the trained model by clicking on `From project` or `Upload new file`, to choose an existing script in the project file system or upload a new script, respectively. - -If you prefer, change the name of the deployment, model version or [artifact version](#artifact-files). -Then, click on `Create new deployment` to create the deployment for your model. +In the simplified form, choose the model server that will be used to serve your model.

- Select the model framework -
Select the model framework
+ Select the model server +
Select the model server

+Then, select the model you want to deploy from the list of available models under `pick a model`. +

Select the model @@ -72,6 +61,20 @@ Then, click on `Create new deployment` to create the deployment for your model.

+After selecting the model, select a model version and give your model deployment a name. + +!!! info "Deployment name validation rules" + A valid deployment name can only contain characters a-z, A-Z and 0-9. + +!!! info "Predictor script for Python models" + For Python models, you must select a custom [predictor script](#predictor) that loads and runs the trained model by clicking on `From project` or `Upload new file`, to choose an existing script in the project file system or upload a new script, respectively. + +!!! info "Server configuration file for vLLM" + For vLLM deployments, a server configuration file is required. + See the [Predictor Guide](predictor.md#server-configuration-file) for more details. + +Lastly, click on `Create new deployment` to create the deployment for your model. + ### Step 3 (Optional): Advanced configuration Optionally, you can access and adjust other parameters of the deployment configuration by clicking on `Advanced options`. @@ -82,28 +85,12 @@ Optionally, you can access and adjust other parameters of the deployment configu
Advanced options. Go to advanced deployment creation form

- -You will be redirected to a full-page deployment creation form where you can see all the default configuration values we selected for your deployment and adjust them according to your use case. -Apart from the aforementioned simplified configuration, in this form you can setup the following components: - -!!! info "Deployment advanced options" - 1. [Predictor](#predictor) - 2. [Transformer](#transformer) - 3. [Inference logger](predictor.md#inference-logger) - 4. [Inference batcher](predictor.md#inference-batcher) - 5. [Resources](predictor.md#resources) - 6. [API protocol](predictor.md#api-protocol) +You will be redirected to a full-page deployment creation form, where you can review all default configuration values and customize them to fit your requirements. +In addition to the basic settings, this form allows you to further configure the [Predictor](#predictor) and [Transformer](#transformer) KServe components of your model deployment. Once you are done with the changes, click on `Create new deployment` at the bottom of the page to create the deployment for your model. -### Step 4: (Kueue enabled) Select a Queue - -If the cluster is installed with Kueue enabled, you will need to select a queue in which the deployment should run. -This can be done from `Advance configuration -> Scheduler section`. - -![Default queue for job](../../../assets/images/guides/project/scheduler/job_queue.png) - -### Step 5: Deployment creation +### Step 4: Deployment creation Wait for the deployment creation process to finish. @@ -114,7 +101,7 @@ Wait for the deployment creation process to finish.

-### Step 6: Deployment overview +### Step 5: Deployment overview Once the deployment is created, you will be redirected to the list of all your existing deployments in the project. You can use the filters on the top of the page to easily locate your new deployment. @@ -150,45 +137,33 @@ After that, click on the new deployment to access the overview page. mr = project.get_model_registry() ``` -### Step 2: Create deployment +### Step 2: Retrieve your trained model -Retrieve the trained model you want to deploy. +Retrieve the trained model you want to deploy using the Model Registry handle. === "Python" ```python my_model = mr.get_model("my_model", version=1) - ``` -#### Option A: Using the model object +### Step 3: Deploy your trained model + +Create a deployment for your model by calling `.deploy()` on the model metadata object. +This will create a deployment for your model with default values. === "Python" ```python my_deployment = my_model.deploy() - + # optionally, start your model deployment + my_deployment.start() ``` -#### Option B: Using the Model Serving handle - -=== "Python" - - ```python - # get Hopsworks Model Serving handle - ms = project.get_model_serving() - - my_predictor = ms.create_predictor(my_model) - my_deployment = my_predictor.deploy() - - # or - my_deployment = ms.create_deployment(my_predictor) - my_deployment.save() - - - ``` +!!! info "Predictor script and server configuration file" + You can provide a predictor script and a server configuration file directly in the `.deploy()` method using the `script_file` and `config_file` parameters. See the [Predictor Guide](predictor.md) for more details. ### API Reference @@ -203,15 +178,12 @@ Inside a model deployment, the local path to the model files is stored in the `M Moreover, you can explore the model files under the `/Models///Files` directory using the File Browser. !!! warning - All files under `/Models` are managed by Hopsworks. - Changes to model files cannot be reverted and can have an impact on existing model deployments. + All files under `/Models` and `/Deployments` are managed by Hopsworks. + Manual changes to these files cannot be reverted and can have an impact on existing model deployments. ## Artifact Files -Artifact files are files involved in the correct startup and running of the model deployment. -The most important files are the **predictor** and **transformer scripts**. -The former is used to load and run the model for making predictions. -The latter is typically used to apply transformations on the model inputs at inference time before making predictions. +Artifact files are essential for the proper initialization and operation of a model deployment. The most critical artifact files are the **predictor** and **transformer scripts**. The predictor script loads the trained model and handles prediction requests, while the transformer script applies any necessary input transformations before inference. Predictor and transformer scripts run on separate components and, therefore, scale independently of each other. !!! tip @@ -220,40 +192,26 @@ Predictor and transformer scripts run on separate components and, therefore, sca Additionally, artifact files can also contain a **server configuration file** that helps detach configuration used within the model deployment from the model server or the implementation of the predictor and transformer scripts. Inside a model deployment, the local path to the configuration file is stored in the `CONFIG_FILE_PATH` environment variable (see [environment variables](../serving/predictor.md#environment-variables)). -Every model deployment runs a specific version of the artifact files, commonly referred to as artifact version. ==One or more model deployments can use the same artifact version== (i.e., same predictor and transformer scripts). -Artifact versions are unique for the same model version. - -When a new deployment is created, a new artifact version is generated in two cases: - -- the artifact version in the predictor is set to `CREATE` (see [Artifact Version](./predictor.md#environment-variables)) -- no model artifact with the same files has been created before. +Each deployment tracks its artifact files through a ==deployment version== — an integer (1, 2, 3...) that is incremented whenever the artifact content changes (e.g., updating a predictor script or configuration file). Inside a model deployment, the local path to the artifact files is stored in the `ARTIFACT_FILES_PATH` environment variable (see [environment variables](../serving/predictor.md#environment-variables)). -Moreover, you can explore the artifact files under the `/Models///Artifacts/` directory using the File Browser. !!! warning - All files under `/Models` are managed by Hopsworks. - Changes to artifact files cannot be reverted and can have an impact on existing model deployments. + All files under `/Models` and `/Deployments` are managed by Hopsworks. + Manual changes to these files cannot be reverted and can have an impact on existing model deployments. -!!! tip "Additional files" - Currently, the artifact files can only include predictor and transformer scripts, and a configuration file. - Support for additional files (e.g., other resources) is coming soon. +!!! tip "vLLM omni mode" + For vLLM deployments, the server configuration file supports a `#HOPSWORKS omni: true` directive to enable omni mode. ## Predictor -Predictors are responsible for running the model server that loads the trained model, listens to inference requests and returns prediction results. -To learn more about predictors, see the [Predictor Guide](predictor.md) - -!!! note - Only one predictor is supported in a deployment. - -!!! info - Model artifacts are assigned an incremental version number, being `0` the version reserved for model artifacts that do not contain predictor or transformer scripts (i.e., shared artifacts containing only the model files). +Predictors are responsible for running the model server that loads the trained model, handles inference requests and returns prediction results. +To learn more about predictors, see the [Predictor (KServe) Guide](predictor.md) ## Transformer Transformers are used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model. -To learn more about transformers, see the [Transformer Guide](transformer.md). +To learn more about transformers, see the [Transformer (KServe) Guide](transformer.md). !!! warning - Transformers are only supported in KServe deployments. + Transformers are not available for vLLM deployments. diff --git a/docs/user_guides/mlops/serving/external-access.md b/docs/user_guides/mlops/serving/external-access.md index a02c8c2500..3de74a3941 100644 --- a/docs/user_guides/mlops/serving/external-access.md +++ b/docs/user_guides/mlops/serving/external-access.md @@ -6,7 +6,7 @@ description: Documentation on how to configure external access to a model deploy ## Introduction -Hopsworks supports role-based access control (RBAC) for project members within a project, where a project ML assets can only be accessed by Hopsworks users that are members of that project (See [governance](../../../concepts/projects/governance.md)). +Hopsworks supports **role-based access control (RBAC)** for project members within a project, where a project ML assets can only be accessed by Hopsworks users that are members of that project (See [governance](../../../concepts/projects/governance.md)). However, there are cases where you might want to grant ==external users== with access to specific model deployments without them having to register into Hopsworks or to join the project which will give them access to all project ML assets. For these cases, Hopsworks supports fine-grained access control to model deployments based on ==user groups== managed by an external Identity Provider. @@ -15,7 +15,7 @@ For these cases, Hopsworks supports fine-grained access control to model deploym Hopsworks can be configured to use different types of authentication methods including OAuth2, LDAP and Kerberos. See the [Authentication Methods Guide](../../../setup_installation/admin/auth.md) for more information. -## GUI (for Hopsworks users) +## Web UI (for Hopsworks users) ### Step 1: Navigate to a model deployment @@ -64,7 +64,7 @@ After that, click on the `save` button to persist the changes.

-## GUI (for external users) +## Web UI (for external users) ### Step 1: Login with the external identity provider @@ -105,7 +105,7 @@ Inference requests to model deployments are authenticated and authorized based o You can create API keys to authenticate your inference requests by clicking on the `Create API Key` button. !!! info "Authorization header" - API keys are set in the `Authorization` header following the format `ApiKey ` + API keys are set in the `authorization` header following the format `ApiKey `

@@ -116,12 +116,11 @@ You can create API keys to authenticate your inference requests by clicking on t ### Step 4: Send inference requests -Depending on the type of model deployment, the URI of the model server can differ (e.g., `/chat/completions` for LLM deployments or `/predict` for traditional model deployments). -You can find the corresponding URI on every model deployment card. +The URI path for sending inference requests depends on the type of model deployment. +For example, LLM deployments typically use `/chat/completions`, while traditional model deployments use `/predict`. +You can find the exact URI path for each deployment on its model deployment card. -In addition to the `Authorization` header containing the API key, the `Host` header needs to be set according to the model deployment where the inference requests are sent to. -This header is used by the ingress to route the inference requests to the corresponding model deployment. -You can find the `Host` header value in the model deployment card. +For detailed instructions on constructing requests and handling authentication, refer to the [REST API Guide](rest-api.md). !!! tip "Code snippets" For clients sending inference requests using libraries similar to curl or OpenAI API-compatible libraries (e.g., LangChain), you can find code snippet examples by clicking on the `Curl >_` and `LangChain >_` buttons. diff --git a/docs/user_guides/mlops/serving/index.md b/docs/user_guides/mlops/serving/index.md index bc248e01f3..2b0eda9853 100644 --- a/docs/user_guides/mlops/serving/index.md +++ b/docs/user_guides/mlops/serving/index.md @@ -3,19 +3,19 @@ ## Deployment Assuming you have already created a model in the [Model Registry](../registry/index.md), a deployment can now be created to prepare a model artifact for this model and make it accessible for running predictions behind a REST or gRPC endpoint. -Follow the [Deployment Creation Guide](deployment.md) to create a Deployment for your model. -### Predictor +Refer to the [Deployment Creation Guide](deployment.md) for step-by-step instructions on creating a deployment for your model. For details on monitoring the status and lifecycle of an existing deployment, see the [Deployment State Guide](deployment-state.md). -Predictors are responsible for running a model server that loads a trained model, handles inference requests and returns predictions, see the [Predictor Guide](predictor.md). +!!! tip "Python deployments" + If you want to deploy a Python script without a model artifact, see the [Python Deployments](../../projects/python-deployment/python-deployment.md) page. -### Transformer +### Predictor (KServe component) -Transformers are used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model, see the [Transformer Guide](transformer.md). +Predictors are responsible for running a model server that loads a trained model, handles inference requests and returns predictions, see the [Predictor Guide](predictor.md). -### Resource Allocation +### Transformer (KServe component) -Configure the resources to be allocated for predictor and transformer in a model deployment, see the [Resource Allocation Guide](resources.md). +Transformers are used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model, see the [Transformer Guide](transformer.md). ### Inference Batcher @@ -25,9 +25,28 @@ Configure the predictor to batch inference requests, see the [Inference Batcher Configure the predictor to log inference requests and predictions, see the [Inference Logger Guide](inference-logger.md). +### Resources + +Configure the resources to be allocated for predictor and transformer in a model deployment, see the [Resources Guide](resources.md). + +### Autoscaling + +Configure autoscaling for your model deployment, including scale-to-zero, scale metrics and scaling parameters, see the [Autoscaling Guide](autoscaling.md). + +### Scheduling + +!!! info "Kueue is required" + This feature requires Kueue to be enabled in your cluster. If Kueue is not available, queue and topology options will not be accessible. + +Configure scheduling for your model deployment using Kueue queues, see the [Scheduling Guide](scheduling.md). + +### API Protocol + +Choose between REST and gRPC API protocols for your model deployment, see the [API Protocol Guide](api-protocol.md). + ### REST API -Send inference requests to deployed models using REST API, see the [Rest API Guide](rest-api.md). +Send inference requests to deployed models using REST API, see the [REST API Guide](rest-api.md). ### Troubleshooting diff --git a/docs/user_guides/mlops/serving/inference-batcher.md b/docs/user_guides/mlops/serving/inference-batcher.md index d978b900d9..298137ed08 100644 --- a/docs/user_guides/mlops/serving/inference-batcher.md +++ b/docs/user_guides/mlops/serving/inference-batcher.md @@ -3,10 +3,11 @@ ## Introduction Inference batching can be enabled to increase inference request throughput at the cost of higher latencies. -The configuration of the inference batcher depends on the serving tool and the model server used in the deployment. -See the [compatibility matrix](#compatibility-matrix). +The configuration of the inference batcher depends on the model server used in the deployment. -## GUI +!!! warning "Inference batching is not supported for vLLM deployments." + +## Web UI ### Step 1: Create new deployment @@ -48,6 +49,9 @@ To enable inference batching, click on the `Request batching` checkbox. If your deployment uses KServe, you can optionally set three additional parameters for the inference batcher: maximum batch size, maximum latency (ms) and timeout (s). +!!! note "Timeout parameter" + The `timeout` parameter sets the request timeout in seconds for the inference batcher. If a batch is not filled within this time, the available requests are sent as a partial batch. + Once you are done with the changes, click on `Create new deployment` at the bottom of the page to create the deployment for your model. ## Code @@ -91,30 +95,10 @@ Once you are done with the changes, click on `Create new deployment` at the bott ```python my_model = mr.get_model("my_model", version=1) - my_predictor = ms.create_predictor(my_model, inference_batcher=my_batcher) - my_predictor.deploy() - - # or - - my_deployment = ms.create_deployment(my_predictor) - my_deployment.save() - + my_model.deploy(inference_batcher=my_batcher) ``` ### API Reference [`InferenceBatcher`][hsml.inference_batcher.InferenceBatcher] - -## Compatibility matrix - -??? info "Show supported inference batcher configuration" - - | Serving tool | Model server | Inference batching | Fine-grained configuration | - | ------------ | ------------------ | ------------------ | ------- | - | Docker | Flask | ❌ | - | - | | TensorFlow Serving | ✅ | ❌ | - | Kubernetes | Flask | ❌ | - | - | | TensorFlow Serving | ✅ | ❌ | - | KServe | Flask | ✅ | ✅ | - | | TensorFlow Serving | ✅ | ✅ | diff --git a/docs/user_guides/mlops/serving/inference-logger.md b/docs/user_guides/mlops/serving/inference-logger.md index 64012324b7..091f6ff10d 100644 --- a/docs/user_guides/mlops/serving/inference-logger.md +++ b/docs/user_guides/mlops/serving/inference-logger.md @@ -6,9 +6,21 @@ Once a model is deployed and starts making predictions as inference requests arr Hopsworks supports logging both inference requests and predictions as events to a Kafka topic for analysis. -!!! warning "Topic schemas vary depending on the serving tool. See [below](#topic-schema)" +!!! warning "Inference logging is not supported for vLLM deployments." -## GUI +!!! info "Logging modes" + Three logging modes are available: + + | Mode | Logger Mode | Description | + | ------------ | ----------- | --------------------------- | + | ALL | `all` | Log both inputs and outputs | + | PREDICTIONS | `response` | Log model outputs only | + | MODEL_INPUTS | `request` | Log model inputs only | + +!!! note "Kafka topic requirements" + The Kafka topic must use the `inferenceschema` subject. Schema v4+ is required for KServe topics. + +## Web UI ### Step 1: Create new deployment @@ -105,14 +117,7 @@ Once you are done with the changes, click on `Create new deployment` at the bott ```python my_model = mr.get_model("my_model", version=1) - my_predictor = ms.create_predictor(my_model, inference_logger=my_logger) - my_predictor.deploy() - - # or - - my_deployment = ms.create_deployment(my_predictor) - my_deployment.save() - + my_model.deploy(inference_logger=my_logger) ``` @@ -122,47 +127,23 @@ Once you are done with the changes, click on `Create new deployment` at the bott ## Topic schema -The schema of Kafka events varies depending on the serving tool. -In KServe deployments, model inputs and predictions are logged in separate events, but sharing the same `requestId` field. -In non-KServe deployments, the same event contains both the model input and prediction related to the same inference request. - -??? example "Show kafka topic schemas" - - === "KServe" - - ``` json - { - "fields": [ - { "name": "servingId", "type": "int" }, - { "name": "modelName", "type": "string" }, - { "name": "modelVersion", "type": "int" }, - { "name": "requestTimestamp", "type": "long" }, - { "name": "responseHttpCode", "type": "int" }, - { "name": "inferenceId", "type": "string" }, - { "name": "messageType", "type": "string" }, - { "name": "payload", "type": "string" } - ], - "name": "inferencelog", - "type": "record" - } - ``` - - === "Docker / Kubernetes" - - ``` json - { - "fields": [ - { "name": "modelId", "type": "int" }, - { "name": "modelName", "type": "string" }, - { "name": "modelVersion", "type": "int" }, - { "name": "requestTimestamp", "type": "long" }, - { "name": "responseHttpCode", "type": "int" }, - { "name": "inferenceRequest", "type": "string" }, - { "name": "inferenceResponse", "type": "string" }, - { "name": "modelServer", "type": "string" }, - { "name": "servingTool", "type": "string" } - ], - "name": "inferencelog", - "type": "record" - } - ``` +Model inputs and predictions are logged in separate events, sharing the same `requestId` field. + +!!! example "Kafka topic schema" + + ``` json + { + "fields": [ + { "name": "servingId", "type": "int" }, + { "name": "modelName", "type": "string" }, + { "name": "modelVersion", "type": "int" }, + { "name": "requestTimestamp", "type": "long" }, + { "name": "responseHttpCode", "type": "int" }, + { "name": "inferenceId", "type": "string" }, + { "name": "messageType", "type": "string" }, + { "name": "payload", "type": "string" } + ], + "name": "inferencelog", + "type": "record" + } + ``` diff --git a/docs/user_guides/mlops/serving/predictor.md b/docs/user_guides/mlops/serving/predictor.md index 4470ffac75..bb13b9c171 100644 --- a/docs/user_guides/mlops/serving/predictor.md +++ b/docs/user_guides/mlops/serving/predictor.md @@ -14,22 +14,23 @@ In this guide, you will learn how to configure a predictor for a trained model. Predictors are the main component of deployments. They are responsible for running a model server that loads a trained model, handles inference requests and returns predictions. -They can be configured to use different model servers, serving tools, log specific inference data or scale differently. -In each predictor, you can configure the following components: +They can be configured to use different model servers, different resources or scale differently. +In each predictor, you can decide the following configuration: !!! info "" 1. [Model server](#model-server) - 2. [Serving tool](#serving-tool) - 3. [User-provided script](#user-provided-script) - 4. [Server configuration file](#server-configuration-file) - 5. [Python environments](#python-environments) - 6. [Transformer](#transformer) - 7. [Inference Logger](#inference-logger) - 8. [Inference Batcher](#inference-batcher) - 9. [Resources](#resources) - 10. [API protocol](#api-protocol) - -## GUI + 2. [Predictor script](#predictor-script) + 3. [Server configuration file](#server-configuration-file) + 4. [Python environments](#python-environments) + 5. [Transformer script](#transformer-script) + 6. [Inference Logger](#inference-logger) + 7. [Inference Batcher](#inference-batcher) + 8. [Resources](#resources) + 9. [Autoscaling](#autoscaling) + 10. [Scheduling](#scheduling) + 11. [API protocol](#api-protocol) + +## Web UI ### Step 1: Create new deployment @@ -51,11 +52,11 @@ A simplified creation form will appear, including the most common deployment fie The first step is to choose a ==backend== for your model deployment. The backend will filter the models shown below according to the framework that the model was registered with in the model registry. -For example if you registered the model as a TensorFlow model using `ModelRegistry.tensorflow.create_model(...)` you select `Tensorflow Serving` in the dropdown. +For example if you registered the model as a TensorFlow model using `ModelRegistry.tensorflow.create_model(...)` you select `TensorFlow Serving` in the dropdown.

- Select the model framework + Select the model server
Select the backend

@@ -69,12 +70,13 @@ All models compatible with the selected backend will be listed in the model drop

-Moreover, you can optionally select a predictor script (see [Step 3 (Optional): Select a predictor script](#step-3-optional-select-a-predictor-script)), enable KServe (see [Step 4 (Optional): Enable KServe](#step-6-optional-enable-kserve)) or change other advanced configuration (see [Step 5 (Optional): Other advanced options](#step-7-optional-other-advanced-options)). -Otherwise, click on `Create new deployment` to create the deployment for your model. +After selecting a model from the dropdown, you can optionally choose a predictor script, modify the predictor environment, add a configuration file, or adjust other advanced settings as described in the optional steps below. + +Otherwise, click on `Create new deployment` to create the deployment for your model with default values. ### Step 3 (Optional): Select a predictor script -For python models, if you want to use your own [predictor script](#step-2-optional-implement-a-predictor-script) click on `From project` and navigate through the file system to find it, or click on `Upload new file` to upload a predictor script now. +For python models, to select a [predictor script](#predictor-script) click on `From project` and navigate through the file system to find it, or click on `Upload new file` to upload a predictor script now.

@@ -85,12 +87,12 @@ For python models, if you want to use your own [predictor script](#step-2-option ### Step 4 (Optional): Change predictor environment -If you are using a predictor script it is also required to configure the inference environment for the predictor. +If you are using a predictor script, it is required to select an inference environment for the predictor. This environment needs to have all the necessary dependencies installed to run your predictor script. -By default, we provide a set of environments like `tensorflow-inference-pipeline`, `torch-inference-pipeline` and `pandas-inference-pipeline` that serves this purpose for common machine learning frameworks. +Hopsworks provide a collection of built-in environments like `minimal-inference-pipeline`, `pandas-inference-pipeline` or `torch-inference-pipeline` with different sets of libraries pre-installed. By default, the `pandas-inference-pipeline` Python environment is used. -To create your own it is recommended to [clone](../../projects/python/python_env_clone.md) the `minimal-inference-pipeline` and install additional dependencies for your use-case. +To create your own it is recommended to [clone](../../projects/python/python_env_clone.md) the `pandas-inference-pipeline` and install additional dependencies for your use-case.

@@ -101,13 +103,12 @@ To create your own it is recommended to [clone](../../projects/python/python_env ### Step 5 (Optional): Select a configuration file -!!! note - Only available for LLM deployments. - You can select a configuration file to be added to the [artifact files](deployment.md#artifact-files). -If a predictor script is provided, this configuration file will be available inside the model deployment at the local path stored in the `CONFIG_FILE_PATH` environment variable. -If a predictor script is **not** provided, this configuration file will be directly passed to the vLLM server. -You can find all configuration parameters supported by the vLLM server in the [vLLM documentation](https://docs.vllm.ai/en/v0.7.1/serving/openai_compatible_server.html). +In Python model deployments, this configuration file will be available inside the model deployment at the local path stored in the `CONFIG_FILE_PATH` environment variable. In vLLM deployments, this configuration file will be directly passed to the vLLM server. +You can find all configuration parameters supported by the vLLM server in the [vLLM documentation](https://docs.vllm.ai/en/v0.10.2/cli/serve.html). + +!!! info + Configuration files are required for vLLM deployments as they are used to define the configuration for the vLLM server.

@@ -116,10 +117,9 @@ You can find all configuration parameters supported by the vLLM server in the [v

-### Step 6 (Optional): Enable KServe +### Step 6 (Optional): Other advanced options -Other configuration such as the serving tool, is part of the advanced options of a deployment. -To navigate to the advanced creation form, click on `Advanced options`. +To access the advanced deployment configuration, click on `Advanced options`.

@@ -128,25 +128,16 @@ To navigate to the advanced creation form, click on `Advanced options`.

-Here, you change the [serving tool](#serving-tool) for your deployment by enabling or disabling the KServe checkbox. +Here, you can further change the default values of the predictor: -

-

- KServe in advanced deployment form -
KServe checkbox in the advanced deployment form
-
-

- -### Step 7 (Optional): Other advanced options - -Additionally, you can adjust the default values of the rest of components: - -!!! info "Predictor components" - 1. [Transformer](#transformer) +!!! info "Predictor configuration" + 1. [Transformer](#transformer-script) 2. [Inference logger](#inference-logger) 3. [Inference batcher](#inference-batcher) 4. [Resources](#resources) - 5. [API protocol](#api-protocol) + 5. [Autoscaling](#autoscaling) + 6. [Scheduling](#scheduling) + 7. [API protocol](#api-protocol) Once you are done with the changes, click on `Create new deployment` at the bottom of the page to create the deployment for your model. @@ -168,104 +159,110 @@ Once you are done with the changes, click on `Create new deployment` at the bott ms = project.get_model_serving() ``` -### Step 2 (Optional): Implement a predictor script +### Step 2.1 (Optional): Implement a predictor script + +For Python model deployments, you need implement a predictor script that loads and serve your model. === "Predictor" ``` python class Predictor: - def __init__(self): - """Initialization code goes here""" - # Model files can be found at os.environ["MODEL_FILES_PATH"] - # self.model = ... # load your model - - def predict(self, inputs): - """Serve predictions using the trained model""" - # Use the model to make predictions - # return self.model.predict(inputs) + def __init__(self): + """Initialization code goes here""" + # Optional __init__ params: project, deployment, model, async_logger + # Model files can be found at os.environ["MODEL_FILES_PATH"] + # self.model = ... # load your model + + def predict(self, inputs): + """Serve predictions using the trained model""" + # Use the model to make predictions + # return self.model.predict(inputs) ``` -=== "Async Predictor" +=== "Predictor with Feature Logging" ``` python class Predictor: - def __init__(self): - """Initialization code goes here""" - # Model files can be found at os.environ["MODEL_FILES_PATH"] - # self.model = ... # load your model - - async def predict(self, inputs): - """Asynchronously serve predictions using the trained model""" - # Perform async operations that required - # result = await some_async_preprocessing(inputs) - - # Use the model to make predictions - # return self.model.predict(result) + def __init__(self, async_logger, model, project): + """Initializes the serving state, reads a trained model""" + # Get feature view attached to model + ## self.model = model + ## self.feature_view = model.get_feature_view() + + # Initialize feature view with async feature logger + ## self.feature_view.init_feature_logger(feature_logger=async_logger) + + def predict(self, inputs): + """Serves a prediction request usign a trained model""" + # Extract serving keys and request parameters from inputs + ## serving_keys = ... + ## request_parameters = ... + + # Fetch feature vector with logging metadata + ## vector = self.feature_view.get_feature_vector(serving_keys, + ## request_parameters=request_parameters, + ## logging_data=True) + + # Make predictions + ## predictions = model.predict(vector) + + # Log Predictions + ## self.feature_view.log(vector, + ## predictions=predictions, + ## model = self.model) + + # Predictions + ## return predictions ``` -=== "Predictor (vLLM deployments only)" +=== "Async Predictor" ``` python - import os - from vllm import **version**, AsyncEngineArgs, AsyncLLMEngine - from typing import Iterable, AsyncIterator, Union, Optional - from kserve.protocol.rest.openai import ( - CompletionRequest, - ChatPrompt, - ChatCompletionRequestMessage, - ) - from kserve.protocol.rest.openai.types import Completion - from kserve.protocol.rest.openai.types.openapi import ChatCompletionTool - - class Predictor(): - - def __init__(self): - """ Initialization code goes here""" - - # (optional) if any, access the configuration file via os.environ["CONFIG_FILE_PATH"] - config = ... - - print("Starting vLLM backend...") - engine_args = AsyncEngineArgs( - model=os.environ["MODEL_FILES_PATH"], - **config - ) - - # "self.vllm_engine" is required as the local variable with the vllm engine handler - self.vllm_engine = AsyncLLMEngine.from_engine_args(engine_args) - - # - # NOTE: Default implementations of the apply_chat_template and create_completion methods are already provided. - # If needed, you can override these methods as shown below - # - - #def apply_chat_template( - # self, - # messages: Iterable[ChatCompletionRequestMessage], - # chat_template: Optional[str] = None, - # tools: Optional[list[ChatCompletionTool]] = None, - #) -> ChatPrompt: - # """Converts a prompt or list of messages into a single templated prompt string""" - - # prompt = ... # apply chat template on the message to build the prompt - # return ChatPrompt(prompt=prompt) - - #async def create_completion( - # self, request: CompletionRequest - #) -> Union[Completion, AsyncIterator[Completion]]: - # """Generate responses using the vLLM engine""" - # - # generators = self.vllm_engine.generate(...) - # - # # Completion: used for returning a single answer (batch) - # # AsyncIterator[Completion]: used for returning a stream of answers - # return ... + class Predictor: + def __init__(self): + """Initialization code goes here""" + # Optional __init__ params: project, deployment, model, async_logger + # Model files can be found at os.environ["MODEL_FILES_PATH"] + # self.model = ... # load your model + + async def predict(self, inputs): + """Asynchronously serve predictions using the trained model""" + # Perform async operations that required + # result = await some_async_preprocessing(inputs) + + # Use the model to make predictions + # return self.model.predict(result) ``` +!!! tip "Optional `__init__` parameters" + The `__init__` method supports optional parameters that are automatically injected at runtime: + + | Parameter | Class | Description | + | -------------- | -------------------- | ------------------------------------------------------ | + | `project` | `Project` | Hopsworks project handle | + | `deployment` | `Deployment` | Current model deployment handle | + | `model` | `Model` | Model handle | + | `async_logger` | `AsyncFeatureLogger` | Async feature logger for logging features to Hopsworks | + + You can add any combination of these parameters to your `__init__` method: + + ```python + class Predictor: + def __init__(self, project, model): + # Access the project and model directly + self.project = project + self.model = model + ``` + +!!! tip "Feature logging" + The `async_logger` parameter enables asynchronous logging of features and predictions from your predictor script via the Feature View API. This is useful for debugging, monitoring, and auditing the data your models use in production. Logged features are periodically materialized to the offline feature store, and can be retrieved, filtered, and managed through the feature view. + + See the [Feature and Prediction Logging](../../fs/feature_view/feature_logging.md) guide for details on enabling logging, retrieving logs, and managing the log lifecycle. + !!! info "Jupyter magic" In a jupyter notebook, you can add `%%writefile my_predictor.py` at the top of the cell to save it as a local file. -### Step 3 (Optional): Upload the script to your project +### Step 2.2 (Optional): Upload the script to your project !!! info "You can also use the UI to upload your predictor script. See [above](#step-3-optional-select-a-predictor-script)" @@ -279,172 +276,181 @@ Once you are done with the changes, click on `Create new deployment` at the bott "/Projects", project.name, uploaded_file_path ) - ``` -### Step 4: Define predictor +### Step 3: Pass predictor configuration to model deployment + +You can customize the default predictor settings when creating a model deployment. === "Python" ```python my_model = mr.get_model("my_model", version=1) - my_predictor = ms.create_predictor( - my_model, - # optional + my_deployment = my_model.deploy( + # predictor configuration model_server="PYTHON", - serving_tool="KSERVE", script_file=predictor_script_path, ) - - ``` -### Step 5: Create a deployment with the predictor +### API Reference -=== "Python" +[`Predictor`][hsml.predictor.Predictor] - ```python - my_deployment = my_predictor.deploy() +## Model Server - # or - my_deployment = ms.create_deployment(my_predictor) - my_deployment.save() +Hopsworks Model Serving supports deploying models with a Python model server for python-based models (scikit-learn, XGBoost , pytorch...), TensorFlow Serving for TensorFlow / Keras models and vLLM for Large Language Models (LLMs). +!!! info "Supported model servers" - ``` + | Model Server | Backend | ML Models and Frameworks | + | -------------------- | --------------------------------------------- | ------------------------------------------------------------------------------------------------ | + | Python | Any `*-inference-pipeline` Python environment | Python-based (scikit-learn, XGBoost , pytorch...) | + | KServe sklearnserver | Sklearn built-in KServe runtime | Scikit-learn, XGBoost | + | TensorFlow Serving | TensorFlow Serving runtime | Keras, TensorFlow | + | vLLM | vLLM openai-compatible server | vLLM-supported models (see [list](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html)) | -### API Reference +Each model server has specific requirements and supports different types of model artifacts, file formats, and configuration options. When deploying a model, ensure that your model files and configuration align with the expectations of the selected server. -[`Predictor`][hsml.predictor.Predictor] +!!! info "Model artifact requirements" -## Model Server + | Model server | Model files | + | -------------------- | --------------------------------------------------------- | + | Python | Any model file format | + | KServe sklearnserver | Files with extensions `.joblib`, `.pkl`, `.pickle` | + | TensorFlow Serving | Model artifact needs `variables/` and `.pb` file | + | vLLM | Model files supported by vLLM engine (e.g., .safetensors) | -Hopsworks Model Serving supports deploying models with a Flask server for python-based models, TensorFlow Serving for TensorFlow / Keras models and vLLM for Large Language Models (LLMs). -Today, you can deploy PyTorch models as python-based models. +All deployments use [KServe](https://kserve.github.io/website/latest/) as the serving platform, providing autoscaling (including scale-to-zero), fine-grained resource allocation, inference logging, inference batching, and transformers. -??? info "Show supported model servers" +## Predictor script - | Model Server | Supported | ML Models and Frameworks | - | ------------------ | --------- | ----------------------------------------------------------------------------------------------- | - | Flask | ✅ | python-based (scikit-learn, xgboost, pytorch...) | - | TensorFlow Serving | ✅ | keras, tensorflow | - | TorchServe | ❌ | pytorch | - | vLLM | ✅ | vLLM-supported models (see [list](https://docs.vllm.ai/en/v0.7.1/models/supported_models.html)) | +For **Python model deployments** ==only==, you can provide a custom Python script—called a predictor script—to load your model and serve predictions. This script is included in the [artifact files](../serving/deployment.md#artifact-files) of the deployment. The script must follow a specific template, as shown in [Step 2](#step-21-optional-implement-a-predictor-script). -## Serving tool +## Server configuration file -In Hopsworks, model servers are deployed on Kubernetes. -There are two options for deploying models on Kubernetes: using [KServe](https://kserve.github.io/website/latest/) inference services or Kubernetes built-in deployments. ==KServe is the recommended way to deploy models in Hopsworks==. +For **Python model deployments**, you can provide a server configuration file to separate deployment-specific settings from the logic in your predictor or transformer scripts. This approach allows you to update configuration parameters without modifying the code. Within the deployment, the configuration file is accessible at the path specified by the `CONFIG_FILE_PATH` environment variable (see [environment variables](#environment-variables)). -The following is a comparative table showing the features supported by each of them. +For **vLLM deployments**, the server configuration file is ==required== and is used to configure the vLLM server. For example, you can use this configuration file to specify the chat template or LoRA modules to be loaded by the vLLM server. See all available parameters in the [official documentation](https://docs.vllm.ai/en/v0.10.2/serving/openai_compatible_server.html). -??? info "Show serving tools comparison" +!!! warning "Configuration file format" + The configuration file can be of any format, except in **vLLM deployments** for which a YAML file (`.yml`/`.yaml`) is ==required==. + When a predictor script is provided, any format is allowed as users can load it as necessary. - | Feature / requirement | Kubernetes (enterprise) | KServe (enterprise) | - | ----------------------------------------------------- | ----------------------- | ------------------------- | - | Autoscaling (scale-out) | ✅ | ✅ | - | Resource allocation | ➖ min. resources | ✅ min / max. resources | - | Inference logging | ➖ simple | ✅ fine-grained | - | Inference batching | ➖ partially | ✅ | - | Scale-to-zero | ❌ | ✅ after 30s of inactivity | - | Transformers | ❌ | ✅ | - | Low-latency predictions | ❌ | ✅ | - | Multiple models | ❌ | ➖ (python-based) | - | User-provided predictor required
(python-only) | ✅ | ❌ | +## Environment variables -## User-provided script +A number of different environment variables is available in the predictor to ease its implementation. -Depending on the model server and serving platform used in the model deployment, you can (or need) to provide your own python script to load the model and make predictions. -This script is referred to as **predictor script**, and is included in the [artifact files](../serving/deployment.md#artifact-files) of the model deployment. +!!! tip "Available environment variables" -The predictor script needs to implement a given template depending on the model server of the model deployment. -See the templates in [Step 2](#step-2-optional-implement-a-predictor-script). + === "Deployment" -??? info "Show supported user-provided predictors" + These variables are available in all deployments. - | Serving tool | Model server | User-provided predictor script | - | ------------ | ------------------ | ---------------------------------------------------- | - | Kubernetes | Flask server | ✅ (required) | - | | TensorFlow Serving | ❌ | - | KServe | Fast API | ✅ (only required for artifacts with multiple models) | - | | TensorFlow Serving | ❌ | - | | vLLM | ✅ (optional) | + | Name | Description | + | --------------------- | -------------------------------- | + | `DEPLOYMENT_NAME` | Name of the current deployment | + | `DEPLOYMENT_VERSION` | Version of the deployment | + | `ARTIFACT_FILES_PATH` | Local path to the artifact files | -### Server configuration file + === "Predictor" -Depending on the model server, a **server configuration file** can be selected to help detach configuration used within the model deployment from the model server or the implementation of the predictor and transformer scripts. -In other words, by modifying the configuration file of an existing model deployment you can adjust its settings without making changes to the predictor or transformer scripts. -Inside a model deployment, the local path to the configuration file is stored in the `CONFIG_FILE_PATH` environment variable (see [environment variables](#environment-variables)). + These variables are set for predictor components. -!!! warning "Configuration file format" - The configuration file can be of any format, except in vLLM deployments **without a predictor script** for which a YAML file is ==required==. + | Name | Description | + | ------------------ | -------------------------------------------------- | + | `SCRIPT_PATH` | Full path to the predictor script | + | `SCRIPT_NAME` | Prefixed filename of the predictor script | + | `CONFIG_FILE_PATH` | Local path to the configuration file (if provided) | + | `IS_PREDICTOR` | Set to `true` for predictor components | -!!! note "Passing arguments to vLLM via configuration file" - For vLLM deployments **without a predictor script**, the server configuration file is ==required== and it is used to configure the vLLM server. - For example, you can use this configuration file to specify the chat template or LoRA modules to be loaded by the vLLM server. - See all available parameters in the [official documentation](https://docs.vllm.ai/en/v0.7.1/serving/openai_compatible_server.html#command-line-arguments-for-the-server). + === "Model" -### Environment variables + | Name | Description | + | ------------------ | ---------------------------------------------------------------- | + | `MODEL_FILES_PATH` | Local path to the model files (`/var/lib/hopsworks/model_files`) | + | `MODEL_NAME` | Name of the model being served by the current deployment | + | `MODEL_VERSION` | Version of the model being served by the current deployment | -A number of different environment variables is available in the predictor to ease its implementation. + === "Others" -??? info "Show environment variables" + These variables are available in all deployments. - | Name | Description | - | ------------------- | -------------------------------------------------------------------- | - | MODEL_FILES_PATH | Local path to the model files | - | ARTIFACT_FILES_PATH | Local path to the artifact files | - | CONFIG_FILE_PATH | Local path to the configuration file | - | DEPLOYMENT_NAME | Name of the current deployment | - | MODEL_NAME | Name of the model being served by the current deployment | - | MODEL_VERSION | Version of the model being served by the current deployment | - | ARTIFACT_VERSION | Version of the model artifact being served by the current deployment | + | Name | Description | + | ------------------------ | -------------------------------------------------- | + | `REST_ENDPOINT` | Hopsworks REST API endpoint | + | `HOPSWORKS_PROJECT_ID` | ID of the project | + | `HOPSWORKS_PROJECT_NAME` | Name of the project | + | `HOPSWORKS_PUBLIC_HOST` | Hopsworks public hostname | + | `API_KEY` | API key for authenticating with Hopsworks services | + | `PROJECT_ID` | Project ID (for Feature Store access) | + | `PROJECT_NAME` | Project name (for Feature Store access) | + | `SECRETS_DIR` | Path to secrets directory (`/keys`) | + | `MATERIAL_DIRECTORY` | Path to TLS certificates (`/certs`) | + | `REQUESTS_VERIFY` | SSL verification setting | ## Python environments -Depending on the model server and serving tool used in the model deployment, you can select the Python environment where the predictor and transformer scripts will run. +Based on the model server used in the model deployment, you can select the Python environment where the predictor and transformer scripts will run. To create a new Python environment see [Python Environments](../../projects/python/python_env_overview.md). -??? info "Show supported Python environments" +!!! info "Supported Python environments" - | Serving tool | Model server | Editable | Predictor | Transformer | - | ------------ | ------------------ | -------- | ------------------------------------------ | ------------------------------ | - | Kubernetes | Flask server | ❌ | `pandas-inference-pipeline` only | ❌ | - | | TensorFlow Serving | ❌ | (official) tensorflow serving image | ❌ | - | KServe | Fast API | ✅ | any `inference-pipeline` image | any `inference-pipeline` image | - | | TensorFlow Serving | ✅ | (official) tensorflow serving image | any `inference-pipeline` image | - | | vLLM | ✅ | `vllm-inference-pipeline` or `vllm-openai` | any `inference-pipeline` image | + | Model server | Predictor | Transformer | + | -------------------- | -------------------------------- | -------------------------------- | + | Python | any `*-inference-pipeline` image | any `*-inference-pipeline` image | + | KServe sklearnserver | `sklearnserver` | any `*-inference-pipeline` image | + | TensorFlow Serving | `tensorflow/serving` | any `*-inference-pipeline` image | + | vLLM | `vllm-openai` | Not supported | !!! note - The selected Python environment is used for both predictor and transformer. - Support for selecting a different Python environment for the predictor and transformer is coming soon. + For **Python model deployments**, the same Python environment is used for both predictor and transformer. -## Transformer +## Transformer script -Transformers are used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model. -To learn more about transformers, see the [Transformer Guide](transformer.md). +Transformer scripts are Python scripts used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model. +To learn more about transformers, see the [Transformer (KServe) Guide](transformer.md). !!! note - Transformers are only supported in KServe deployments. + Transformer scripts are ==not== supported in **vLLM deployments** ## Inference logger Inference loggers are deployment components that log inference requests into a Kafka topic for later analysis. + To learn about the different logging modes, see the [Inference Logger Guide](inference-logger.md) ## Inference batcher Inference batcher are deployment component that apply batching to the incoming inference requests for a better throughput-latency trade-off. + To learn about the different configuration available for the inference batcher, see the [Inference Batcher Guide](inference-batcher.md). ## Resources Resources include the number of replicas for the deployment as well as the resources (i.e., memory, CPU, GPU) to be allocated per replica. + To learn about the different combinations available, see the [Resources Guide](resources.md). +## Autoscaling + +Deployments use Knative Pod Autoscaler (KPA) to automatically scale the number of replicas based on traffic, including scale-to-zero. + +To learn about the different autoscaling parameters, see the [Autoscaling Guide](autoscaling.md). + +## Scheduling + +!!! info "Kueue is required" + This feature requires Kueue to be enabled in your cluster. If Kueue is not available, queue and topology options will not be accessible. + +If the cluster has Kueue enabled, you can select a queue for your deployment from the advanced configuration. Queues control resource allocation and scheduling priority across the cluster. + +For full details on scheduling configuration, see the [Scheduling Guide](scheduling.md). + ## API protocol -Hopsworks supports both REST and gRPC as the API protocols to send inference requests to model deployments. +Depending on the model server, Hopsworks supports both REST and gRPC as the API protocols to send inference requests to model deployments. In general, you use gRPC when you need lower latency inference requests. + To learn more about the REST and gRPC API protocols for model deployments, see the [API Protocol Guide](api-protocol.md). diff --git a/docs/user_guides/mlops/serving/resources.md b/docs/user_guides/mlops/serving/resources.md index 2ba350d1a9..2be10d59eb 100644 --- a/docs/user_guides/mlops/serving/resources.md +++ b/docs/user_guides/mlops/serving/resources.md @@ -6,11 +6,23 @@ description: Documentation on how to allocate resources to a model deployment ## Introduction -Depending on the serving tool used to deploy a trained model, resource allocation can be configured at different levels. -While deployments on Docker containers only support a fixed number of resources (CPU and memory), using Kubernetes or KServe allows a better exploitation of the resources available in the platform, by enabling you to specify how many CPUs, GPUs, and memory are allocated to a deployment. -See the [compatibility matrix](#compatibility-matrix). +Resource allocation can be configured ==per component== (predictor and transformer) in a deployment, allowing you to specify how many CPUs, GPUs, and memory are allocated. +For each component, you can set minimum (requests) and maximum (limits) resources, as well as the number of instances. -## GUI +??? info "Resource defaults" + + | Field | Default Request | Default Limit | Validation | + | ------------------ | --------------- | -------------- | -------------------------------------------------- | + | CPU (cores) | 0.2 | -1 (unlimited) | Request cannot exceed limit (unless -1, unlimited) | + | Memory (MB) | 32 | -1 (unlimited) | Request cannot exceed limit (unless -1, unlimited) | + | GPUs | 0 | 0 | Request must equal limit | + | Shared Memory (MB) | 128 | — | — | + +!!! tip "Automatic downscale of inactive instances" + Setting the number of instances to **0** for a component (predictor or transformer) enables **scale-to-zero**. + This means that all instances of the component will automatically scale down to zero after a default period of inactivity of 30 seconds. + +## Web UI ### Step 1: Create new deployment @@ -39,13 +51,13 @@ To navigate to the advanced creation form, click on `Advanced options`.

-### Step 3: Configure resource allocation +### Step 3: Configure resources In the `Resource allocation` section of the form, you can optionally set the resources to be allocated to the predictor and/or the transformer (if available). Moreover, you can choose the minimum number of replicas for each of these components. -??? note "Scale-to-zero capabilities" - Deployments with KServe enabled can scale to zero by choosing `0` as the number of instances. +!!! note "Scale-to-zero capabilities" + Set the number of instances to **0** to enable scale-to-zero on the component.

@@ -103,7 +115,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott num_instances=2, requests=minimum_res, limits=maximum_res ) - ``` ### Step 4: Create a deployment with the resource configuration @@ -126,22 +137,13 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_deployment = ms.create_deployment(my_predictor) my_deployment.save() - ``` ### API Reference [`Resources`][hsml.resources.Resources] -## Compatibility matrix - -??? info "Show supported resource allocation configuration" +## Autoscaling - | Serving tool | Component | Resources | - | ------------ | ----------- | --------------------------- | - | Docker | Predictor | Fixed | - | | Transformer | ❌ | - | Kubernetes | Predictor | Minimum resources | - | | Transformer | ❌ | - | KServe | Predictor | Minimum / maximum resources | - | | Transformer | Minimum / maximum resources | +Deployments can be configured to automatically scale the number of replicas based on traffic. +To learn about the different autoscaling parameters, see the [Autoscaling Guide](autoscaling.md). diff --git a/docs/user_guides/mlops/serving/rest-api.md b/docs/user_guides/mlops/serving/rest-api.md index d7e99de1e1..bc3726add2 100644 --- a/docs/user_guides/mlops/serving/rest-api.md +++ b/docs/user_guides/mlops/serving/rest-api.md @@ -2,51 +2,110 @@ ## Introduction -Hopsworks provides model serving capabilities by leveraging [KServe](https://kserve.github.io/website/) as the model serving platform and [Istio](https://istio.io/) as the ingress gateway to the model deployments. +Hopsworks provides ==model serving capabilities== by leveraging [KServe](https://kserve.github.io/website/) as the model serving platform and [Istio](https://istio.io/) as the ingress gateway to the model deployments. This document explains how to interact with a model deployment via REST API. -## Base URL +!!! tip "Tutorials" + End-to-end examples are available in the [hopsworks-tutorials](https://github.com/logicalclocks/hopsworks-tutorials/tree/master) repository. -Deployed models are accessible through the Istio ingress gateway. -The URL to interact with a model deployment is provided on the model deployment page in the Hopsworks UI. +## Sending Inference Requests through Istio Ingress -The URL follows the format `http:///`, where `RESOURCE_PATH` depends on the [`Predictor.model_server`][hsml.predictor.Predictor.model_server] (e.g., vLLM, TensorFlow Serving, SKLearn ModelServer). +The full inference URL is constructed by combining a base path with a model server-specific suffix. +See [URL Paths](#url-paths) for the complete URL format and examples. -

-

- Endpoints -
Deployment Endpoints
-
-

- -## Authentication +### Authentication All requests must include an API Key for authentication. -You can create an API by following this [guide](../../projects/api_key/create_api_key.md). +You can create an API key by following this [guide](../../projects/api_key/create_api_key.md). -Include the key in the Authorization header: +Include the key in the `authorization` header: ```text -Authorization: ApiKey +authorization: ApiKey ``` -## Headers +### Headers + +| Header | Description | Example Value | +| --------------- | ----------------------------------- | ----------------------- | +| `authorization` | API key for authentication. | `ApiKey ` | +| `content-type` | Request payload type (always JSON). | `application/json` | + +## URL Paths + +Deployed models are accessible through the ==Istio ingress gateway== using **path-based** routing. +The full URL is constructed by combining the base path with a model server-specific suffix. +This URL is also provided on the model deployment page in the Hopsworks UI. + +!!! example "" + **`/`** + +Where `server-specific_suffix` depends on the model server type (see [ML Inference Paths](#ml-inference) or [OpenAI-compatible Paths](#openai-compatible)). + +### Base URL + +The base URL is composed of the **Istio ingress gateway IP**, the **project name**, and the **deployment name**. + +!!! example "" + **`https:///v1//`** + +!!! warning "Host-based routing (legacy)" + Prior to path-based routing, requests were routed using a `Host` header matching the model deployment hostname, and **`https://`** as base url. + + ``` + Host: .. + ``` + + Each model deployment gets its own Knative-generated hostname, and routing depends on the `Host` header matching Istio ingress gateway rules. + + Path-based routing (described above) is the preferred method for external access. + +### ML Inference + +For model deployments using Python, KServe sklearnserver, or TensorFlow Serving, the URL follows the KServe V1 inference protocol. + +!!! info "Supported verbs and path format" + | Model Server | Supported Verbs | Path Format | + | -------------------- | -------------------------------- | ------------------------------------ | + | Python | `predict` | `/v1/models/:` | + | KServe sklearnserver | `predict` | `/v1/models/:` | + | TensorFlow Serving | `predict`, `classify`, `regress` | `/v1/models/:` | + +!!! tip "Hopsworks Python API" + + ML inference urls can be retrieved using the `Deployment` class. + + ```python + # Returns: https:///v1///v1/models/:predict + inference_url = deployment.get_inference_url() + ``` + +### OpenAI-compatible + +==vLLM deployments== provide an OpenAI API-compatible endpoint at `/v1/`, allowing you to send any standard OpenAI API request to the vLLM server. + +!!! example "e.g., Chat Completions endpoint" + **`/v1/chat/completions`** + +Refer to the official [vLLM OpenAI-compatible server documentation](https://docs.vllm.ai/en/v0.10.2/serving/openai_compatible_server.html) for details about the available APIs. + +!!! tip "Hopsworks Python API" + + OpenAI-compatible urls can be retrieved using the `Deployment` class. -| Header | Description | Example Value | -| --------------- | ------------------------------------------- | ------------------------------------ | -| `Host` | Model’s hostname, provided in Hopsworks UI. | `fraud.test.hopsworks.ai` | -| `Authorization` | API key for authentication. | `ApiKey ` | -| `Content-Type` | Request payload type (always JSON). | `application/json` | + ```python + # Returns: https:///v1///v1 + # Append /chat/completions or /completions for specific endpoints + openai_url = deployment.get_openai_url() + ``` ## Request Format -The request format depends on the model sever being used. +The request format depends on the model server being used. -For predictive inference (i.e., for Tensorflow or SkLearn or Python Serving). -The request must be sent as a JSON object containing an `inputs` or `instances` field. +For predictive inference (TensorFlow, sklearn, or Python model server), the request must be sent as a JSON object containing an `inputs` or `instances` field. See [more information on the request format](https://kserve.github.io/website/docs/concepts/architecture/data-plane/v1-protocol#request-format). -An example for this is given below. !!! example "REST API example for Predictive Inference (Tensorflow or SkLearn or Python Serving)" === "Python" @@ -56,41 +115,71 @@ An example for this is given below. data = {"inputs": [[4641025220953719, 4920355418495856]]} - headers = { - "Host": "fraud.test.hopsworks.ai", - "Authorization": "ApiKey 8kDOlnRlJU4kiV1Y.RmFNJY3XKAUSqmJZ03kbUbXKMQSHveSBgMIGT84qrM5qXMjLib7hdlfGeg8fBQZp", - "Content-Type": "application/json", - } + headers = {"authorization": "ApiKey ", "content-type": "application/json"} response = requests.post( - "http://10.87.42.108/v1/models/fraud:predict", headers=headers, json=data + "https:///v1/my_project/fraud/v1/models/fraud:predict", + headers=headers, + json=data, ) print(response.json()) - - ``` === "Curl" ```bash - curl -X POST "http://10.87.42.108/v1/models/fraud:predict" \ - -H "Host: fraud.test.hopsworks.ai" \ - -H "Authorization: ApiKey 8kDOlnRlJU4kiV1Y.RmFNJY3XKAUSqmJZ03kbUbXKMQSHveSBgMIGT84qrM5qXMjLib7hdlfGeg8fBQZp" \ - -H "Content-Type: application/json" \ + curl -X POST "https:///v1/my_project/fraud/v1/models/fraud:predict" \ + -H "authorization: ApiKey " \ + -H "content-type: application/json" \ -d '{ "inputs": [ - [ - 4641025220953719, - 4920355418495856 - ] + [4641025220953719, 4920355418495856] ] }' ``` -For generative inference (i.e vLLM) the response follows the [OpenAI specification](https://platform.openai.com/docs/api-reference/chat/create). +For generative inference (vLLM), the request follows the [OpenAI specification](https://docs.vllm.ai/en/v0.10.2/serving/openai_compatible_server.html) supported by the vLLM OpenAI-compatible server. + +!!! example "vLLM chat completions" + === "Python" + + ```python + import requests + + data = { + "model": "my-llm", + "messages": [{"role": "user", "content": "Hello, how are you?"}], + } + + headers = {"authorization": "ApiKey ", "content-type": "application/json"} + + response = requests.post( + "https:///v1/my_project/my-llm/v1/chat/completions", + headers=headers, + json=data, + ) + print(response.json()) + ``` + + === "Curl" + + ```bash + curl -X POST "https:///v1/my_project/my-llm/v1/chat/completions" \ + -H "authorization: ApiKey " \ + -H "content-type: application/json" \ + -d '{ + "model": "my-llm", + "messages": [ + {"role": "user", "content": "Hello, how are you?"} + ] + }' + ``` + +## CORS + +The Istio EnvoyFilter handles CORS preflight (`OPTIONS`) requests automatically. Allowed origins can be configured via `istio.envoyFilter.corsAllowedOrigins` in the Helm chart configuration. ## Response The model returns predictions in a JSON object. The response depends on the model server implementation. -You can find more information regarding specific model servers in the [Kserve documentation](https://kserve.github.io/website/docs/intro). diff --git a/docs/user_guides/mlops/serving/scheduling.md b/docs/user_guides/mlops/serving/scheduling.md new file mode 100644 index 0000000000..074a1e6148 --- /dev/null +++ b/docs/user_guides/mlops/serving/scheduling.md @@ -0,0 +1,101 @@ +--- +description: Documentation on how to configure scheduling options for a model deployment +--- + +# How To Configure Scheduling For A Model Deployment + +## Introduction + +Scheduling configuration determines how and where your model deployment pods are placed in the Kubernetes cluster. +Hopsworks supports Kubernetes scheduler abstractions such as [node affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity), [anti-affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity), and [priority classes](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/), as well as advanced scheduling with [Kueue queues](https://kueue.sigs.k8s.io/docs/concepts/local_queue/) and [topologies](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/). + +!!! tip "Scheduling available for all workloads" + In addition to model deployments, all scheduling options are also available for jobs, Jupyter notebooks, and Python deployments. + +## Web UI + +### Step 1: Create new deployment + +If you have at least one model already trained and saved in the Model Registry, navigate to the deployments page by clicking on the `Deployments` tab on the navigation menu on the left. + +

+

+ Deployments navigation tab +
Deployments navigation tab
+
+

+ +Once in the deployments page, you can create a new deployment by either clicking on `New deployment` (if there are no existing deployments) or on `Create new deployment` it the top-right corner. +Both options will open the deployment creation form. + +### Step 2: Go to advanced options + +A simplified creation form will appear including the most common deployment fields from all available configurations. +Scheduling is part of the advanced options of a deployment. +To navigate to the advanced creation form, click on `Advanced options`. + +

+

+ Advance options +
Advanced options. Go to advanced deployment creation form
+
+

+ +### Step 3: Configure scheduling + +In the advanced creation form, go to the **Scheduler** section to set up scheduling options for your deployment. +Here, you can specify [affinity, anti-affinity, and priority classes](#affinity-anti-affinity-and-priority-classes) to control how your deployment pods are scheduled within the cluster. + +

+

+ Affinity and Priority Classes +
Configure affinity and priority classes for the model deployment
+
+

+ +If Kueue is ==enabled==, you can also select a [queue and topology](#queues-and-topologies) for your deployment. + +

+

+ Select a queue for the deployment +
Select a queue for the model deployment
+
+

+ +

+

+ Select a topology unit for the deployment +
Select a topology unit for the model deployment
+
+

+ +Once you are done with the changes, click on `Create new deployment` at the bottom of the page to create the deployment for your model. + +## Affinity, Anti-Affinity, and Priority Classes + +You can configure [node affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity), [anti-affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity), and [priority classes](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) to control pod placement and scheduling priority for your deployment. + +- **Affinity**: Constrains which nodes the deployment pods can run on based on node labels (e.g., GPU nodes, specific zones). +- **Anti-Affinity**: Prevents pods from running on nodes with specific labels. +- **Priority Class**: Determines the scheduling and eviction priority of pods. + Higher priority pods are scheduled first and can preempt lower priority pods. + +## Queues and Topologies + +!!! warning "Kueue is required" + This feature requires Kueue to be enabled in your cluster. + If Kueue is not available, queue and topology options will not be accessible. + +If the cluster has Kueue enabled, you can select a queue for your deployment. +[Queues](https://kueue.sigs.k8s.io/docs/concepts/local_queue/) control resource allocation and scheduling priority across the cluster. +Administrators define quotas on how many resources a queue can use, and queues can be grouped in cohorts to borrow resources from each other. + +You can also select a [topology](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/) unit to control how deployment pods are co-located. +For example, you can require all pods to run on the same host to minimize network latency. + +## Learn more + +For detailed documentation on scheduling abstractions and cluster-level configuration, see the following guides: + +- [Scheduler](../../projects/scheduling/kube_scheduler.md) — Affinity, anti-affinity, priority classes, and project-level defaults +- [Kueue Details](../../projects/scheduling/kueue_details.md) — Queues, cohorts, topologies, and resource flavors diff --git a/docs/user_guides/mlops/serving/transformer.md b/docs/user_guides/mlops/serving/transformer.md index 9abf279d59..463880a5c5 100644 --- a/docs/user_guides/mlops/serving/transformer.md +++ b/docs/user_guides/mlops/serving/transformer.md @@ -6,23 +6,29 @@ description: Documentation on how to configure a KServe transformer for a model ## Introduction -In this guide, you will learn how to configure a transformer in a deployment. +In this guide, you will learn how to configure a transformer script in a model deployment. -Transformers are used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model. -They run on a built-in Flask server provided by Hopsworks and require a user-provided python script implementing the [Transformer class](#step-2-implement-transformer-script). +Transformer scripts are used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model. +They are user-provided Python scripts (`.py` or `.ipynb`) implementing the [Transformer class](#step-2-implement-transformer-script). -???+ warning - Transformers are only supported in deployments using KServe as serving tool. +!!! info "Transformer scripts are not supported in vLLM deployments." -A transformer has two configurable components: +!!! tip "Independent scaling" + The transformer has independent resources and autoscaling configuration from the predictor. + This allows you to scale the pre/post-processing separately from the model inference. + +A transformer has the following configurable components: !!! info "" - 1. [User-provided script](#step-2-implement-transformer-script) - 5. [Resources](#resources) + 1. [Transformer script](#transformer-script) + 2. [Resources](#resources) + 3. [Autoscaling](#autoscaling) + 4. [Python environments](#python-environments) + 5. [Environment variables](#environment-variables) See examples of transformer scripts in the serving [example notebooks](https://github.com/logicalclocks/hops-examples/blob/master/notebooks/ml/serving). -## GUI +## Web UI ### Step 1: Create new deployment @@ -53,17 +59,7 @@ To navigate to the advanced creation form, click on `Advanced options`. ### Step 3: Select a transformer script -Transformers require KServe as the serving platform for the deployment. -Make sure that KServe is enabled for this deployment by activating the corresponding checkbox. - -

-

- KServe enabled in advanced deployment form -
Enable KServe in the advanced deployment form
-
-

- -Then, if the transformer script is already located in Hopsworks, click on `From project` and navigate through the file system to find your script. +If the transformer script is already located in Hopsworks, click on `From project` and navigate through the file system to find your script. Otherwise, you can click on `Upload new file` to upload the transformer script now.

@@ -73,15 +69,12 @@ Otherwise, you can click on `Upload new file` to upload the transformer script n

-After selecting the transformer script, you can optionally configure resource allocation for your transformer (see [Step 4](#step-4-optional-configure-resource-allocation)). +After selecting the transformer script, you can optionally configure resources and autoscaling for your transformer (see [Step 4](#step-4-optional-other-advanced-options)). Otherwise, click on `Create new deployment` to create the deployment for your model. -### Step 4 (Optional): Configure resource allocation - -At the end of the page, you can configure the resources to be allocated for the transformer, as well as the minimum and maximum number of replicas to be deployed. +### Step 4 (Optional): Other advanced options -??? note "Scale-to-zero capabilities" - Deployments with KServe enabled can scale to zero by choosing `0` as the number of instances. +In this page, you can also configure the [resources](resources.md) to be allocated for the transformer, as well as the [autoscaling](autoscaling.md) parameters to control how the transformer scales based on traffic.

@@ -106,8 +99,8 @@ Once you are done with the changes, click on `Create new deployment` at the bott # get Dataset API instance dataset_api = project.get_dataset_api() - # get Hopsworks Model Serving handle - ms = project.get_model_serving() + # get Hopsworks Model Registry handle + mr = project.get_model_registry() ``` ### Step 2: Implement transformer script @@ -118,6 +111,7 @@ Once you are done with the changes, click on `Create new deployment` at the bott class Transformer: def __init__(self): """Initialization code goes here""" + # Optional __init__ params: project, deployment, model, async_logger pass def preprocess(self, inputs): @@ -129,6 +123,26 @@ Once you are done with the changes, click on `Create new deployment` at the bott return outputs ``` +!!! tip "Optional `__init__` parameters" + The `__init__` method supports optional parameters that are automatically injected at runtime: + + | Parameter | Class | Description | + | -------------- | -------------------- | ------------------------------------------------------ | + | `project` | `Project` | Hopsworks project handle | + | `deployment` | `Deployment` | Current model deployment handle | + | `model` | `Model` | Model handle | + | `async_logger` | `AsyncFeatureLogger` | Async feature logger for logging features to Hopsworks | + + You can add any combination of these parameters to your `__init__` method: + + ```python + class Transformer: + def __init__(self, project, model): + # Access the project and model directly + self.project = project + self.model_metadata = model + ``` + !!! info "Jupyter magic" In a jupyter notebook, you can add `%%writefile my_transformer.py` at the top of the cell to save it as a local file. @@ -146,7 +160,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott "/Projects", project.name, uploaded_file_path ) - ``` ### Step 4: Define a transformer @@ -162,43 +175,111 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_transformer = Transformer(script_file) - ``` ### Step 5: Create a deployment with the transformer +Use the `transformer` parameter to set the transformer configuration when creating the model deployment. + === "Python" ```python - my_predictor = ms.create_predictor(transformer=my_transformer) - my_deployment = my_predictor.deploy() - - # or - my_deployment = ms.create_deployment(my_predictor, transformer=my_transformer) - my_deployment.save() - + my_model = mr.get_model("my_model", version=1) + my_deployment = my_model.deploy( + transformer=my_transformer + ) ``` ### API Reference [`Transformer`][hsml.transformer.Transformer] +## Transformer script + +A transformer script is a custom Python script to apply pre/post-processing on the model inputs and outputs. +This script is included in the [artifact files](../serving/deployment.md#artifact-files) of the deployment. +The script must implement the `Transformer` class, as shown in [Step 2](#step-2-implement-transformer-script). + +!!! info "Transformer scripts are not supported in vLLM deployments." + ## Resources Resources include the number of replicas for the deployment as well as the resources (i.e., memory, CPU, GPU) to be allocated per replica. + To learn about the different combinations available, see the [Resources Guide](resources.md). +## Autoscaling + +The transformer has independent autoscaling from the predictor. +Deployments use Knative Pod Autoscaler (KPA) to automatically scale the number of replicas based on traffic, including scale-to-zero. + +To learn about the different autoscaling parameters, see the [Autoscaling Guide](autoscaling.md). + ## Environment variables A number of different environment variables is available in the transformer to ease its implementation. -??? info "Show environment variables" +!!! tip "Available environment variables" + + === "Deployment" + + These variables are available in all deployments. + + | Name | Description | + | --------------------- | -------------------------------- | + | `DEPLOYMENT_NAME` | Name of the current deployment | + | `DEPLOYMENT_VERSION` | Version of the deployment | + | `ARTIFACT_FILES_PATH` | Local path to the artifact files | + + === "Transformer" + + These variables are set for transformer components. + + | Name | Description | + | ------------------ | -------------------------------------------------- | + | `SCRIPT_PATH` | Full path to the transformer script | + | `SCRIPT_NAME` | Prefixed filename of the transformer script | + | `CONFIG_FILE_PATH` | Local path to the configuration file (if provided) | + | `IS_TRANSFORMER` | Set to `true` for transformer components | + + === "Model" + + | Name | Description | + | --------------- | ----------------------------------------------------------- | + | `MODEL_NAME` | Name of the model being served by the current deployment | + | `MODEL_VERSION` | Version of the model being served by the current deployment | + + === "Others" + + These variables are available in all deployments. + + | Name | Description | + | ------------------------ | -------------------------------------------------- | + | `REST_ENDPOINT` | Hopsworks REST API endpoint | + | `HOPSWORKS_PROJECT_ID` | ID of the project | + | `HOPSWORKS_PROJECT_NAME` | Name of the project | + | `HOPSWORKS_PUBLIC_HOST` | Hopsworks public hostname | + | `API_KEY` | API key for authenticating with Hopsworks services | + | `PROJECT_ID` | Project ID (for Feature Store access) | + | `PROJECT_NAME` | Project name (for Feature Store access) | + | `SECRETS_DIR` | Path to secrets directory (`/keys`) | + | `MATERIAL_DIRECTORY` | Path to TLS certificates (`/certs`) | + | `REQUESTS_VERIFY` | SSL verification setting | + +## Python environments + +Transformer scripts always run on `*-inference-pipeline` Python environments. +To create a new Python environment see [Python Environments](../../projects/python/python_env_overview.md). + +!!! note + For **Python model deployments**, the same Python environment is used for both predictor and transformer. + +!!! info "Supported Python environments" - | Name | Description | - | ------------------- | -------------------------------------------------------------------- | - | ARTIFACT_FILES_PATH | Local path to the model artifact files | - | DEPLOYMENT_NAME | Name of the current deployment | - | MODEL_NAME | Name of the model being served by the current deployment | - | MODEL_VERSION | Version of the model being served by the current deployment | - | ARTIFACT_VERSION | Version of the model artifact being served by the current deployment | + | Model server | Predictor | Transformer | + | -------------------- | -------------------------------- | -------------------------------- | + | Python | any `*-inference-pipeline` image | any `*-inference-pipeline` image | + | KServe sklearnserver | `sklearnserver` | any `*-inference-pipeline` image | + | TensorFlow Serving | `tensorflow/serving` | any `*-inference-pipeline` image | + | vLLM | `vllm-openai` | Not supported | diff --git a/docs/user_guides/mlops/serving/troubleshooting.md b/docs/user_guides/mlops/serving/troubleshooting.md index f02d942ab8..53d92de23d 100644 --- a/docs/user_guides/mlops/serving/troubleshooting.md +++ b/docs/user_guides/mlops/serving/troubleshooting.md @@ -9,10 +9,15 @@ description: Documentation on how to troubleshoot a model deployment In this guide, you will learn how to troubleshoot a deployment that is having issues to serve a trained model. But before that, it is important to understand how [deployment states](deployment-state.md) are defined and the possible transitions between conditions. +Before a deployment starts, it goes through a `CREATING` phase where deployment artifacts are prepared. When a deployment is starting, it follows an ordered sequence of [states](deployment-state.md#deployment-conditions) before becoming ready for serving predictions. Similarly, it follows an ordered sequence of states when being stopped, although with fewer steps. -## GUI +!!! warning "`FAILED` is a terminal state" + If a deployment reaches the `FAILED` state, it cannot recover on its own. + You must stop and restart the deployment to attempt recovery. + +## Web UI ### Step 1: Inspect deployment status @@ -102,7 +107,7 @@ To access the OpenSearch Dashboards, click on the `See logs` button at the top o Once in the OpenSearch Dashboards, you can search for keywords, apply multiple filters and sort the records by timestamp. -??? info "Show available filters" +??? info "Available filters" | Filter | Description | | -------------- | -------------------------------------------------------------------------------------------------------- | @@ -135,7 +140,6 @@ Once in the OpenSearch Dashboards, you can search for keywords, apply multiple f ```python deployment = ms.get_deployment("mydeployment") - ``` ### Step 3: Get current deployment's predictor state @@ -147,7 +151,6 @@ Once in the OpenSearch Dashboards, you can search for keywords, apply multiple f state.describe() - ``` ### Step 4: Explore transient logs @@ -157,7 +160,6 @@ Once in the OpenSearch Dashboards, you can search for keywords, apply multiple f ```python deployment.get_logs(component="predictor|transformer", tail=10) - ``` ### API Reference diff --git a/docs/user_guides/projects/project/create_project.md b/docs/user_guides/projects/project/create_project.md index 6e92994372..d218effddf 100644 --- a/docs/user_guides/projects/project/create_project.md +++ b/docs/user_guides/projects/project/create_project.md @@ -8,7 +8,7 @@ In this guide, you will learn how to create a new project. A valid project name can only contain characters a-z, A-Z, 0-9 and special characters ‘_’ and ‘.’ but not ‘__’ (double underscore). There is also a number of [reserved project names](#reserved-project-names) that can not be used. -## GUI +## Web UI ### Step 1: Create a project diff --git a/docs/user_guides/projects/python-deployment/python-deployment.md b/docs/user_guides/projects/python-deployment/python-deployment.md new file mode 100644 index 0000000000..f6b23a2d7d --- /dev/null +++ b/docs/user_guides/projects/python-deployment/python-deployment.md @@ -0,0 +1,233 @@ +--- +description: Documentation on how to create Python deployments +--- + +# Python Deployment + +## Introduction + +Python deployments allow you to deploy a Python script as a service without requiring a model artifact in the Model Registry. +This is useful for custom inference pipelines, feature view deployments, or any Python-based program that needs to be served behind an HTTP endpoint. + +!!! warning "Incoming requests are directed to port 8080" + Python deployments run your script directly on port 8080. + Therefore, make sure your implementation listens to 8080 port for handling incoming requests. + +!!! info "gRPC protocol not supported" + +!!! tip "Use your favourite HTTP server" + There are no constraints on the framework or library used — you can use Flask, FastAPI, or any other HTTP server. + +In each Python deployment, you can configure the following: + +!!! info "" + 1. [Python environments](#python-environments) + 2. [Resources](#resources) + 3. [Autoscaling](#autoscaling) + 4. [Scheduling](#scheduling) + +## Web UI + +### Step 1: Create new deployment + +Navigate to the deployments page by clicking on the `Deployments` tab on the navigation menu on the left. + +

+

+ Deployments navigation tab +
Deployments navigation tab
+
+

+ +Then, click on `New Python deployment`. + +### Step 2: Configure the deployment + +Choose a name for your Python deployment. +Then, provide the script for you Python program by clicking on `From project` or `Upload new file`. + +### Step 3 (Optional): Change Python environment + +Python deployments run the scripts in one of the [Python Environments](../../projects/python/python_env_overview.md) available in your project. +This environment must have all the necessary dependencies for your Python program. + +Hopsworks provide a collection of built-in environments like `minimal-inference-pipeline`, `pandas-inference-pipeline` or `torch-inference-pipeline` with different sets of libraries pre-installed. +By default, the `pandas-inference-pipeline` Python environment is used in Python deployments. + +To create your own environment it is recommended to [clone](../../projects/python/python_env_clone.md) the `minimal-inference-pipeline` or `pandas-inference-pipeline` environment and install additional dependencies needed for your Python program. + +

+

+ Python script in the simplified deployment form +
Select an environment for the Python program
+
+

+ +### Step 4 (Optional): Advanced configuration + +Click on `Advanced options` to configure your Python deployment further, including: + +!!! info "" + 1. [Resources](#resources) + 2. [Autoscaling](#autoscaling) + 3. [Scheduling](#scheduling) + +Once you are done with the changes, click on `Create new Python deployment` at the bottom of the page to create the Python deployment. + +## Code + +### Step 1: Connect to Hopsworks + +=== "Python" + + ```python + import hopsworks + + project = hopsworks.login() + + # get Hopsworks Model Serving handle + ms = project.get_model_serving() + ``` + +### Step 2: Implement a Python script + +=== "Python" + + ```python + import uvicorn + from fastapi import FastAPI + + app = FastAPI() + + + @app.get("/ping") + async def ping(): + return {"status": "ready"} + + + @app.post("/echo") + async def echo(data: dict): + return data + + + if __name__ == "__main__": + uvicorn.run(app, host="0.0.0.0", port=8080) + ``` + +!!! info "Jupyter magic" + In a jupyter notebook, you can add `%%writefile python_server.py` at the top of the cell to save it as a local file. + +### Step 3: Upload the script to your project + +=== "Python" + + ```python + import os + + dataset_api = project.get_dataset_api() + + uploaded_file_path = dataset_api.upload("python_server.py", "Resources", overwrite=True) + script_path = os.path.join("/Projects", project.name, uploaded_file_path) + ``` + +### Step 4: Create a deployment + +=== "Python" + + ```python + py_server = ms.create_endpoint( + name="pyserver", + script_file=script_path + ) + py_deployment = py_server.deploy() + ``` + +### Step 5: Send requests + +=== "Python" + + ```python + import requests + + url = py_deployment.get_endpoint_url() + + response = requests.post(f"{url}/echo", json={"key": "value"}) + print(response.json()) + ``` + +## Environment variables + +A number of different environment variables is available in the Python deployment to ease its implementation. + +!!! tip "Available environment variables" + + === "Deployment" + + These variables are available in all deployments. + + | Name | Description | + | --------------------- | -------------------------------- | + | `DEPLOYMENT_NAME` | Name of the current deployment | + | `DEPLOYMENT_VERSION` | Version of the deployment | + | `ARTIFACT_FILES_PATH` | Local path to the artifact files | + + === "Python deployment" + + These variables are specific to Python deployments. + + | Name | Description | + | ------------------ | -------------------------------------------------- | + | `SCRIPT_PATH` | Full path to the Python script | + | `SCRIPT_NAME` | Prefixed filename of the Python script | + | `CONFIG_FILE_PATH` | Local path to the configuration file (if provided) | + + === "Others" + + These variables are available in all deployments. + + | Name | Description | + | ------------------------ | -------------------------------------------------- | + | `REST_ENDPOINT` | Hopsworks REST API endpoint | + | `HOPSWORKS_PROJECT_ID` | ID of the project | + | `HOPSWORKS_PROJECT_NAME` | Name of the project | + | `HOPSWORKS_PUBLIC_HOST` | Hopsworks public hostname | + | `API_KEY` | API key for authenticating with Hopsworks services | + | `PROJECT_ID` | Project ID (for Feature Store access) | + | `PROJECT_NAME` | Project name (for Feature Store access) | + | `SECRETS_DIR` | Path to secrets directory (`/keys`) | + | `MATERIAL_DIRECTORY` | Path to TLS certificates (`/certs`) | + | `REQUESTS_VERIFY` | SSL verification setting | + +## Python environments + +Python deployments run in one of the `*-inference-pipeline` Python environments available in your project. +Hopsworks provides built-in environments like `minimal-inference-pipeline`, `pandas-inference-pipeline` or `torch-inference-pipeline` with different sets of libraries pre-installed. +By default, the `pandas-inference-pipeline` environment is used. + +To create your own environment, it is recommended to [clone](../../projects/python/python_env_clone.md) the `minimal-inference-pipeline` or `pandas-inference-pipeline` environment and install additional dependencies needed for your Python program. +To learn more about Python environments, see [Python Environments](../../projects/python/python_env_overview.md). + +## Resources + +Configure CPU, memory, and GPU allocation for your Python deployment. +Each deployment component has separate request and limit values. + +For full details on resource configuration, see the [Resources Guide](../../mlops/serving/resources.md). + +## Autoscaling + +Deployments use **Knative Pod Autoscaler (KPA)** to automatically scale the number of replicas based on traffic. +You can configure the minimum and maximum number of instances as well as the scale metric (requests per second or concurrency). + +For full details on autoscaling parameters, see the [Autoscaling Guide](../../mlops/serving/autoscaling.md). + +## Scheduling + +!!! info "Kueue is required" + This feature requires Kueue to be enabled in your cluster. + If Kueue is not available, queue and topology options will not be accessible. + +If the cluster has Kueue enabled, you can select a queue for your deployment from the advanced configuration. +Queues control resource allocation and scheduling priority across the cluster. + +For full details on scheduling configuration, see the [Scheduling Guide](../../mlops/serving/scheduling.md). diff --git a/docs/user_guides/projects/python-deployment/rest-api.md b/docs/user_guides/projects/python-deployment/rest-api.md new file mode 100644 index 0000000000..9ab0d004b4 --- /dev/null +++ b/docs/user_guides/projects/python-deployment/rest-api.md @@ -0,0 +1,109 @@ +--- +description: Documentation on how to interact with a Python deployment via REST API +--- + +# Python Deployment REST API + +## Introduction + +Python deployments are accessible via REST API through the [Istio](https://istio.io/) ingress gateway. + +This document explains how to send requests to a Python deployment. + +!!! tip "Tutorials" + End-to-end examples are available in the [hopsworks-tutorials](https://github.com/logicalclocks/hopsworks-tutorials/tree/master) repository. + +## Sending Requests through Istio Ingress + +The full URL path is constructed by combining a base path with a resource path available in the Python server. +See [URL Paths](#url-paths) for the complete URL format and examples. + +### Authentication + +All requests must include an API Key for authentication. +You can create an API key by following this [guide](../../projects/api_key/create_api_key.md). + +Include the key in the `authorization` header: + +```text +authorization: ApiKey +``` + +### Headers + +| Header | Description | Example Value | +| --------------- | --------------------------- | ----------------------- | +| `authorization` | API key for authentication. | `ApiKey ` | +| `content-type` | Request payload type. | `application/json` | + +## URL Paths + +Python deployments are accessible through the ==Istio ingress gateway== using **path-based** routing. +The full URL is constructed by combining the base URL with the paths defined in your Python server. + +!!! example "" + **`/`** + +Where `` depends entirely on the routes defined in your Python server implementation (e.g., `/echo`, `/predict`, `/health`). + +### Base URL + +The base URL is composed of the **Istio ingress gateway IP**, the **project name**, and the **deployment name**. + +!!! example "" + **`https:///v1//`** + +!!! warning "Host-based routing (legacy)" + Prior to path-based routing, requests were routed using a `Host` header matching the deployment hostname, and **`https://`** as base url. + + ``` + Host: .. + ``` + + Each deployment gets its own Knative-generated hostname, and routing depends on the `Host` header matching Istio ingress gateway rules. + + Path-based routing (described above) is the preferred method for external access. + +!!! tip "Hopsworks Python API" + + The endpoint URL can be retrieved using the `Deployment` class. + + ```python + # Returns: https:///v1// + endpoint_url = deployment.get_endpoint_url() + ``` + +## Request Format + +The request format depends entirely on your Python server implementation. +There are no framework or protocol constraints — your server defines the expected HTTP methods, paths, and payload format. + +!!! example "REST API example" + === "Python" + + ```python + import requests + + url = deployment.get_endpoint_url() + + response = requests.post(f"{url}/echo", json={"key": "value"}) + print(response.json()) + ``` + + === "Curl" + + ```bash + curl -X POST "https:///v1/my_project/pyserver/echo" \ + -H "authorization: ApiKey " \ + -H "content-type: application/json" \ + -d '{"key": "value"}' + ``` + +## CORS + +The Istio EnvoyFilter handles CORS preflight (`OPTIONS`) requests automatically. +Allowed origins can be configured via `istio.envoyFilter.corsAllowedOrigins` in the Helm chart configuration. + +## Response + +The response format depends on your Python server implementation. diff --git a/docs/user_guides/projects/python-deployment/troubleshooting.md b/docs/user_guides/projects/python-deployment/troubleshooting.md new file mode 100644 index 0000000000..06af2c46b4 --- /dev/null +++ b/docs/user_guides/projects/python-deployment/troubleshooting.md @@ -0,0 +1,165 @@ +--- +description: Documentation on how to troubleshoot a Python deployment +--- + +# How To Troubleshoot A Python Deployment + +## Introduction + +In this guide, you will learn how to troubleshoot a deployment that is having issues running. +But before that, it is important to understand how [deployment states](../../mlops/serving/deployment-state.md) are defined and the possible transitions between conditions. + +Before a deployment starts, it goes through a `CREATING` phase where deployment artifacts are prepared. +When a deployment is starting, it follows an ordered sequence of [states](../../mlops/serving/deployment-state.md#deployment-conditions) before becoming ready for handling requests. +Similarly, it follows an ordered sequence of states when being stopped, although with fewer steps. + +!!! warning "`FAILED` is a terminal state" + If a deployment reaches the `FAILED` state, it cannot recover on its own. + You must stop and restart the deployment to attempt recovery. + +## Web UI + +### Step 1: Inspect deployment status + +If you have at least one deployment already created, navigate to the deployments page by clicking on the `Deployments` tab on the navigation menu on the left. + +

+

+ Deployments navigation tab +
Deployments navigation tab
+
+

+ +Once in the deployments page, find the deployment you want to inspect. +Next to the actions buttons, you can find an indicator showing the current status of the deployment. +For a more descriptive representation, this indicator changes its color based on the status. + +To inspect the condition of the deployment, click on the name of the deployment to open the deployment overview page. + +### Step 2: Inspect condition + +At the top of page, you can find the same status indicator mentioned in the previous step. +Below it, a one-line message is shown with a more detailed description of the deployment status. +This message is built using the current status [condition](../../mlops/serving/deployment-state.md#deployment-conditions) of the deployment. + +Oftentimes, the status and the one-line description are enough to understand the current state of a deployment. +For instance, when the cluster lacks enough allocatable resources to meet the deployment requirements, a meaningful error message will be shown with the root cause. + +

+

+ Deployment failed to schedule condition +
Condition of a deployment that cannot be scheduled
+
+

+ +However, when the deployment fails to start further details might be needed depending on the source of failure. +For example, failures in the initialization or starting steps will show a less relevant message. +In those cases, you can explore the deployments logs in search of the cause of the problem. + +

+

+ Deployment failed to start condition +
Condition of a deployment that fails to start
+
+

+ +### Step 3: Explore transient logs + +Each deployment is composed of several components depending on its configuration. +Transient logs refer to component-specific logs that are directly retrieved from the component itself. +Therefore, these logs can only be retrieved as long as the deployment components are reachable. + +!!! info "" + Transient logs are informative and fast to retrieve, facilitating the troubleshooting of deployment components at a glance + +Transient logs are convenient when access to the most recent logs of a deployment is needed. + +!!! info + When a deployment is in idle state, there are no components running (i.e., scaled to zero) and, thus, no transient logs are available. + +!!! note + In the current version of Hopsworks, transient logs can only be accessed using the Hopsworks Machine Learning Python library. + See [an example](#step-4-explore-transient-logs). + +### Step 4: Explore historical logs + +Transient logs are continuously collected and stored in OpenSearch, where they become historical logs accessible using the integrated OpenSearch Dashboards. +Therefore, historical logs contain the same information than transient logs. +However, there might be cases where transient logs could not be collected in time for a specific component and, thus, not included in the historical logs. + +!!! info "" + Historical logs are persisted transient logs that can be queried, filtered and sorted using OpenSearch Dashboards, facilitating a more sophisticated exploration of past records. + +Historical logs are convenient when a deployment fails occasionally, either at runtime or without a clear reason. +In this case, narrowing the inspection of component-specific logs at a concrete point in time and searching for keywords can be helpful. + +To access the OpenSearch Dashboards, click on the `See logs` button at the top of the deployment overview page. + +

+

+ See logs button +
Access to historical logs of a deployment
+
+

+ +!!! note + In case you are not familiar with the interface, you may find the [official documentation](https://opensearch.org/docs/latest/dashboards/index/) useful. + +Once in the OpenSearch Dashboards, you can search for keywords, apply multiple filters and sort the records by timestamp. + +??? info "Available filters" + + | Filter | Description | + | -------------- | ---------------------------------------- | + | component | Name of the deployment component | + | container_name | Name of the container within a component | + | serving_name | Name of the deployment | + | timestamp | Timestamp when the record was reported | + +## Code + +### Step 1: Connect to Hopsworks + +=== "Python" + + ```python + import hopsworks + + project = hopsworks.login() + + # get Hopsworks Model Serving handle + ms = project.get_model_serving() + ``` + +### Step 2: Retrieve an existing deployment + +=== "Python" + + ```python + deployment = ms.get_deployment("mydeployment") + + ``` + +### Step 3: Get current deployment state + +=== "Python" + + ```python + state = deployment.get_state() + + state.describe() + + ``` + +### Step 4: Explore transient logs + +=== "Python" + + ```python + deployment.get_logs(tail=10) + + ``` + +### API Reference + +[`Deployment`][hsml.deployment.Deployment] diff --git a/mkdocs.yml b/mkdocs.yml index 1367a588a6..26765d1eeb 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -191,6 +191,10 @@ nav: - Create API Key: user_guides/projects/api_key/create_api_key.md - AWS IAM Roles: user_guides/projects/iam_role/iam_role_chaining.md - Query Engine: user_guides/projects/trino/query_engine.md + - Python Deployment: + - Deployment Creation: user_guides/projects/python-deployment/python-deployment.md + - REST API: user_guides/projects/python-deployment/rest-api.md + - Troubleshooting: user_guides/projects/python-deployment/troubleshooting.md - MLOps: - user_guides/mlops/index.md - Model Registry: @@ -206,15 +210,17 @@ nav: - Model Evaluation Images: user_guides/mlops/registry/model_evaluation_images.md - Model Serving: - user_guides/mlops/serving/index.md - - Deployment: - - Deployment creation: user_guides/mlops/serving/deployment.md - - Deployment state: user_guides/mlops/serving/deployment-state.md - - Predictor: user_guides/mlops/serving/predictor.md - - Transformer: user_guides/mlops/serving/transformer.md - - Resource Allocation: user_guides/mlops/serving/resources.md - - Inference Logger: user_guides/mlops/serving/inference-logger.md - - Inference Batcher: user_guides/mlops/serving/inference-batcher.md - - API Protocol: user_guides/mlops/serving/api-protocol.md + - Model Deployment: + - Deployment Creation: user_guides/mlops/serving/deployment.md + - Deployment State: user_guides/mlops/serving/deployment-state.md + - Predictor (KServe): user_guides/mlops/serving/predictor.md + - Transformer (KServe): user_guides/mlops/serving/transformer.md + - Inference Logger: user_guides/mlops/serving/inference-logger.md + - Inference Batcher: user_guides/mlops/serving/inference-batcher.md + - Resources: user_guides/mlops/serving/resources.md + - Autoscaling: user_guides/mlops/serving/autoscaling.md + - Scheduling: user_guides/mlops/serving/scheduling.md + - API Protocol: user_guides/mlops/serving/api-protocol.md - REST API: user_guides/mlops/serving/rest-api.md - Troubleshooting: user_guides/mlops/serving/troubleshooting.md - External Access: user_guides/mlops/serving/external-access.md