As an OpenShift cluster administrator, you can manage the following Open Data Hub resources:
-
Users and groups
-
Custom workbench images
-
Applications that show in the dashboard
-
Custom deployment resources that are related to the Open Data Hub Operator, for example, CPU and memory limits and requests
-
Accelerators
-
Workload resources with Kueue
-
Workload metrics
-
Configure external OIDC identity providers
-
Data backup
-
Monitoring and observability
-
Logs and audit records
Managing users and groups
Users with administrator access to OpenShift Container Platform can add, modify, and remove user permissions for Open Data Hub.
Overview of user types and permissions
Table 1 describes the Open Data Hub user types.
| User Type | Permissions |
|---|---|
Users |
Machine learning operations (MLOps) engineers and data scientists can access and use individual components of Open Data Hub, such as workbenches and data science pipelines. |
Administrators |
In addition to the actions permitted to users, administrators can perform these actions:
|
By default, all OpenShift users have access to Open Data Hub. In addition, users in the OpenShift administrator group (cluster admins), automatically have administrator access in Open Data Hub.
Optionally, if you want to restrict access to your Open Data Hub deployment to specific users or groups, you can create user groups for users and administrators.
If you decide to restrict access, and you already have groups defined in your configured identity provider, you can add these groups to your Open Data Hub deployment. If you decide to use groups without adding these groups from an identity provider, you must create the groups in OpenShift Container Platform and then add users to them.
There are some operations relevant to Open Data Hub that require the cluster-admin role. Those operations include:
-
Adding users to the Open Data Hub user and administrator groups, if you are using groups.
-
Removing users from the Open Data Hub user and administrator groups, if you are using groups.
-
Managing custom environment and storage configuration for users in OpenShift Container Platform, such as Jupyter notebook resources, ConfigMaps, and persistent volume claims (PVCs).
|
Important
|
Although users of Open Data Hub and its components are authenticated through OpenShift, session management is separate from authentication. This means that logging out of OpenShift Container Platform or Open Data Hub does not affect a logged in Jupyter session running on those platforms. This means that when a user’s permissions change, that user must log out of all current sessions in order for the changes to take effect. |
Viewing Open Data Hub users
If you have defined Open Data Hub user groups, you can view the users that belong to these groups.
-
The Open Data Hub user group, administrator group, or both exist.
-
You have the
cluster-adminrole in OpenShift Container Platform. -
You have configured a supported identity provider for OpenShift Container Platform.
-
In the OpenShift Container Platform web console, click User Management → Groups.
-
Click the name of the group containing the users that you want to view.
-
For administrative users, click the name of your administrator group. for example,
odh-admins. -
For normal users, click the name of your user group, for example,
odh-users.The Group details page for the group is displayed.
-
-
In the Users section for the relevant group, you can view the users who have permission to access Open Data Hub.
Adding users to Open Data Hub user groups
By default, all OpenShift users have access to Open Data Hub.
Optionally, you can restrict user access to your Open Data Hub instance by defining user groups. You must grant users permission to access Open Data Hub by adding user accounts to the Open Data Hub user group, administrator group, or both. You can either use the default group name, or specify a group name that already exists in your identity provider.
The user group provides the user with access to product components in the Open Data Hub dashboard, such as data science pipelines, and associated services, such as Jupyter. By default, users in the user group have access to data science pipeline applications within data science projects that they created.
The administrator group provides the user with access to developer and administrator functions in the Open Data Hub dashboard, such as data science pipelines, and associated services, such as Jupyter. Users in the administrator group can configure data science pipeline applications in the Open Data Hub dashboard for any data science project.
If you restrict access by using user groups, users that are not in the Open Data Hub user group or administrator group cannot view the dashboard and use associated services, such as Jupyter. They are also unable to access the Cluster settings page.
|
Important
|
If you are using LDAP as your identity provider, you need to configure LDAP syncing to OpenShift Container Platform. For more information, see Syncing LDAP groups. |
Follow the steps in this section to add users to your Open Data Hub administrator and user groups.
Note: You can add users in Open Data Hub but you must manage the user lists in the OpenShift Container Platform web console.
-
You have configured a supported identity provider for OpenShift Container Platform.
-
You are assigned the
cluster-adminrole in OpenShift Container Platform. -
You have defined an administrator group and user group for Open Data Hub.
-
In the OpenShift Container Platform web console, click User Management → Groups.
-
Click the name of the group you want to add users to.
-
For administrative users, click the administrator group, for example,
odh-admins. -
For normal users, click the user group, for example,
odh-users.The Group details page for that group opens.
-
-
Click Actions → Add Users.
The Add Users dialog opens.
-
In the Users field, enter the relevant user name to add to the group.
-
Click Save.
-
Click the Details tab for each group and confirm that the Users section contains the user names that you added.
Selecting Open Data Hub administrator and user groups
By default, all users authenticated in OpenShift can access Open Data Hub.
Also by default, users with cluster-admin permissions are Open Data Hub administrators. A cluster admin is a superuser that can perform any action in any project in the OpenShift cluster. When bound to a user with a local binding, they have full control over quota and every action on every resource in the project.
After a cluster admin user defines additional administrator and user groups in OpenShift, you can add those groups to Open Data Hub by selecting them in the Open Data Hub dashboard.
-
You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
-
The groups that you want to select as administrator and user groups for Open Data Hub already exist in OpenShift Container Platform. For more information, see Managing users and groups.
-
From the Open Data Hub dashboard, click Settings → User management.
-
Select your Open Data Hub administrator groups: Under Data science administrator groups, click the text box and select an OpenShift group. Repeat this process to define multiple administrator groups.
-
Select your Open Data Hub user groups: Under Data science user groups, click the text box and select an OpenShift group. Repeat this process to define multiple user groups.
ImportantThe system:authenticatedsetting allows all users authenticated in OpenShift to access Open Data Hub. -
Click Save changes.
-
Administrator users can successfully log in to Open Data Hub and have access to the Settings navigation menu.
-
Non-administrator users can successfully log in to Open Data Hub. They can also access and use individual components, such as projects and workbenches.
Deleting users
About deleting users and their resources
If you have administrator access to OpenShift Container Platform, you can revoke a user’s access to workbenches and delete the user’s resources from Open Data Hub. Before you delete a user from Open Data Hub, it is good practice to back up the data on your persistent volume claims (PVCs).
Deleting a user and the user’s resources involves the following tasks:
-
Stop workbenches owned by the user.
-
Revoke user access to workbenches.
-
Remove the user from the allowed group in your OpenShift identity provider.
-
After you delete a user, delete their associated configuration files from OpenShift Container Platform.
Stopping basic workbenches owned by other users
Open Data Hub administrators can stop basic workbenches that are owned by other users to reduce resource consumption on the cluster, or as part of removing a user and their resources from the cluster.
-
You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
-
You have launched the Start basic workbench application, as described in Starting a basic workbench.
-
The workbench that you want to stop is running.
-
On the page that opens when you launch a basic workbench, click the Administration tab.
-
Stop one or more servers.
-
If you want to stop one or more specific servers, perform the following actions:
-
In the Users section, locate the user that the workbench belongs to.
-
To stop the workbench, perform one of the following actions:
-
Click the action menu (⋮) beside the relevant user and select Stop server.
-
Click View server beside the relevant user and then click Stop workbench.
The Stop server dialog box opens.
-
-
Click Stop server.
-
-
If you want to stop all workbenches, perform the following actions:
-
Click the Stop all workbenches button.
-
Click OK to confirm stopping all servers.
-
-
-
The Stop server link beside each server changes to a Start workbench link when the workbench has stopped.
Revoking user access to basic workbenches
You can revoke a user’s access to basic workbenches by removing the user from the Open Data Hub user groups that define access to Open Data Hub. When you remove a user from the user groups, the user is prevented from accessing the Open Data Hub dashboard and from using associated services that consume resources in your cluster.
|
Important
|
Follow these steps only if you have implemented Open Data Hub user groups to restrict access to Open Data Hub. To completely remove a user from Open Data Hub, you must remove them from the allowed group in your OpenShift identity provider. |
-
You have stopped any workbenches owned by the user you want to delete.
-
You are using Open Data Hub user groups, and the user is part of the user group, administrator group, or both.
-
In the OpenShift Container Platform web console, click User Management → Groups.
-
Click the name of the group that you want to remove the user from.
-
For administrative users, click the name of your administrator group, for example,
odh-admins. -
For non-administrator users, click the name of your user group, for example,
odh-users.
The Group details page for the group is displayed.
-
-
In the Users section on the Details tab, locate the user that you want to remove.
-
Click the action menu (⋮) beside the user that you want to remove and click Remove user.
-
In the Users section on the Details tab of the Group details page, confirm that the user that you removed is not visible. In Workloads → Pods, select the default workbench project (
opendatahubor your custom workbench namespace), and ensure that there is no workbench pod for this user. If you see a pod namedjupyter-nb-<username>-*for the user that you have removed, delete that pod to ensure that the deleted user is not consuming resources on the cluster. -
In the Open Data Hub dashboard, check the list of data science projects. Delete any projects that belong to the user.
Backing up storage data
It is a best practice to back up the data on your persistent volume claims (PVCs) regularly.
Backing up your data is particularly important before you delete a user and before you uninstall Open Data Hub, as all PVCs are deleted when Open Data Hub is uninstalled.
For more information about backing up PVCs for your cluster platform, see OADP Application backup and restore in the OpenShift Container Platform documentation.
Cleaning up after deleting users
After you remove a user’s access to Open Data Hub, you must also delete the configuration files for the user from OpenShift Container Platform. Red Hat recommends that you back up the user’s data before removing their configuration files.
-
(Optional) If you want to completely remove the user’s access to Open Data Hub, you have removed their credentials from your identity provider.
-
You have logged in to the OpenShift Container Platform web console as a user with the
cluster-adminrole.
-
Delete the user’s persistent volume claim (PVC).
-
Click Storage → PersistentVolumeClaims.
-
If it is not already selected, select the default workbench project (
opendatahubor your custom workbench namespace) from the project list. -
Locate the
jupyter-nb-<username>PVC.Replace
<username>with the relevant user name. -
Click the action menu (⋮) and select Delete PersistentVolumeClaim from the list.
The Delete PersistentVolumeClaim dialog opens.
-
Inspect the dialog and confirm that you are deleting the correct PVC.
-
Click Delete.
-
-
Delete the user’s ConfigMap.
-
Click Workloads → ConfigMaps.
-
If it is not already selected, select the default workbench project (
opendatahubor your custom workbench namespace) from the project list. -
Locate the
jupyterhub-singleuser-profile-<username>ConfigMap.Replace
<username>with the relevant user name. -
Click the action menu (⋮) and select Delete ConfigMap from the list.
The Delete ConfigMap dialog opens.
-
Inspect the dialog and confirm that you are deleting the correct ConfigMap.
-
Click Delete.
-
-
The user cannot access Open Data Hub and sees an "Access permission needed" message if they try.
-
The user’s single-user profile, persistent volume claim (PVC), and ConfigMap are not visible in OpenShift Container Platform.
Creating custom workbench images
Open Data Hub includes a selection of default workbench images that a data scientist can select when they create or edit a workbench.
In addition, you can import a custom workbench image, for example, if you want to add libraries that data scientists often use, or if your data scientists require a specific version of a library that is different from the version provided in a default image. Custom workbench images are also useful if your data scientists require operating system packages or applications because they cannot install them directly in their running environment (data scientist users do not have root access, which is needed for those operations).
A custom workbench image is simply a container image. You build one as you would build any standard container image, by using a Containerfile (or Dockerfile). You start from an existing image (the FROM instruction), and then add your required elements.
You have the following options for creating a custom workbench image:
-
Start from one of the default images, as described in Creating a custom image from a default Open Data Hub image.
-
Create your own image by following the guidelines for making it compatible with Open Data Hub, as described in Creating a custom image from your own image.
For more information about creating images, see the following resources:
Creating a custom image from a default Open Data Hub image
After Open Data Hub is installed on a cluster, you can find the default workbench images in the OpenShift console, under Builds → ImageStreams for the redhat-ods-applications project.
You can create a custom image by adding OS packages or applications to a default Open Data Hub image.
-
You know which default image you want to use as the base for your custom image.
ImportantIf you want to create a custom Elyra-compatible image, the base image must be an Open Data Hub image that contains the Elyra extension.
-
You have
cluster-adminaccess to the OpenShift console for the cluster where Open Data Hub is installed.
-
Obtain the location of the default image that you want to use as the base for your custom image.
-
In the OpenShift console, select Builds → ImageStreams.
-
Select the redhat-ods-applications project.
-
From the list of installed imagestreams, click the name of the image that you want to use as the base for your custom image. For example, click pytorch.
-
On the ImageStream details page, click YAML.
-
In the
spec:tagssection, find the tag for the version of the image that you want to use.The location of the original image is shown in the tag’s
from:namesection, for example:name: 'quay.io/modh/odh-pytorch-notebook@sha256:b68e0192abf7d…' -
Copy this location for use in your custom image.
-
-
Create a standard Containerfile or Dockerfile.
-
For the
FROMinstruction, specify the base image location that you copied in Step 1, for example:FROM quay.io/modh/odh-pytorch-notebook@sha256:b68e0… -
Optional: Install OS images:
-
Switch to
USER 0(USER 0 is required to install OS packages). -
Install the packages.
-
Switch back to
USER 1001.The following example creates a custom workbench image that adds Java to the default PyTorch image:
FROM quay.io/modh/odh-pytorch-notebook@sha256:b68e0… USER 0 RUN INSTALL_PKGS="java-11-openjdk java-11-openjdk-devel" && \ dnf install -y --setopt=tsflags=nodocs $INSTALL_PKGS && \ dnf -y clean all --enablerepo=* USER 1001
-
-
Optional: Add Python packages:
-
Specify
USER 1001. -
Copy the
requirements.txtfile. -
Install the packages.
The following example installs packages from the
requirements.txtfile in the default PyTorch image:FROM quay.io/modh/odh-pytorch-notebook@sha256:b68e0… USER 1001 COPY requirements.txt ./requirements.txt RUN pip install -r requirements.txt
-
-
Build the image file. For example, you can use
podman buildlocally where the image file is located and then push the image to a registry that is accessible to Open Data Hub:$ podman build -t my-registry/my-custom-image:0.0.1 . $ podman push my-registry/my-custom-image:0.0.1
Alternatively, you can leverage OpenShift’s image build capabilities by using BuildConfig.
Creating a custom image from your own image
You can build your own custom image. However, you must make sure that your image is compatible with OpenShift and Open Data Hub.
-
General Container image guidelines section in the OpenShift Container Platform Images documentation.
-
Red Hat Universal Base Image: https://catalog.redhat.com/software/base-images
-
Red Hat Ecosystem Catalog: https://catalog.redhat.com/
Basic guidelines for creating your own workbench image
The following basic guidelines provide information to consider when you build your own custom workbench image.
Designing your image to run with USER 1001
In OpenShift, your container will run with a random UID and a GID of 0. Make sure that your image is compatible with these user and group requirements, especially if you need write access to directories. Best practice is to design your image to run with USER 1001.
Avoid placing artifacts in $HOME
The persistent volume attached to the workbench will be mounted on /opt/app-root/src. This location is also the location of $HOME. Therefore, do not put any files or other resources directly in $HOME because they are not visible after the workbench is deployed (and the persistent volume is mounted).
Specifying the API endpoint
OpenShift readiness and liveness probes will query the /api endpoint. For a Jupyter IDE, this is the default endpoint. For other IDEs, you must implement the /api endpoint.
Advanced guidelines for creating your own workbench image
The following guidelines provide information to consider when you build your own custom workbench image.
Minimizing image size
A workbench image uses a "layered" file system. Every time you use a COPY or a RUN command in your workbench image file, a new layer is created. Artifacts are not deleted. When you remove an artifact, for example, a file, it is "masked" in the next layer. Therefore, consider the following guidelines when you create your workbench image file.
-
Avoid using the
dnf updatecommand.-
If you start from an image that is constantly updated, such as
ubi9/python-39from the Red Hat Catalog, you might not need to use thednf updatecommand. This command fetches new metadata, updates files that might not have impact, and increases the workbench image size. -
Point to a newer version of your base image rather than performing a
dnf updateon an older version.
-
-
Group
RUNcommands. Chain your commands by adding&& \at the end of each line. -
If you must compile code (such as a library or an application) to include in your custom image, implement multi-stage builds so that you avoid including the build artifacts in your final image. That is, compile the library or application in an intermediate image and then copy the result to your final image, leaving behind build artifacts that you do not want included.
Setting access to files and directories
-
Set the ownership of files and folders to
1001:0(user "default", group "0"), for example:COPY --chown=1001:0 os-packages.txt ./
On OpenShift, every container is in a standard namespace (unless you modify security). The container runs with a user that has a random user ID (uid) and with a group ID (gid) of
0. Therefore, all folders that you want to write to - and all the files you want to (temporarily) modify - in your image must be accessible by the user that has the random user ID (uid). Alternatively, you can set access to any user, as shown in the following example:COPY --chmod=775 os-packages.txt ./
-
Build your image with
/opt/app-root/srcas the default location for the data that you want persisted, for example:WORKDIR /opt/app-root/src
When a user launches a workbench from the Open Data Hub Applications → Enabled page, the personal volume of the user is mounted in the user’s HOME directory (
/opt/app-root/src). Because this location is not configurable, when you build your custom image, you must specify this default location for persisted data. -
Fix permissions to support PIP (the package manager for Python packages) in OpenShift environments. Add the following command to your custom image (if needed, change
python3.11to the Python version that you are using):chmod -R g+w /opt/app-root/lib/python3.11/site-packages && \ fix-permissions /opt/app-root -P
-
A service within your workbench image must answer at
${NB_PREFIX}/api, otherwise the OpenShift liveness/readiness probes fail and delete the pod for the workbench image.The
NB_PREFIXenvironment variable specifies the URL path where the container is expected to be listening.The following is an example of an Nginx configuration:
location = ${NB_PREFIX}/api { return 302 /healthz; access_log off; } -
For idle culling to work, the
${NB_PREFIX}/api/kernelsURL must return a specifically-formatted JSON payload, as shown in the following example:The following is an example of an Nginx configuration:
location = ${NB_PREFIX}/api/kernels { return 302 $custom_scheme://$http_host/api/kernels/; access_log off; } location ${NB_PREFIX}/api/kernels/ { return 302 $custom_scheme://$http_host/api/kernels/; access_log off; } location /api/kernels/ { index access.cgi; fastcgi_index access.cgi; gzip off; access_log off; }The returned JSON payload should be:
{"id":"rstudio","name":"rstudio","last_activity":(time in ISO8601 format),"execution_state":"busy","connections": 1}
Enabling CodeReady Builder (CRB) and Extra Packages for Enterprise Linux (EPEL)
CRB and EPEL are repositories that provide packages which are absent from a standard Red Hat Enterprise Linux (RHEL) or Universal Base Image (UBI) installation. They are useful and required for installing some software, for example, RStudio.
On UBI9 images, CRB is enabled by default. To enable EPEL on UBI9-based images, run the following command:
RUN yum install -y https://download.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
To enable CRB and EPEL on Centos Stream 9-based images, run the following command:
RUN yum install -y yum-utils && \
yum-config-manager --enable crb && \
yum install -y https://download.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
Adding Elyra compatibility
Support for data science pipelines V2 (provided with the odh-elyra package) is available in Open Data Hub version 2.9 and later. Previous versions of Open Data Hub support data science pipelines V1 (provided with the elyra package).
If you want your custom image to support data science pipelines V2, you must address the following requirements:
-
Include the
odh-elyrapackage for having support with Data Science pipeline V2 (not theelyrapackage), for example:USER 1001 RUN pip install odh-elyra
-
If you want to include the data science pipeline configuration automatically, as a runtime configuration, add an annotation when you import a custom workbench image.
Enabling custom images in Open Data Hub
All Open Data Hub administrators can import custom workbench images, by default, by selecting the Settings → Environment setup → Workbench images navigation option in the Open Data Hub dashboard.
If the Settings → Environment setup → Workbench images option is not available, check the following settings, depending on which navigation element does not appear in the dashboard:
-
The Settings menu does not appear in the Open Data Hub navigation bar.
The visibility of the Open Data Hub dashboard Settings menu is determined by your user permissions. By default, the Settings menu is available to Open Data Hub administration users (users that are members of the
odh-adminsgroup). Users with the OpenShiftcluster-adminrole are automatically added to theodh-adminsgroup and are granted administrator access in Open Data Hub.For more information about user permissions, see Managing users and groups.
-
The Workbench images menu item does not appear under the Settings menu.
The visibility of the Workbench images menu item is controlled in the dashboard configuration, by the value of the
dashboardConfig: disableBYONImageStreamoption. It is set to false (the Workbench images menu item is visible) by default.You need Open Data Hub administrator permissions to edit the dashboard configuration.
For more information about setting dashboard configuration options, see Customizing the dashboard.
Importing a custom workbench image
You can import custom workbench images that cater to your Open Data Hub project’s specific requirements. From the Workbench images page, you can enable or disable a previously imported workbench image and create an accelerator profile or a hardware profile as a recommended accelerator for existing workbench images.
You must import it so that your Open Data Hub users (data scientists) can access it when they create a project workbench.
-
You have logged in to Open Data Hub as a user with Open Data Hub administrator privileges.
-
Your custom image exists in an image registry that is accessible to Open Data Hub.
-
The Settings → Environment setup → Workbench images dashboard navigation menu item is enabled, as described in Creating a custom image from a default Open Data Hub image.
-
If you want to associate an accelerator with the custom image that you want to import, you know the accelerator’s identifier - the unique string that identifies the hardware accelerator. You must also have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
-
From the Open Data Hub dashboard, click Settings → Environment setup → Workbench images.
The Workbench images page opens. Previously imported images are displayed. To enable or disable a previously imported image, on the row containing the relevant image, click the toggle in the Enable column.
-
Optional: If you want to associate an accelerator and you have not already created an accelerator profile or a hardware profile, click Create profile on the row containing the image and complete the relevant fields. If the image does not contain an accelerator identifier, you must manually configure one before creating an associated accelerator profile or a hardware profile.
-
Click Import new image. Alternatively, if no previously imported images were found, click Import image.
The Import workbench image dialog opens.
-
In the Image location field, enter the URL of the repository containing the image. For example:
quay.io/my-repo/my-image:tag,quay.io/my-repo/my-image@sha256:xxxxxxxxxxxxx, ordocker.io/my-repo/my-image:tag. -
In the Name field, enter an appropriate name for the image.
-
Optional: In the Description field, enter a description for the image.
-
Optional: From the Accelerator identifier list, select an identifier to set its accelerator as recommended with the image. If the image contains only one accelerator identifier, the identifier name displays by default.
-
Optional: Add software to the image. After the import has completed, the software is added to the image’s meta-data and displayed on the workbench creation page.
-
Click the Software tab.
-
Click the Add software button.
-
Click Edit (
). -
Enter the Software name.
-
Enter the software Version.
-
Click Confirm (
) to confirm your entry. -
To add additional software, click Add software, complete the relevant fields, and confirm your entry.
-
-
Optional: Add packages to the workbench images. After the import has completed, the packages are added to the image’s meta-data and displayed on the workbench creation page.
-
Click the Packages tab.
-
Click the Add package button.
-
Click Edit (
). -
Enter the Package name. For example, if you want to include data science pipeline V2 automatically, as a runtime configuration, type
odh-elyra. -
Enter the package Version. For example, type
3.16.7. -
Click Confirm (
) to confirm your entry. -
To add an additional package, click Add package, complete the relevant fields, and confirm your entry.
-
-
Click Import.
-
The image that you imported is displayed in the table on the Workbench images page.
-
Your custom image is available for selection when a user creates a workbench.
Managing applications that show in the dashboard
Adding an application to the dashboard
If you have installed an application in your OpenShift Container Platform cluster, an Open Data Hub administrator can add a tile for that application to the Open Data Hub dashboard (the Applications → Enabled page) to make it accessible for Open Data Hub users.
-
You have Open Data Hub administrator privileges.
-
The
spec.dashboardConfig.enablementdashboard configuration option is set totrue(the default).For more information about setting dashboard configuration options, see Customizing the dashboard.
-
Log in to the OpenShift Container Platform console as an Open Data Hub administrator.
-
In the Administrator perspective, click Home → API Explorer.
-
In the search bar, enter
OdhApplicationto filter by kind. -
Click the
OdhApplicationcustom resource (CR) to open the resource details page. -
From the Project list, select the Open Data Hub application namespace; the default is
opendatahub. -
Click the Instances tab.
-
Click Create OdhApplication.
-
On the Create OdhApplication page, copy the following code and paste it into the YAML editor.
apiVersion: dashboard.opendatahub.io/v1 kind: OdhApplication metadata: name: examplename namespace: opendatahub labels: app: odh-dashboard app.kubernetes.io/part-of: odh-dashboard spec: enable: validationConfigMap: examplename-enable img: >- <svg width="24" height="25" viewBox="0 0 24 25" fill="none" xmlns="http://www.w3.org/2000/svg"> <path d="path data" fill="#ee0000"/> </svg> getStartedLink: 'https://example.org/docs/quickstart.html' route: exampleroutename routeNamespace: examplenamespace displayName: Example Name kfdefApplications: [] support: third party support csvName: '' provider: example docsLink: 'https://example.org/docs/index.html' quickStart: '' getStartedMarkDown: >- # Example Enter text for the information panel. description: >- Enter summary text for the tile. category: Self-managed | Partner managed | Red Hat managed -
Modify the parameters in the code for your application.
TipTo see example YAML files, click Home → API Explorer, select OdhApplication, click the Instances tab, select an instance, and then click the YAML tab. -
Click Create. The application details page opens.
-
Log in to Open Data Hub.
-
In the left menu, click Applications → Explore.
-
Locate the new tile for your application and click it.
-
In the information pane for the application, click Enable.
-
In the left menu of the Open Data Hub dashboard, click Applications → Enabled and verify that your application is available.
Preventing users from adding applications to the dashboard
By default, Open Data Hub administrators can add applications to the Open Data Hub dashboard Application → Enabled page.
As an Open Data Hub administrator, you can disable the ability for Open Data Hub administrators to add applications to the dashboard.
Note: The Start basic workbench tile is enabled by default. To disable it, see Hiding the default basic workbench application.
-
You have Open Data Hub administrator privileges.
-
Log in to the OpenShift Container Platform console as an Open Data Hub administrator.
-
Open the dashboard configuration file:
-
In the Administrator perspective, click Home → API Explorer.
-
In the search bar, enter
OdhDashboardConfigto filter by kind. -
Click the
OdhDashboardConfigcustom resource (CR) to open the resource details page. -
From the Project list, select the Open Data Hub application namespace; the default is
opendatahub. -
Click the Instances tab.
-
Click the
odh-dashboard-configinstance to open the details page. -
Click the YAML tab.
-
-
In the
spec.dashboardConfigsection, set the value ofenablementtofalseto disable the ability for dashboard users to add applications to the dashboard. -
Click Save to apply your changes and then click Reload to make sure that your changes are synced to the cluster.
-
Open the Open Data Hub dashboard Application → Enabled page.
Disabling applications connected to Open Data Hub
You can disable applications and components so that they do not appear on the Open Data Hub dashboard when you no longer want to use them, for example, when data scientists no longer use an application or when the application license expires.
Disabling unused applications allows your data scientists to manually remove these application tiles from their Open Data Hub dashboard so that they can focus on the applications that they are most likely to use.
-
You have logged in to the OpenShift Container Platform web console.
-
You are part of the
cluster-adminsuser group in OpenShift Container Platform. -
You have installed or configured the service on your OpenShift Container Platform cluster.
-
The application or component that you want to disable is enabled and visible on the Enabled page.
-
In the OpenShift Container Platform web console, switch to the Administrator perspective.
-
Switch to the
odhproject. -
Click Operators → Installed Operators.
-
Click on the Operator that you want to uninstall. You can enter a keyword into the Filter by name field to help you find the Operator faster.
-
Delete any Operator resources or instances by using the tabs in the Operator interface.
During installation, some Operators require the administrator to create resources or start process instances using tabs in the Operator interface. These must be deleted before the Operator can uninstall correctly.
-
On the Operator Details page, click the Actions drop-down menu and select Uninstall Operator.
An Uninstall Operator? dialog box is displayed.
-
Select Uninstall to uninstall the Operator, Operator deployments, and pods. After this is complete, the Operator stops running and no longer receives updates.
|
Important
|
Removing an Operator does not remove any custom resource definitions or managed resources for the Operator. Custom resource definitions and managed resources still exist and must be cleaned up manually. Any applications deployed by your Operator and any configured off-cluster resources continue to run and must be cleaned up manually. |
-
The Operator is uninstalled from its target clusters.
-
The Operator is no longer displayed on the Installed Operators page.
-
The disabled application is no longer available for your data scientists to use, and is marked as
Disabledon the Enabled page of the Open Data Hub dashboard. This action may take a few minutes to occur following the removal of the Operator.
Showing or hiding information about available applications
You can view a list of available applications in the Exploring applications page of the Open Data Hub dashboard. By default, the following information is provided for each application:
-
Any independent software vendor (ISV) application is indicated with a label on the tile indicating
Red Hat-managed,Partner managed, orSelf-managed. As an Open Data Hub administrator, you can hide or show the labels. For example, if you are running a self-managed environment, you might want to show all available applications regardless of the support level. -
When a user clicks on an application, an information panel is displayed and provides more information about the application, including links to quick starts or detailed documentation. You can disable or enable the appearance of application information panels.
-
You have Open Data Hub administrator privileges.
-
Log in to the OpenShift Container Platform console as an Open Data Hub administrator.
-
Open the dashboard configuration file:
-
In the Administrator perspective, click Home → API Explorer.
-
In the search bar, enter
OdhDashboardConfigto filter by kind. -
Click the
OdhDashboardConfigcustom resource (CR) to open the resource details page. -
From the Project list, select the Open Data Hub application namespace; the default is
opendatahub. -
Click the Instances tab.
-
Click the
odh-dashboard-configinstance to open the details page. -
Click the YAML tab.
-
-
In the
spec.dashboardConfigsection, set either or both of the following options:-
disableInfo: Set totrueto hide the appearance of application information panel. Set toFalse(the default) to show the application information panel. -
disableISVBadges: Set totrueto hide the appearance of the support-level label. Set toFalse(the default) to show the support-level label.
-
-
Click Save to apply your changes and then click Reload to make sure that your changes are synced to the cluster.
Log in to Open Data Hub and verify that your dashboard configurations apply.
Hiding the default basic workbench application
The Open Data Hub dashboard includes Start basic workbench as an enabled application by default.
To hide the Start basic workbench tile so that it is no longer included in the list of applications on the Applications → Enabled page, edit the dashboard configuration file.
-
You have Open Data Hub administrator privileges.
-
Log in to the OpenShift Container Platform console as an Open Data Hub administrator.
-
Open the dashboard configuration file:
-
In the Administrator perspective, click Home → API Explorer.
-
In the search bar, enter
OdhDashboardConfigto filter by kind. -
Click the
OdhDashboardConfigcustom resource (CR) to open the resource details page. -
From the Project list, select the Open Data Hub application namespace; the default is
opendatahub. -
Click the Instances tab.
-
Click the
odh-dashboard-configinstance to open the details page. -
Click the YAML tab.
-
-
In the
spec:notebookControllersection, set the value ofenabledtofalseto remove the Start basic workbench tile from the list of applications on the Applications → Enabled page. -
Click Save to apply your changes and then click Reload to make sure that your changes are synced to the cluster.
In the Open Data Hub dashboard, click Applications → Enabled. The list of applications no longer includes the Start basic workbench tile.
Creating project-scoped resources
Open Data Hub users can access global resources in all Open Data Hub projects. However, they can access project-scoped resources only within projects that they have permissions to access.
As a cluster administrator, you can create the following types of project-scoped resources in any Open Data Hub project:
-
Workbench images
-
Hardware profiles
-
Accelerator profiles
-
Model-serving runtimes for KServe
All resource names must be unique within a project.
|
Note
|
A user with access permissions to a project can create project-scoped resources for that project, as described in Creating project-scoped resources for your project. |
-
You can access the OpenShift Container Platform console as a cluster administrator.
-
You have set the
disableProjectScopeddashboard configuration option tofalse, as described in Customizing the dashboard.
-
Log in to the OpenShift Container Platform console as a cluster administrator.
-
Copy the YAML code to create the resource.
You can get the YAML code from a trusted source, such as an existing resource, a Git repository, or documentation.
For example, you can copy the YAML code from an existing resource, as follows:
-
In the Administrator perspective, click Home → Search.
-
From the Project list, select the appropriate project.
To limit the search to global Open Data Hub resources only, select the
opendatahubproject. -
In the Resources list, search for the relevant resource type:
-
For workbench images, search for
ImageStream. -
For hardware profiles, search for
HardwareProfile. -
For accelerator profiles, search for
AcceleratorProfile. -
For serving runtimes, search for
Template. From the resulting list, find the templates that have theobjects.kindspecification set toServingRuntime.
-
-
Select a resource, and then click the YAML tab.
-
Copy the YAML content, and then click Cancel.
-
-
From the Project list, select the target project name.
-
From the toolbar, click the + icon to open the Import YAML page.
-
Paste the relevant YAML content into the code area.
-
Edit the
metadata.namespacevalue to specify the name of the target project. -
If necessary, edit the
metadata.namevalue to ensure that the resource name is unique within the specified project. -
Optional: Edit the resource name that is displayed in the Open Data Hub console:
-
For workbench images, edit the
metadata.annotations.opendatahub.io/notebook-image-namevalue. -
For hardware profiles and accelerator profiles, edit the
spec.displayNamevalue. -
For serving runtimes, edit the
objects.metadata.annotations.openshift.io/display-namevalue.
-
-
Click Create.
-
Log in to the Open Data Hub console as a regular user.
-
Verify that the project-scoped resource is shown in the specified project:
-
For workbench images, hardware profiles, and accelerator profiles, see Creating a workbench.
-
For serving runtimes, see Deploying models on the single-model serving platform.
-
Allocating additional resources to Open Data Hub users
As a cluster administrator, you can allocate additional resources to a cluster to support compute-intensive data science work. This support includes increasing the number of nodes in the cluster and changing the cluster’s allocated machine pool.
For more information about allocating additional resources to an OpenShift Container Platform cluster, see Manually scaling a compute machine set.
Customizing component deployment resources
Overview of component resource customization
You can customize deployment resources that are related to the Open Data Hub Operator, for example, CPU and memory limits and requests. For resource customizations to persist without being overwritten by the Operator, the opendatahub.io/managed: true annotation must not be present in the YAML file for the component deployment. This annotation is absent by default.
The following table shows the deployment names for each component in the opendatahub namespace:
| Component | Deployment names |
|---|---|
CodeFlare |
codeflare-operator-manager |
KServe |
|
TrustyAI |
trustyai-service-operator-controller-manager |
Ray |
kuberay-operator |
Kueue |
kueue-controller-manager |
Workbenches |
|
Dashboard |
odh-dashboard |
Model serving |
|
Model registry |
model-registry-operator-controller-manager |
Data science pipelines |
data-science-pipelines-operator-controller-manager |
Training Operator |
kubeflow-training-operator |
Customizing component resources
You can customize component deployment resources by updating the .spec.template.spec.containers.resources section of the YAML file for the component deployment.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
Log in to the OpenShift Container Platform console as a cluster administrator.
-
In the Administrator perspective, click Workloads → Deployments.
-
From the Project drop-down list, select
opendatahub. -
In the Name column, click the name of the deployment for the component that you want to customize resources for.
NoteFor more information about the deployment names for each component, see Overview of component resource customization.
-
On the Deployment details page that is displayed, click the YAML tab.
-
Find the
.spec.template.spec.containers.resourcessection. -
Update the value of the resource that you want to customize. For example, to update the memory limit to 500Mi, make the following change:
containers: - resources: limits: cpu: '2' memory: 500Mi requests: cpu: '1' memory: 1Gi -
Click Save.
-
Click Reload.
-
Log in to Open Data Hub and verify that your resource changes apply.
Disabling component resource customization
You can disable customization of component deployment resources, and restore default values, by adding the opendatahub.io/managed: true annotation to the YAML file for the component deployment.
|
Important
|
Manually removing or setting the To remove the annotation from a deployment, use the steps described in Re-enabling component resource customization. |
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
Log in to the OpenShift Container Platform console as a cluster administrator.
-
In the Administrator perspective, click Workloads → Deployments.
-
From the Project drop-down list, select
opendatahub. -
In the Name column, click the name of the deployment for the component to which you want to add the annotation.
NoteFor more information about the deployment names for each component, see Overview of component resource customization.
-
On the Deployment details page that opens, click the YAML tab.
-
Find the
metadata.annotations:section. -
Add the
opendatahub.io/managed: trueannotation.metadata: annotations: opendatahub.io/managed: true -
Click Save.
-
Click Reload.
-
The
opendatahub.io/managed: trueannotation is displayed in the YAML file for the component deployment.
Re-enabling component resource customization
You can re-enable customization of component deployment resources after manually disabling it.
|
Important
|
Manually removing or setting the To remove the annotation from a deployment, use the following steps to delete the deployment. The controller pod for the deployment will automatically redeploy with the default settings. |
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
Log in to the OpenShift Container Platform console as a cluster administrator.
-
In the Administrator perspective, click Workloads → Deployments.
-
From the Project drop-down list, select
opendatahub. -
In the Name column, click the name of the deployment for the component for which you want to remove the annotation.
-
Click the Options menu
. -
Click Delete Deployment.
-
The controller pod for the deployment automatically redeploys with the default settings.
Enabling accelerators
Enabling NVIDIA GPUs
Before you can use NVIDIA GPUs in Open Data Hub, you must install the NVIDIA GPU Operator.
-
You have logged in to your OpenShift Container Platform cluster.
-
You have the
cluster-adminrole in your OpenShift Container Platform cluster. -
You have installed an NVIDIA GPU and confirmed that it is detected in your environment.
-
To enable GPU support on an OpenShift cluster, follow the instructions here: NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
ImportantAfter you install the Node Feature Discovery (NFD) Operator, you must create an instance of NodeFeatureDiscovery. In addition, after you install the NVIDIA GPU Operator, you must create a ClusterPolicy and populate it with default values.
-
Delete the migration-gpu-status ConfigMap.
-
In the OpenShift Container Platform web console, switch to the Administrator perspective.
-
Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate ConfigMap.
-
Search for the migration-gpu-status ConfigMap.
-
Click the action menu (⋮) and select Delete ConfigMap from the list.
The Delete ConfigMap dialog opens.
-
Inspect the dialog and confirm that you are deleting the correct ConfigMap.
-
Click Delete.
-
-
Restart the dashboard replicaset.
-
In the OpenShift Container Platform web console, switch to the Administrator perspective.
-
Click Workloads → Deployments.
-
Set the Project to All Projects or
opendatahubto ensure you can see the appropriate deployment. -
Search for the rhods-dashboard deployment.
-
Click the action menu (⋮) and select Restart Rollout from the list.
-
Wait until the Status column indicates that all pods in the rollout have fully restarted.
-
-
The reset migration-gpu-status instance is present on the Instances tab on the
AcceleratorProfilecustom resource definition (CRD) details page. -
From the Administrator perspective, go to the Operators → Installed Operators page. Confirm that the following Operators appear:
-
NVIDIA GPU
-
Node Feature Discovery (NFD)
-
Kernel Module Management (KMM)
-
-
The GPU is correctly detected a few minutes after full installation of the Node Feature Discovery (NFD) and NVIDIA GPU Operators. The OpenShift CLI (
oc) displays the appropriate output for the GPU worker node. For example:# Expected output when the GPU is detected properly oc describe node <node name> ... Capacity: cpu: 4 ephemeral-storage: 313981932Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16076568Ki nvidia.com/gpu: 1 pods: 250 Allocatable: cpu: 3920m ephemeral-storage: 288292006229 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 12828440Ki nvidia.com/gpu: 1 pods: 250
After installing the NVIDIA GPU Operator, create a hardware profile as described in Working with accelerators.
|
Important
|
By default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. To show the Settings → Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the |
Intel Gaudi AI Accelerator integration
To accelerate your high-performance deep learning models, you can integrate Intel Gaudi AI accelerators into Open Data Hub. This integration enables your data scientists to use Gaudi libraries and software associated with Intel Gaudi AI accelerators through custom-configured workbench instances.
Intel Gaudi AI accelerators offer optimized performance for deep learning workloads, with the latest Gaudi 3 devices providing significant improvements in training speed and energy efficiency. These accelerators are suitable for enterprises running machine learning and AI applications on Open Data Hub.
Before you can enable Intel Gaudi AI accelerators in Open Data Hub, you must complete the following steps:
-
Install the latest version of the Intel Gaudi Base Operator from OperatorHub.
-
Create and configure a custom workbench image for Intel Gaudi AI accelerators. A prebuilt workbench image for Gaudi accelerators is not included in Open Data Hub.
-
Manually define and configure an accelerator profile or a hardware profile for each Intel Gaudi AI device in your environment.
ImportantBy default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. To show the Settings → Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the
disableHardwareProfilesvalue tofalsein theOdhDashboardConfigcustom resource (CR) in OpenShift Container Platform. For more information about setting dashboard configuration options, see Customizing the dashboard.
Red Hat supports Intel Gaudi devices up to Intel Gaudi 3. The Intel Gaudi 3 accelerators, in particular, offer the following benefits:
-
Improved training throughput: Reduce the time required to train large models by using advanced tensor processing cores and increased memory bandwidth.
-
Energy efficiency: Lower power consumption while maintaining high performance, reducing operational costs for large-scale deployments.
-
Scalable architecture: Scale across multiple nodes for distributed training configurations.
Your OpenShift platform must support EC2 DL1 instances to use Intel Gaudi AI accelerators in an Amazon EC2 DL1 instance. You can use Intel Gaudi AI accelerators in workbench instances or model serving after you enable the accelerators, create a custom workbench image, and configure the accelerator profile or the hardware profile.
To identify the Intel Gaudi AI accelerators present in your deployment, use the lspci utility. For more information, see lspci(8) - Linux man page.
|
Important
|
The presence of Intel Gaudi AI accelerators in your deployment, as indicated by the |
Enabling Intel Gaudi AI accelerators
Before you can use Intel Gaudi AI accelerators in Open Data Hub, you must install the required dependencies, deploy the Intel Gaudi Base Operator, and configure the environment.
-
You have logged in to OpenShift Container Platform.
-
You have the
cluster-adminrole in OpenShift Container Platform. -
You have installed your Intel Gaudi accelerator and confirmed that it is detected in your environment.
-
Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
Install the latest version of the Intel Gaudi Base Operator, as described in Intel Gaudi Base Operator OpenShift installation.
-
By default, OpenShift Container Platform sets a per-pod PID limit of 4096. If your workload requires more processing power, such as when you use multiple Gaudi accelerators or when using vLLM with Ray, you must manually increase the per-pod PID limit to avoid
Resource temporarily unavailableerrors. These errors occur due to PID exhaustion. Red Hat recommends setting this limit to 32768, although values over 20000 are sufficient.-
Run the following command to label the node:
oc label node <node_name> custom-kubelet=set-pod-pid-limit-kubelet -
Optional: To prevent workload distribution on the affected node, you can mark the node as unschedulable and then drain it in preparation for maintenance. For more information, see Understanding how to evacuate pods on nodes.
-
Create a
custom-kubelet-pidslimit.yamlKubeletConfig resource file:oc create -f custom-kubelet-pidslimit.yaml -
Populate the file with the following YAML code. Set the
PodPidsLimitvalue to 32768:apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: custom-kubelet-pidslimit spec: kubeletConfig: PodPidsLimit: 32768 machineConfigPoolSelector: matchLabels: custom-kubelet: set-pod-pid-limit-kubelet -
Apply the configuration:
oc apply -f custom-kubelet-pidslimit.yamlThis operation causes the node to reboot. For more information, see Understanding node rebooting.
-
Optional: If you previously marked the node as unschedulable, you can allow scheduling again after the node reboots.
-
-
Create a custom workbench image for Intel Gaudi AI accelerators, as described in Creating custom workbench images.
-
After installing the Intel Gaudi Base Operator, create an accelerator profile, as described in Working with accelerator profiles.
ImportantBy default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. To show the Settings → Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the
disableHardwareProfilesvalue tofalsein theOdhDashboardConfigcustom resource (CR) in OpenShift Container Platform. For more information about setting dashboard configuration options, see Customizing the dashboard.
From the Administrator perspective, go to the Operators → Installed Operators page. Confirm that the following Operators appear:
-
Intel Gaudi Base Operator
-
Node Feature Discovery (NFD)
-
Kernel Module Management (KMM)
AMD GPU Integration
You can use AMD GPUs with Open Data Hub to accelerate AI and machine learning (ML) workloads. AMD GPUs provide high-performance compute capabilities, allowing users to process large data sets, train deep neural networks, and perform complex inference tasks more efficiently.
Integrating AMD GPUs with Open Data Hub involves the following components:
-
ROCm workbench images: Use the ROCm workbench images to streamline AI/ML workflows on AMD GPUs. These images include libraries and frameworks optimized with the AMD ROCm platform, enabling high-performance workloads for PyTorch and TensorFlow. The pre-configured images reduce setup time and provide an optimized environment for GPU-accelerated development and experimentation.
-
AMD GPU Operator: The AMD GPU Operator simplifies GPU integration by automating driver installation, device plugin setup, and node labeling for GPU resource management. It ensures compatibility between OpenShift and AMD hardware while enabling scaling of GPU-enabled workloads.
Verifying AMD GPU availability on your cluster
Before you proceed with the AMD GPU Operator installation process, you can verify the presence of an AMD GPU device on a node within your OpenShift Container Platform cluster. You can use commands such as lspci or oc to confirm hardware and resource availability.
-
You have administrative access to the OpenShift Container Platform cluster.
-
You have a running OpenShift Container Platform cluster with a node equipped with an AMD GPU.
-
You have access to the OpenShift CLI (
oc) and terminal access to the node.
-
Use the OpenShift CLI (
oc) to verify if GPU resources are allocatable:-
List all nodes in the cluster to identify the node with an AMD GPU:
oc get nodes
-
Note the name of the node where you expect the AMD GPU to be present.
-
Describe the node to check its resource allocation:
oc describe node <node_name>
-
In the output, locate the Capacity and Allocatable sections and confirm that
amd.com/gpuis listed. For example:Capacity: amd.com/gpu: 1 Allocatable: amd.com/gpu: 1
-
-
Check for the AMD GPU device using the
lspcicommand:-
Log in to the node:
oc debug node/<node_name> chroot /host
-
Run the
lspcicommand and search for the supported AMD device in your deployment. For example:lspci | grep -E "MI210|MI250|MI300"
-
Verify that the output includes one of the AMD GPU models. For example:
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD] Instinct MI210
-
-
Optional: Use the
rocminfocommand if the ROCm stack is installed on the node:rocminfo
-
Confirm that the ROCm tool outputs details about the AMD GPU, such as compute units, memory, and driver status.
-
-
The
oc describe node <node_name>command listsamd.com/gpuunder Capacity and Allocatable. -
The
lspcicommand output identifies an AMD GPU as a PCI device matching one of the specified models (for example, MI210, MI250, MI300). -
Optional: The
rocminfotool provides detailed GPU information, confirming driver and hardware configuration.
Enabling AMD GPUs
Before you can use AMD GPUs in Open Data Hub, you must install the required dependencies, deploy the AMD GPU Operator, and configure the environment.
-
You have logged in to OpenShift Container Platform.
-
You have the
cluster-adminrole in OpenShift Container Platform. -
You have installed your AMD GPU and confirmed that it is detected in your environment.
-
Your OpenShift Container Platform environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
-
Install the latest version of the AMD GPU Operator, as described in Install AMD GPU Operator on OpenShift.
-
After installing the AMD GPU Operator, configure the AMD drivers required by the Operator as described in the documentation: Configure AMD drivers for the GPU Operator.
NoteAlternatively, you can install the AMD GPU Operator from the Red Hat Catalog. For more information, see Install AMD GPU Operator from Red Hat Catalog.
-
After installing the AMD GPU Operator, create an accelerator profile, as described in Working with accelerator profiles.
ImportantBy default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. To show the Settings → Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the
disableHardwareProfilesvalue tofalsein theOdhDashboardConfigcustom resource (CR) in OpenShift Container Platform. For more information about setting dashboard configuration options, see Customizing the dashboard.
From the Administrator perspective, go to the Operators → Installed Operators page. Confirm that the following Operators appear:
-
AMD GPU Operator
-
Node Feature Discovery (NFD)
-
Kernel Module Management (KMM)
|
Note
|
Ensure that you follow all the steps for proper driver installation and configuration. Incorrect installation or configuration may prevent the AMD GPUs from being recognized or functioning properly. |
Managing workloads with Kueue
As a cluster administrator, you can manage AI and machine learning workloads at scale by integrating the Red Hat build of Kueue with Open Data Hub. This integration provides capabilities for quota management, resource allocation, and prioritized job scheduling.
|
Important
|
The embedded Kueue component for managing distributed workloads is deprecated. Kueue is now provided through Red Hat build of Kueue, which is installed and managed by the Red Hat build of Kueue Operator. You cannot install both the embedded Kueue and the Red Hat build of Kueue Operator on the same cluster because this creates conflicting controllers that manage the same resources. Open Data Hub does not automatically migrate existing workloads. To ensure your workloads continue using queue management after upgrading, cluster administrators must manually migrate from the embedded Kueue to the Red Hat build of Kueue Operator. For more information, see Migrating to the Red Hat build of Kueue Operator. |
Overview of managing workloads with Kueue
You can use Kueue in Open Data Hub to manage AI and machine learning workloads at scale. Kueue controls how cluster resources are allocated and shared through hierarchical quota management, dynamic resource allocation, and prioritized job scheduling. These capabilities help prevent cluster contention, ensure fair access across teams, and optimize the use of heterogeneous compute resources, such as hardware accelerators.
Kueue lets you schedule diverse workloads, including distributed training jobs (RayJob, RayCluster, PyTorchJob), workbenches (Notebook), and model serving (InferenceService). Kueue validation and queue enforcement apply only to workloads in namespaces with the kueue.openshift.io/managed=true label.
Using Kueue in Open Data Hub provides these benefits:
-
Prevents resource conflicts and prioritizes workload processing
-
Manages quotas across teams and projects
-
Ensures consistent scheduling for all workload types
-
Maximizes GPU and other specialized hardware utilization
Kueue management states
You configure how Open Data Hub interacts with Kueue by setting the managementState in the DataScienceCluster object.
Unmanaged-
This state is supported for using Kueue with Open Data Hub. In
Unmanagedstate, Open Data Hub integrates with an existing Kueue installation managed by the Red Hat build of Kueue Operator. You must have the Red Hat build of Kueue Operator installed and running on the cluster.When you enable
Unmanagedmode, the Open Data Hub Operator creates a defaultKueuecustom resource (CR) if one does not already exist. This prompts the Red Hat build of Kueue Operator to activate Kueue on the cluster. Managed-
This state is deprecated. Previously, Open Data Hub deployed and managed an embedded Kueue distribution.
Managedmode is not compatible with the Red Hat build of Kueue Operator. If both are installed, Open Data Hub stops reconciliation to avoid conflicts. You must migrate any environment using theManagedstate to theUnmanagedstate to ensure continued support. Removed-
This state disables Kueue in Open Data Hub. If the state was previously
Managed, Open Data Hub uninstalls the embedded Kueue distribution. If the state was previouslyUnmanaged, Open Data Hub stops checking for the external Kueue integration but does not uninstall the Red Hat build of Kueue Operator. An emptymanagementStatevalue also functions asRemoved.
Queue enforcement for projects
To ensure workloads do not bypass the queuing system, a validating webhook automatically enforces queuing rules on any project that is enabled for Kueue management. You enable a project for Kueue management by applying the kueue.openshift.io/managed=true label to the project namespace.
|
Note
|
This validating webhook enforcement method replaces the Validating Admission Policy that was used with the deprecated embedded Kueue component. The system also supports the legacy |
After a project is enabled for Kueue management, the webhook requires that any new or updated workload has the kueue.x-k8s.io/queue-name label. If this label is missing, the webhook prevents the workload from being created or updated.
Open Data Hub creates a default, cluster-scoped ClusterQueue (if one does not already exist) and a namespace-scoped LocalQueue for that namespace (if one does not already exist). These default resources are created with the opendatahub.io/managed=false annotation, so they are not managed after creation. Cluster administrators can change or delete them.
The webhook enforces this rule on the create and update operations for the following resource types:
-
InferenceService -
Notebook -
PyTorchJob -
RayCluster -
RayJob
|
Note
|
You can apply hardware profiles to other workload types, but the validation webhook enforces the |
Restrictions for managing workloads with Kueue
When you use Kueue to manage workloads in Open Data Hub, the following restrictions apply:
-
Namespaces must be labeled with
kueue.openshift.io/managed=trueto enable Kueue validation and queue enforcement. -
All workloads that you create from the Open Data Hub dashboard, such as workbenches and model servers, must use a hardware profile that specifies a local queue.
-
When you specify a local queue in a hardware profile, Open Data Hub automatically applies the corresponding
kueue.x-k8s.io/queue-namelabel to workloads that use that profile. -
You cannot use hardware profiles that contain node selectors or tolerations for node placement. To direct workloads to specific nodes, use a hardware profile that specifies a local queue that is associated with a queue configured with the appropriate resource flavors.
-
You cannot use accelerator profiles with Kueue. You must migrate any existing accelerator profiles to hardware profiles.
-
Because workbenches are not suspendable workloads, you can only assign them to a local queue that is associated with a non-preemptive cluster queue. The default cluster queue that Open Data Hub creates is non-preemptive.
Kueue workflow
Managing workloads with Kueue in Open Data Hub involves tasks for OpenShift Container Platform cluster administrators, Open Data Hub administrators, and machine learning (ML) engineers or data scientists:
Cluster administrator
Installs and configures Kueue:
-
Installs the Red Hat build of Kueue Operator on the cluster, as described in the Red Hat build of Kueue documentation.
-
Activates the Kueue integration by setting the
managementStatetoUnmanagedin theDataScienceClustercustom resource, as described in Configuring workload management with Kueue. -
Configures quotas to optimize resource allocation for user workloads, as described in the Red Hat build of Kueue documentation.
-
Enables Kueue in the dashboard by setting
disableKueuetofalsein theOdhDashboardConfigcustom resource, as described in Enabling Kueue in the dashboard.NoteWhen Kueue is enabled in the dashboard, Open Data Hub automatically enables Kueue management for all new projects created from the dashboard. For existing projects, or for projects created by using the OpenShift CLI (
oc), you must enable Kueue management manually by applying thekueue.openshift.io/managed=truelabel to the project namespace.
Open Data Hub administrator
Prepares the Open Data Hub environment:
-
Creates Kueue-enabled hardware profiles so that users can submit workloads from the Open Data Hub dashboard, as described in Working with hardware profiles.
ML Engineer or data scientist
Submits workloads to the queuing system:
-
For workloads created from the Open Data Hub dashboard, such as workbenches and model servers, selects a Kueue-enabled hardware profile during creation.
-
For workloads created by using a command-line interface or an SDK, such as distributed training jobs, adds the
kueue.x-k8s.io/queue-namelabel to the workload’s YAML manifest and sets its value to the targetLocalQueuename.
Configuring workload management with Kueue
To use workload queuing in Open Data Hub, install the Red Hat build of Kueue Operator and activate the Kueue integration in Open Data Hub.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You are using OpenShift Container Platform 4.18 or later.
-
You have installed and configured the cert-manager Operator for Red Hat OpenShift for your cluster.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
In a terminal window, log in to the OpenShift CLI (
oc) as shown in the following example:$ oc login <openshift_cluster_url> -u <admin_username> -p <password> -
Install the Red Hat build of Kueue Operator on your OpenShift Container Platform cluster as described in the Red Hat build of Kueue documentation.
-
Activate the Kueue integration. You can use the predefined names for the default cluster queue and default local queue, or specify custom names.
-
To use the predefined queue names (
default), run the following command. Replace<operator-namespace>with your operator namespace. The default operator namespace isopenshift-operators.$ oc patch datasciencecluster default-dsc --type='merge' -p '{"spec":{"components":{"kueue":{"managementState":"Unmanaged"}}}}' -n <operator-namespace> -
To specify custom queue names, run the following command. Replace
<example-cluster-queue>and<example-local-queue>with your custom queue names, and replace<operator-namespace>with your operator namespace. The default operator namespace isopenshift-operators.$ oc patch datasciencecluster default-dsc --type='merge' -p '{"spec":{"components":{"kueue":{"managementState":"Unmanaged","defaultClusterQueueName":"<example-cluster-queue>","defaultLocalQueueName":"<example-local-queue>"}}}}' -n <operator-namespace>
-
-
Verify that the Red Hat build of Kueue pods are running:
$ oc get pods -n openshift-kueue-operatorYou should see output similar to the following example:
kueue-controller-manager-d9fc745df-ph77w 1/1 Running openshift-kueue-operator-69cfbf45cf-lwtpm 1/1 Running -
Verify that the default
ClusterQueuewas created:$ oc get clusterqueues
-
Configure quotas by creating and modifying
ResourceFlavor,ClusterQueue, andLocalQueueobjects. For details, see the Red Hat build of Kueue documentation. -
Enable Kueue in the dashboard so that users can select Kueue-enabled options when creating workloads. When you enable Kueue, you also enable Kueue management for all new projects created from the dashboard. See Enabling Kueue in the dashboard.
-
Cluster administrators and Open Data Hub administrators can create hardware profiles so that users can submit workloads from the Open Data Hub dashboard. See Working with hardware profiles.
Enabling Kueue in the dashboard
Enable Kueue in the Open Data Hub dashboard so that users can select Kueue-enabled options when creating workloads.
When you enable Kueue in the dashboard, Open Data Hub automatically enables Kueue management for all new projects created from the dashboard. For these projects, Open Data Hub applies the kueue.openshift.io/managed=true label to the namespace and creates a LocalQueue object if one does not already exist. The LocalQueue object is created with the opendatahub.io/managed=false annotation, so it is not managed after creation. Cluster administrators can modify or delete it as needed. A validating webhook then enforces that any new or updated workload resource in a Kueue-enabled project includes the kueue.x-k8s.io/queue-name label.
|
Note
|
For existing projects, or for projects created by using the OpenShift CLI (
|
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You are using OpenShift Container Platform 4.18 or later.
-
You have installed and activated the Red Hat build of Kueue Operator, as described in Configuring workload management with Kueue.
-
You have configured quotas, as described in the Red Hat build of Kueue documentation.
-
In a terminal window, log in to the OpenShift CLI (
oc) as shown in the following example:$ oc login <openshift_cluster_url> -u <admin_username> -p <password> -
Update the
odh-dashboard-configcustom resource in the Open Data Hub applications namespace. Replace<applications-namespace>with your Open Data Hub applications namespace. The default isopendatahub.$ oc patch odhdashboardconfig odh-dashboard-config \ -n \<applications-namespace\> \ --type merge \ -p {"spec":{"dashboardConfig":{"disableHardwareProfiles":false,"disableKueue":false}}}
-
From the Open Data Hub dashboard, create a new project.
-
Verify that the project namespace is labeled for Kueue management:
$ oc get ns <project-namespace> -o jsonpath='{.metadata.labels.kueue\.openshift\.io/managed}{"\n"}'The output should be
true. -
Confirm that a default
LocalQueueexists for the project namespace:$ oc get localqueues -n <project-namespace> -
Create a test workload (for example, a
Notebook) and verify that it includes thekueue.x-k8s.io/queue-namelabel.
-
Cluster administrators and Open Data Hub administrators can create hardware profiles so that users can submit workloads from the Open Data Hub dashboard. See Working with hardware profiles.
Troubleshooting common problems with Kueue
If your users are experiencing errors in Open Data Hub relating to Kueue workloads, read this section to understand what could be causing the problem, and how to resolve the problem.
A user receives a "failed to call webhook" error message for Kueue
After the user runs the cluster.apply() command, the following error is shown:
ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}
The Kueue pod might not be running.
-
In the OpenShift Container Platform console, select the user’s project from the Project list.
-
Click Workloads → Pods.
-
Verify that the Kueue pod is running. If necessary, restart the Kueue pod.
-
Review the logs for the Kueue pod to verify that the webhook server is serving, as shown in the following example:
{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}
A user receives a "Default Local Queue … not found" error message
After the user runs the cluster.apply() command, the following error is shown:
Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.
No default local queue is defined, and a local queue is not specified in the cluster configuration.
-
Check whether a local queue exists in the user’s project, as follows:
-
In the OpenShift Container Platform console, select the user’s project from the Project list.
-
Click Home → Search, and from the Resources list, select LocalQueue.
-
If no local queues are found, create a local queue.
-
Provide the user with the details of the local queues in their project, and advise them to add a local queue to their cluster configuration.
-
-
Define a default local queue.
For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.
A user receives a "local_queue provided does not exist" error message
After the user runs the cluster.apply() command, the following error is shown:
local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.
An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.
-
In the OpenShift Container Platform console, select the user’s project from the Project list.
-
Click Search, and from the Resources list, select LocalQueue.
-
Resolve the problem in one of the following ways:
-
If no local queues are found, create a local queue.
-
If one or more local queues are found, provide the user with the details of the local queues in their project. Advise the user to ensure that they spelled the local queue name correctly in their cluster configuration, and that the
namespacevalue in the cluster configuration matches their project name.
-
-
Define a default local queue.
For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.
-
The pod provisioned by Kueue is terminated before the image is pulled
Kueue waits for a period of time before marking a workload as ready for all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.
-
In the OpenShift Container Platform console, select the user’s project from the Project list.
-
Click Workloads → Pods.
-
Click the user’s pod name to open the pod details page.
-
Click the Events tab, and review the pod events to check whether the image pull completed successfully.
If the pod takes more than 5 minutes to pull the image, resolve the problem in one of the following ways:
-
Add an
OnFailurerestart policy for resources that are managed by Kueue. -
Configure a custom timeout for the
waitForPodsReadyproperty in theKueuecustom resource (CR). The CR is installed in theopenshift-kueue-operatornamespace by the Red Hat build of Kueue Operator.
For more information about this configuration option, see Enabling waitForPodsReady in the Kueue documentation.
Migrating to the Red Hat build of Kueue Operator
The embedded Kueue component for managing distributed workloads is deprecated.
Open Data Hub now uses the Red Hat build of Kueue Operator to provide enhanced workload scheduling for distributed training, workbench, and model serving workloads.
Check if your environment is using the embedded Kueue component by verifying the spec.components.kueue.managementState field in the DataScienceCluster custom resource. If the field is set to Managed, you must migrate to the Red Hat build of Kueue Operator before upgrading Open Data Hub to avoid controller conflicts and ensure continued support for queue-based workloads.
Open Data Hub does not automatically migrate workloads, and you cannot install both the embedded Kueue and the Red Hat build of Kueue Operator on the same cluster.
-
Your environment is currently using the embedded Kueue component. That is, the
spec.components.kueue.managementStatefield in theDataScienceClustercustom resource is set toManaged.NoteIf
spec.components.kueue.managementStateis set toRemovedorUnmanaged, skip this migration. -
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You are using OpenShift Container Platform 4.18 or later.
-
You have installed and configured the cert-manager Operator for Red Hat OpenShift for your cluster.
-
Optional: When you migrate from the embedded Kueue to Red Hat build of Kueue, the Open Data Hub Operator automatically moves your existing Kueue configuration from the
kueue-manager-configConfigMap to theKueuecustom resource (CR).If you want to keep the
kueue-manager-configConfigMap for reference, run the following command. Replace<applications-namespace>with your Open Data Hub applications namespace. The default namespace isopendatahub.$ oc annotate configmap kueue-manager-config -n <applications-namespace> opendatahub.io/managed=false -
Log in to the OpenShift Container Platform web console as a cluster administrator.
-
Uninstall the embedded Kueue component to avoid potential configuration conflicts.
NoteIf you need to keep workloads running without interruption, you can skip this step. However, skipping it is not recommended because it might cause temporary configuration issues during the Open Data Hub upgrade.
-
In the web console, click Operators → Installed Operators and then click the Open Data Hub Operator.
-
Click the Data Science Cluster tab.
-
Click the default-dsc object.
-
Click the YAML tab.
-
Set
spec.components.kueue.managementStatetoRemovedas shown:spec: components: kueue: managementState: Removed -
Click Save.
-
Wait for the Open Data Hub Operator to reconcile, and then verify that the embedded Kueue was removed:
-
On the Details tab of the
default-dscobject, check that the KueueReady condition has a Status ofFalseand a Reason ofRemoved. -
Go to Workloads → Deployments, select the project where Open Data Hub is installed (for example,
redhat-ods-applications), and confirm that Kueue-related deployments (for example,kueue-controller-manager) are no longer present.
-
-
-
Install the Red Hat build of Kueue Operator on your OpenShift Container Platform cluster:
-
Follow the steps to install the Red Hat build of Kueue Operator, as described in the Red Hat build of Kueue documentation.
-
Go to Operators → Installed Operators and confirm that the Red Hat build of Kueue Operator is listed with Status as Succeeded.
-
-
Activate the Red Hat build of Kueue Operator in Open Data Hub:
-
In the web console, click Operators → Installed Operators and then click the Open Data Hub Operator.
-
Click the Data Science Cluster tab.
-
Click the default-dsc object.
-
Click the YAML tab.
-
Set
spec.components.kueue.managementStatetoUnmanaged. You can either use the predefined names (default) for the default cluster queue and default local queue, or specify custom names, as shown in the following examples.-
To use the predefined queue names, apply the following configuration:
spec: components: kueue: managementState: Unmanaged -
To specify custom queue names, apply the following configuration, replacing
<example-cluster-queue>and<example-local-queue>with your custom values:spec: components: kueue: managementState: Unmanaged defaultClusterQueueName: <example-cluster-queue> defaultLocalQueueName: <example-local-queue>
-
-
Click Save.
-
-
Enable Kueue management for existing projects by applying the
kueue.openshift.io/managed=truelabel to each project namespace:$ oc label namespace <project-namespace> kueue.openshift.io/managed=true --overwriteReplace
<project-namespace>with the name of your project.NoteKueue validation and queue enforcement apply only to workloads in namespaces labeled with
kueue.openshift.io/managed=true.
-
Verify that the embedded Kueue component was removed.
-
Verify that the
DataScienceClusterresource shows a healthyUnmanagedstatus for Kueue. -
Verify that existing workloads in the queue continue to be processed by the Red Hat build of Kueue controllers. Submit a new test workload to confirm functionality.
-
Configure quotas by creating and modifying
ResourceFlavor,ClusterQueue, andLocalQueueobjects. For details, see the Red Hat build of Kueue documentation. -
Enable Kueue in the dashboard so that users can select Kueue-enabled options when creating workloads. When enabled, Kueue management is automatically applied to all new projects created from the dashboard. See Enabling Kueue in the dashboard.
-
Cluster administrators and Open Data Hub administrators can create hardware profiles so that users can submit workloads from the Open Data Hub dashboard. See Working with hardware profiles.
Managing distributed workloads
In Open Data Hub, distributed workloads like PyTorchJob, RayJob, and RayCluster are created and managed by their respective workload operators. Kueue provides queueing and admission control and integrates with these operators to decide when workloads can run based on cluster-wide quotas.
You can perform advanced configuration for your distributed workloads environment, such as changing the default behavior of the CodeFlare Operator or setting up a cluster for RDMA.
Configuring quota management for distributed workloads
Configure quotas for distributed workloads by creating Kueue resources. Quotas ensure that you can share resources between several data science projects.
-
You have logged in to OpenShift Container Platform with the
cluster-adminrole. -
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
You have installed and activated the Red Hat build of Kueue Operator as described in Configuring workload management with Kueue.
-
You have installed the required distributed workloads components as described in Installing the distributed workloads components.
-
You have created a data science project that contains a workbench, and the workbench is running a default workbench image that contains the CodeFlare SDK, for example, the Standard Data Science workbench. For information about how to create a project, see Creating a data science project.
-
You have sufficient resources. In addition to the base Open Data Hub resources, you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure.
-
The resources are physically available in the cluster. For more information about Kueue resources, see the Red Hat build of Kueue documentation.
-
If you want to use graphics processing units (GPUs), you have enabled GPU support. This process includes installing the Node Feature Discovery Operator and the relevant GPU Operator. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation for NVIDIA GPUs and AMD GPU Operator on Red Hat OpenShift Container Platform in the AMD documentation for AMD GPUs.
-
In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI (
oc) as shown in the following example:$ oc login <openshift_cluster_url> -u <admin_username> -p <password> -
Verify that a resource flavor exists or create a custom one, as follows:
-
Check whether a
ResourceFlavoralready exists:$ oc get resourceflavors -
If a
ResourceFlavoralready exists and you need to modify it, edit it in place:$ oc edit resourceflavor <existing_resourceflavor_name> -
If a
ResourceFlavordoes not exist or you want a custom one, create a file calleddefault_flavor.yamland populate it with the following content:Empty Kueue resource flavorapiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: <example_resource_flavor>For more examples, see Example Kueue resource configurations.
-
Perform one of the following actions:
-
If you are modifying the existing resource flavor, save the changes.
-
If you are creating a new resource flavor, apply the configuration to create the
ResourceFlavorobject:$ oc apply -f default_flavor.yaml
-
-
-
Verify that a default cluster queue exists or create a custom one, as follows:
NoteOpen Data Hub automatically created a default cluster queue when the Kueue integration was activated. You can verify and modify the default cluster queue, or create a custom one.
-
Check whether a
ClusterQueuealready exists:$ oc get clusterqueues -
If a
ClusterQueuealready exists and you need to modify it (for example, to change the resources), edit it in place:$ oc edit clusterqueue <existing_clusterqueue_name> -
If a
ClusterQueuedoes not exist or you want a custom one, create a file calledcluster_queue.yamland populate it with the following content:Example cluster queueapiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: <example_cluster_queue> spec: namespaceSelector: {} (1) resourceGroups: - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] (2) flavors: - name: "<resource_flavor_name>" (3) resources: (4) - name: "cpu" nominalQuota: 9 - name: "memory" nominalQuota: 36Gi - name: "nvidia.com/gpu" nominalQuota: 5-
Defines which namespaces can use the resources governed by this cluster queue. An empty
namespaceSelectoras shown in the example means that all namespaces can use these resources. -
Defines the resource types governed by the cluster queue. This example
ClusterQueueobject governs CPU, memory, and GPU resources. If you use AMD GPUs, replacenvidia.com/gpuwithamd.com/gpuin the example code. -
Defines the resource flavor that is applied to the resource types listed. In this example, the <resource_flavor_name> resource flavor is applied to CPU, memory, and GPU resources.
-
Defines the resource requirements for admitting jobs. The cluster queue will start a distributed workload only if the total required resources are within these quota limits.
-
-
Replace the example quota values (9 CPUs, 36 GiB memory, and 5 NVIDIA GPUs) with the appropriate values for your cluster queue. If you use AMD GPUs, replace
nvidia.com/gpuwithamd.com/gpuin the example code. For more examples, see Example Kueue resource configurations.You must specify a quota for each resource that the user can request, even if the requested value is 0, by updating the
spec.resourceGroupssection as follows:-
Include the resource name in the
coveredResourceslist. -
Specify the resource
nameandnominalQuotain theflavors.resourcessection, even if thenominalQuotavalue is 0.
-
-
Perform one of the following actions:
-
If you are modifying the existing cluster queue, save the changes.
-
If you are creating a new cluster queue, apply the configuration to create the
ClusterQueueobject:$ oc apply -f cluster_queue.yaml
-
-
-
Verify that a local queue that points to your cluster queue exists for your project namespace, or create a custom one, as follows:
NoteIf Kueue is enabled in the Open Data Hub dashboard, new projects created from the dashboard are automatically configured for Kueue management. In those namespaces, a default local queue might already exist. You can verify and modify the local queue, or create a custom one.
-
Check whether a
LocalQueuealready exists for your project namespace:$ oc get localqueues -n <project_namespace> -
If a
LocalQueuealready exists and you need to modify it (for example, to point to a differentClusterQueue), edit it in place:$ oc edit localqueue <existing_localqueue_name> -n <project_namespace> -
If a
LocalQueuedoes not exist or you want a custom one, create a file calledlocal_queue.yamland populate it with the following content:Example local queueapiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: name: <example_local_queue> namespace: <project_namespace> spec: clusterQueue: <cluster_queue_name> -
Replace the
name,namespace, andclusterQueuevalues accordingly. -
Perform one of the following actions:
-
If you are modifying an existing local queue, save the changes.
-
If you are creating a new local queue, apply the configuration to create the
LocalQueueobject:$ oc apply -f local_queue.yaml
-
-
Check the status of the local queue in a project, as follows:
$ oc get localqueues -n <project_namespace>
Example Kueue resource configurations for distributed workloads
You can use these example configurations as a starting point for creating Kueue resources to manage your distributed training workloads.
These examples show how to configure Kueue resource flavors and cluster queues for common distributed training scenarios.
NVIDIA GPUs without shared cohort
NVIDIA RTX A400 GPU resource flavor
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "a400node"
spec:
nodeLabels:
instance-type: nvidia-a400-node
tolerations:
- key: "HasGPU"
operator: "Exists"
effect: "NoSchedule"
NVIDIA RTX A1000 GPU resource flavor
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "a1000node"
spec:
nodeLabels:
instance-type: nvidia-a1000-node
tolerations:
- key: "HasGPU"
operator: "Exists"
effect: "NoSchedule"
NVIDIA RTX A400 GPU cluster queue
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "a400queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "a400node"
resources:
- name: "cpu"
nominalQuota: 16
- name: "memory"
nominalQuota: 64Gi
- name: "nvidia.com/gpu"
nominalQuota: 2
NVIDIA RTX A1000 GPU cluster queue
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "a1000queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "a1000node"
resources:
- name: "cpu"
nominalQuota: 16
- name: "memory"
nominalQuota: 64Gi
- name: "nvidia.com/gpu"
nominalQuota: 2
NVIDIA GPUs and AMD GPUs without shared cohort
AMD GPU resource flavor
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "amd-node"
spec:
nodeLabels:
instance-type: amd-node
tolerations:
- key: "HasGPU"
operator: "Exists"
effect: "NoSchedule"
NVIDIA GPU resource flavor
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "nvidia-node"
spec:
nodeLabels:
instance-type: nvidia-node
tolerations:
- key: "HasGPU"
operator: "Exists"
effect: "NoSchedule"
AMD GPU cluster queue
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "team-a-amd-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "amd.com/gpu"]
flavors:
- name: "amd-node"
resources:
- name: "cpu"
nominalQuota: 16
- name: "memory"
nominalQuota: 64Gi
- name: "amd.com/gpu"
nominalQuota: 2
NVIDIA GPU cluster queue
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "team-a-nvidia-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "nvidia-node"
resources:
- name: "cpu"
nominalQuota: 16
- name: "memory"
nominalQuota: 64Gi
- name: "nvidia.com/gpu"
nominalQuota: 2
-
Resource Flavor in the Kueue documentation
-
Cluster Queue in the Kueue documentation
Configuring the CodeFlare Operator
If you want to change the default configuration of the CodeFlare Operator for distributed workloads in Open Data Hub, you can edit the associated config map.
-
You have logged in to OpenShift Container Platform with the
cluster-adminrole. -
You have installed the required distributed workloads components as described in Installing the distributed workloads components.
-
In the OpenShift Container Platform console, click Workloads → ConfigMaps.
-
From the Project list, select odh.
-
Search for the codeflare-operator-config config map, and click the config map name to open the ConfigMap details page.
-
Click the YAML tab to show the config map specifications.
-
In the
data:config.yaml:kuberaysection, you can edit the following entries:- ingressDomain
-
This configuration option is null (
ingressDomain: "") by default. Do not change this option unless the Ingress Controller is not running on OpenShift. Open Data Hub uses this value to generate the dashboard and client routes for every Ray Cluster, as shown in the following examples:Example dashboard and client routesray-dashboard-<clustername>-<namespace>.<your.ingress.domain> ray-client-<clustername>-<namespace>.<your.ingress.domain> - mTLSEnabled
-
This configuration option is enabled (
mTLSEnabled: true) by default. When this option is enabled, the Ray Cluster pods create certificates that are used for mutual Transport Layer Security (mTLS), a form of mutual authentication, between Ray Cluster nodes. When this option is enabled, Ray clients cannot connect to the Ray head node unless they download the generated certificates from theca-secret-_<cluster_name>_secret, generate the necessary certificates for mTLS communication, and then set the required Ray environment variables. Users must then re-initialize the Ray clients to apply the changes. The CodeFlare SDK provides the following functions to simplify the authentication process for Ray clients:Example Ray client authentication codefrom codeflare_sdk import generate_cert generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) generate_cert.export_env(cluster.config.name, cluster.config.namespace) ray.init(cluster.cluster_uri()) - rayDashboardOauthEnabled
-
This configuration option is enabled (
rayDashboardOAuthEnabled: true) by default. When this option is enabled, Open Data Hub places an OpenShift OAuth proxy in front of the Ray Cluster head node. Users must then authenticate by using their OpenShift cluster login credentials when accessing the Ray Dashboard through the browser. If users want to access the Ray Dashboard in another way (for example, by using the RayJobSubmissionClientclass), they must set an authorization header as part of their request, as shown in the following example:Example authorization header{Authorization: "Bearer <your-openshift-token>"}
-
To save your changes, click Save.
-
To apply your changes, delete the pod:
-
Click Workloads → Pods.
-
Find the codeflare-operator-manager-<pod-id> pod.
-
Click the options menu (⋮) for that pod, and then click Delete Pod. The pod restarts with your changes applied.
-
Check the status of the codeflare-operator-manager pod, as follows:
-
In the OpenShift Container Platform console, click Workloads → Deployments.
-
Search for the codeflare-operator-manager deployment, and then click the deployment name to open the deployment details page.
-
Click the Pods tab. When the status of the codeflare-operator-manager-<pod-id> pod is Running, the pod is ready to use. To see more information about the pod, click the pod name to open the pod details page, and then click the Logs tab.
Configuring a cluster for RDMA
NVIDIA GPUDirect RDMA uses Remote Direct Memory Access (RDMA) to provide direct GPU interconnect. To configure a cluster for RDMA, a cluster administrator must install and configure several Operators.
-
You can access an OpenShift cluster as a cluster administrator.
-
Your cluster has multiple worker nodes with supported NVIDIA GPUs, and can access a compatible NVIDIA accelerated networking platform.
-
You have installed Open Data Hub with the required distributed training components as described in Installing the distributed workloads components.
-
You have configured the distributed training resources as described in Managing distributed workloads.
-
Log in to the OpenShift Console as a cluster administrator.
-
Enable NVIDIA GPU support in Open Data Hub.
This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU Operator. For more information, see NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
NoteAfter the NVIDIA GPU Operator is installed, ensure that
rdmais set toenabledin yourClusterPolicycustom resource instance. -
To simplify the management of NVIDIA networking resources, install and configure the NVIDIA Network Operator, as follows:
-
Install the NVIDIA Network Operator, as described in Adding Operators to a cluster in the OpenShift documentation.
-
Configure the NVIDIA Network Operator, as described in the deployment examples in the Network Operator Application Notes in the NVIDIA documentation.
-
-
[Optional] To use Single Root I/O Virtualization (SR-IOV) deployment modes, complete the following steps:
-
Install the SR-IOV Network Operator, as described in the Installing the SR-IOV Network Operator section in the OpenShift documentation.
-
Configure the SR-IOV Network Operator, as described in the Configuring the SR-IOV Network Operator section in the OpenShift documentation.
-
-
Use the Machine Configuration Operator to increase the limit of pinned memory for non-root users in the container engine (CRI-O) configuration, as follows:
-
In the OpenShift Console, in the Administrator perspective, click Compute → MachineConfigs.
-
Click Create MachineConfig.
-
Replace the placeholder text with the following content:
Example machine configurationapiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 02-worker-container-runtime spec: config: ignition: version: 3.2.0 storage: files: - contents: inline: | [crio.runtime] default_ulimits = [ "memlock=-1:-1" ] mode: 420 overwrite: true path: /etc/crio/crio.conf.d/10-custom -
Edit the
default_ulimitsentry to specify an appropriate value for your configuration. For more information about default limits, see the Set default ulimits on CRIO Using machine config Knowledgebase solution. -
Click Create.
-
Restart the worker nodes to apply the machine configuration.
This configuration enables non-root users to run the training job with RDMA in the most restrictive OpenShift default security context.
-
-
Verify that the Operators are installed correctly, as follows:
-
In the OpenShift Console, in the Administrator perspective, click Workloads → Pods.
-
Select your project from the Project list.
-
Verify that a pod is running for each of the newly installed Operators.
-
-
Verify that RDMA is being used, as follows:
-
Edit the
PyTorchJobresource to set the*NCCL_DEBUG*environment variable toINFO, as shown in the following example:Setting the NCCL debug level to INFOspec: containers: - command: - /bin/bash - -c - "your container command" env: - name: NCCL_SOCKET_IFNAME value: "net1" - name: NCCL_IB_HCA value: "mlx5_1" - name: NCCL_DEBUG value: "INFO" -
Run the PyTorch job.
-
Check that the pod logs include an entry similar to the following text:
Example pod log entryNCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [RO]
-
Troubleshooting common problems with distributed workloads for administrators
If your users are experiencing errors in Open Data Hub relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.
A user’s Ray cluster is in a suspended state
The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.
The user’s Ray cluster head pod or worker pods remain in a suspended state.
Check the status of the Workload resource that is created with the RayCluster resource.
The status.conditions.message field provides the reason for the suspended state, as shown in the following example:
status:
conditions:
- lastTransitionTime: '2024-05-29T13:05:09Z'
message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'
-
Check whether the resource flavor is created, as follows:
-
In the OpenShift Container Platform console, select the user’s project from the Project list.
-
Click Home → Search, and from the Resources list, select ResourceFlavor.
-
If necessary, create the resource flavor.
-
-
Check the cluster queue configuration in the user’s code, to ensure that the resources that they requested are within the limits defined for the project.
-
If necessary, increase the resource quota.
For information about configuring resource flavors and quotas, see Configuring quota management for distributed workloads.
A user’s Ray cluster is in a failed state
The user might have insufficient resources.
The user’s Ray cluster head pod or worker pods are not running.
When a Ray cluster is created, it initially enters a failed state.
This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.
If the failed state persists, complete the following steps:
-
In the OpenShift Container Platform console, select the user’s project from the Project list.
-
Click Workloads → Pods.
-
Click the user’s pod name to open the pod details page.
-
Click the Events tab, and review the pod events to identify the cause of the problem.
-
Check the status of the
Workloadresource that is created with theRayClusterresource. Thestatus.conditions.messagefield provides the reason for the failed state.
A user receives a "failed to call webhook" error message for the CodeFlare Operator
After the user runs the cluster.apply() command, the following error is shown:
ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}
The CodeFlare Operator pod might not be running.
-
In the OpenShift Container Platform console, select the user’s project from the Project list.
-
Click Workloads → Pods.
-
Verify that the CodeFlare Operator pod is running. If necessary, restart the CodeFlare Operator pod.
-
Review the logs for the CodeFlare Operator pod to verify that the webhook server is serving, as shown in the following example:
INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443}
A user’s Ray cluster does not start
After the user runs the cluster.apply() command, when they run either the cluster.details() command or the cluster.status() command, the Ray cluster status remains as Starting instead of changing to Ready.
No pods are created.
Check the status of the Workload resource that is created with the RayCluster resource.
The status.conditions.message field provides the reason for remaining in the Starting state.
Similarly, check the status.conditions.message field for the RayCluster resource.
-
In the OpenShift Container Platform console, select the user’s project from the Project list.
-
Click Workloads → Pods.
-
Verify that the KubeRay pod is running. If necessary, restart the KubeRay pod.
-
Review the logs for the KubeRay pod to identify errors.
A user cannot create a Ray cluster or submit jobs
After the user runs the cluster.apply() command, an error similar to the following text is shown:
RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}
The correct OpenShift login credentials are not specified in the TokenAuthentication section of the user’s notebook code.
-
Advise the user to identify and specify the correct OpenShift login credentials as follows:
-
In the OpenShift Container Platform console header, click your username and click Copy login command.
-
In the new tab that opens, log in as the user whose credentials you want to use.
-
Click Display Token.
-
From the Log in with this token section, copy the
tokenandservervalues. -
Specify the copied
tokenandservervalues in your notebook code as follows:auth = TokenAuthentication( token = "<token>", server = "<server>", skip_tls=False ) auth.login()
-
-
Verify that the user has the correct permissions and is part of the
odh-usersgroup.
Configuring a central authentication service for an external OIDC identity provider
The built-in OpenShift OAuth server supports integration with various identity providers. However, it has limitations in direct OpenID Connect (OIDC) configurations on Red Hat OpenShift Service on AWS (ROSA) and on-premises OpenShift Container Platform (OCP) 4.20+ clusters. The internal oauth service can be disabled or removed, which breaks dependencies like oauth-proxy sidecar containers.
You can configure an external OIDC identity provider directly with Open Data Hub by configuring a centralized Gateway API. The Gateway API configuration provides a secure, scalable, and manageable authentication solution because it centralizes the authentication logic and decouples it from individual backend services.
|
Important
|
OpenID Connect (OIDC) configuration is currently available in Red Hat OpenShift AI 3.0 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope. |
About centralized authentication Gateway API
A Gateway API with centralized authentication centralizes ingress for all services behind a single domain, providing the following advanced capabilities:
-
Centralized authentication: A single authentication service requiring only one client ID and secret from the external OIDC Identity Provider (IDP).
-
Simplified backend services: Backend services assume all incoming traffic is authenticated and contains necessary user headers.
-
Authorization handling: Services still handle authorization at the service or pod level using sidecars like
kube-rbac-proxy. -
Encrypted Communication: Traffic from the gateway to the backend services is fully encrypted with Transport Layer Security (TLS).
The Gateway API is implemented via an Istio Gateway on OpenShift Container Platform (OCP) 4.19 and later. Since the Istio Gateway is built on the Envoy Proxy, it provides access to powerful Envoy-specific Custom Resource Definitions (CRDs), such as EnvoyFilter. The opendatahub-operator manages the deployment of kube-auth-proxy. The Operator then configures the Istio Gateway to use this service via an EnvoyFilter Custom Resource (CR).
For more information on supported OpenID Connect (OIDC) identity providers, see OCP documentation on Direct authentication identity providers
Security considerations
-
Secret Management: Store OIDC client secrets securely and rotate them regularly.
-
Network Policies: Consider implementing network policies to restrict access to the authentication proxy.
-
TLS Configuration: Ensure all OIDC communication uses Transport Layer Security (TLS).
-
Token Validation: While
kube-auth-proxyvalidates tokens, ensure your OIDC provider is configured with appropriate token lifetimes. -
Audit Logging: Enable audit logging for authentication events.
Configuring OpenID Connect (OIDC) authentication for Gateway API
As a Open Data Hub administrator, you can configure an OpenID Connect (OIDC) authentication for Gateway API using parameters from your external OIDC identity provider.
|
Important
|
You must configure the OpenShift Container Platform for direct authentication with an external OIDC identity provider before configuring the ODH Gateway for the Gateway to function properly. |
-
You have configured the OpenShift Container Platform for direct authentication with an external OIDC identity provider.
-
To configure OpenShift for direct authentication, follow the appropriate OCP documentation: Enabling direct authentication with an external OIDC identity provider.
-
To configure OpenShift for direct authentication using ROSA, follow the appropriate Red Hat OpenShift Service on AWS documentation: Creating an OpenID Connect Configuration.
-
|
Note
|
You must configure OpenShift for direct authentication using the same OIDC provider that the Gateway will use. |
-
You have successfully installed and deployed Open Data Hub.
-
You have deployed the DataScienceCluster (DSC) and DSCInitialization. For more information, see Installing and deploying OpenShift.
-
You have deployed the OpenShift AI Operator in the
rhods-operatornamespace. -
You have enabled Gateway API support on OCP 4.19+ with Istio Gateway.
-
You have the following external authentication provider details:
-
Issuer URL
-
Client ID
-
Client Secret
-
Realm name (for Keycloak)
-
-
You have cluster administrator access which is required to create Secrets and configure
GatewayConfig.
-
For detailed step-by-step instructions, troubleshooting, and field definitions, refer to the OpenShift Container Platform documentation on Configuring an external OIDC identity provider.
-
In the OpenShift CLI (
oc), verify the OpenShift authentication type by running the following command:oc get authentication.config/cluster -o jsonpath='{.spec.type}'If the authentication is successful, you will see the following output:
OIDC -
Verify that your OIDC provider is configured as expected by running the following command:
oc get authentication.config/cluster -o jsonpath='{.spec.type}'If the OIDC configuration is successful, you will see your provider name (e.g.,
keycloak). -
Verify that the
kube-apiserverhas rolled out changes as expected.oc get co kube-apiserverIf success is indicated, the expected output should look like the following example:
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-apiserver 4.14.9 True False False 1d
NoteThe rollout can take 20 minutes or more. Wait until all nodes have the new revision before proceeding. You can proceed to Gateway configuration steps when oc get authentication.config/clustershowstype: OIDC,oc get co kube-apiservershows the authentication rollout is complete, and you can successfully authenticate to OpenShift using OIDC credentials. -
Define the the following environment variables. You must replace the placeholder values with the actual details from your OIDC Identity Provider (IDP):
# Replace with your actual values KEYCLOAK_DOMAIN="keycloak.example.com" KEYCLOAK_REALM="your-realm" KEYCLOAK_CLIENT_ID="your-client-id" KEYCLOAK_CLIENT_SECRET="your-client-secret" -
Create the client secret in the
openshift-ingressnamespace:oc create secret generic keycloak-client-secret \ --from-literal=clientSecret=$KEYCLOAK_CLIENT_SECRET \ -n openshift-ingress -
Update the
GatewayConfigcustom resource to enable OIDC authentication by patching it with the Secret reference and OIDC details:oc patch gatewayconfig default-gateway --type='merge' -p='{ "spec": { "oidc": { "issuerURL": "https://'$KEYCLOAK_DOMAIN'/realms/'$KEYCLOAK_REALM'", "clientID": "'$KEYCLOAK_CLIENT_ID'", "clientSecretRef": { "name": "keycloak-client-secret", "key": "clientSecret" } } } }' -
Verify that the client secret has been created and that the
GatewayConfigshows the correct OIDC configuration:oc get secret keycloak-client-secret -n openshift-ingress oc get gatewayconfig default-gateway -o jsonpath='{.spec.oidc}'Expected output for secret and
GatewayConfigshould look like the following example:# Expected output (for secret) NAME TYPE DATA AGE keycloak-client-secret Opaque 1 2m # Expected output (for GatewayConfig) {"clientID":"your-client-id","clientSecretRef":{"key":"clientSecret","name":"keycloak-client-secret"},"issuerURL":"https://keycloak.example.com/realms/your-realm"}
-
After configuring and authenticating the Gateway for your identity provider, you need to ensure that you can access your OpenShift console.
-
Access the gateway by accessing the Console link:
$ oc get consolelink -
Login with your OIDC credentials and verify the following:
-
You are redirected to the OIDC provider login page. A successful authentication redirects back to the Gateway.
-
The ODH components are accessible (dashboard, notebooks, etc.).
-
-
-
Check the
GatewayConfigstatus to verify that the OIDC configuration was successfully provisioned:oc get gatewayconfig default-gateway -o yamlThe expected output is the full YAML configuration of the
GatewayConfigresource, showing the OIDC configuration details underspec.oidcand confirming successful deployment by displaying both theReadyandProvisioningSucceededconditions with astatus: "True"value. -
Verify the
kube-auth-proxydeployment is running successfully in theopenshift-ingressnamespace:oc get deployment kube-auth-proxy -n openshift-ingressThe expected output should look like the following example:
NAME READY UP-TO-DATE AVAILABLE AGE kube-auth-proxy 1/1 1 1 5m -
Check the status and accessibility of the
data-science-gateway:oc get gateway data-science-gateway -n openshift-ingressThe expected output should look like the following example:
NAME CLASS ADDRESS PROGRAMMED AGE data-science-gateway data-science-gateway-class aa87f5da7f0c748d5aa63b4916604108-107643684.us-east-1.elb.amazonaws.com True 5m -
Test the OpenID Connect (OIDC) discovery endpoint by running the following command:
curl -s https://your-keycloak-domain/realms/your-realm/.well-known/openid-configurationThe expected output is a JSON object containing the OIDC configuration endpoints (
issuer,authorization_endpoint,token_endpoint, etc.) that confirm the OIDC provider is publicly discoverable.
Once the external OIDC is configured and authenticated, the Cluster Administrator must perform the necessary authorization by mapping external Identity Provider (IDP) groups to specific OpenShift ClusterRoles to grant access to projects and resources.
-
Create a
ClusterRolethat grants users read and list access to OpenShift projects in the console:apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: odh-projects-read rules: - apiGroups: ["project.openshift.io"] resources: ["projects"] verbs: ["get","list"] -
Bind the odh-projects-read ClusterRole to your Identity Provider (IDP) group (e.g., odh-users):
oc adm policy add-cluster-role-to-group odh-projects-read odh-users -
Grant the ability to create and manage new projects by assigning the built-in self-provisioner ClusterRole to your group:
oc adm policy add-cluster-role-to-group self-provisioner odh-users
Troubleshooting common problems with Gateway API configuration
If your users are experiencing errors in Open Data Hub relating to Gateway API configuration, read this section to understand what could be causing the problem, and how to resolve the problem.
The GatewayConfig status shows as not ready
While setting up the OIDC, the GatewayConfig status shows as not ready. You see error messages about missing OIDC configuration and the GatewayConfig resource shows its status as Ready: False.
-
Check
GatewayConfigstatus by running the following command.oc get gatewayconfig default-gateway -o yaml -
Check for specific error messages by running the following command.
oc describe gatewayconfig default-gatewayThe expected output confirms the
GatewayConfig resource has been successfully provisioned by showing the OIDC configuration details under `Spec.Oidcand displaying both theReadyandProvisioningSucceededstatus conditions with aTruestatus. -
Verify that the OIDC configuration is correct by running the following command.
oc get gatewayconfig default-gateway -o jsonpath='{.spec.oidc}'Expected output should look like the following example.
{"clientID":"your-client-id","clientSecretRef":{"key":"clientSecret","name":"keycloak-client-secret"},"issuerURL":"https://keycloak.example.com/realms/your-realm"}
-
Verify the OIDC secret exists and is correct by running the following command.
oc get secret keycloak-client-secret -n openshift-ingress -
Check OIDC issuer URL accessibility by running the following command.
curl -I https://your-keycloak-domain/realms/your-realm/.well-known/openid-configurationThe expected output confirms the OIDC issuer URL is accessible by returning the HTTP status code
HTTP/2 200and the correctcontent-type: application/json header. -
Ensure that the client Secret is correct.
Authentication proxy fails to start
The authentication proxy component fails to start after deploying kube-auth-proxy. The associated Pods are in a failing state, showing statuses such as CrashLoopBackOff or Pending, and the kube-auth-proxy Deployment is not ready.
-
Check the
kube-auth-proxydeployment status by running the following command.oc get deployment kube-auth-proxy -n openshift-ingressThe expected output confirms that the deployment is successfully provisioned, showing
1/1under theREADYcolumn. -
Check the Pod logs by running the following command.
oc logs -l app=kube-auth-proxy -n openshift-ingressThe expected output confirms that the OAuth2 Proxy is configured and starting on the specified ports.
# Expected output time="2024-01-15T10:30:00Z" level=info msg="OAuth2 Proxy configured" time="2024-01-15T10:30:00Z" level=info msg="OAuth2 Proxy starting on :4180" time="2024-01-15T10:30:00Z" level=info msg="OAuth2 Proxy starting on :8443" -
Check the Pod events for errors by running the following command.
oc describe pod -l app=kube-auth-proxy -n openshift-ingressThe expected output should look like the following example.
# Expected output Name: kube-auth-proxy-7d4f8b9c6-xyz12 Namespace: openshift-ingress Status: Running Containers: kube-auth-proxy: State: Running Ready: True Restart Count: 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 5m default-scheduler Successfully assigned openshift-ingress/kube-auth-proxy-7d4f8b9c6-xyz12 to worker-node-1
-
Verify the authentication secret contains the correct client secret by running the following command.
oc get secret kube-auth-proxy-creds -n openshift-ingress -o yamlThe expected output should contain the keys
OAUTH2_PROXY_CLIENT_SECRET,OAUTH2_PROXY_COOKIE_SECRET, andOAUTH2_PROXY_CLIENT_ID. -
Check if the OIDC issuer URL is accessible from the cluster by running the following command.
curl -I https://your-keycloak-domain/realms/your-realm/.well-known/openid-configurationThe expected output should return the HTTP status code
HTTP/2 200. -
Ensure that the client ID exists in your OIDC provider.
The Gateway is inaccessible
After configuring OIDC, you cannot access the Gateway URL: https://data-science-gateway.$CLUSTER_DOMAIN. Attempts to access the URL return 502 (Bad Gateway) or 503 (Service Unavailable) errors, indicating a networking failure that prevents external access or traffic routing to the service endpoint.
-
Check the Gateway status of
data-science-gatewayby running the following command.oc get gateway data-science-gateway -n openshift-ingressThe expected output shows
PROGRAMMEDcolumn asTrue, and a valid address is listed under theADDRESScolumn. -
Check the
HTTPRoutestatus by running the following command.oc get httproute -n openshift-ingressThe expected output shows that the
oauth-callback-routeis present. -
Check the
EnvoyFilterby running the following command.oc get envoyfilter -n openshift-ingressThe expected output shows that the
authn-filteris present. -
Check the
kube-auth-proxyService by running the following command.oc get service kube-auth-proxy -n openshift-ingressThe expected output shows that the Service and correct ports are present like the following example.
# Expected output NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kube-auth-proxy ClusterIP 172.30.31.69 <none> 8443/TCP,9000/TCP 41h
-
Verify the Gateway has a valid address by running the following command.
oc get gateway data-science-gateway -o jsonpath='{.status.addresses}'The expected output shows a valid IP address or hostname.
-
Check if the
HTTPRouteis properly configured by running the following command.oc describe httproute oauth-callback-route -n openshift-ingressThe expected output confirms proper parent references and backend services.
-
Ensure the
EnvoyFilteris applied correctly by running the following command.oc describe envoyfilter authn-filter -n openshift-ingressThe expected outputconfirms the proper configuration for
kube-auth-proxy.
The OIDC authentication fails
The OIDC authentication fails and you are unable to log in through the Gateway. You also experience symptoms such as redirect loops or explicit authentication errors after attempting to log in.
-
Check the
kube-auth-proxylogs for specific error messages by running the following command.oc logs -l app=kube-auth-proxy -n openshift-ingressThe expected output confirms that the OAuth2 Proxy is configured and starting on the specified ports.
-
Verify the OIDC configuration in the
kube-auth-proxySecret by running the following command.oc get secret kube-auth-proxy-creds -n openshift-ingress -o yamlThe expected output shows that the Secret contains the keys
OAUTH2_PROXY_CLIENT_ID,OAUTH2_PROXY_CLIENT_SECRET, andOAUTH2_PROXY_COOKIE_SECRET. The output should look like the following example.# Expected output apiVersion: v1 kind: Secret metadata: name: kube-auth-proxy-creds namespace: openshift-ingress type: Opaque data: OAUTH2_PROXY_CLIENT_ID: b2RoLWNsaWVudA== # base64 encoded "data-science" OAUTH2_PROXY_CLIENT_SECRET: <base64-encoded-secret> OAUTH2_PROXY_COOKIE_SECRET: <base64-encoded-cookie-secret> -
Test the OIDC discovery endpoint by running the following command.
curl -s https://your-keycloak-domain/realms/your-realm/.well-known/openid-configuration | jq .The expected output returns the complete JSON configuration, including valid endpoints for
issuer,authorization_endpoint, andtoken_endpoint.
-
Log in to the OIDC provider (for example, Keycloak) and verify that the redirect URI registered for the client matches the expected endpoint on the Gateway:
https://data-science-gateway.$CLUSTER_DOMAIN/oauth2/callback`. Mismatches are a frequent cause of redirect loops. -
Check if the client secret is properly set by running the following command.
echo $KEYCLOAK_CLIENT_SECRET | base64 -dThe expected output matches the secret in your OIDC provider.
-
Ensure that the issuer URL is accessible and correct by running the following command.
curl -I https://your-keycloak-domain/realms/your-realm/.well-known/openid-configurationThe expected output returns
HTTP/2 200.
The dashboard is not accessible after authentication
After successfully authenticating with OIDC, you experience an authorization failure that prevents access to the dashboard. The failure results in 403 Forbidden errors while accessing the dashboard.
-
Check the
odh-dashboardDeployment status by running the following command.oc get deployment odh-dashboard -n redhat-ods-applicationsThe expected outcome confirms that the Pods are running, similar to the following example.
NAME READY UP-TO-DATE AVAILABLE AGE odh-dashboard 2/2 2 2 41h -
Check the dashboard logs for any authorization errors by running the following command.
oc logs -l app=odh-dashboard -n redhat-ods-applicationsIn the expected output, the logs confirm the Dashboard is running and ready to serve requests.
-
Verify the user permissions by running the following command.
oc auth can-i get projects --as=your-usernameThe expected output confirms that the user has the required access.
-
Ensure that you have cluster-level RBAC permissions by running the following command.
oc adm policy add-cluster-role-to-user view your-usernameThe expected output confirms that the view cluster role has been added to the user.
-
Verify that the
odh-dashboardHTTPRoute is properly configured with correct parent references (linking it to the Gateway) by running the following command.oc get httproute odh-dashboard -n redhat-ods-applications -o yamlThe expected output shows proper parent references to the Gateway.
-
Check if the user is in the expected groups that may have roles bound to them required by the dashboard.
oc get user your-username -o yamlThe expected output confirms that the user is in the
odh-usersgroup.
Backing up data
Backing up Open Data Hub involves various components, including the OpenShift Container Platform cluster and storage data.
Backing up storage data
It is a best practice to back up the data on your persistent volume claims (PVCs) regularly.
Backing up your data is particularly important before you delete a user and before you uninstall Open Data Hub, as all PVCs are deleted when Open Data Hub is uninstalled.
For more information about backing up PVCs for your cluster platform, see OADP Application backup and restore in the OpenShift Container Platform documentation.
Backing up your cluster
If you plan to upgrade or uninstall Open Data Hub on your cluster, back up your cluster data so that you can restore it later if needed.
For more information, see Backup and restore in the OpenShift Container Platform documentation.
Managing observability
Open Data Hub provides centralized platform observability: an integrated, out-of-the-box solution for monitoring the health and performance of your Open Data Hub instance and user workloads.
This centralized solution includes a dedicated, pre-configured observability stack, featuring the OpenTelemetry Collector (OTC) for standardized data ingestion, Prometheus for metrics, and the Red Hat build of Tempo for distributed tracing. This architecture enables a common set of health metrics and alerts for Open Data Hub components and offers mechanisms to integrate with your existing external observability tools.
Enabling the observability stack
The observability stack collects and correlates metrics, traces, and alerts for Open Data Hub so that you can monitor, troubleshoot, and optimize Open Data Hub components. A cluster administrator must explicitly enable this capability in the DataScienceClusterInitialization (DSCI) custom resource.
Once enabled, you can perform the following actions:
-
Accelerate troubleshooting by viewing metrics, traces, and alerts for Open Data Hub components in one place.
-
Maintain platform stability by monitoring health and resource usage and receiving alerts for critical issues.
-
Integrate with existing tools by exporting telemetry to third-party observability solutions through the Red Hat build of OpenTelemetry.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have installed Open Data Hub.
-
You have installed the following Operators, which provide the components of the observability stack:
-
Cluster Observability Operator: Deploys and manages Prometheus and Alertmanager for metrics and alerts.
-
Tempo Operator: Provides the Tempo backend for distributed tracing.
-
Red Hat build of OpenTelemetry: Deploys the OpenTelemetry Collector for collecting and exporting telemetry data.
-
-
Log in to the OpenShift Container Platform web console as a cluster administrator.
-
In the OpenShift Container Platform console, click Operators → Installed Operators.
-
Search for the Open Data Hub Operator, and then click the Operator name to open the Operator details page.
-
Click the DSCInitialization tab.
-
Click the default instance name (for example, default-dsci) to open the instance details page.
-
Click the YAML tab to show the instance specifications.
-
In the
spec.monitoringsection, set the value of themanagementStatefield toManaged, and configure metrics, alerting, and tracing settings as shown in the following example:Example monitoring configuration# ... spec: monitoring: managementState: Managed # Required: Enables and manages the observability stack namespace: opendatahub # Required: Namespace where monitoring components are deployed alerting: {} # Alertmanager configuration, uses default settings if empty metrics: # Prometheus configuration for metrics collection replicas: 1 # Optional: Number of Prometheus instances resources: # CPU and memory requests and limits for Prometheus pods cpulimit: 500m # Optional: Maximum CPU allocation in millicores cpurequest: 100m # Optional: Minimum CPU allocation in millicores memorylimit: 512Mi # Optional: Maximum memory allocation in mebibytes memoryrequest: 256Mi # Optional: Minimum memory allocation in mebibytes storage: # Storage configuration for metrics data size: 5Gi # Required: Storage size for Prometheus data retention: 90d # Required: Retention period for metrics data in days exporters: {} # External metrics exporters traces: # Tempo backend for distributed tracing sampleRatio: '0.1' # Optional: Portion of traces to sample, expressed as a decimal storage: # Storage configuration for trace data backend: pv # Required: Storage backend for Tempo traces (pv, s3, or gcs) retention: 2160h # Optional: Retention period for trace data in hours exporters: {} # External traces exporters # ... -
Click Save to apply your changes.
Verify that the observability stack components are running in the configured namespace:
-
In the OpenShift Container Platform web console, click Workloads → Pods.
-
From the project list, select opendatahub.
-
Confirm that there are running pods for your configuration. The following pods indicate that the observability stack is active:
alertmanager-data-science-monitoringstack-# 2/2 Running 0 1m data-science-collector-collector-# 1/1 Running 0 1m prometheus-data-science-monitoringstack-# 2/2 Running 0 1m tempo-data-science-tempomonolithic-# 1/1 Running 0 1m thanos-querier-data-science-thanos-querier-# 2/2 Running 0 1m
Collecting metrics from user workloads
After a cluster administrator enables the observability stack in your cluster, metric collection becomes available but is not automatically active for all deployed workloads. The monitoring system relies on a specific label to identify which pods Prometheus should scrape for metrics.
To include a workload, such as a user-created workbench, training job, or inference service, in the centralized observability stack, add the label monitoring.opendatahub.io/scrape=true to the pod template in the workload’s deployment configuration.
This ensures that all pods created by the deployment include the label and are automatically scraped by Prometheus.
|
Note
|
Apply the |
-
A cluster administrator has enabled the observability stack as described in Enabling the observability stack.
-
You have Open Data Hub administrator privileges or you are the project owner.
-
You have deployed a workload that exposes a
/metricsendpoint, such as a workbench server or model service pod. -
You have access to the project where the workload is running.
-
Log in to the OpenShift Container Platform web console as a cluster administrator or project owner.
-
Click Workloads → Deployments.
-
In the Project list at the top of the page, select the project where your workload is deployed.
-
Identify the deployment that you want to collect metrics from and click its name.
-
On the Deployment details page, click the YAML tab.
-
In the YAML editor, add the required label under the
spec.template.metadata.labelssection, as shown in the following example:apiVersion: apps/v1 kind: Deployment metadata: name: <example_name> namespace: <example_namespace> spec: template: metadata: labels: monitoring.opendatahub.io/scrape: 'true' # ... -
Click Save.
OpenShift automatically rolls out a new ReplicaSet and pods with the updated label. When the new pods start, the observability stack begins scraping their metrics.
Verify that metrics are being collected by accessing the Prometheus instance deployed by Open Data Hub.
-
Access Prometheus by using a route:
-
In the OpenShift Container Platform web console, click Networking → Routes.
-
From the project list, select opendatahub.
-
Locate the route associated with the Prometheus service, such as
data-science-prometheus-route. -
Click the Location URL to open the Prometheus web console.
-
-
Alternatively, access Prometheus locally by using port forwarding:
-
List the Prometheus pods:
$ oc get pods -n opendatahub -l prometheus=data-science-monitoringstack -
Start port forwarding:
$ oc port-forward __<prometheus-pod-name>__ 9090:9090 -n opendatahub -
In a web browser, open the following URL:
http://localhost:9090
-
-
In the Prometheus web console, search for a metric exposed by your workload.
If the label is applied correctly and the workload exposes metrics, the metrics appear in the Prometheus instance deployed by Open Data Hub.
Exporting metrics to external observability tools
You can export Open Data Hub operational metrics to an external observability platform, such as Grafana, Prometheus, or any OpenTelemetry-compatible backend. This allows you to visualize and monitor Open Data Hub metrics alongside data from other systems in your existing observability environment.
Metrics export is configured in the DataScienceClusterInitialization (DSCI) custom resource by populating the .spec.monitoring.metrics.exporters field.
When you define one or more exporters in this field, the OpenTelemetry Collector (OTC) deployed by Open Data Hub automatically updates its configuration to include each exporter in its metrics pipeline. If this field is empty or undefined, metrics are collected only by the in-cluster Prometheus instance that is deployed with Open Data Hub.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
The observability stack is enabled as described in Enabling the observability stack.
-
The external observability platform can receive metrics through a supported export protocol.
-
You know the URL of your external metrics receiver endpoint.
-
Log in to the OpenShift Container Platform web console as a cluster administrator.
-
Click Operators → Installed Operators.
-
Select the Open Data Hub Operator from the list.
-
Click the DSCInitialization tab.
-
Click the default DSCI instance, for example, default-dsci, to open its details page.
-
Click the YAML tab.
-
In the
spec.monitoring.metricssection, add anexporterslist that defines the external receiver configuration, as shown in the following example:spec: monitoring: metrics: exporters: - name: <external_exporter_name> type: <type> endpoint: https://example-otlp-receiver.example.com:4317-
name: A unique, descriptive name for the exporter configuration. Do not use reserved names such asprometheusorotlp/tempo. -
type: The protocol used for export, for example:-
otlp: For OpenTelemetry-compatible backends using gRPC or HTTP. -
prometheusremotewrite: For Prometheus-compatible systems that use the remote write protocol.
-
-
endpoint: The full URL of your external metrics receiver. For OTLP, endpoints typically use port4317(gRPC) or4318(HTTP). For Prometheus remote write, endpoints typically end with/api/v1/write. For example:-
otlp:https://example-otlp-receiver.example.com:4317(gRPC) orhttps://example-otlp-receiver.example.com:4318(HTTP) -
prometheusremotewrite:https://example-prometheus-remote.example.com/api/v1/write
-
-
-
Click Save.
The OpenTelemetry Collector automatically reloads its configuration and begins forwarding metrics to the specified external endpoint.
-
Verify that the OpenTelemetry Collector pods restart and apply the new configuration:
$ oc get pods -n opendatahubThe
data-science-collector-collector-*pods should restart and display a Running status. -
In your external observability platform, verify that new metrics from Open Data Hub appear in the metrics list or dashboard.
|
Note
|
If you remove the |
Viewing traces in external tracing platforms
When tracing is enabled in the DataScienceClusterInitialization (DSCI) custom resource, Open Data Hub deploys the Red Hat build of Tempo as the tracing backend and the Red Hat build of OpenTelemetry Collector (OTC) to receive and route trace data.
To view and analyze traces outside of Open Data Hub, complete the following tasks:
-
Configure your instrumented applications to send traces to the OpenTelemetry Collector.
-
Connect your preferred visualization tool, such as Grafana or Jaeger, to the Tempo Query API.
-
A cluster administrator has enabled tracing as part of the observability stack in the DSCI configuration.
-
You have access to the monitoring namespace, for example
opendatahub. -
You have network access or cluster administrator privileges to create a route or port forward from the cluster.
-
Your application is instrumented with an OpenTelemetry SDK or library to generate and export trace data.
-
Find the OpenTelemetry Collector endpoint.
The OpenTelemetry Collector receives trace data from instrumented applications by using the OpenTelemetry Protocol (OTLP).
-
In the OpenShift Container Platform web console, navigate to Networking → Services.
-
In the Project list, select the monitoring namespace, for example,
opendatahub. -
Locate the Service named
data-science-collectoror a similar name associated with the OpenTelemetry Collector. -
Use the Service name or ClusterIP as the OTLP endpoint in your application configuration.
Your application must export traces to one of the following ports on the collector service:
-
gRPC:
4317 -
HTTP:
4318Example environment variable:
OTEL_EXPORTER_OTLP_ENDPOINT=http://data-science-collector.opendatahub.svc.cluster.local:4318NoteSee the Red Hat build of OpenTelemetry documentation for details about configuring application instrumentation.
-
-
-
Connect your visualization tool to the Tempo query service.
You can use a visualization tool, such as Grafana or Jaeger, to query and display traces from the Red Hat build of Tempo deployed by Open Data Hub.
-
In the OpenShift Container Platform web console, navigate to Networking → Services.
-
In the Project list, select the monitoring namespace, for example,
opendatahub. -
Locate the Service named
tempo-queryortempo-query-frontend. -
To make the service accessible to external tools, a cluster administrator must perform one of the following actions:
-
Create a route: Expose the Tempo Query service externally by creating an OpenShift route.
-
Use port forwarding: Temporarily forward a local port to the Tempo Query service by using the OpenShift CLI (
oc):$ oc port-forward svc/tempo-query-frontend 3200:3200 -n opendatahubAfter the port is forwarded, connect your visualization tool to the Tempo Query API endpoint, for example:
http://localhost:3200NoteSee the Tempo Operator documentation for details about connecting to Tempo.
-
-
-
Confirm that your instrumented application is generating and exporting trace data.
-
Verify that the OpenTelemetry Collector pod is running in the monitoring namespace:
$ oc get pods -n opendatahub | grep collectorThe
data-science-collector-collector-*pod should display a Running status. -
Access your visualization tool and confirm that new traces appear in the trace list or search view.
Accessing built-in alerts
The centralized observability stack deploys a Prometheus Alertmanager instance that provides a common set of built-in alerts for Open Data Hub components. These alerts monitor critical platform conditions, such as operator downtime, crashlooping pods, and unresponsive services.
By default, the Alertmanager is internal to the cluster and is not exposed through a route.
You can access the Alertmanager web interface locally by using the OpenShift CLI (oc).
-
You have Open Data Hub administrator privileges.
-
The observability stack is enabled as described in Enabling the observability stack.
-
You know the monitoring namespace, for example
opendatahub. -
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
In a terminal window, log in to the OpenShift CLI (
oc) as a cluster administrator:$ oc login https://api.198.51.100.10:6443 -
Verify that the Alertmanager pods are running in the monitoring namespace:
$ oc get pods -n opendatahub | grep alertmanagerExample output:
alertmanager-data-science-monitoringstack-0 2/2 Running 0 2h alertmanager-data-science-monitoringstack-1 2/2 Running 0 2h -
Confirm that a ClusterIP service exposes the Alertmanager web interface on port 9093:
$ oc get svc -n opendatahub | grep alertmanagerExample output:
data-science-monitoringstack-alertmanager ClusterIP 198.51.100.5 <none> 9093/TCP -
Start a local port forward to the Alertmanager service:
$ oc port-forward svc/data-science-monitoringstack-alertmanager 9093:9093 -n opendatahub -
In a web browser, open the following URL to access the Alertmanager web interface:
http://localhost:9093
-
Confirm that the Alertmanager web interface opens at
http://localhost:9093and displays active alerts for Open Data Hub components.
Viewing logs and audit records
As a cluster administrator, you can use the Open Data Hub Operator logger to monitor and troubleshoot issues. You can also use OpenShift Container Platform audit records to review a history of changes made to the Open Data Hub Operator configuration.
Configuring the Open Data Hub Operator logger
You can change the log level for Open Data Hub Operator components by setting the .spec.devFlags.logmode flag for the DSC Initialization/DSCI custom resource during runtime. If you do not set a logmode value, the logger uses the INFO log level by default.
The log level that you set with .spec.devFlags.logmode applies to all components, not just those in a Managed state.
The following table shows the available log levels:
| Log level | Stacktrace level | Verbosity | Output | Timestamp type |
|---|---|---|---|---|
|
WARN |
INFO |
Console |
Epoch timestamps |
|
ERROR |
INFO |
JSON |
Human-readable timestamps |
|
ERROR |
INFO |
JSON |
Human-readable timestamps |
Logs that are set to devel or development generate in a plain text console format.
Logs that are set to prod, production, or which do not have a level set generate in a JSON format.
-
You have administrator access to the
DSCInitializationresources in the OpenShift Container Platform cluster. -
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
Log in to the OpenShift Container Platform as a cluster administrator.
-
Click Operators → Installed Operators and then click the Open Data Hub Operator.
-
Click the DSC Initialization tab.
-
Click the default-dsci object.
-
Click the YAML tab.
-
In the
specsection, update the.spec.devFlags.logmodeflag with the log level that you want to set.apiVersion: dscinitialization.opendatahub.io/v1 kind: DSCInitialization metadata: name: default-dsci spec: devFlags: logmode: development -
Click Save.
You can also configure the log level from the OpenShift CLI (oc) by using the following command with the logmode value set to the log level that you want.
oc patch dsci default-dsci -p '{"spec":{"devFlags":{"logmode":"development"}}}' --type=merge
-
If you set the component log level to
develordevelopment, logs generate more frequently and include logs atWARNlevel and above. -
If you set the component log level to
prodorproduction, or do not set a log level, logs generate less frequently and include logs atERRORlevel or above.
Viewing the Open Data Hub Operator logs
-
Log in to the OpenShift CLI (
oc). -
Run the following command to stream logs from all Operator pods:
for pod in $(oc get pods -l name=opendatahub-operator -n openshift-operators -o name); do oc logs -f "$pod" -n openshift-operators & doneThe Operator pod logs open in your terminal.
TipPress Ctrl+Cto stop viewing. To fully stop all log streams, runkill $(jobs -p).
Viewing audit records
Cluster administrators can use OpenShift Container Platform auditing to see changes made to the Open Data Hub Operator configuration by reviewing modifications to the DataScienceCluster (DSC) and DSCInitialization (DSCI) custom resources. Audit logging is enabled by default in standard OpenShift Container Platform cluster configurations. For more information, see Viewing audit logs in the OpenShift Container Platform documentation.
The following example shows how to use the OpenShift Container Platform audit logs to see the history of changes made (by users) to the DSC and DSCI custom resources.
-
You have cluster administrator privileges for your OpenShift Container Platform cluster.
-
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:-
Installing the OpenShift CLI for OpenShift Container Platform
-
Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
-
-
In a terminal window, if you are not already logged in to your OpenShift Container Platform cluster as a cluster administrator, log in to the OpenShift Container Platform CLI as shown in the following example:
$ oc login <openshift_cluster_url> -u <admin_username> -p <password> -
To access the full content of the changed custom resources, set the OpenShift Container Platform audit log policy to
WriteRequestBodiesor a more comprehensive profile. For more information, see Configuring the audit log policy. -
Fetch the audit log files that are available for the relevant control plane nodes. For example:
oc adm node-logs --role=master --path=kube-apiserver/ \ | awk '{ print $1 }' | sort -u \ | while read node ; do oc adm node-logs $node --path=kube-apiserver/audit.log < /dev/null done \ | grep opendatahub > /tmp/kube-apiserver-audit-opendatahub.log -
Search the files for the DSC and DSCI custom resources. For example:
jq 'select((.objectRef.apiGroup == "dscinitialization.opendatahub.io" or .objectRef.apiGroup == "datasciencecluster.opendatahub.io") and .user.username != "system:serviceaccount:openshift-operators:redhat-ods-operator-controller-manager" and .verb != "get" and .verb != "watch" and .verb != "list")' < /tmp/kube-apiserver-audit-opendatahub.log
-
The commands return relevant log entries.