Hulleman.io

Day 3 - Troubleshooting Lab

Fix the following items in existing namespaces:

Broken NetworkPolicy

In the namespace connection-<teamname> an existing app is running. The app can’t reach the database. Please fix the network policy so that the app can access the database.

Troubleshooting

oc get pods

NAME                          READY   STATUS    RESTARTS   AGE
postgres-75bb9f47fb-dtzkc     1/1     Running   0          46m
simple-app-685b4d4d56-mfzdg   1/1     Running   0          46m

oc logs simple-app-685b4d4d56-mfzdg

```bash
[2026-02-12 07:50:31 +0000] [1] [INFO] Starting gunicorn 25.0.1
[2026-02-12 07:50:31 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
[2026-02-12 07:50:31 +0000] [1] [INFO] Using worker: sync
[2026-02-12 07:50:31 +0000] [2] [INFO] Booting worker with pid: 2
2026-02-12 08:27:49,604 [INFO] Attempting database connection: host=postgres db=appdb user=appuser
2026-02-12 08:27:54,619 [ERROR] Database connection failed: connection to server at "postgres" (172.231.95.105), port 5432 failed: timeout expired

oc get networkpolicies

NAME                POD-SELECTOR     AGE
allow-app-ingress   app=simple-app   21m
deny-all            <none>           21m

oc get networkpolicie allow-app-ingress -o yaml

spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: openshift-ingress
    ports:
    - port: 5000 #this port is wrong and should be 5432
      protocol: TCP
  podSelector:
    matchLabels:
      app: simple-app #this label is wrong and should be app=postgres
  policyTypes:
  - Ingress

Fixing the NetworkPolicy

oc edit networkpolicy allow-app-ingress

spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: openshift-ingress
    ports:
    - port: 5432
      protocol: TCP
  podSelector:
    matchLabels:
      app: simple-app
  policyTypes:
  - Ingress

Failing Route

In the namespace secure-<teamname> the application is not showing correctly. Please fix the route so that the application can be reached.

Troubleshooting

oc get pods

NAME                          READY   STATUS    RESTARTS   AGE
nginx-dual-67b845f895-qjmbp   1/1     Running   0          84m

oc get routes

NAME         HOST/PORT                                                                PATH   SERVICES     PORT   TERMINATION     WILDCARD
nginx-dual   nginx-dual-secure-wessel.apps.cluster-tqq9g.dynamic.redhatworkshops.io          nginx-dual   443    edge/Redirect   None

oc get route nginx-dual -o yaml

spec:
  host: nginx-dual-secure-wessel.apps.cluster-tqq9g.dynamic.redhatworkshops.io
  tls:
    insecureEdgeTerminationPolicy: Redirect
    termination: edge # This should be reencrypt to send the traffic from the router to the pod over TLS
  to:
    kind: Service
    name: nginx-dual
    weight: 100
  wildcardPolicy: None
status:
  ingress:
  - conditions:
    - lastTransitionTime: "2026-02-12T07:50:06Z"
      status: "True"
      type: Admitted
    host: nginx-dual-secure-wessel.apps.cluster-tqq9g.dynamic.redhatworkshops.io
    routerCanonicalHostname: router-default.apps.cluster-tqq9g.dynamic.redhatworkshops.io
    routerName: default
    wildcardPolicy: None

Fixing the Route

oc edit route nginx-dual

spec:
  host: nginx-dual-secure-wessel.apps.cluster-tqq9g.dynamic.redhatworkshops.io
  tls:
    insecureEdgeTerminationPolicy: Redirect
    termination: reencrypt 
  to:
    kind: Service
    name: nginx-dual
    weight: 100
  wildcardPolicy: None
status:
  ingress:
  - conditions:
    - lastTransitionTime: "2026-02-12T07:50:06Z"
      status: "True"
      type: Admitted
    host: nginx-dual-secure-wessel.apps.cluster-tqq9g.dynamic.redhatworkshops.io
    routerCanonicalHostname: router-default.apps.cluster-tqq9g.dynamic.redhatworkshops.io
    routerName: default
    wildcardPolicy: None

Pod restarting

In the namespace restart-<teamname> the application works but restarts unexpectedly. Please fix the deployment and make sure the pod runs without problems.

oc get pods

NAME                     READY   STATUS             RESTARTS         AGE
nginx-585cc48c5b-bp6zj   0/1     CrashLoopBackOff   39 (4m57s ago)   130m

oc logs nginx-585cc48c5b-bp6zj

10.232.0.2 - - [12/Feb/2026:10:01:01 +0000] "GET /broken HTTP/1.1" 404 153 "-" "kube-probe/1.33" "-"
2026/02/12 10:01:01 [error] 24#24: *2 open() "/usr/share/nginx/html/broken" failed (2: No such file or directory), client: 10.232.0.2, server: localhost, request: "GET /broken HTTP/1.1", host: "10.232.0.190:8080"
2026/02/12 10:01:11 [error] 26#26: *3 open() "/usr/share/nginx/html/broken" failed (2: No such file or directory), client: 10.232.0.2, server: localhost, request: "GET /broken HTTP/1.1", host: "10.232.0.190:8080"
10.232.0.2 - - [12/Feb/2026:10:01:11 +0000] "GET /broken HTTP/1.1" 404 153 "-" "kube-probe/1.33" "-"

oc describe pod nginx-585cc48c5b-bp6zj

Events:
  Type     Reason          Age                     From               Message
  ----     ------          ----                    ----               -------
  Normal   Scheduled       132m                    default-scheduler  Successfully assigned restart-wessel/nginx-585cc48c5b-bp6zj to control-plane-cluster-tqq9g-1
  Normal   AddedInterface  132m                    multus             Add eth0 [10.232.0.190/23] from ovn-kubernetes
  Normal   Started         126m (x6 over 131m)     kubelet            Started container nginx
  Normal   Created         71m (x21 over 131m)     kubelet            Created container: nginx
  Normal   Killing         6m54s (x40 over 131m)   kubelet            Container nginx failed liveness probe, will be restarted
  Warning  BackOff         6m28s (x419 over 126m)  kubelet            Back-off restarting failed container nginx in pod nginx-585cc48c5b-bp6zj_restart-wessel(f703044e-083b-4932-a8cd-68dba2159133)
  Normal   Pulled          100s (x41 over 132m)    kubelet            Container image "nginxinc/nginx-unprivileged:stable" already present on machine
  Warning  Unhealthy       74s (x121 over 131m)    kubelet            Liveness probe failed: HTTP probe failed with statuscode: 404

Fixing the Deployment

oc edit deployment nginx

    spec:
      containers:
      - image: nginxinc/nginx-unprivileged:stable
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: / #change the path to / to make the liveness probe work
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
        name: nginx
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          limits:
            cpu: 10m
            memory: 32Mi
          requests:
            cpu: 10m
            memory: 32Mi
        terminationMessagePath: /dev/termination-log

Deployment missing replicas

The application team can’t figure out why their pods in the namespace busy-<teamname> are not running. Can you help them without changing anything about the deployment.

oc get deploy

NAME   READY   UP-TO-DATE   AVAILABLE   AGE
app    5/8     5            5           159m

oc get pods

NAME                   READY   STATUS    RESTARTS      AGE
app-5b469c454d-9g8j5   1/1     Running   2 (38m ago)   159m
app-5b469c454d-9q895   1/1     Running   2 (38m ago)   159m
app-5b469c454d-mb4mk   1/1     Running   2 (38m ago)   159m
app-5b469c454d-txnrk   1/1     Running   2 (38m ago)   159m
app-5b469c454d-z65wk   1/1     Running   2 (38m ago)   159m

oc get deploy app -o yaml

status:
  availableReplicas: 5
  conditions:
  - lastTransitionTime: "2026-02-12T07:49:59Z"
    lastUpdateTime: "2026-02-12T07:49:59Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2026-02-12T07:50:00Z"
    lastUpdateTime: "2026-02-12T07:50:00Z"
    message: 'pods "app-5b469c454d-b8j68" is forbidden: exceeded quota: tiny-exact-quota, 
      requested: limits.cpu=10m,limits.memory=10Mi,requests.cpu=10m,requests.memory=10Mi,
      used: limits.cpu=50m,limits.memory=50Mi,requests.cpu=50m,requests.memory=50Mi,
      limited: limits.cpu=50m,limits.memory=50Mi,requests.cpu=50m,requests.memory=50Mi'
    reason: FailedCreate #namespace quota is exeeded
    status: "True"
    type: ReplicaFailure
  - lastTransitionTime: "2026-02-12T10:01:26Z"
    lastUpdateTime: "2026-02-12T10:01:26Z"
    message: ReplicaSet "app-5b469c454d" has timed out progressing. #extending the replicaset timeout will not create new pods.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 5
  replicas: 5
  unavailableReplicas: 3
  updatedReplicas: 5

Fixing the resource quota

oc get resourcequota

NAME               REQUEST                                                         LIMIT                                           AGE
tiny-exact-quota   pods: 8/10, requests.cpu: 80m/50m, requests.memory: 80Mi/50Mi   limits.cpu: 80m/50m, limits.memory: 80Mi/50Mi   3h2m #deployment needs 80, but only gets 50.

oc edit resourcequota tiny-exact-quota

spec:
  hard:
    limits.cpu: 100m #change the limits to 100m to allow the deployment to create new pods
    limits.memory: 100Mi
    pods: "10"
    requests.cpu: 100m
    requests.memory: 100Mi
status:
  hard:
    limits.cpu: 100m
    limits.memory: 100Mi
    pods: "10"
    requests.cpu: 100m
    requests.memory: 100Mi
  used:
    limits.cpu: 50m
    limits.memory: 50Mi
    pods: "5"
    requests.cpu: 50m
    requests.memory: 50Mi

oc scale deployment app –replicas=8 #scale the deployment to 8 replicas to create new pods

oc get pods

NAME                   READY   STATUS    RESTARTS        AGE
app-5b469c454d-6lgmh   1/1     Running   0               28s
app-5b469c454d-hwrml   1/1     Running   0               22m
app-5b469c454d-k2xd6   1/1     Running   0               22m
app-5b469c454d-s66lc   1/1     Running   0               28s
app-5b469c454d-txnrk   1/1     Running   3 (4m34s ago)   3h5m
app-5b469c454d-wqbtb   1/1     Running   0               22m
app-5b469c454d-xvpf2   1/1     Running   0               28s
app-5b469c454d-z65wk   1/1     Running   3 (4m41s ago)   3h5m

SCC validation

In the namespace deploy-<teamname> a deployment created a replicaset but there are no pods. Without changing the manifest, get the pod running. Don’t give too much access.

oc get deploy

NAME            READY   UP-TO-DATE   AVAILABLE   AGE
root-required   0/1     0            0           6h

oc get replicaset

NAME                      DESIRED   CURRENT   READY   AGE
root-required-bd8b44979   1         0         0       7m30s

oc describe replicaset

    spec:
      containers:
      - command:
        - sh
        - -c
        - id && sleep infinity
        image: docker.io/library/busybox:latest
        imagePullPolicy: Always
        name: root-required
        resources:
          limits:
            cpu: 20m
            memory: 20Mi
          requests:
            cpu: 20m
            memory: 20Mi
        securityContext:
          allowPrivilegeEscalation: false
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: root-sa #the serivceaccount that is being used should have root rights
      serviceAccountName: root-sa
      terminationGracePeriodSeconds: 30
status:
  conditions:
  - lastTransitionTime: "2026-02-12T13:44:25Z" #the out put shows a mismatch in ID range, find a SCC which allows this range
    message: 'pods "root-required-bd8b44979-" is forbidden: unable to validate against
      any security context constraint: [provider "anyuid": Forbidden: not usable by
      user or serviceaccount, provider restricted-v2: .containers[0].runAsUser: Invalid
      value: 0: must be in the ranges: [1000940000, 1000949999], provider "restricted-v3":
      Forbidden: not usable by user or serviceaccount, provider "restricted": Forbidden:
      not usable by user or serviceaccount, provider "nested-container": Forbidden:
      not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not
      usable by user or serviceaccount, provider "nonroot": Forbidden: not usable
      by user or serviceaccount, provider "noobaa-core": Forbidden: not usable by
      user or serviceaccount, provider "noobaa-endpoint": Forbidden: not usable by
      user or serviceaccount, provider "noobaa": Forbidden: not usable by user or
      serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or
      serviceaccount, provider "hostmount-anyuid-v2": Forbidden: not usable by user
      or serviceaccount, provider "machine-api-termination-handler": Forbidden: not
      usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not
      usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable
      by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user
      or serviceaccount, provider "insights-runtime-extractor-scc": Forbidden: not
      usable by user or serviceaccount, provider "rook-ceph": Forbidden: not usable
      by user or serviceaccount, provider "rook-ceph-csi": Forbidden: not usable by
      user or serviceaccount, provider "node-exporter": Forbidden: not usable by user
      or serviceaccount, provider "ceph-csi-op-scc": Forbidden: not usable by user
      or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]'
    reason: FailedCreate
    status: "True"
    type: ReplicaFailure
  observedGeneration: 1
  replicas: 0

oc get scc

NAME                              PRIV    CAPS                   SELINUX     RUNASUSER          FSGROUP     SUPGROUP    PRIORITY     READONLYROOTFS   VOLUMES
anyuid                            false   <no value>             MustRunAs   RunAsAny           RunAsAny    RunAsAny    10           false            ["configMap","csi","downwardAPI","emptyDir","ephemeral","persistentVolumeClaim","projected","secret"]
restricted-v2                     false   ["NET_BIND_SERVICE"]   MustRunAs   MustRunAsRange     MustRunAs   RunAsAny    <no value>   false            ["configMap","csi","downwardAPI","emptyDir","ephemeral","persistentVolumeClaim","projected","secret"]

Above output is limited to the relevant SCC’s, the anyuid is closest to restricted-v2. RUNASUSER for this SSC is set to RunAsAny.

Fixing the SCC

oc adm policy add-scc-to-user anyuid -z root-sa

Find the pod using root

In the namespace forbidden-<teamname> one of the pods is running with root access. Remove this deployment and prevent other pods from running with root in the future.

Certificate Trust Failure

The OpenShift route to the service in the namespace mtls-<teamname> does not serve the application. Fix the issue by only using annotations.

A lot has gone wrong

In the namespace statuspage-<teamname> there are several issues. Without changing the RBAC or SCC classes!

Make sure the pod runs
Make sure the pod can reach the database
Make sure the application is accessible through the URL
Database needs persistence

Use GUI for Networkpolicies