Monitor CI pod events without CloudWatch

If your Kubernetes cluster doesn't use monitoring products like CloudWatch, you can deploy a DaemonSet to read and upload CI pod event logs to a storage bucket. This solution helps with:

Root cause analysis of resource issues such as OOM (Out of Memory) kills
Understanding pod evictions and why they happen in CI pods
Debugging with persisted logs for older pipeline executions

Prerequisites

Before implementing this monitoring solution, ensure you have:

A storage bucket configured (GCP Cloud Storage, AWS S3, or Azure Blob Storage)
Kubernetes service account with appropriate read permissions for pods and events
Write access to your storage bucket from the Kubernetes cluster
kubectl access to deploy resources to your cluster

Implementation

The monitoring solution consists of three Kubernetes resources that work together to capture and persist CI pod events:

1. ServiceAccount with RBAC permissions

Create a ServiceAccount that can read pods and events across all namespaces:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: ci-pod-monitor-sa
  namespace: harness-delegate-ng
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ci-pod-monitor-role
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log", "events"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: ci-pod-monitor-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: ci-pod-monitor-role
subjects:
- kind: ServiceAccount
  name: ci-pod-monitor-sa
  namespace: harness-delegate-ng

2. ConfigMap for cloud provider settings

Configure your cloud storage provider and bucket details:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cloud-config
  namespace: harness-delegate-ng
data:
  provider: "gcp"  # Options: gcp, aws, azure
  bucket: "your-bucket-name"  # Replace with your actual bucket name

3. ConfigMap with monitoring script

The monitoring script runs four parallel functions to capture comprehensive CI pod information:

apiVersion: v1
kind: ConfigMap
metadata:
  name: build-logger-script
  namespace: harness-delegate-ng
data:
  build-logger.sh: |
    #!/bin/bash

    FILTER="harnessci"  # All Harness CI pods start with "harnessci"
    OUTPUT_DIR="/logs"
    TOUCHED="/tmp/logged_pods.txt"
    CLOUD_SYNC_INTERVAL=60

    mkdir -p "$OUTPUT_DIR"
    touch "$TOUCHED"

    # Function 1: Upload logs to cloud bucket every CLOUD_SYNC_INTERVAL seconds
    sync_logs() {
      while true; do
        echo "[SYNC] Uploading logs to cloud storage..."
        if [ "$CLOUD_PROVIDER" = "gcp" ]; then
          gsutil -m cp "$OUTPUT_DIR"/* "gs://$CLOUD_BUCKET/"
        elif [ "$CLOUD_PROVIDER" = "aws" ]; then
          aws s3 sync "$OUTPUT_DIR" "s3://$CLOUD_BUCKET/"
        elif [ "$CLOUD_PROVIDER" = "azure" ]; then
          az storage blob upload-batch --source "$OUTPUT_DIR" --destination "$CLOUD_BUCKET"
        fi
        sleep "$CLOUD_SYNC_INTERVAL"
      done
    }

    # Function 2: Stream logs from pods matching FILTER
    stream_pod_logs() {
      while true; do
        kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.metadata.name | test("'$FILTER'")) | [.metadata.namespace, .metadata.name] | @tsv' |
        while IFS=$'\t' read -r namespace pod; do
          if grep -q "$namespace/$pod" "$TOUCHED"; then
            continue
          fi
          echo "[$(date)] Starting log stream for $namespace/$pod"
          echo "$namespace/$pod" >> "$TOUCHED"
          kubectl logs -n "$namespace" "$pod" -f > "$OUTPUT_DIR/${namespace}_${pod}.log" 2>&1 &
        done
        sleep 5
      done
    }

    # Function 3: Collect cluster-wide Kubernetes events every 30 seconds
    collect_events() {
      while true; do
        timestamp=$(date +%Y%m%d-%H%M%S)
        kubectl get events --all-namespaces > "$OUTPUT_DIR/events_$timestamp.txt" 2>&1
        sleep 30
      done
    }

    # Function 4: Capture kubectl describe pod output for matching pods
    log_pod_description() {
      while true; do
        kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.metadata.name | test("'$FILTER'")) | [.metadata.namespace, .metadata.name] | @tsv' |
        while IFS=$'\t' read -r namespace pod; do
          desc_file="$OUTPUT_DIR/${namespace}_${pod}_describe.txt"
          if [ ! -f "$desc_file" ]; then
            echo "[$(date)] Capturing pod description for $namespace/$pod"
            kubectl describe pod -n "$namespace" "$pod" > "$desc_file" 2>&1
          fi
        done
        sleep 60
      done
    }

    # Run all collectors in background
    sync_logs &
    collect_events &
    log_pod_description &
    stream_pod_logs

4. DaemonSet deployment

Deploy the DaemonSet to run the monitoring script on every node:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ci-pod-monitor
  namespace: harness-delegate-ng
spec:
  selector:
    matchLabels:
      app: ci-pod-monitor
  template:
    metadata:
      labels:
        app: ci-pod-monitor
    spec:
      serviceAccountName: ci-pod-monitor-sa
      containers:
      - name: logger
        image: google/cloud-sdk:latest
        command: ["/bin/bash", "-c"]
        args:
          - |
            apt-get update && \
            DEBIAN_FRONTEND=noninteractive apt-get install -y jq && \
            /script/build-logger.sh
        env:
        - name: CLOUD_PROVIDER
          valueFrom:
            configMapKeyRef:
              name: cloud-config
              key: provider
        - name: CLOUD_BUCKET
          valueFrom:
            configMapKeyRef:
              name: cloud-config
              key: bucket
        volumeMounts:
        - name: script
          mountPath: /script
        - name: log-output
          mountPath: /logs
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "200m"
      volumes:
      - name: script
        configMap:
          name: build-logger-script
          defaultMode: 0755
      - name: log-output
        emptyDir: {}

Deployment steps

Save all the YAML configurations above into a single file named ci-monitoring.yaml
Update the cloud provider configuration in the ConfigMap:
- Set provider to your cloud provider (gcp, aws, or azure)
- Set bucket to your actual bucket name
Apply the configuration to your cluster:

kubectl apply -f ci-monitoring.yaml

Verify the DaemonSet is running:

kubectl get daemonset -n harness-delegate-ng ci-pod-monitor
kubectl get pods -n harness-delegate-ng -l app=ci-pod-monitor

What gets captured

The monitoring solution captures four types of data:

Pod logs: Continuous streaming of logs from all CI pods (pods starting with "harnessci")
Kubernetes events: Cluster-wide events captured every 30 seconds, including:
- Pod scheduling events
- Container state changes
- Resource allocation issues
- Pod evictions
Pod descriptions: Complete kubectl describe output for each CI pod, including:
- Resource requests and limits
- Node allocation
- Container statuses
- Recent events
Timestamps: All captured data includes timestamps for correlation

Debugging common issues

Identifying OOM kills

Look for these patterns in the pod description files:

Last State: Terminated
Reason: OOMKilled
Exit Code: 137

In the events files, search for:

Killing container with id docker://... : Need to kill Pod

Understanding pod evictions

Check the events files for eviction reasons:

Evicted: The node was low on resource: memory
Evicted: The node was low on resource: ephemeral-storage

Resource pressure indicators

In pod descriptions, look for:

Status: Failed
Reason: Evicted
Message: The node had condition: [DiskPressure, MemoryPressure]

Customization options

Adjust collection intervals

Modify these variables in the script ConfigMap:

CLOUD_SYNC_INTERVAL: How often to upload logs to cloud storage (default: 60 seconds)
Event collection interval: Change sleep 30 in collect_events() function
Pod description interval: Change sleep 60 in log_pod_description() function

Filter different pod patterns

Change the FILTER variable to monitor different pod prefixes:

FILTER="your-pod-prefix"  # Monitor pods starting with "your-pod-prefix"

Add AWS S3 or Azure Blob support

For AWS S3, ensure the nodes have appropriate IAM roles or update the script to include AWS credentials.

For Azure Blob Storage, update the sync command in the script:

elif [ "$CLOUD_PROVIDER" = "azure" ]; then
  az storage blob upload-batch --source "$OUTPUT_DIR" --destination "$CLOUD_BUCKET" --account-name "$AZURE_STORAGE_ACCOUNT"
fi

Monitoring and maintenance

Check DaemonSet health

kubectl get daemonset -n harness-delegate-ng ci-pod-monitor
kubectl logs -n harness-delegate-ng -l app=ci-pod-monitor --tail=50

View collected data

Check your storage bucket for:

Log files: namespace_podname.log
Event snapshots: events_YYYYMMDD-HHMMSS.txt
Pod descriptions: namespace_podname_describe.txt

Clean up old data

Implement a lifecycle policy on your storage bucket to automatically delete old logs after a retention period (e.g., 30 days).

Troubleshooting the monitoring solution

If the DaemonSet pods are not running:

Check pod status:

kubectl describe pods -n harness-delegate-ng -l app=ci-pod-monitor

Verify ServiceAccount permissions:

kubectl auth can-i get pods --all-namespaces --as=system:serviceaccount:harness-delegate-ng:ci-pod-monitor-sa

Check ConfigMap is properly mounted:

kubectl exec -n harness-delegate-ng -it <pod-name> -- ls /script/

Review container logs for errors:

kubectl logs -n harness-delegate-ng <pod-name>

Prerequisites​

Implementation​

1. ServiceAccount with RBAC permissions​

2. ConfigMap for cloud provider settings​

3. ConfigMap with monitoring script​

4. DaemonSet deployment​

Deployment steps​

What gets captured​

Debugging common issues​

Identifying OOM kills​

Understanding pod evictions​

Resource pressure indicators​

Customization options​

Adjust collection intervals​

Filter different pod patterns​

Add AWS S3 or Azure Blob support​

Monitoring and maintenance​

Check DaemonSet health​

View collected data​

Clean up old data​

Troubleshooting the monitoring solution​

See also​