Skip to content

openshift/configuration-anomaly-detection

Folders and files

NameName
Last commit message
Last commit date
Mar 21, 2025
Feb 10, 2025
Feb 7, 2025
Mar 25, 2025
Mar 25, 2024
Mar 28, 2025
Mar 18, 2025
Feb 6, 2025
Mar 25, 2025
Mar 28, 2025
Mar 27, 2025
Mar 21, 2025
Feb 7, 2025
Mar 15, 2022
Feb 4, 2025
May 9, 2022
Jan 4, 2024
Jun 26, 2024
Jan 11, 2022
Mar 11, 2025
Mar 20, 2025
Feb 10, 2025
Mar 27, 2025
Feb 10, 2025
Feb 10, 2025

Repository files navigation

Go Report Card PkgGoDev codecov License


Configuration Anomaly Detection

Configuration Anomaly Detection

About

Configuration Anomaly Detection (CAD) is responsible for reducing manual SRE effort by pre-investigating alerts, detecting cluster anomalies and sending relevant communications to the cluster owner.

Overview

CAD consists of:

  • a tekton deployment including a custom tekton interceptor
  • the cadctl command line tool implementing alert remediations and pre-investigations

Workflow

  1. PagerDuty Webhooks are used to trigger Configuration-Anomaly-Detection when a PagerDuty incident is created
  2. The webhook routes to a Tekton EventListener
  3. Received webhooks are filtered by a Tekton Interceptor that uses the payload to evaluate whether the alert has an implemented handler function in cadctl or not, and validates the webhook against the X-PagerDuty-Signature header. If there is no handler implemented, the alert is directly forwarded to a human SRE.
  4. If cadctl implements a handler for the received payload/alert, a Tekton PipelineRun is started.
  5. The pipeline runs cadctl which determines the handler function by itself based on the payload.

CAD Overview CAD Overview

Contributing

Building

For build targets, see make help.

Adding a new investigation

CAD investigations are triggered by PagerDuty webhooks. Currently, CAD supports the following two formats of webhooks:

  • WebhookV3
  • EventOrchestrationWebhook

The required investigation is identified by CAD based on the incident and its payload. As PagerDuty itself does not provide finer granularity for webhooks than service-based, CAD filters out the alerts it should investigate. For more information, please refer to https://support.pagerduty.com/docs/webhooks.

To add a new alert investigation:

  • run make bootstrap-investigation to generate boilerplate code in pkg/investigations (This creates the corresponding folder & .go file, and also appends the investigation to the availableInvestigations interface in registry.go.).
  • investigation.Resources contain initialized clients for the clusters aws environment, ocm and more. See Integrations
  • Add test objects or scripts used to recreate the alert symptoms to the pkg/investigations/$INVESTIGATION_NAME/testing/ directory for future use. Be sure to clearly document the testing procedure under the Testing section of the investigation-specific README.md file

Integrations

Note: When writing an investiation, you can use them right away. They are initialized for you and passed to the investigation via investigation.Resources.

  • AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
    • See pkg/aws
  • PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
    • See pkg/pagerduty
  • OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
    • See pkg/ocm
    • In case of missing permissions to query an ocm resource, add it to the Configuration-Anomaly-Detection role in uhc-account-manager
  • osd-network-verifier -- Tool to verify the pre-configured networking components for ROSA and OSD CCS clusters.
  • k8sclient -- Interact with clusters kube-api
    • Requires RBAC definitions for your investigation to be added to metadata.yaml

Testing locally

Requires an existing cluster.

  1. Create a test incident and payload file for your cluster

    ./test/generate_incident.sh <alertname> <clusterid>
  2. Export the required env variables from vault

    Note: For information on the envs see required env variables.

    source test/set_stage_env.sh
    
  3. make build

  4. Run cadctl with the payload file created by test/generate_incident.sh

    ./bin/cadctl investigate --payload-path payload

Logging levels

CAD allows for different logging levels (debug, info, warn, error, fatal, panic). The log level is determind through a hierarchy, where the cli flag log-level is checked first, and if not set the optional environment variable LOG_LEVEL is used. If neither is set, the log level defaults to info.

Documentation

Investigations

Every alert managed by CAD corresponds to an investigation, representing the executed code associated with the alert.

Investigation specific documentation can be found in the according investigation folder, e.g. for ClusterHasGoneMissing.

Integrations

  • AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
  • PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
  • OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
  • osd-network-verifier -- Tool to verify the pre-configured networking components for ROSA and OSD CCS clusters.

Templates

  • Update-Template -- Updating configuration-anomaly-detection-template.Template.yaml.
  • OpenShift -- Used by app-interface to deploy the CAD resources on a target cluster.

Dashboards

Grafana dashboard configmaps are stored in the Dashboards directory. See app-interface for further documentation on dashboards.

Deployment

  • Tekton -- Installation/configuration of Tekton and triggering pipeline runs.
  • Skip Webhooks -- Skipping the eventlistener and creating the pipelinerun directly.
  • Namespace -- Allowing the code to ignore the namespace.

Boilerplate

PipelinePruner

Required ENV variables

Note: For local execution, these can exported from vault with source test/set_stage_env.sh

  • CAD_OCM_CLIENT_ID: refers to the OCM client ID used by CAD to initialize the OCM client
  • CAD_OCM_CLIENT_SECRET: refers to the OCM client secret used by CAD to initialize the OCM client
  • CAD_OCM_URL: refers to the used OCM url used by CAD to initialize the OCM client
  • CAD_PD_EMAIL: refers to the email for a login via mail/pw credentials
  • CAD_PD_PW: refers to the password for a login via mail/pw credentials
  • CAD_PD_TOKEN: refers to the generated private access token for token-based authentication
  • CAD_PD_USERNAME: refers to the username of CAD on PagerDuty
  • CAD_SILENT_POLICY: refers to the silent policy CAD should use if the incident shall be silent
  • PD_SIGNATURE: refers to the PagerDuty webhook signature (HMAC+SHA256)
  • X_SECRET_TOKEN: refers to our custom Secret Token for authenticating against our pipeline
  • CAD_PROMETHEUS_PUSHGATEWAY: refers to the URL cad will push metrics to
  • BACKPLANE_URL: refers to the backplane url to use
  • BACKPLANE_INITIAL_ARN: refers to the initial ARN used for the isolated backplane jumprole flow

Optional ENV variables

  • BACKPLANE_PROXY: refers to the proxy CAD uses for the isolated backplane access flow.

Note: BACKPLANE_PROXY is required for local development, as a backplane api is only accessible through the proxy.

  • CAD_EXPERIMENTAL_ENABLED: enables experimental investigations when set to true, see mapping.go

For Red Hat employees, these environment variables can be found in the SRE-P vault.

  • LOG_LEVEL: refers to the CAD log level, if not set, the default is info. See