How to set up a watcher to detect CDR Connection Failure errors in Kibana (Part 1/2)

Objectives of this blog post:

Describe the steps to follow to configure and create a simple watcher that detects a condition and sends an email when triggered. This document focuses on a real user case, the monitoring of CDR Connection Failure errors, which can be used as an example and base for different applications.

Watch the video

Step 1: Get an Elastic Stack subscription license

The watchers are part of the Elastic Stack subscription license, and therefore it is needed before doing any alerting.

Step 2: Set up notifications

When the watcher triggers, you can choose to add an action to send a notification that an event was fired. You can then precise the message you want to send, but first, you need to setup the notifications.

–       Email notification:

Documentation for email notification settings

Documentation for different email profiles

To configure the SMTP server you want to use and the email account, go to the “Dev Tools” tab in your Kibana page, and input the following request:

PUT _cluster/settings

{

    “persistent” : {

      “xpack.notification.email” : {

        “account”: {

          “NAME_OF_ACCOUNT”: {

            “smtp”: {

              “host”: “SMTP_HOST”,

              “port”: “SMTP_PORT”

            }

          }

        }

      }

    }

}

If an authentication is needed, you can specify an smtp.user and a smtp.secure_password.

–       Slack notification:

Documentation for slack notification settings

Documentation for slack webhooks

You need to create a webhook for your slack system. Follow the steps described in the link above (you need administrator privileges). Once that is done, go to the “Dev Tools” tab and input the following request:

PUT _cluster/settings

{

    “persistent” : {

        “xpack.notification.slack” : {

          “account”: {

                  “ACCOUNT_NAME”: {

                    “secure_url”: “SLACK_WEBHOOK_URL”,

                      “message_defaults”: {

                          “from”: “Kibana Watch”,

                          “to”: “DESTINATION”,

                          “icon”: “http://example.com/images/watcher-icon.jpg”,

                          “attachment”: {

                              “fallback”: “X-Pack Notification”,

                              “color”: “#36a64f”,

                              “title”: “X-Pack Notification”,

                              “title_link”: “https://www.elastic.co/guide/en/x-pack/current/index.html”,

                              “text”: “One of your watches generated this notification.”,

                              “mrkdwn_in”: “pretext, text”

                          }

                      }

                  }

              }

          }

    }

}

Step 3: Create the watcher

Go to the Management>Elasticsearch>Watcher tab, and create a new watcher.

You can use the simple “Create Threshold Alert” option, and choose the action you want to link to the alert created.

You can also use the “Create Advanced watch” option, where you can customize more in detail your watcher.

Recommendation:

The window provided by the Kibana page to edit the watcher is really small and inconvenient. It is highly recommended to use software to edit the watcher that can display .json files properly. It is preferable to have an editor that can manage the indentation and color the text for better readability. Those can include: text editors (ex: notepad++) or IDEs (ex: Visual Studio). You can then edit the file in your editor with ease, and paste it back into the Kibana window to save it.

Here is a complete example of a watcher that tracks CDR Connection Failure errors every 30 seconds, by detecting the first time the error occurs and sending an email with the time it was detected:

{

  “trigger”: {

    “schedule”: {

      “interval”: “30s”

    }

  },

  “input”: {

    “chain”: {

      “inputs”: [

        {

          “first”: {

            “search”: {

              “request”: {

                “search_type”: “query_then_fetch”,

                “indices”: [

                  “cmsalarmstates-*”

                ],

                “types”: [],

                “body”: {

                  “query”: {

                    “bool”: {

                      “must”: {

                        “match”: {

                          “cmsalarm.type”: “cdrConnectionFailure”

                        }

                      },

                      “filter”: {

                        “bool”: {

                          “must”: {

                            “range”: {

                              “timestamp”: {

                                “gte”: “now-1m”,

                                “lte”: “now-30s”

                              }

                            }

                          }

                        }

                      }

                    }

                  }

                }

              }

            }

          }

        },

        {

          “second”: {

            “search”: {

              “request”: {

                “search_type”: “query_then_fetch”,

                “indices”: [

                  “cmsalarmstates-*”

                ],

                “types”: [],

                “body”: {

                  “query”: {

                    “bool”: {

                      “must”: {

                        “match”: {

                          “cmsalarm.type”: “cdrConnectionFailure”

                        }

                      },

                      “filter”: {

                        “bool”: {

                          “must”: {

                            “range”: {

                              “timestamp”: {

                                “gte”: “now-90s”,

                                “lte”: “now-60s”

                              }

                            }

                          }

                        }

                      }

                    }

                  }

                }

              }

            }

          }

        }

      ]

    }

  },

  “condition”: {

    “always”: {}

  },

  “actions”: {

    “send_email”: {

      “condition”: {

        “script”: {

          “source”: “return ctx.payload.first.hits.total > 0 && ctx.payload.first.hits.total < 30 && ctx.payload.second.hits.total != 30”,

          “lang”: “painless”

        }

      },

      “email”: {

        “profile”: “standard”,

        “from”: “watchtest@test.com”,

        “to”: [

          “test@vqcomms.com”

        ],

        “subject”: “Watcher Notification”,

        “body”: {

          “text”: “CDR Connection Failure : ({{ctx.execution_time}})”

        }

      }

    }

  }

}

In this case, the CMS server produces an error if the CDR receiver cannot be reached. This happens every second until the problem is resolved. We want to set up a watcher that triggers the first error but doesn’t continue to send notifications after the first one.

In order to do so, this watcher uses chained searches:

  • The first one searches through the logs in the time range [now-60s ; now-30s] to make sure no new logs are missed.  We verify that at least one error is found, but less than 30, which is the maximum number of errors over that period (1error/s).  This is to avoid triggering the action multiple times after the first detection of the error if the error keeps being sent regularly.
  • The second search uses the time range of the previous iteration of the watcher, so in this case 30s earlier [now-90s ; now-60s]. The second search is used to verify if the error was already present 30s before the current search. If it is, it has been handled already and we do not need to send another notification.

Here is the “painless” script that defines the condition to decide if we send a notification email or not:

return ctx.payload.first.hits.total > 0 && ctx.payload.first.hits.total < 30 && ctx.payload.second.hits.total != 30

In the second part of this documentation, we will see how to add a link to a dashboard covering the 10 minutes window around the triggering event.

Read part 2 of this blog