A little trick to express mysql read-only clearly in grafana

Overview

I mentioned about PMM(Percona Monitoring and Management), actually I will talk about using Prometheus query and Grafana with little trick. It’s how to express read-only ON/OFF clearly in grafana. 🙂

Definition

This topic is in the server notation according to the READ-ONLY setting value of MySQL, but in my opinion this is also a topic that separates the master/slave. Of course, you can also run a service with a different topology of MySQL, such as (M)-(M/S)-(S) as a body-dining structure. I think that the simplest configuration is the most powerful, when you prepare for a variety of disabilities with personal thought. In my position, READ-ONLY is defined as slave, a guy activated ON.

  • Master (READ_ONLY: OFF)

    An instance for subject of data change
    Master recognition for (M/S) with middle chaining like (M)-(M/S)-(S)

  • Slave (READ_ONLY: OFF)

    Instances that data cannot be changed

A query may differ depending on the individual’s definition, but I want to have representation of the “READ_ONLY is based on ON/OFF server Classification” or “all instance criteria.”

Grafana Variable

In MySQL, READ-ONLY means only ON/OFF, but here we have added a value of 2 to mean the entire server.

  • var_read_only : 0 (READ-ONLY  OFF)
  • var_read_only : 1 (READ-ONLY  ON)
  • var_read_only : 2 (ALL)

How to add a new Variable from Grafana is to be added in the Variables tab of the Setting of the dashboard, and the custom type as shown below 2, 0, 1. The reason why start with 2. Just try to specify the default value.

grafana-read-only-variable1

However, the problem is that if you look at the Variable select box created, the notation is shown as 2, 0, 1 and is not intuitive. 2-> ALL, 0-> OFF, 1-> ON to be represented in the Select box, it would be okay to be used? (Of course, no matter to use.)

In the UI setting of Grafana, it seems to be impossible (yet)… This requires a little manipulation and re-Import after the Export of Grafana. Look up “Share Dashboard” > “Export” tab and open it up as a JSON file, the “text” part is also set to something like value, as shown below.

{
  "allValue": null,
  "current": {
    "text": "2", <== here
    "value": "2"
  },
  "hide": 0,
  "includeAll": false,
  "label": "Read-only",
  "multi": false,
  "name": "read_only",
  "options": [
    {
      "selected": true,
      "text": "2",
      "value": "2"
    },
    {
      "selected": false,
      "text": "0", <== here
      "value": "0"
    },
    {
      "selected": false,
      "text": "1", <== here
      "value": "1"
    }
  ],
  "query": "2,0,1",
  "type": "custom"
}

The above text will be replaced with an intuitive phrase as below, and you can Import the Dashboard again.

{
  "allValue": null,
  "current": {
    "text": "ALL", <== here
    "value": "2"
  },
  "hide": 0,
  "includeAll": false,
  "label": "Read-only",
  "multi": false,
  "name": "read_only",
  "options": [
    {
      "selected": true,
      "text": "ALL", <== here
      "value": "2"
    },
    {
      "selected": false,
      "text": "OFF", <== here
      "value": "0"
    },
    {
      "selected": false,
      "text": "ON", <== here
      "value": "1"
    }
  ],
  "query": "2,0,1",
  "type": "custom"
}

Finally, it is possible to make intuitive phrases on the Grafana screen. It is not easy to control and anyway the first query setting is completed!

grafana-read-only-variable2

Promethues Query

At Grafana, we’ve created the basis to deliver whatever you want, and this time we’ll write a Prometheus query to look at the data. In this place, I will not explain the Prometheus query every one, please visit the official site manual once Prometheus, read the query syntax. (Here)
My concern is the grouping result of the READ-ONLY ON/OFF. In the preceding definition, if the READ_ONLY value is 0, the master, the READ_ONLY is 1, the slave, and 2 means the entire server. Based on this data from Grafana, we will select and take only the data that we want.

First, try extracting the current system load with a Prometheus query to get a maximum value of one minute for max_over_time, as shown below. This is the result corresponding to the entire instance.

max_over_time(node_load1[1m])

Add a condition for the READ-ONLY status value to look up the results for READ-ONLY on/OFF. Here, the “on” entry means to process data operations based on which data, if it is a SQL reference, it’s probably the JOIN condition.

max_over_time(node_load1[1m]) 
  and on (instance) (mysql_global_variables_read_only == $read_only)

But there’s one problem here. If the READ-ONLY value exists, it will query queries without a big problem, but if you look up master/slave at the same time, like a entire server, the query does not work normally.

So, here is the another trick for it. If the result of the three values is equal to 1 or more than 2, then it seems to be resolved? And if we tie the “OR” condition, we can do the first arithmetic process, and if we do not meet, we will write a query to READ-ONLY part? The metric that “exists unconditionally” in each instance is ?

You can use the value “up {job = ‘ Linux ‘}”. The result of this value is 0 or 1, and you can assume that the Exporter exists unconditionally. In other words, if the $read _oly is 0 and 1, it will be negative, and you can create a query as shown below that it will only have 1 or more positive values.

max_over_time(node_load1[1m]) 
  and on (instance) (
      up{job='linux'} <= $read_only^2-3 
      or mysql_global_variables_read_only == $read_only
  )

If this is the case, the first 2 is filtered in the first up condition and does not refer to the mysql_global_variables_read_only portion of the back. (Or does not have the preceding result) However, if the READ_ONLY filtering condition enters zero or one operation, the up result does not exist, so you can check the Mysql_global_variables_read_only value and finally extract the data with desired.

Result

I configured the VM locally as a test and set up the MySQL server with the following structure, and then I tried to float the PMM exporter. Set the OFF for read only because of node01 and node03 are the role of the main for data change.

node01 : Master
  ㄴ node02 : Slave
node03:  Master(Single)

Case1. ALL Instance

All servers are displayed, including Pmm-server instances.

pmm_result_readonly_all

Case2. READ-ONLY: OFF

As intended, only instances where the READ-ONLY is OFF are shown in the graph. Note that only MySQL instance data with the READ-ONLY property is indicate, so the Pmm-server instance will fall out of the target list.
pmm_result_readonly_off

Case3. READ-ONLY: ON

Finally, The READ-ONLY only shows the slaves that are ON. Here is the same with no pmm-server.
pmm_result_readonly_on

Conclusion

Here is the what I did
Intuitively notation for READ-ONLY with trick for dashboard Import in Grafana
Create Prometheus tips and tricks to filter by two steps (ALL, READ-ONLY ON/OFF)

It can be very simple and easy for master hand, but if you have to get resource status on system at a glace with dozens of master/slave mixed environments, the sorting according to the rolls can be a big deal.

However, since the criteria of this notation depends on the database READ-ONLY, another notation is necessary if the DB is not running. I think it is a good idea to solve this problem through the recording function in Prometheus.

Good luck~ 🙂

Introduction of pmm-ruled

I personally believe that reliable service starts with monitoring and ends with tuning.

PMM is a very useful monitoring tool. I believe super stable service is mostly coming from Service Monitoring.

Anyway, when I first saw PMM, I was very impressing because I was thirsty for customizing of Database monitoring solution. However, while deeply analyzing the structure and function of the PMM, I found out that there are some features of motoring solution that need to have additional features.

Why Monitoring Feature is so important?

The monitoring system aims to show the current indicators or status of database properly. In this respect, I would like give PMM with good score.

However, it is also a very important monitoring feature to send alerting to the administrator when the problem occurs (or before).

The problem was that there was no effective solution with PMM, and I needed that my banking service should be running without any further problems.

PMM Alerting

As you know, there are two ways you can send Alerting in PMM.

The first is to take advantage of Grafana’s Alerting feature.

You can control from UI interface and it’s intuitive.

https://www.percona.com/blog/2017/01/23/mysql-and-mongodb-alerting-with-pmm-and-grafana/
https://www.percona.com/blog/2017/02/02/pmm-alerting-with-grafana-working-with-templated-dashboards/

Unfortunately, Grafana does not support “Template Variable” properly, so I should set the number of alerting for each metric. This is a tired and not a suitable solution for the expansion of hundreds of nodes.

The second is to use “alert rule” of prometheus.

Though not intuitive, it is a way to specify multiple node thresholds at once.

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

However, there’s no way to set dynamic thresold for each node, only supports a fixed.
Even though I want to specify different thresholds for the purpose of the server(ex, production/development), there is no way to configure the alert rule without complicated alerting rule configuration in prometheus.

PMM-ruled

I do need a “dynamic rule manager”, so I made this with a funny idea.

Grafana alerting which I said before that was difficult to apply, so I decided the second method, prometheus alerting. As I mentioned a little while ago, promethus is a time series database. In order to perform data operations, the data to be computed and must exist in the same time range.

threshold-rule-metric

In other words, in order to configure the threshold value of each server differently, this threshold value is also periodically taken from the prometheus as some kind of metrics.

PMM-ruled is the Alerting threshold exporter.

It is not enough. There must be a principal to manage these alerting rules. If you want to set A node and B node need to receive alerting message when CPU usage is over 10%, 20% in each, this information also must be stored and managed somewhere else.

From the perspective of monitoring for dozens and hundreds of nodes with a single PMM, I did not want to manage the file thresholds one by one manually, one by one. (In fact it is impossible to handle by myself.)
Alert thresholds must be managed with “group” or “instance” in each. And I would like to have the ability to skip alerting as well.

alert-thresolds

If an instance level rule exists, it is given priority for that instance. If there is no instance level rule, priority is given to the group rule. If neither instance nor group has a specified threshold, use the threshold specified in alert rule for that node. The threshold values are classified into layers, like the image above

PMM-ruled is the alerting rule managing API server.

And if the alert rule was defined as a fixed value it would be shown as below:

groups:
- name: alert_rules
  rules:
  - alert: mysql_agent_up
    expr:  up{job="mysql"} == 0
    for: 10s
    labels: 
      level: critical
    annotations: 
      summary: MySQL Agent Down
      description: MySQL Agent Down

The thresholds can be calculated as a single metric and managed dynamically on a server basis as shown below.

groups:
- name: alert_rules
  rules:
  - alert: mysql_agent_up
    expr:  up{job="mysql"} == on (instance) group_left (level, name, gname) (alert_rule_threshold{name="mysql_agent_up", level="critical"})
    for: 10s
    labels: 
      level: critical
    annotations: 
      summary: MySQL Agent Down
      description: MySQL Agent Down

PMM-ruled is the alerting rule file generator.

In addition to this, it has more features such as prometheus recording rule management, snapshot rule management which inserts current state value into MySQL. (to show current node status with just single table.)

The figure below is showing the configuration of the whole pmm-ruled.
pmm-ruled-overview

Interesting?

I made this feature in April, 2018, and am still monitoring for dozens of mysql nodes with only single PMM server from that time. I’m currently collaborating with Percona after contributing the entire pmm-ruled source code.

You can start this project from percona-lab repository.

https://github.com/percona-lab/pmm-ruled

I would like to keep the original source, so I forked into a separate project.

https://github.com/gywndi/pmm-ruled

I will talk about how to port from pmm-ruled to pmm in the next time.

Thanks.