The purpose of this ticket is to make whatever changes are necessary so that we can use self-monitoring for diagnostics troubleshooting. By self-monitoring I am men collecting and storing metrics about Hawkular Metrics itself, including the Cassandra the driver. Cassandra metrics will be handled separately (see HWKMETRICS-448).
We will store internal metrics in the same Cassandra cluster that is used to store user metrics. There are some downsides to doing this, namely how we store/retrieve internal metrics when Hawkular Metrics and/or Cassandra is having problems. In the worst case scenario for retriecing metrics where Cassandra is down for an extended period, we could tar up the data directory and load it into another cluster.
We can also explore more advanced set ups like having a separate Cassandra cluster in another data center (in Cassandra terms a DC is just a logical separation of nodes and can be one node) that does not accept user requests. It could be dedicated for handling requests for internal metrics and possibly batch processing jobs. This set up also provides failover in the event that the primary cluster in the first DC goes down. The multi-DC support is beyond the scope of this ticket, but I wanted to point it out because this ticket forms the basis for the work involved.
One of the big things that needs to be sorted out is naming to make sure we avoid conflicts and meta data, which will be in the form of tags, so that we can query the metrics. The metrics will live under the admin tenant which means that they will not be accessible to any user. Metric ids will be of the form:
We may want to include the port with the hostname.
namespace will be org.hawkular.metrics for internal metrics. For Cassandra driver metrics, it would be com.datastax.driver. For Hawkualr Alerts, it would be org.hawkular.alerts. From these examples we can see that the namespace is basically a top-level package name. I am proposing that metric names be in CamelCase.
Note that metric names must be unique within a namespace.
I am proposing a standard set of tags for each metric which are described next.
This is pretty self-explanatory. This can be used to limit metrics to a particular instance of Hawkular Metrics.
This could be used for example to restrict results to Cassandra driver metrics.
Examples might include REST or Core, Metrics having a scope of REST would include those for the REST API. Core would include metrics in the core Java API, primarily in MetricsService.
For a metric like DataPointsInserted which measures the throughput of writing data points, it might have a type of Ingestion or Write. And for a metric like MetricTagsQueryLatecy it might have a type of Query or Read. The combination of scope and type will allow us drill down more easily into particular types of metrics.
These are the standard tags I am proposing for our internal metrics. Additional tags can be supplied. For REST endpoint metrics we might have method=POST.
We will make live values accessible via the REST API. That work will probably be tracked under a separate ticket though. We may also want to expose our metrics via JMX, but again, that effort would be tracked under a separate ticket.