Although the code paths are a bit different, this issue applied to the endpoints for both creating metrics and for updating metric tags. We store metric tags in two tables - metrics_idx and metrics_tags_idx. In metrics_idx the tags are stored in a single column as a map. Inserting tags into metrics_idx is an atomic operation. Either all tags are successfully inserted, or none of them are. Things are a bit different with metrics_tags_idx. We partition by the tag name which means each tag is stored in a separate partition. Inserting tags into metrics_tags_idx is therefore not an atomic operation.
We have had situations where a client creates a metric or adds tags, and then one of the writes to Cassandra fails. There is currently no retry logic, and we immediately abort processing the request. This can lead to inconsistent state. Suppose we have the following tags - x:1, y:2, z:3. The write to metrics_idx succeeds and the second write to metrics_tags_idx fails. We wind up with all the tags in metrics_idx but only a subset of the tags in metrics_tags_idx.
All of the writes are idempotent by design so that clients can just retry the request if it fails. In terms of performance this is probably the best option to get to a consistent state with respect to tags... in theory.
In OpenShift, Heapster is the client writing tags. Heapster is not stateful, so at start up it queries Hawkular Metrics to get the set of already persisted metrics. It queries metrics_idx for this. Metrics that Heapster knows about locally and that are not in metrics_idx will have their tags written to Hawkular Metrics. Herein lies the problem. If we have tags that have been successfully written to metrics_idx but not to metrics_tags_idx, then Heapster will not resend for those metrics, and we are stuck in an inconsistent state.
There are at least a couple of ways that we can handle this situation. We can do it on the server side using logged batches. The downside here is the reduced performance due to the overhead of logged batches. The batch request is first written to a durable batch log table before trying the writes within the batch statement. The coordinator node has to wait for acks from all nodes being written to. This increases the risks for timeouts at the coordinator. I have done some preliminary bechmarking with batch vs non-batch updates. Results can be viewed here The numbers for batched vs non-batched do not look bad; however I would expect them get worse as the number of tags increases. The upside to using logged batches is that is handled completely on the server side. All of the writes may not happen at once (and in fact there is no isolation with logged batches), but Cassandra will make sure that they all do succeed eventually.
Another option would be to handle this on the client side. To do this we would need to make a change on the server side and only write the tags to metrics_idx after all the writes to metrics_tags_idx have succeeded. If we perform the writes this way, then the metrics that Heapster gets back from querying metrics_idx are guaranteed to have all of their tags in metrics_tags_idx. The upside is that this is a more scalable and performant solution than logged batches. There are a couple downsides. First and most obvious, the burden is placed on the client. We will need to make sure HOSA has similar behavior. Secondly, there might be a longer delay for getting tags in sync than there would be with using logged batches. I say that because I am not sure how often Heapster sends sends to be stored.
We also need to add retry logic on the server side for whichever approach we take.
Michael Burman I would like to get your thoughts on this since you maintain the Heapster sink.