infinispan > Exception based eviction Yesterday William Burns: not the same thing as partition DEGRADED William Burns: if a read was null for an owner it would have to check all other owners William Burns: writes would be fine William Burns: assuming you don't get the ContainerFullException again William Burns: it would stay in that state until another node comes up and relieves the memory pressure, using the ask all owner state transfer William Burns: this is a lot of work, lol William Burns: if we had all this EXCEPTION based should work pretty well, with the caveat that if you are in this DEGRADED like state you have to add a new node otherwise you can lose data if you have another failure Galder: You're likely to have many failures Galder: Cos writes will be coming in William Burns: that is fine, they would throw exceptions William Burns: just like normal Galder: So, what of all of this is in place and what is missing? Galder: As it stands, I'm going with EXCEPTION and making cache transactional William Burns: we don't have this intermediate state thing or the special state transfer Galder: And without it you cannot add a new node? William Burns: so with EXCEPTION everything should work fine, just don't take a node down when you are full of memory, lol Galder: Or you cannot restart a node with more memory? William Burns: no adding is fine William Burns: the problem is taking down a node when you are full of memory Galder: Ok, that's something that can be documented Galder: As in, the best practice in this scenario is... William Burns: gotcha William Burns: so restarting and adding more memory would be problematic if the cache already has a node with full memory Galder: We should definitely improve that William Burns: so William Burns: the key to look at is William Burns: I added a JMX property a while back for min number of nodes William Burns: this should take this all into account William Burns: let me find it again Galder: Ok Galder: gtg, talk tomorrow, let's see what @Dan has to say :) William Burns: ISPN-6879 William Burns: so if that guy returns how many nodes you currently have - don't shut one down :D William Burns: it will check current eviction sizes and estimate based on how much memory it has and num owners and tell you how many nodes you have to keep up to not start losing data due to eviction Tristan: 129 messages since I last checked ? William Burns: lots of thinking being done :slight_smile: William Burns: @Tristan I did however come up with a way for eviction caches to be more consistent when a node goes down when memory is already full William Burns: note it isn't specific to EXCEPTION caches, normal eviction would also be helped by it William Burns: don't know if we think "fixing" that would be helpful or not Dan: the other option is state transfer has to ask all owners for segments with EXCEPTION based @William Burns I wouldn't want to do that, in a replicated cache it would mean a joiner would receive more or less the same entries from all the existing cluster members William Burns: we would only do that for segments that were in this degraded state William Burns: but yes I agree for a lot of entries you would receive it multiple times possibly William Burns: actualy William Burns: sorry REPL would be fine William Burns: it couldn't get into this state William Burns: since REPL doesn't require state transfer on a node dying William Burns: or leaving* Dan: basically any entry that throws a ContainerFullException would cause its segment and any pending segments to be in some sort of DEGRADED like state would this degraded-like state be kept only on the node where the insert/update failed, or would it be broadcast to the entire cluster? because on the node where the insert/update failed, all the segments it owns are in the same situation William Burns: I would assume it only is required on each node individually William Burns: since each node may have a different subset of segments it couldn't write Dan: wouldn't be just like keeping a flag on each node which says "I threw at least one ContainerFullException"? William Burns: you could, but more than likely you will only have a small subset of segments like this at a time Dan: if the other nodes don't know that the segment is full, they'll keep trying to write to it Dan: I don't see why it would be a subset of segments, once one node's memory is full you can't write to any segment on that node William Burns: and this state doesn't affect writes William Burns: writes would behave the same way always William Burns: it is just reads William Burns: that would have a different behavior William Burns: let me explain the use case again Dan: please do, I don't think I understood William Burns: lets say you have nodes A,B,C which are all "full" William Burns: C goes down William Burns: C owned segments 1-4 William Burns: A owns 1-2 and B owns 3-4 William Burns: A will try to transfer 1-2 to B William Burns: but they all fail due to ContainerFullException William Burns: and same for A when it tries to get 3-4 from B William Burns: @Dan with me so far? Dan: I wouldn't say A tries to transfer segments to B, because B is asking for them, but yeah William Burns: I was short handing it William Burns: since there is a lot to write :slight_smile: William Burns: so now A has segments 3-4 in this degraded state William Burns: since it may have some or none of the entries William Burns: but it is still an owner William Burns: if a read comes to A for a key in segment 3 William Burns: it would have to ask all other owners for the value if it doesn't have it William Burns: same for B but segments 1-2 William Burns: writes would work exactly the same William Burns: the cache could continue operation and be consistent William Burns: if it loses a node at this point, data loss is guaranteed though Dan: one question Dan: if B tries to write a value to segment 3 and it doesn't fit on B, does segment 3 become degraded on B as well? William Burns: no William Burns: it isn't written to A or B William Burns: just returns a ContainerFullException Dan: what if it fits in A but not in B? William Burns: still has to throw exception William Burns: so we can guarantee consistency Dan: sure, but now we have an inconsistency William Burns: ? Dan: the key is on A, but not on B William Burns: which key sorry? William Burns: B has all entries for segments 3-4 William Burns: it is just missing ones in 1-2 Dan: the key that we're trying to write to segment 3 after it got into this degraded mode William Burns: so like I said that key is never written to either node William Burns: it throws an exception William Burns: this is how EXCEPTION works right now Dan: why would it not be written on A, if it fits in A's memory? William Burns: because it can't be written into B Dan: aren't the checks independent? William Burns: when you have EXCEPTION based eviction William Burns: you never perform a write from the user that doesn't fit into all owners William Burns: state transfer write is the only exception to this Dan: ah ok, so the cache must still be transactional William Burns: since it only writes to one owner at a time William Burns: yes Dan: I thought you wanted to make non-transactional caches work with EXCEPTION William Burns: for this use case, yes William Burns: I haven't thought about normal eviction yet William Burns: no William Burns: I want to make EXCEPTION based eviction work better William Burns: if you have full memory and a node goes down Dan: ok, so how does that work now? William Burns: right now you get inconsistencies tbh William Burns: :D Dan: state transfer just skips the entries that don't fit? William Burns: since the write is blocked during state transfer William Burns: which means you have an entry on an owner that might exist in another Dan: blocked? so state transfer doesn't finish? William Burns: I believe it throws an exception William Burns: I don't know where we catch it Dan: maybe never, and state transfer hangs :) William Burns: at worst case :slight_smile: William Burns: lets assume for now it doesn't Dan: ok, let's assume those entries are just skipped William Burns: in that case you have some owners that have the value and some that don't William Burns: and adding a new node doesn't fix this Dan: right William Burns: so I was trying to think of a way we could "live" with this temporarily William Burns: until a new node comes back up Dan: yeah, it makes sense now William Burns: but I worry this is probably too much work for a feature we haven't used yet, lol Dan: TBH I have no idea what's the use case that sparked this whole topic either :) William Burns: the use case is shared memory service William Burns: don't ask me more details than that :P Dan: aha William Burns: @Tristan might be able to illuminate it more Dan: the idea of having owners that don't really have the entries reminds me a lot of scattered cache, where you have non-owners that do have the entries Dan: BTW, when C leaves, besides its segments 1-4 being distributed between A and B, you could also have segment 5 moving from A to B and segment 6 moving from B to A William Burns: well I was assuming that others not 1-4 were both already owned by A and B William Burns: so like 5-8 and 9-12 were already both owned by A and B Dan: right, I needed a more complex setup :) if you have nodes ABCD, 1=AB, 2=AC, 3=AD, 4=BC, 5=BD when C dies, you could get 1=AB, 2=DB, 3=BA, 4=BA, 5=BD although you'd first have a union write CH, so 1=AB, 2=ADB, 3=ADB, 4=BA, 5=BD William Burns: yes you mean when you have 4 nodes William Burns: it could change that 2 non owners both becomes owners William Burns: in that case I guess we would be screwed :slight_smile: Dan: and also segment 3 moves from D to B even though D hasn't left Dan: both are possible William Burns: that should be fine though William Burns: as B would get the exception and be marked as degraded-like William Burns: A would still have the value Dan: good point, so it's only a problem if none of the owners in the pre-leave CH is an owner in the rebalanced post-leave CH William Burns: the problem is if a segment loses all owners it had previous Dan: I guess we could always block the rebalance William Burns: yes, precisely William Burns: we could, but we only want to do that if a node will run out of memory Dan: yeah, I wouldn't want to block the rebalance every time :) Dan: but we could say if a node can't apply state for a segment, it should never become a read owner of that segment William Burns: this all sounds a bit complex though :D William Burns: how would we know if we can apply all of the entries from a given segment without trying first? :D William Burns: guess we need to keep track of size by segment William Burns: haha Dan: I like it more than asking all the owners for state on a join :D William Burns: okay well I defer to you re: the location of owners William Burns: so it sounds like what you are proposing is the coordinator would figure out the suggested CH William Burns: ask the nodes if they have room for the given segments Dan: it's like making a segment degraded-like, once you do it for one segment then all the segments that are being moved to that node will have the same fate William Burns: and only apply the CH based on which can be applied? Dan: no, not quite William Burns: or just don't even apply a new CH William Burns: until we can fit everything? Dan: the coordinator already doesn't move from read_old_write_all to read_new_write_all until all the nodes confirm that they applied the new segments Dan: what we'd need to do is instead of ignoring ContainerFullExceptions, nodes would send to the coordinator a list of segments that they couldn't apply Dan: and the coordinator would install a new CH, not sure what phase, with the segments that were successfully applied William Burns: gotcha Dan: of course, it would be nicer if we would know the size of segment ahead of time William Burns: yeah we can fix that Dan: so we wouldn't even request it from the previous owner when we're full William Burns: we already talked about having segment specific sizing for EXCEPTION eviction William Burns: that is not hard to add Dan: ok, that should work better William Burns: we would know exactly how many bytes it would be William Burns: disregarding concurrent updates Dan: if we had that, I guess we could tell whether we'd get a problem like this before we even start the rebalance, and change the cluster health state to require another pod instead of starting the rebalance William Burns: yeah I am thinking that might be the best solution William Burns: avoids a lot of heart ache Dan: :+1: William Burns: not sure how that translates when not in open shift though, heh William Burns: guess we just set some administrator thing William Burns: in JMX Dan: yeah, we need a "call administrator" attribute in JMX :) Tristan: Ok, now I have 200 messages to read William Burns: gotta keep up man! William Burns: :wink: infinispan > Exception based eviction Today Galder: Caught up with discussion. To summarise things a bit, early on I proposed ISPN-9690 but don't seem easy to do: - Suggested options are some probabilistic way for an owner to know when it'd topple others. It'd require tracking key/value sizes. - Use cache with passivated store, when memory full store data. In shutdown it'd passivate all. This was deemed risky approach. Passivate can work well but under stress scenarios like about to run out of memory it could break. E.g. passivate takes longer than pod shutdown timeout. It'd also require that when a non-owner receives a request and its memory is full, it still stores it to the local persistent store skipping in-memory. - We ended up agreeing that although transactions are a penalty, it's a more predictable penalty than the hiccups as a result of passivation. ^ Anything else to add? The discussion then moved onto what a user should do if EXCEPTION strategy kicks in? Two options: - Give the nodes more memory, restarting one at the time. In the current state of things, this would break due to memory increases as a result of rebalancing. E.g. when going from 3 to 2 nodes the data space required in the other nodes would increase. @William Burns came up with an idea to avoid this, maybe he can summarise it :) - Add a more nodes. Although existing nodes would require to rebalance and would have need memory to handle the rebalancing, their overall memory consumption should eventually reduce as data gets rebalanced to new nodes. This should be best practice. ^ Anything else to add? @William Burns @Dan @Tristan Tristan: Aside from basic protection within Infinispan itself, in the context of OpenShift the operator should be the thing that ensures that the cluster can survive a downscaling. Dan: I forgot to say this yesterday, but I disagree a bit with Will on passivation, it can be much faster than regular persistence when the distribution of key reads/writes is skewed towards some keys, even when the data container is full. I'm also not sure about the need to write out in-memory data to disk on shutdown, when the node restarts it will have to request state from the other nodes anyway, and the local data might be stale. One thing is certain though, performance depends on the access pattern a lot more than regular persistence. Dan: About option 1, giving the nodes more memory, me and Will more or less concluded that we could track how much memory each segment uses, and then we could know before starting a rebalance if it's going to lead to ContainerFullExceptions on any node. We could then block the rebalance, and wait for the pod where the configuration changed to restart. Dan: If an operator is doing the restart, it could just block rebalancing before and we'd be fine (assuming no unintended crashes) William Burns: yeah @Dan I mentioned it is faster if you are modifying/removing in memory contents, which I assume is what you are referring to with it being skewed towards some keys William Burns: @Galder talking with Dan more, we found it should be easier to just block rebalance as he mentioned until a new node can come to replace the missing one instead of the read consistency thing I had mentioned Dan: yeah, if the access pattern follows a uniform distribution then you're pretty much going to have a store write on every cache write, just like non-passivation persistence, but if it's a poisson distribution, you could save quite a bit on writes William Burns: yes, agreed William Burns: tbh the worst position for passivation is a read from store contents :D William Burns: not only does it read the store it also then has to remove that entry from the store and then possible add a different entry William Burns: so a store read, store write and store remove for 1 get operation :D William Burns: and actually a write to a key that is in the store is the same thing William Burns: that is why I mentioned the perf would be more consistent, and that passivation is really good at certain use cases is all William Burns: as a user I would think a more consistent performance is best as a default myself Galder: Aside from basic protection within Infinispan itself, in the context of OpenShift the operator should be the thing that ensures that the cluster can survive a downscaling. We're not talking about downscaling here really, but more about nodes running out of memory. William Burns: well if they tune the size parameter correctly that shouldn't happen Galder: lol William Burns: :slight_smile: Galder: You'll probably only know the correct size when you run out of memory ;) Galder: Not saying everyone will do that, but a lot will William Burns: well that is why you test stuff before putting it in prod William Burns: :D Galder: rofl William Burns: what it is true Galder: You should go spend some time with wolf and you'll see how much testing before prod happens Galder: ;) William Burns: a lot of times customers use a non critical system live to test stuff like that out William Burns: which is similar William Burns: at least in my experience before RH William Burns: but anyways the fix would help both cases - assuming only 1 node ran out of memory Galder: About option 1, giving the nodes more memory, me and Will more or less concluded that we could track how much memory each segment uses, and then we could know before starting a rebalance if it's going to lead to ContainerFullExceptions on any node. We could then block the rebalance, and wait for the pod where the configuration changed to restart. What about before a normal put? That's the case we were trying to solve with transactional cache. The fact that a put could be partially applied as a result of one or all backup owners running out of memory. Galder: @William Burns Btw, I was thinking this further, as long as the owner node writes the key, does it really matter what happens in backups? William Burns: yeah so we would need the transactional cache and the new rebalancing logic William Burns: you lose redundancy and reads from the backups will be inconsistent Galder: Lose redundancy: one of your nodes is running out, you have to do something: say you give the node(s) more memory and restart, as long as this is done before N >= numOwners goes down, you're fine... William Burns: yeah but what if the node you restarted was the primary in this case William Burns: you just lost the entry Galder: Reads from the backups will be inconsistent: Hot Rod sends reads to owners... Galder: If node was primary and was running out of memory, you'd not have done any writes at all? Galder: I mean, you're running out of memory, that the primary owner can't write it is expected? Galder: I mean: CH only cares about what the primary owners contain, don't they? Galder: I mean when they have to rebalance the data William Burns: @Dan would have to answer that Galder: The problem would be if after restarting this node, the rebalance would lead the new node to have the v-1 version of the data William Burns: why don't you explain the scenario, and what if you had 2 nodes run out of memory at approximately the same time? Galder: With 3 node, numOwners=2, you're f* Galder: Unless you have persistence storage for it William Burns: what do you mean by run out of memory? William Burns: do you mean the pod crashing? William Burns: rather killed Galder: off heap max memory size, Eviction.EXCEPTION William Burns: or are you talking about the max size? William Burns: k Galder: max size William Burns: so I think our discussion earlier around sizing is flawed then :D William Burns: cause when you said out of memory I was thinking the pod was killed due to OOM William Burns: not reaching eviction size Galder: With EvictionStrategy.EXCEPTION we're trying to make it nicer than OOM Galder: Max size being reached won't make node unavailable from a Kube perspective AFAIK William Burns: no Galder: So at least is not going and restarting, so you have more margin... William Burns: so I don't see why we are fucked at that point? Galder: Puts will still fail William Burns: if we have the new rebalance logic that Dan mentioned William Burns: we would be able to safely restart a node with more memory Galder: True Galder: My main worry was about data consistency Galder: The problem would be if after restarting this node, the rebalance would lead the new node to have the v-1 version of the data William Burns: I didn't see that message William Burns: I don't see how that is possible with the rebalance logic Galder: No worries Galder: I don't see how that is possible with the rebalance logic In that case, are transactions really needed? Galder: The only problem is if old versions stay around William Burns: I would think they would still be needed Dan: I mean: CH only cares about what the primary owners contain, don't they? @Galder CH contains both the primary and the backup owners of each segment in dist mode, only the primary in repl/scattered mode Galder: So primary owner gets a put and broadcasts to 4 backup owners, 1 of which fails due to OOME... on rebalance the failed node gets the primary owner's version, which is Ok, but the other backup owners should have correct data? William Burns: also keep in mind a backup can be promoted to primary during CH change Dan: @Galder depends on what triggered the rebalance Galder: Ok, let's assume that the other backup owners have had the correct data Galder: The only danger I see is if the node that has reached max size would somehow become primary! William Burns: if you have a pending rebalance it could William Burns: or you have more than 1 node hit max Dan: Agree, the primary change is only in the 3rd rebalance phase phase, read_new_write_all, but you could already be in the middle of a rebalance when you get the ContainerFullException William Burns: now I agree that it should work just fine without tx for like 99% of the time, but that 1 time :frown: Dan: Actually it's much more likely to happen in the middle of a rebalance, because that's when the cluster has to hold numOwners + X copies of the segments Galder: Yeah... William Burns: yeah during rebalance you may have to hold 50% extra elements temporarily :frown: Galder: 50% extra? Galder: Wow Dan: Maybe that's something else we need to rethink, especially in the context of exception eviction William Burns: Dan would know about which percentages, I guess it could even be double temporarily even? Galder: Could there be a way for a node that has had containerfullexception not to become primary owner until restart? Galder: Just thinking out loud... Dan: Depends on numOwners and the number of nodes in the cluster, with numOwners=1 you could have an extra copy of each segment if you're unlucky William Burns: :confused: Galder: For now going with transactions with EXCEPTION Galder: I've got the PR ready Galder: @Dan @William Burns Can you add extra JIRAs for related topics to ISPN-9690? Galder: I'll try to summarise this last discussion in the JIRA William Burns: think you put wrong JIRA # Galder: Ups yeah Galder: ISPN-9690 William Burns: but it sounds like we need 1. segment size approximations when eviction is enabled 2. eviction aware topology/state transfer William Burns: @Dan do you think you can handle the latter JIRA, since you know more details there Dan: @William Burns do we need this for any eviction, or just for exception eviction? William Burns: I would think we would want this for regular eviction too right? William Burns: at least when a store isn't present Dan: IDK, with regular eviction without persistence you don't usually care about data that doesn't fit William Burns: yeah that is how I have always thought - I guess only do it for EXCEPTION then William Burns: I just thought it would be pretty identical code me thinks William Burns: but lets not enable it :slight_smile: Dan: the code would be the same for caches without eviction, or for caches with eviction+persistence :D William Burns: hrmm, not sure what you mean there - but I will change mine to be only for EXCEPTION - saves me work :slight_smile: Dan: if by "eviction aware topology/state transfer" you mean blocking the rebalance when write ch segments would hold too much data to fit into the data container, then the code to check whether the DC is full and to block the rebalance is the same regardless of how eviction and persistence are configured William Burns: yes that is what I was talking about William Burns: are there more pieces we need? Dan: no, what I meant is that we already have to check that there's no persistence, adding a check that the eviction mode is EXCEPTION makes no difference William Burns: yeah exactly William Burns: but from our talk I will only add exposing the segment sizes when EXCEPTION is configured William Burns: so you will have to add the check William Burns: :slight_smile: William Burns: less code to worry about Dan: assuming I'm the one who will implement this :) William Burns: :slight_smile: Galder: This is not urgent btw William Burns: when would we want it though? Galder: ES.EXCEPTION is optional param for cache service William Burns: after we prove that EXCEPTION is something we want to do/support? :D Galder: I'd wait and see how many users end up going for that choice in cache service Galder: We added the option as a result of discussions with Markito in the OpenShift breakout William Burns: k sounds good William Burns: I won't worry about it after this discussion unless you or someone else prod us/me about it :slight_smile: Galder: Yeah, that's the right approach Galder: Good to have the discussion though... I'll copy paste the entire discussion into a file and add it to the JIRA William Burns: k cool