infinispan > Exception based eviction Yesterday

William Burns: not the same thing as partition DEGRADED

William Burns: if a read was null for an owner it would have to check all other owners

William Burns: writes would be fine

William Burns: assuming you don't get the ContainerFullException again

William Burns: it would stay in that state until another node comes up and relieves the memory pressure, using the ask all owner state transfer

William Burns: this is a lot of work, lol

William Burns: if we had all this EXCEPTION based should work pretty well, with the caveat that if you are in this DEGRADED like state you have to add a new node otherwise you can lose data if you have another failure

Galder: You're likely to have many failures

Galder: Cos writes will be coming in

William Burns: that is fine, they would throw exceptions

William Burns: just like normal

Galder: So, what of all of this is in place and what is missing?

Galder: As it stands, I'm going with EXCEPTION and making cache transactional

William Burns: we don't have this intermediate state thing or the special state transfer

Galder: And without it you cannot add a new node?

William Burns: so with EXCEPTION everything should work fine, just don't take a node down when you are full of memory, lol

Galder: Or you cannot restart a node with more memory?

William Burns: no adding is fine

William Burns: the problem is taking down a node when you are full of memory

Galder: Ok, that's something that can be documented

Galder: As in, the best practice in this scenario is...

William Burns: gotcha

William Burns: so restarting and adding more memory would be problematic if the cache already has a node with full memory

Galder: We should definitely improve that

William Burns: so

William Burns: the key to look at is

William Burns: I added a JMX property a while back for min number of nodes

William Burns: this should take this all into account

William Burns: let me find it again

Galder: Ok

Galder: gtg, talk tomorrow, let's see what @Dan has to say :)

William Burns: ISPN-6879

William Burns: so if that guy returns how many nodes you currently have - don't shut one down :D

William Burns: it will check current eviction sizes and estimate based on how much memory it has and num owners and tell you how many nodes you have to keep up to not start losing data due to eviction

Tristan: 129 messages since I last checked ?

William Burns: lots of thinking being done :slight_smile:

William Burns: @Tristan I did however come up with a way for eviction caches to be more consistent when a node goes down when memory is already full

William Burns: note it isn't specific to EXCEPTION caches, normal eviction would also be helped by it

William Burns: don't know if we think "fixing" that would be helpful or not

Dan:
the other option is state transfer has to ask all owners for segments with EXCEPTION based
@William Burns I wouldn't want to do that, in a replicated cache it would mean a joiner would receive more or less the same entries from all the existing cluster members

William Burns: we would only do that for segments that were in this degraded state

William Burns: but yes I agree for a lot of entries you would receive it multiple times possibly

William Burns: actualy

William Burns: sorry REPL would be fine

William Burns: it couldn't get into this state

William Burns: since REPL doesn't require state transfer on a node dying

William Burns: or leaving*

Dan:
basically any entry that throws a ContainerFullException would cause its segment and any pending segments to be in some sort of DEGRADED like state
would this degraded-like state be kept only on the node where the insert/update failed, or would it be broadcast to the entire cluster? because on the node where the insert/update failed, all the segments it owns are in the same situation

William Burns: I would assume it only is required on each node individually

William Burns: since each node may have a different subset of segments it couldn't write

Dan: wouldn't be just like keeping a flag on each node which says "I threw at least one ContainerFullException"?

William Burns: you could, but more than likely you will only have a small subset of segments like this at a time

Dan: if the other nodes don't know that the segment is full, they'll keep trying to write to it

Dan: I don't see why it would be a subset of segments, once one node's memory is full you can't write to any segment on that node

William Burns: and this state doesn't affect writes

William Burns: writes would behave the same way always

William Burns: it is just reads

William Burns: that would have a different behavior

William Burns: let me explain the use case again

Dan: please do, I don't think I understood

William Burns: lets say you have nodes A,B,C which are all "full"

William Burns: C goes down

William Burns: C owned segments 1-4

William Burns: A owns 1-2 and B owns 3-4

William Burns: A will try to transfer 1-2 to B

William Burns: but they all fail due to ContainerFullException

William Burns: and same for A when it tries to get 3-4 from B

William Burns: @Dan with me so far?

Dan: I wouldn't say A tries to transfer segments to B, because B is asking for them, but yeah

William Burns: I was short handing it

William Burns: since there is a lot to write :slight_smile:

William Burns: so now A has segments 3-4 in this degraded state

William Burns: since it may have some or none of the entries

William Burns: but it is still an owner

William Burns: if a read comes to A for a key in segment 3

William Burns: it would have to ask all other owners for the value if it doesn't have it

William Burns: same for B but segments 1-2

William Burns: writes would work exactly the same

William Burns: the cache could continue operation and be consistent

William Burns: if it loses a node at this point, data loss is guaranteed though

Dan: one question

Dan: if B tries to write a value to segment 3 and it doesn't fit on B, does segment 3 become degraded on B as well?

William Burns: no

William Burns: it isn't written to A or B

William Burns: just returns a ContainerFullException

Dan: what if it fits in A but not in B?

William Burns: still has to throw exception

William Burns: so we can guarantee consistency

Dan: sure, but now we have an inconsistency

William Burns: ?

Dan: the key is on A, but not on B

William Burns: which key sorry?

William Burns: B has all entries for segments 3-4

William Burns: it is just missing ones in 1-2

Dan: the key that we're trying to write to segment 3 after it got into this degraded mode

William Burns: so like I said that key is never written to either node

William Burns: it throws an exception

William Burns: this is how EXCEPTION works right now

Dan: why would it not be written on A, if it fits in A's memory?

William Burns: because it can't be written into B

Dan: aren't the checks independent?

William Burns: when you have EXCEPTION based eviction

William Burns: you never perform a write from the user that doesn't fit into all owners

William Burns: state transfer write is the only exception to this

Dan: ah ok, so the cache must still be transactional

William Burns: since it only writes to one owner at a time

William Burns: yes

Dan: I thought you wanted to make non-transactional caches work with EXCEPTION

William Burns: for this use case, yes

William Burns: I haven't thought about normal eviction yet

William Burns: no

William Burns: I want to make EXCEPTION based eviction work better

William Burns: if you have full memory and a node goes down

Dan: ok, so how does that work now?

William Burns: right now you get inconsistencies tbh

William Burns: :D

Dan: state transfer just skips the entries that don't fit?

William Burns: since the write is blocked during state transfer

William Burns: which means you have an entry on an owner that might exist in another

Dan: blocked? so state transfer doesn't finish?

William Burns: I believe it throws an exception

William Burns: I don't know where we catch it

Dan: maybe never, and state transfer hangs :)

William Burns: at worst case :slight_smile:

William Burns: lets assume for now it doesn't

Dan: ok, let's assume those entries are just skipped

William Burns: in that case you have some owners that have the value and some that don't

William Burns: and adding a new node doesn't fix this

Dan: right

William Burns: so I was trying to think of a way we could "live" with this temporarily

William Burns: until a new node comes back up

Dan: yeah, it makes sense now

William Burns: but I worry this is probably too much work for a feature we haven't used yet, lol

Dan: TBH I have no idea what's the use case that sparked this whole topic either :)

William Burns: the use case is shared memory service

William Burns: don't ask me more details than that :P

Dan: aha

William Burns: @Tristan might be able to illuminate it more

Dan: the idea of having owners that don't really have the entries reminds me a lot of scattered cache, where you have non-owners that do have the entries

Dan: BTW, when C leaves, besides its segments 1-4 being distributed between A and B, you could also have segment 5 moving from A to B and segment 6 moving from B to A

William Burns: well I was assuming that others not 1-4 were both already owned by A and B

William Burns: so like 5-8 and 9-12 were already both owned by A and B

Dan: right, I needed a more complex setup :)
if you have nodes ABCD, 1=AB, 2=AC, 3=AD, 4=BC, 5=BD
when C dies, you could get 1=AB, 2=DB, 3=BA, 4=BA, 5=BD
although you'd first have a union write CH, so 1=AB, 2=ADB, 3=ADB, 4=BA, 5=BD

William Burns: yes you mean when you have 4 nodes

William Burns: it could change that 2 non owners both becomes owners

William Burns: in that case I guess we would be screwed :slight_smile:

Dan: and also segment 3 moves from D to B even though D hasn't left

Dan: both are possible

William Burns: that should be fine though

William Burns: as B would get the exception and be marked as degraded-like

William Burns: A would still have the value

Dan: good point, so it's only a problem if none of the owners in the pre-leave CH is an owner in the rebalanced post-leave CH

William Burns: the problem is if a segment loses all owners it had previous

Dan: I guess we could always block the rebalance

William Burns: yes, precisely

William Burns: we could, but we only want to do that if a node will run out of memory

Dan: yeah, I wouldn't want to block the rebalance every time :)

Dan: but we could say if a node can't apply state for a segment, it should never become a read owner of that segment

William Burns: this all sounds a bit complex though :D

William Burns: how would we know if we can apply all of the entries from a given segment without trying first? :D

William Burns: guess we need to keep track of size by segment

William Burns: haha

Dan: I like it more than asking all the owners for state on a join :D

William Burns: okay well I defer to you re: the location of owners

William Burns: so it sounds like what you are proposing is the coordinator would figure out the suggested CH

William Burns: ask the nodes if they have room for the given segments

Dan: it's like making a segment degraded-like, once you do it for one segment then all the segments that are being moved to that node will have the same fate

William Burns: and only apply the CH based on which can be applied?

Dan: no, not quite

William Burns: or just don't even apply a new CH

William Burns: until we can fit everything?

Dan: the coordinator already doesn't move from read_old_write_all to read_new_write_all until all the nodes confirm that they applied the new segments

Dan: what we'd need to do is instead of ignoring ContainerFullExceptions, nodes would send to the coordinator a list of segments that they couldn't apply

Dan: and the coordinator would install a new CH, not sure what phase, with the segments that were successfully applied

William Burns: gotcha

Dan: of course, it would be nicer if we would know the size of segment ahead of time

William Burns: yeah we can fix that

Dan: so we wouldn't even request it from the previous owner when we're full

William Burns: we already talked about having segment specific sizing for EXCEPTION eviction

William Burns: that is not hard to add

Dan: ok, that should work better

William Burns: we would know exactly how many bytes it would be

William Burns: disregarding concurrent updates

Dan: if we had that, I guess we could tell whether we'd get a problem like this before we even start the rebalance, and change the cluster health state to require another pod instead of starting the rebalance

William Burns: yeah I am thinking that might be the best solution

William Burns: avoids a lot of heart ache

Dan: :+1:

William Burns: not sure how that translates when not in open shift though, heh

William Burns: guess we just set some administrator thing

William Burns: in JMX

Dan: yeah, we need a "call administrator" attribute in JMX :)

Tristan: Ok, now I have 200 messages to read

William Burns: gotta keep up man!

William Burns: :wink:

infinispan > Exception based eviction Today

Galder: Caught up with discussion. To summarise things a bit, early on I proposed ISPN-9690 but don't seem easy to do:
- Suggested options are some probabilistic way for an owner to know when it'd topple others. It'd require tracking key/value sizes.
- Use cache with passivated store, when memory full store data. In shutdown it'd passivate all. This was deemed risky approach. Passivate can work well but under stress scenarios like about to run out of memory it could break. E.g. passivate takes longer than pod shutdown timeout. It'd also require that when a non-owner receives a request and its memory is full, it still stores it to the local persistent store skipping in-memory.
- We ended up agreeing that although transactions are a penalty, it's a more predictable penalty than the hiccups as a result of passivation.

^ Anything else to add?

The discussion then moved onto what a user should do if EXCEPTION strategy kicks in? Two options:
- Give the nodes more memory, restarting one at the time. In the current state of things, this would break due to memory increases as a result of rebalancing. E.g. when going from 3 to 2 nodes the data space required in the other nodes would increase. @William Burns came up with an idea to avoid this, maybe he can summarise it :)
- Add a more nodes. Although existing nodes would require to rebalance and would have need memory to handle the rebalancing, their overall memory consumption should eventually reduce as data gets rebalanced to new nodes. This should be best practice.

^ Anything else to add?

@William Burns @Dan @Tristan

Tristan: Aside from basic protection within Infinispan itself, in the context of OpenShift the operator should be the thing that ensures that the cluster can survive a downscaling.

Dan: I forgot to say this yesterday, but I disagree a bit with Will on passivation, it can be much faster than regular persistence when the distribution of key reads/writes is skewed towards some keys, even when the data container is full. I'm also not sure about the need to write out in-memory data to disk on shutdown, when the node restarts it will have to request state from the other nodes anyway, and the local data might be stale. One thing is certain though, performance depends on the access pattern a lot more than regular persistence.

Dan: About option 1, giving the nodes more memory, me and Will more or less concluded that we could track how much memory each segment uses, and then we could know before starting a rebalance if it's going to lead to ContainerFullExceptions on any node. We could then block the rebalance, and wait for the pod where the configuration changed to restart.

Dan: If an operator is doing the restart, it could just block rebalancing before and we'd be fine (assuming no unintended crashes)

William Burns: yeah @Dan I mentioned it is faster if you are modifying/removing in memory contents, which I assume is what you are referring to with it being skewed towards some keys

William Burns: @Galder talking with Dan more, we found it should be easier to just block rebalance as he mentioned until a new node can come to replace the missing one instead of the read consistency thing I had mentioned

Dan: yeah, if the access pattern follows a uniform distribution then you're pretty much going to have a store write on every cache write, just like non-passivation persistence, but if it's a poisson distribution, you could save quite a bit on writes

William Burns: yes, agreed

William Burns: tbh the worst position for passivation is a read from store contents :D

William Burns: not only does it read the store it also then has to remove that entry from the store and then possible add a different entry

William Burns: so a store read, store write and store remove for 1 get operation :D

William Burns: and actually a write to a key that is in the store is the same thing

William Burns: that is why I mentioned the perf would be more consistent, and that passivation is really good at certain use cases is all

William Burns: as a user I would think a more consistent performance is best as a default myself

Galder:
Aside from basic protection within Infinispan itself, in the context of OpenShift the operator should be the thing that ensures that the cluster can survive a downscaling.
We're not talking about downscaling here really, but more about nodes running out of memory.

William Burns: well if they tune the size parameter correctly that shouldn't happen

Galder: lol

William Burns: :slight_smile:

Galder: You'll probably only know the correct size when you run out of memory ;)

Galder: Not saying everyone will do that, but a lot will

William Burns: well that is why you test stuff before putting it in prod

William Burns: :D

Galder: rofl

William Burns: what it is true

Galder: You should go spend some time with wolf and you'll see how much testing before prod happens

Galder: ;)

William Burns: a lot of times customers use a non critical system live to test stuff like that out

William Burns: which is similar

William Burns: at least in my experience before RH

William Burns: but anyways the fix would help both cases - assuming only 1 node ran out of memory

Galder:
About option 1, giving the nodes more memory, me and Will more or less concluded that we could track how much memory each segment uses, and then we could know before starting a rebalance if it's going to lead to ContainerFullExceptions on any node. We could then block the rebalance, and wait for the pod where the configuration changed to restart.
What about before a normal put? That's the case we were trying to solve with transactional cache. The fact that a put could be partially applied as a result of one or all backup owners running out of memory.

Galder: @William Burns Btw, I was thinking this further, as long as the owner node writes the key, does it really matter what happens in backups?

William Burns: yeah so we would need the transactional cache and the new rebalancing logic

William Burns: you lose redundancy and reads from the backups will be inconsistent

Galder: Lose redundancy: one of your nodes is running out, you have to do something: say you give the node(s) more memory and restart, as long as this is done before N >= numOwners goes down, you're fine...

William Burns: yeah but what if the node you restarted was the primary in this case

William Burns: you just lost the entry

Galder: Reads from the backups will be inconsistent: Hot Rod sends reads to owners...

Galder: If node was primary and was running out of memory, you'd not have done any writes at all?

Galder: I mean, you're running out of memory, that the primary owner can't write it is expected?

Galder: I mean: CH only cares about what the primary owners contain, don't they?

Galder: I mean when they have to rebalance the data

William Burns: @Dan would have to answer that

Galder: The problem would be if after restarting this node, the rebalance would lead the new node to have the v-1 version of the data

William Burns: why don't you explain the scenario, and what if you had 2 nodes run out of memory at approximately the same time?

Galder: With 3 node, numOwners=2, you're f*

Galder: Unless you have persistence storage for it

William Burns: what do you mean by run out of memory?

William Burns: do you mean the pod crashing?

William Burns: rather killed

Galder: off heap max memory size, Eviction.EXCEPTION

William Burns: or are you talking about the max size?

William Burns: k

Galder: max size

William Burns: so I think our discussion earlier around sizing is flawed then :D

William Burns: cause when you said out of memory I was thinking the pod was killed due to OOM

William Burns: not reaching eviction size

Galder: With EvictionStrategy.EXCEPTION we're trying to make it nicer than OOM

Galder: Max size being reached won't make node unavailable from a Kube perspective AFAIK

William Burns: no

Galder: So at least is not going and restarting, so you have more margin...

William Burns: so I don't see why we are fucked at that point?

Galder: Puts will still fail

William Burns: if we have the new rebalance logic that Dan mentioned

William Burns: we would be able to safely restart a node with more memory

Galder: True

Galder: My main worry was about data consistency

Galder:
The problem would be if after restarting this node, the rebalance would lead the new node to have the v-1 version of the data
William Burns: I didn't see that message

William Burns: I don't see how that is possible with the rebalance logic

Galder: No worries

Galder:
I don't see how that is possible with the rebalance logic
In that case, are transactions really needed?

Galder: The only problem is if old versions stay around

William Burns: I would think they would still be needed

Dan:
I mean: CH only cares about what the primary owners contain, don't they?
@Galder CH contains both the primary and the backup owners of each segment in dist mode, only the primary in repl/scattered mode

Galder: So primary owner gets a put and broadcasts to 4 backup owners, 1 of which fails due to OOME... on rebalance the failed node gets the primary owner's version, which is Ok, but the other backup owners should have correct data?

William Burns: also keep in mind a backup can be promoted to primary during CH change

Dan: @Galder depends on what triggered the rebalance

Galder: Ok, let's assume that the other backup owners have had the correct data

Galder: The only danger I see is if the node that has reached max size would somehow become primary!

William Burns: if you have a pending rebalance it could

William Burns: or you have more than 1 node hit max

Dan: Agree, the primary change is only in the 3rd rebalance phase phase, read_new_write_all, but you could already be in the middle of a rebalance when you get the ContainerFullException

William Burns: now I agree that it should work just fine without tx for like 99% of the time, but that 1 time :frown:

Dan: Actually it's much more likely to happen in the middle of a rebalance, because that's when the cluster has to hold numOwners + X copies of the segments

Galder: Yeah...

William Burns: yeah during rebalance you may have to hold 50% extra elements temporarily :frown:

Galder: 50% extra?

Galder: Wow

Dan: Maybe that's something else we need to rethink, especially in the context of exception eviction

William Burns: Dan would know about which percentages, I guess it could even be double temporarily even?

Galder: Could there be a way for a node that has had containerfullexception not to become primary owner until restart?

Galder: Just thinking out loud...

Dan: Depends on numOwners and the number of nodes in the cluster, with numOwners=1 you could have an extra copy of each segment if you're unlucky

William Burns: :confused:

Galder: For now going with transactions with EXCEPTION

Galder: I've got the PR ready

Galder: @Dan @William Burns Can you add extra JIRAs for related topics to ISPN-9690?

Galder: I'll try to summarise this last discussion in the JIRA

William Burns: think you put wrong JIRA #

Galder: Ups yeah

Galder: ISPN-9690

William Burns: but it sounds like we need
1. segment size approximations when eviction is enabled
2. eviction aware topology/state transfer

William Burns: @Dan do you think you can handle the latter JIRA, since you know more details there

Dan: @William Burns do we need this for any eviction, or just for exception eviction?

William Burns: I would think we would want this for regular eviction too right?

William Burns: at least when a store isn't present

Dan: IDK, with regular eviction without persistence you don't usually care about data that doesn't fit

William Burns: yeah that is how I have always thought - I guess only do it for EXCEPTION then

William Burns: I just thought it would be pretty identical code me thinks

William Burns: but lets not enable it :slight_smile:

Dan: the code would be the same for caches without eviction, or for caches with eviction+persistence :D

William Burns: hrmm, not sure what you mean there - but I will change mine to be only for EXCEPTION - saves me work :slight_smile:

Dan: if by "eviction aware topology/state transfer" you mean blocking the rebalance when write ch segments would hold too much data to fit into the data container, then the code to check whether the DC is full and to block the rebalance is the same regardless of how eviction and persistence are configured

William Burns: yes that is what I was talking about

William Burns: are there more pieces we need?

Dan: no, what I meant is that we already have to check that there's no persistence, adding a check that the eviction mode is EXCEPTION makes no difference

William Burns: yeah exactly

William Burns: but from our talk I will only add exposing the segment sizes when EXCEPTION is configured

William Burns: so you will have to add the check

William Burns: :slight_smile:

William Burns: less code to worry about

Dan: assuming I'm the one who will implement this :)

William Burns: :slight_smile:

Galder: This is not urgent btw

William Burns: when would we want it though?

Galder: ES.EXCEPTION is optional param for cache service

William Burns: after we prove that EXCEPTION is something we want to do/support? :D

Galder: I'd wait and see how many users end up going for that choice in cache service

Galder: We added the option as a result of discussions with Markito in the OpenShift breakout

William Burns: k sounds good

William Burns: I won't worry about it after this discussion unless you or someone else prod us/me about it :slight_smile:

Galder: Yeah, that's the right approach

Galder: Good to have the discussion though... I'll copy paste the entire discussion into a file and add it to the JIRA

William Burns: k cool