-
Bug
-
Resolution: Done
-
Major
-
8.2.0.Final
The following example data race can cause unrecoverable errors during indexing:
[node1] cache.put(key) // key maps to segment 48, owned by node1
[node1] starts shard 48
[node1] acquires lock on shard 48
[node1] starts writing to the index
[node1] notification of topology changed, lock released on shard 48
[node1] lock reacquired (still writing to the index)
[node1] commit on shard 48
[node1] shard still locked
[node2] cache.put(key) // Node2 now owns segment 48
[node2] starts shard 48
[node2] tries to acquire the lock on shard 48
[node2] fail (lock still owned by node1)
The current mechanism employed by the ShardIndexManager during topology changes involves using a listener and closing the IndexWriter on all nodes upon ownership changes, so that the lock is released and can be reacquired by the new owner (1 segment maps to 1 shard).
Since writing to a shard can take some time, the listener can be triggered in the middle of an index operation and the closing of the index writer will have a very short duration because it is sudden reacquired, and not released anymore.