IST Basics

State transfers in Galera remain a mystery to most people. Incremental State transfers (as opposed to full State Snapshot transfers) are used under the following conditions:

The Joiner node reports Galera a valid Galera GTID to the cluster
The Donor node selected contains all the transactions the Joiner needs to catch up to the rest of the cluster in its Gcache
The Donor node can establish a TCP connection to the Joiner on port 4568 (by default)

IST states

Galera has many internal node states related to Joiner nodes. They currently are:

Joining
Joining: preparing for State Transfer
Joining: requested State Transfer
Joining: receiving State Transfer
Joining: State Transfer request failed
Joining: State Transfer failed
Joined

I don’t claim any special knowledge of most of these states apart from what their titles indicate. Many of these states are occur very briefly and it is unlikely you’ll ever actually see them on a node’s wsrep_local_state_comment.

During IST, however, I have observed the following states have the potential to take a long while:

Joining: receiving State Transfer

During this state transactions are being streamed to the Joiner’s wsrep_local_recv_queue. You can connect to the node at this time and poll state. If you do, you’ll easily see the inbound queue increasing (usually quickly) but no writesets being ‘received’ (read: applied). It’s not clear to me if there is a reason why transction apply couldn’t be started during this steam, but it does not do so currently.

The further behind the Joiner is, the longer this can take. Here’s some output from the latest release of myq-tools showing wsrep stats:

[root@node2 ~]# myq_status wsrep
mycluster / node2 (idx: 1) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:04:40 P   4  2 J:Rc 0.4ms    0   0b   0    1 197b  4k   0ns   0   0   0  367k    0  94%
14:04:41 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  5k   0ns   0   0   0  368k    0  93%
14:04:42 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  6k   0ns   0   0   0  371k    0  92%
14:04:43 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  7k   0ns   0   0   0  373k    0  92%
14:04:44 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  8k   0ns   0   0   0  376k    0  92%
14:04:45 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 10k   0ns   0   0   0  379k    0  92%
14:04:46 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 11k   0ns   0   0   0  382k    0  92%
14:04:47 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 12k   0ns   0   0   0  386k    0  91%
14:04:48 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 13k   0ns   0   0   0  390k    0  91%
14:04:49 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 14k   0ns   0   0   0  394k    0  91%
14:04:50 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 15k   0ns   0   0   0  397k    0  91%
14:04:51 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 16k   0ns   0   0   0  401k    0  91%
14:04:52 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 18k   0ns   0   0   0  404k    0  91%
14:04:53 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 19k   0ns   0   0   0  407k    0  91%
14:04:54 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 20k   0ns   0   0   0  411k    0  91%

The node is in ‘J:Rc’ (Joining: Receiving) state and we can see the Inbound queue growing (wsrep_local_recv_queue). Otherwise this node is not sending or receiving transactions.

Joining

Once all the requested transactions are copied over, the Joiner flips to the ‘Joining’ state, during which it starts applying the transactions as quickly as the wsrep_slave_threads can go. For example:

Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:04:55 P   4  2 Jing 0.6ms    0   0b   0 2243 3.7M 19k   0ns   0   0   0  2236  288  91%
14:04:56 P   4  2 Jing 0.5ms    0   0b   0 4317 7.0M 16k   0ns   0   0   0  6520  199  92%
14:04:57 P   4  2 Jing 0.5ms    0   0b   0 4641 7.5M 12k   0ns   0   0   0 11126  393  92%
14:04:58 P   4  2 Jing 0.4ms    0   0b   0 4485 7.2M  9k   0ns   0   0   0 15575  200  93%
14:04:59 P   4  2 Jing 0.5ms    0   0b   0 4564 7.4M  5k   0ns   0   0   0 20102  112  93%

Notice the Inbound msgs (wsrep_received) starts increasing rapidly and the queue decreases accordingly.

Joined

14:05:00 P   4  2 Jned 0.5ms    0   0b   0 4631 7.5M  2k   0ns   0   0   0 24692   96  94%

Towards the end the node briefly switches to the ‘Joined’ state, though that is a fast state in this case. ‘Joining’ and ‘Joined’ are similar states, the difference (I believe) is that:

‘Joining’ is applying transactions acquired via the IST
‘Joined’ is applying transactions that have queued up via standard Galera replication since the IST (i.e., everything has been happening on the cluster since the IST)

Flow control during Joining/Joined states

The Codership documentation says something interesting about ‘Joined’ (from experimentation, I believe the ‘Joining’ state behaves the same here.):

Nodes in this state can apply write-sets. Flow Control here ensures that the node can eventually catch up with the cluster. It specifically ensures that its write-set cache never grows. Because of this, the cluster wide replication rate remains limited by the rate at which a node in this state can apply write-sets. Since applying write-sets is usually several times faster than processing a transaction, nodes in this state hardly ever effect cluster performance.

What this essentially means is that a Joiner’s wsrep_local_recv_queue is allowed to shrink but NEVER GROW during an IST catchup. Growth will trigger flow control, but why would it grow? Writes on other cluster nodes must still be replicated to our Joiner and added to the queue.

If the Joiner’s apply rate is less than the rate of writes coming from Cluster replication, flow control will be applied to slow down Cluster replication (read: your application writes). As far as I can tell, there is no way to tune this or turn it off. The Codership manual continues here:

The one occasion when nodes in the JOINED state do effect cluster performance is at the very beginning, when the buffer pool on the node in question is empty.

Essentially a Joiner node with a cold cache can really hurt performance on your cluster. This can possibly be improved by:

Better IO and other resources available to the Joiner for a quicker cache warmup. A huge example of this would be flash over convention storage.
Buffer pool preloading
More Galera apply threads
etc.

Synced

From what I can tell, the ‘Joined’ state ends when the wsrep_local_recv_queue drops lower than the node’s configured flow control limit. At that point it changes to ‘Synced’ and the node behaves more normally (WRT to flow control).

Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:05:01 P   4  2 Sync 0.5ms    0   0b   0 3092 5.0M   0   0ns   0   0   0 27748  150  94%
14:05:02 P   4  2 Sync 0.5ms    0   0b   0 1067 1.7M   0   0ns   0   0   0 28804  450  93%
14:05:03 P   4  2 Sync 0.5ms    0   0b   0 1164 1.9M   0   0ns   0   0   0 29954   67  92%
14:05:04 P   4  2 Sync 0.5ms    0   0b   0 1166 1.9M   0   0ns   0   0   0 31107  280  92%
14:05:05 P   4  2 Sync 0.5ms    0   0b   0 1160 1.9M   0   0ns   0   0   0 32258  606  91%
14:05:06 P   4  2 Sync 0.5ms    0   0b   0 1154 1.9M   0   0ns   0   0   0 33401  389  90%
14:05:07 P   4  2 Sync 0.5ms    0   0b   0 1147 1.8M   1   0ns   0   0   0 34534  297  90%
14:05:08 P   4  2 Sync 0.5ms    0   0b   0 1147 1.8M   0   0ns   0   0   0 35667  122  89%
14:05:09 P   4  2 Sync 0.5ms    0   0b   0 1121 1.8M   0   0ns   0   0   0 36778  617  88%

Conclusion

You may notice these states during IST if you aren’t watching the Joiner closely, but if your IST is talking a long while, it should be easy using the above situation to understand what is happening.

The post PXC – Incremental State transfers in detail appeared first on Percona Data Performance Blog.