Blue series 3 way pairings stop working occasionally (possibly after power outage)

I spoke to an engineer about this last night and will discuss more when I get a chance. Some of these issues seem related to how the zigbee network is functioning, but I would expect things to get sorted out without too much intervention. Here was his response:


If a Zigbee repeater (router) goes offline, two devices that are bound to each other but rely on that repeater for signal forwarding will lose communication.

Here is the detailed technical explanation:

1. The Nature of Binding

In Zigbee, “Binding” establishes only a logical association. It tells the network: “When Device A changes state, send the message to Device B.”

  • The binding table is stored in the coordinator or the source device.
  • Binding does not create a direct physical link, nor does it bypass the network routing mechanism.

2. Communication Relies on Routing

Zigbee is a Mesh Network. Data packets must travel via a physical path from the source to the destination.

  • Scenario: Device A and Device B are too far apart to talk directly and must rely on Repeater C.
    • Path: Device A → Repeater C → Device B.
  • When Repeater C goes offline:
    • The physical link is broken.
    • Although the “binding relationship” between A and B still exists logically, the data packet cannot find the next hop.
    • The network layer will attempt Route Discovery to find a new path. If no alternative path exists (i.e., no other routers can bridge the gap), the route discovery fails, and communication stops.

3. Exceptions (When they might still work)

Communication might persist only in these specific cases:

  • Direct Range: Device A and Device B are actually close enough to communicate directly. If the repeater fails, the Zigbee stack may automatically detect the direct link and switch to it.
  • Redundant Paths: There are other active Zigbee routers in the network that can form an alternative path (e.g., Device A → Repeater D → Device B). Zigbee’s self-healing capability will automatically reroute traffic through the new path.

Summary

  • Binding ≠ Direct Connection.
  • A repeater failure means a broken physical path.
  • Without an alternative route, communication will fail, even if the binding entry remains in the table.

Troubleshooting Steps:

  1. Check the power supply or status of the offline repeater.
  2. Restart other routers in the network to trigger a network re-routing (self-healing).
  3. If possible, move the two bound devices closer to test if they can communicate directly.

The Coordinator can act as a repeater in binding as well

1 Like

Thanks for all the technical detail! That’s quite helpful, though it only makes me even more confused by this issue. Aren’t all of the Inovelli devices acting as repeaters? Also most of the “3-way” pairs of switches are in line of sight with their partner, or at most around a corner so in theory there shouldn’t be any situation where a repeater has to be needed except maybe in the case of the garage.

The situation seems to occur under a complete power failure, so all of the devices, repeaters, and the coordinator all go offline and are all coming back online at the same time. But once power is restored, everything does come back online, there aren’t any switches that don’t get their power restored in this event, so there shouldn’t be a situation where one critical link in the chain is down.

There are also so many repeater devices that I can’t imagine a situation where a link could be fully severed except for the garage maybe but even that probably is close enough to multiple other switches to find another route. If this issue was only happening for one or two devices, I would for sure be trying to figure out if there was a signal issue someplace. But it’s effecting every device.

In terms of routing and mesh coverage, it just seems like this should actually be a best case situation. So what could cause routing to fail in such a way?

Surely the main issue is that every device has lost power, and now the mesh has to re-establish itself completely. But for some reason it’s basically failing to do so? Do the devices remember their routing across power losses? Is the whole network being rebuilt from scratch?

@EricM_Inovelli @rohan

We have performed the following tests:

  • One of the switch pairs was still messed up from before, we checked that we COULD control it from the coordinator.
  • We proceeded to turn off one lighting circuit, the rest stayed on and so did the coordinator. The problem occurred.
  • We used the air gaps to disconnect just a few switches at a time, and restored them together. The problem occurred.
  • We used the air gap to disable just one pair, then restored them together. After multiple tests sometimes the problem occurred and sometimes it did not.
  • We tested air gapping either just the slave/remote or just the master/load and in either case the problem did not occur.
  • We air gapped two master switches and not the slaves, and after restoring them both together the problem occurred.
  • We air gapped two slave switches and not the masters, and after restoring them together the problem occurred.
  • We air gapped two master switches, then restored them one at a time. This resulted in a slightly different issue, one switch came on without issue while the other flickered on and off multiple extra times.
  • We air gapped two pairs of switches, then restored them all at once. We then tried to toggle them from Z2M. Toggling the master switches from Z2M worked reliably. Toggling the slave switches from Z2M worked most of the time, but in a couple instances the light flickered on an off or didn’t respond the first time. The coordinator is for sure able to keep controlling the master/load switches even when the slave/remote switches are not able to.
  • We air gapped two pairs, restored them, waited 3 minutes. The problem still occured.
  • We replaced the coordinator with an MG24 based one (which was not trivial :grimacing:). This did not seem to make any material difference to the main problem, though we might have over all slightly lower latency.

We also observed with this issue that in some cases when two switches aren’t communicating properly we will observe very delayed responses rather than none at all, or we may observe the effects of multiple attempted toggles accumulating and all firing at once causing the light to rapidly flicker on and off a few times. We were not able to get reliable or consistent results, other than that in the conditions noted above something would go wrong with the communication. But the most common was having absolutely no response at all for a few attempted presses of a switch before it finally started to work again.

We have also noticed that despite being enabled on all of the switches, the two tap up for full brightness feature isn’t working. No clue if that’s at all related but I figured I would mention it for the sake of completeness.

Anyway, what we can conclude from these tests is:

  • The coordinator seems to have nothing to do with the problem.
  • The problem will occur when more than two switches lose and regain power at the same time (such as a blackout or a breaker trip)
  • It doesn’t appear to matter which of the three types of switch are involved.
  • They will not automatically fix themselves over time, the only thing that fixes the problem state is toggling the switches until they start working again.
  • This happens despite switches having line of sight and no extreme distances between pairs.
  • Doing that toggling of the slave switches from Home Assistant as an automation might be able to “fix” the problem but this would be a really annoying bandaid solution since it would entail all the lights flickering on and off repeatedly when the power was restored. Plus this obviously should not be needed.
  • The problem does impact the one non-Inovelli device on the network (Lutron Aurora) the same way. (We don’t have any non-Inovelli load controlling devices to test, just this one remote).

I find it somewhat interesting that the Aurora dimmer is also temporarily losing the ability to interact with the Inovelli switches. That’s a battery powered device, so it never loses power regardless of the test. But when the device it is bound to loses power, the bindings get all confused just as they do with all the other devices.

It’s also interesting that resetting just two devices usually didn’t cause a problem (but sometimes still did) but resetting more than that basically always did. This implies that the mere fact that multiple devices are joining the mesh at the same time is sufficient to cause all of the joining devices to fail at proper routing.

Given all I have observed, it does seem like routing is the problem somehow, and that when multiple Inovelli devices reset at the same time they throw all the routing off somehow.

Is it possible to disable the routing functionality on an Inovelli switch? Perhaps if there were fewer routers it would be less prone to confusing the whole network?

Are there other (non-Inovelli) zigbee devices in place there that can help as repeaters in the overall zigbee mesh?

I intend no shade to Inovelli here since I fully acknowledge that this all could very well just be unique to my own setup, but for whatever reason, none of my Blues are great repeaters (or maybe I should just say “not popular”). I don’t rely on them to be any part of my zigbee mesh’s backbone.

I only have one binding setup in place amongst my 9 Blues, so I realize I’m not anywhere close to the overall scale y’all are at there.

There are 69 Inovelli devices and one non-repeater device from Lutron. So there are no non-Inovelli repeaters in the mesh at this point.

When you say they aren’t great repeaters, what do you mean? What issues are they causing?

There are no issues for me, I just notice (using some zigbee mapping tools available in Hubitat) that my Blues aren’t used much as repeaters. They do some repeating, but other nearby devices are just noticeably much more active in the mesh’s routing.

What you’re running into isn’t a loss of binding, but a temporary routing issue in the Zigbee mesh after multiple devices reboot at the same time.

Zigbee devices don’t talk directly to each other—even when bound—they still rely on the mesh network to deliver messages.

When several devices power off and back on together (like during a breaker trip), the network can come back in a partially “unsettled” state where some device-to-device paths aren’t immediately rebuilt.

The reason pressing the switch fixes it is because those button presses generate traffic that forces the network to rebuild the route between the devices. Once that path is re-established, everything works normally again.

Waiting doesn’t resolve it because Zigbee only repairs routes when traffic is sent—it doesn’t continuously fix all paths in the background.

Since your coordinator can still control the switches, we know the devices are connected to the network—the issue is specifically with device-to-device communication paths recovering after power restoration.

Something interesting part about what you are seeing is that even a device that is bound to another must rely on the mesh to communicate with the other device. Zigbee binding is only a logical mapping; the actual packet still has to traverse the mesh using normal routing. Zigbee routing is AODV-style and depends on neighbor/route tables, link status, and route discovery/repair. APS/NWK retries can also cause delayed delivery, which matches the “nothing happens, then several toggles fire at once” symptom.

You may already have things setup in this way, but if not I would recommend using group binding where possible. Especially if you have a lot of power outages. Group binding will send commands across the network via broadcasts so there is no device-to-device specific routes that can be invalid or stale. The communication is still device-to-device, but one device doesn’t need to know the exact route of the other device on the network.

1 Like

Group bindings caused a different sort of problem when we tried them, I mentioned this a few days ago, but basically when we used groups the bindings all completely broke without the coordinator being online which lead to abandoning that approach. Also, there are only two instances where more than two devices are bound to each other, and in both cases it’s still only 3 devices.

I’ve also heard that broadcasts can be problematic for generating too much network activity?

Anyway, I think the technical explanation you gave makes sense - the switches still know what devices they are bound to and still send out messages to those devices, but they get lost in transit or end up taking an extremely sub optimal path. And it makes sense why multiple devices joining the network at once could have an impact on this.

What I don’t get is why it’s this bad given just how many alternative routes should exist, and the fact that in almost every instance the shortest path between any of the bound devices should actually just be direct since they’re in line of sight with each other. That is, it really sounds like the network is being built in a very unstable way for some reason when it should be rock solid given what devices are involved and where.

And then there’s the question of fixing it. I guess we can continue trying to figure out the group binding thing, I’m still not sure what the problem was there but if groups should work without the coordinator it would be nice to get it happening. But it also sounds like it should work how it’s set up now.

Is there something akin to a ping I can do that would force the network to repair routes without doing anything visible/annoying? An automation to send out pings and heal the mesh seems like a potential bandaid at least.

But it still seems like we’re missing something here, I don’t see a bunch of people jumping in saying “oh yeah that’s normal”, so I’m left wondering what is abnormal about this deployment?

I guess I’m at a little bit of a loss here. The binding instructions explicitly tell you to use individual bindings if you have 2 devices, and group if you have more than 2. Yet the only thing even close to a suggestion of what to do that has been posted in this thread is to try to set everything up as groups.

Is there really no other troubleshooting steps or suggestions what to do from here? Sure, we can try group bindings. But I’m not sure why they would work differently now than when we tried them initially.

So what if that doesn’t fix it? what’s the next step from there? It would be nice to have a few things to try each time we get together to troubleshoot these.

I’m left wondering the same thing. I’ve not seen any deployments here of comparable or larger size where users are mentioning this behavior. I’ve not tried cutting power to my own house intentionally here but I definitely have situations where I turn off specific circuits and I cannot reproduce this at all.

I’ve got 70 devices in my Zigbee mesh, but only about 30 of them are Inovelli devices. Maybe it’s something to do with the fact that this network is nearly 100% Inovelli?

You may be at a point where you need to run a zigbee sniffer: Sniff Zigbee traffic | Zigbee2MQTT and try to understand what’s going on from the packet captures. Hopefully you still have the second Zigbee dongle to use with a laptop to do that.

1 Like

I will discuss with our firmware guys if there is some method we can use to re-establish routes better in situations like this, but this is usually handled with the SDK and we don’t usually modify it. Maybe there is a setting that can be tweaked? I can also discuss with him more regarding the situation. In the meantime, I recommend that you give group binding another shot since multicast is more forgiving in the scenarios you are describing. It is possible that it won’t be perfect since it still relies on the mesh to be intact enough for the devices involved to receive the messages. Can you test this:

Switch A (EP1) - Group 1
Switch B (EP1) - Group 1

Switch A Bind EP2 > Group 1
Switch B Bind EP2 > Group 1

If you disconnect the coordinator and this breaks, then that means the coordinator is being used as part of the route for the broadcast message. For example:

Switch A > coordinator (used as router) > Switch B

@rohan @EricM_Inovelli

I went ahead and gave group binding another shot. I have 2 switches and 2 hue bulbs on the same group. I made sure to follow the tutorial for group bindings exactly before posting here.

When the coordinator is turned off, and everything else is left the same (71 switches and 2 bulbs on the zigbee network), it maybe functions 5-10% of the time.

I then went around and turned off every other light switch either by killing a breaker or pulling the air gap. One switch did not fully turn off when the air gap was pulled. It stopped controlling the light that is wired to it, but it would not fully shut off. So at that point there were only 3 inovelli switches and 2 hue bulbs on the zigbee network (with no coordinator).

Each time I killed an additional group of switches, I would go back and test the “group” in question. It improved slightly each time, until by the end case of only 3 switches and 2 bulbs on the network with no coordinator, it was working about 40% of the time. For example, it will turn on and off every time 5 times in a row, then not work 4 times in a row. Sometimes it would be 3 and 6, sometimes 4 and 4… It varies but never worked every time. I probably toggled it on and off 300 times before I turned everything back on.

Once I went ahead and connected the coordinator again, it functions every time.

Physically, these switches and bulbs are nowhere near the coordinator. The switches and bulbs are at the back of the house on the main floor, and the coordinator is at the front of the house in the basement. Even if the one switch that wouldn’t turn off (which is physically right next to one of the switches in the group) is handling all of the traffic for the group when everything else was off, it should be able to handle 2 switches and 2 bulbs. There’s no reason for it to not work in that configuration.

Interestingly, I also went ahead and set up a scene for watching movies. It turns off every light in the great room/kitchen/foyer area. That scene will time out before it finishes turning off all 20 or so lights in the area without fail. It will turn off like 5 at a time, then wait 2 seconds, then turn off another 5, then wait 2 seconds, etc, until there are a handful left on at the end. I have to run the scene twice to get everything to turn off

Oh, and one more thing that may or may not be a clue. The built in double tap feature often does not work. I cannot figure out what makes it work and what makes it not work. It just usually doesn’t do anything. Then I’ll go to show @zeel how it doesn’t work, and it works. Today it was back to not working

bump

We noticed a new problem where many of the entities for the lights were showing “unknown” value in HA, I think it might have been like that since changing the coordinator. The process for swapping that was kind of tricky and not technically supported by Z2M, so I’m not surprised that it caused some additional issues… at least I hope that’s why they started reporting as unknown.

Anyway, it seems like hitting “reconfigure” or “interview” on each one fixed this (sometimes reconfigure failed, but interview worked?) and now I’ve got all the entities back. I really doubt this has anything to do with the main issue but it seemed worth mentioning. We haven’t had time to test bindings since doing that, I doubt it will make a difference but we will definitely try it.

Is this controlling each bulb individually or this controlling the group. If it’s the group, it should be every single light at once (no popcorn effect).

I think we’ve already reached the point where we can’t help without clear traffic sniffer logs during each of the tests that you are doing. See my previous note about it here.

We need to know what’s being transmitted on the network when the failures happen and a traffic sniffer is basically the only way to do that. Those logs can help @EricM_Inovelli and the firmware engineer to figure out what’s happening.

Yeah, I remember having to go through that process (hitting reconfigure on each device) when I migrated as well a couple of months ago.

1 Like

Alright, I’ve ordered the sniffer. Hopefully it tells us something useful

I had a thought/question. I can see how a given router going offline would impact the network, packets that should have passed through it would need re-routed. But why is it that this issue happens after a device is powered back on? If I air gap half the switches and test the ones that are still on and that was broken it would make plenty of sense. But if I don’t press any buttons until after I have power restored to everything why would the routes be messed up? Shouldn’t every router retain its routing data even when it lost power? And wouldn’t other routes not rebuild if no packets are being sent?

It seems like the mesh shouldn’t break like this unless part of it was offline and part online. But if everything goes offline, and comes back up, then it should all just restore to what it was doing before right? Why would the routes be broken between devices if those devices were working before the power loss and never tried to communicate until they were all back online?

I wonder if some devices are trying to send updates for things like power usage, temperature, humidity, presence if you have those before everything powers on and those fail and try to repath and that triggers a cascade.

For this to be the case, I would expect a couple of outcomes.

1: When any switch loses its route through the other switches to the master switch it’s bound to, it remains disconnected until I manually press the buttons. Meaning, in tests where I turn off half of the switches, the other half should be broken until I manually press the switch a few times.

2: Those updates for temp, presence, lux, humidity, power usage, etc (which there are always at least 10 per second from the quantity of switches I have) would trigger the route to rebuild the same way that manually toggling the switch does

Neither of those things are the case though. Switches that weren’t themselves powered off never enter the state where they cannot talk to another switch they are bound to without manually toggling the switch half a dozen times.

And, switches that were powered off and back on can still be talked to by home assistant without manual toggling, meaning they only lost the route to each other, not to home assistant.

Now for #2, maybe those communications are the only thing that cause the route to home assistant to be rebuilt. And because there is never a communication sent between the switches bound together, that round never rebuilds. This I think is actually a poor choice if it’s the case, because it means slaves never reflect the state of the master after power loss. And that has been my experience, the slaves never have the light bar matching the master after power loss. And in my case with the problems I am having, it means I have to go do the 6 toggles on every slave to get them to sync up