I didn’t really catch this at first, but I’m not sure I understand what you are asking.
Are you asking if my scene turns off every switch and bulb at once? Or are you asking if I have my scene set up like:
“Movie” Scene:
-Family room bulb 1
-Family room bulb 2
-Family room lamp switch 1
-Family room lamp switch 2
-Family room ceiling switch
-etc
vs
Family Room Lamp Group:
-Family room bulb 1
-Family room bulb 2
-Family room lamp switch 1
-Family room lamp switch 2
“Movie” Scene:
-Family Room Lamp Group
-Family room ceiling switch
-etc
I have it set up the second way. I guess I never checked specifically if the lamps and switches turned off separately, but I don’t think they do. I just think I have 20 different lights that all need to turn off at once, and in the time that a scene is allowed to try to activate, they do not all succeed at turning off. In fact, I just realized I have a screen recording of this from the home assistant app.
You can see how it turns off the lights 6 or so at a time. And then by the time it gets to the end, it stopped trying to turn them off.
Is there something specific you would like for me to test? I got some preflashed sniffers for a good price on amazon and I’m going to try to sit down tonight and do some testing. But it occurred to me that I don’t exactly know what tests to run or how exactly to sniff the traffic.
I think I’m going to start by turning off all of the other switches to cut down the traffic, but if you would rather I not do that because that isn’t a normal test scenario I don’t have to.
I also need to try to figure out why one of the switches I have doesn’t have a functioning air gap. pulling the air gap only disconnects the ceiling light from the switch, it doesn’t power off the switch. I found one thread somewhere that suggested incorrect wiring could cause this, but also that sometimes there are bad air gaps. In the event where I would turn off every switch not in this one group, that switch presently won’t turn off. It’s on the same circuit as the group, so I can’t independently kill it from the breaker panel.
I think he is asking if the devices are in a zigbee group and the group is being turned off or if the automation just turns off each device individually. I’d be interested to see if there is anything in the Home Assistant logs or the Z2M logs if you can reproduce what is in the video.
Regarding the device restart / route issues the key thing to understand is that Zigbee doesn’t “restore” its routing after a power loss.
When all devices lose power, the network doesn’t come back with the same routing paths it had before. Instead, each device essentially starts fresh and rebuilds its view of the network.
Importantly, Zigbee does not proactively rebuild routes between devices. Routes are created only when one device actually tries to send a message to another.
So even if everything is back online, there may not yet be a valid path between two switches until one of them tries to communicate. The first few attempts can fail or be delayed while the network figures out how to route the message again. Once that path is established, everything works normally.
That’s why pressing the switch a few times resolves the issue—it forces the network to rebuild the route between those devices.
Devices often:
prioritize / quickly rebuild routes to the coordinator (which is why control from the hub works after the switch restart)
or use different routing behavior
Device-to-device paths are made on demand so that is why they require a few clicks sometimes after a restart. I am going to check to see if there is a way to force a device to “ping” non-coordinator devices in its binding table but I think this would be outside the way the Zigbee SDK operates normally.
EDIT: To clarify, this is me turning on all of the lights included in the “movie” scene, and then running the scene. 3 lights failed to turn off, while home assistant reports 4 lights failed to turn off (they all 4 still show as on right now).
Does this tell you guys anything? I did not manage to capture those exact logs. I did capture different ones that had the same “many to one route failure”. This log in the screengrab left 3 lights on when I ran the routine. I went for a screengrab to capture the pop ups explaining that delivery had failed. Interestingly, the 3 lights that are left on do not include “dining room plugs”. The switch believes it is off (light bar is off), though home assistant does still show that it’s on.
Also, it’s different lights that fail to turn off every time. One time it’s kitchen lights 1 and 3 plus under cabinet lights. The next time it’s great room lamps (both switches and the 2 hue bulbs in the group). The time after that it’s hallway, kitchen light 2, kitchen light 3.
It’s also not like the bindings after a power cycle where if I toggle it a few times it finds a route to the paired switch. This never improves no matter how many times I run the scene.
In the process of shutting down all of my lights from mains power, I happened across this in the logs. To me, this explains that even if “ping” isn’t just a thing with zigbee, and z2m is just saying that in the log message, there seems to be “something” it can do to check the status of lights.
To me, this explains why after a power outage, the switches all work the first time I try them from home assistant, but not from the switches themselves. Because the route is already being automatically fixed after a couple of “pings” from z2m. But since the switches themselves don’t do this between switches in a binding, those don’t fix themselves until I go to each switch and toggle them half a dozen times.
I have spent the whole day so far troubleshooting this stuff. I get this same “Many to one route error” in the logs any time there’s an issue.
My issues do not always remain the same. I spent 90 minutes trying to get back to the state where the groups would fail without the coordinator. Couldn’t do it. They worked fine every time. Also the individual bindings didn’t fail to the same degree that I have been used to with power cycling. maybe 1/3 of the switches failed.
So then my next idea was to unbind some individual bindings and re-bind them as groups. So I changed the other 2 sets of bound switches (3 in one set, 2 in the other) to group bindings, making sure to remove the individual bindings in the process. They do not work every time under any condition. I tried:
-Bringing them online without the coordinator online
-Bringing them online with the coordinator online
-changing nothing after the groups were bound
Especially the group with 3 switches, it just fails 20-50% of the time. It’s always a “many to one route failure”. The other groups (one with 2 switches, another with 2 switches and 2 smart bulbs) can also be made to fail and do also go out of sync on the dimming, but not as often as the group with 3 inovelli switches
Also, when bringing them online one time, I happened to catch this “Route Error Source Route Failure”
Device 54057 is one of the slaves in the group with 3 switches
This is particularly annoying because it means that groups randomly either behave totally fine, or worse than individual bindings. I’m having problems with the 3 switch group that I never had when I had it set up as individual. Instead of needing to toggle it half a dozen times after a power cycle, I now cannot send more than one physical input per second, or it just fails. It seems to work every time if I leave it alone, walk up to it, and hit it once. but if I stand there toggling it on and off, as soon as my inputs get faster than 1 per second, it stops working half of the time. And this isn’t just a problem for not being able to manually flicker the lights at the switch. Multi taps are affected, as well as dimming. In the middle of ramping brightness, the switches will go out of sync. The master will keep ramping up until it recognizes that I let go of the slave a few seconds later.
It’s as if one inovelli device is trying to handle the traffic of every single device on the zigbee network. And it’s already so overloaded with all of the stuff that the switches report back to home assistant all of the time, that it can’t handle more than one extra input per second on top of that.
I will try to figure out the sniffer in a bit and go through all of this again.
I went to use the zigbee sniffer 2 weeks ago and found that I needed a special device to flash it (I thought I had bought pre-flashed sniffers). Took a bit for that to come in. I probably will have some time tonight to do some more troubleshooting.
Is there anything specific you guys want me to capture? Should I leave all of the switches powered while trying to capture this traffic? Or should I focus on trying to capture single groups?
Also, @zeel and I would like some clarity about the setup of these. It says in the setup instructions that we should use individual bindings for pairs of switches (3 way), but groups for anything beyond that (3 way with smart bulbs, 4 way, anything with more than 2 devices needing bound together). But my takeaway from this thread so far has been that we should do everything as groups. Even if it’s just a pair of switches. Which is it?
Still would like some clarity on why it has been suggested that everything should be done with group bindings. Both of you said it in this thread, and the documentation does not suggest this should be done.
Regardless, I went ahead and converted every binding to group instead of individual. I figured maybe there’s some weirdness causing issues with individual bindings, and it just goes away if everything is done with groups. This was not the case. Groups still have issues where commands are not sent through the network fast enough, and the master and slave switches get out of sync. It’s rare that I dim a light and both switches that control it have the same dimming level on the light bar. Double tap brightness also almost never works on both ends (just the switch I am pressing). I would guess that these thing succeed 30% of the time or less.
One thing I tried specifically to try to learn how slow the network is through the switches is to try to turn multiple lights on at once from one end of their 3 ways. I have 2 banks of 3 switches side by side. If I try to turn all 3 on at once, with no other commands sent for several minutes beforehand, at least one of the slave switches fails to update (or at least one of the master switches fails to toggle its load). Below is the set of errors I get when I try to turn 3 lights on at once. I have not seen the “failed to register group” error before, and it is not an error that happens with every group. Other groups have identical problems without that error.
Everything continues to work perfectly from home assistant, and in fact even with a delay of 0, it’s significantly faster to turn lights on from home assistant than from the master switch itself. I even hooked up a couple of lutron caseta remote switches (because there are no zigbee battery switches that look like a standard rocker) and it’s faster to turn lights on from those with an automation than it is from the inovelli switch that the load is hooked up to. It’s something like 100ms from home assistant vs 450ms from the switch itself.
I still would also like to know if you guys want me to sniff traffic with my whole zigbee network intact or to shut down everything but one group or what.
I know next to nothing about Zigbee (my experience such as it is) is with Matter however googling that error the results seems to always point at the Zigbee coordinator (most times Sonoff). No idea if that helps or not.
We already changed the coordinator to a recommended one, and the things I am having issues with are supposed to work the same whether the coordinator is online or offline.
I take your point however the route errors are being reported by the coordinator. They infer you have a multicast issue. The exact same error is being reported by users with different switches etc.
Also dumb question 2 for the day, what hardware are you running HA on?
Last thought, I understand wanting 3 and 4 ways to work if the coordinator is down. However in my case I also set up a “backup automation” in HA in case the matter binding does not work first time. That significantly reduces the possibility of users seeing an issue if the binding fails.
If I had any ability to control how the zigbee network sets itself up, I could probably try to do something to fix these many to one route issues. But the network just sets itself up wrong, and can’t be tweaked manually.
I’m running on a pi 5 8GB
I asked earlier in the thread if there’s some method to make the switches send a command back and forth to re-establish the binding after a power outage, and nobody was sure. Something akin to a “ping”. I shouldn’t have to do that, but if I could just make one of the binding methods work without problems, bandaid or no, I wouldn’t have to keep coming here and posting in hopes that I get some suggestions or beta firmware to test. I don’t really want to try to get home assistant to interpret which command I am trying to send physically from the switch and send it again from home assistant to try to stabilize things. The zigbee network seemingly already can’t handle the commands being sent through it as it is. If I do that I will double the number of commands being sent through it. This will basically ensure problems every time I press more than one switch in a row. The only reliable way to make the switches work more than 50% of the time is to only send 1 command per second (which means double tap and dimming often fail)
As it stands I have $5000 of switches that I installed that are significantly less reliable than normal light switches. The smart features are really nice but they just do not function well as actual light switches. Every person that comes into the house I have to explain the ways in which the light switches don’t work right so that they know what to do if they press a switch and nothing happens.
I’m not suggesting that. Just set up an animation that mimics the binding. That way the automation will set up the IP route for you after a power outage the first time the switch is actuated. I really do not think this will impact network performance that much unless your switches are turning on and off multiple times a minute.
It is unfortunate that (AFAIK) no equivalent of distributed border routers and TREL seems to exist in the Zigbee world.
One last thought for today (promise) is that I’ve seen noticeable improvements in the performance of my Thread/Matter network by simply increasing the horsepower of the platform that runs HA. I currently use an older 16Gb 4 core 8 thread PC that cost me ~$150 that has about 2.5 x the single core speed of the pi5 and about 3x the multicore performance.
After a few extended detours, I have captured some zigbee traffic with a sniffer. I am unable to upload it directly because it isn’t a file format the forums support. The issue was I needed the CC debugger which took time to come in, and then the specific flashing cable for the CC2531 took more time once I realized I needed that too. Somehow I had missed that the first time.
I made sure while capturing this traffic to cause some failures by toggling 3 switches at a time that are bound to other switches. I think at least one of the bound switches didn’t receive a command every single time I did it.
I didn’t do anything to my network before capturing. As far as I’m aware, the network has remained powered for weeks at this point without interruption.
If there’s anything special @rohan or @EricM_Inovelli would like me to do now that I have the sniffer working, please let me know. I would really like to get the issues I am having resolved.
EDIT: Also, it may be important to say that I used wireshark. I’m not sure if all of the various methods of capturing can read the logs from other methods or not.
It occurs to me you will need my network key. I’m not sure if that’s something I should post publically or not so if you guys DM me I can give it to you.
In the meantime, I can give you some highlights, knowing full well there is much more information in the actual capture file.
Now I don’t fully understand what I’m looking at, but it seems to be basically what we’ve been guessing all along. That however the switches set themself up in a network, just 2 commands near the same time can be too much for the network to handle. And honestly sometimes a basic toggle doesn’t work either, but that doesn’t fail 50% of the time like 2 toggles at once or dimming does. So the commands go out, it throws a many to one route error, the command never reaches the target, and it times out. At least now with these sniffer logs I can see the timeout happening.
Part of what I don’t get that’s happening is, sometimes the sniffer logs show a single command 500 times (though the data is different for each of those 500 times). But other times, it only shows the commands being sent once.
See in the first picture below how the source is the same for all instances of that onoff command?
But in the second picture, it only shows each of the 3 sources sending the onoff command once in a row. It then repeats them just a couple of times each before the many to one route failure, vs the dozens and dozens of times per command that it shows in other places
WIreshark captures should work. The encryption key is definitely needed (and shouldn’t be provided publicly). You can start a DM with @EricM_Inovelli and I and share it with us that way.
That said, we’re very close to the limit of my knowledge here. I can poke around the capture a little bit with wireshark and an AI assistant but the best feedback for what’s actually happening here will come from Eric and the firmware engineers.
Bumping this again, If there’s someone else I need to be asking besides @EricM_Inovelli I am happy to talk to whoever and provide anything and everything that I can.
But I’m not just going to let these threads lock and go away after spending so much on these switches. And I have had no indication that anything has been or will be looked at. I appreciate that things can’t happen overnight, but any indication that things are being looked at would be wonderful.
I would be willing to set up any requested test scenario to help narrow down this problem, or help find a solution. I am quite good at troubleshooting and bug testing.
I’ve even been trying to bandaid things by reducing the number of messages going to the zigbee network (such as power monitoring). Nothing I have tried has made any difference. I reduced the number of payloads being published from about 15-20 per second down to 3-5 per second, and it has made no difference on the functionality of the switches. Commands do not make it from one switch to another in groups as they should.
A thought you might not like (but I have all my bindings setup this way) is to program a “backup automation” that does the same as the binding. That way if the binding does not initialize correctly for the first toggle or two after a power outage the automation will take care of things and the result is transparent to users.
BTW I have not seen the same issue as you. I’m just a “belt, braces and piece of string” kind of chap (as they say in England). My logic is that if I have a backup way to avoid a perceived failure, I’m going to use it.