Issues with Blue 2-1 After Replacements

I’m also curious about this, but hopefully we can figure it out on z2m as well. I personally have not seen this issue on my network, but I am using Hubitat.

@bgreet When the slowdown occurs, are the devices responsive physically? Can you go up to the device and press a button and have things instantly respond? Or do the devices seem to slow down as well?

I’ll pm you with some other questions and we will dig a little more.

All my switches are smartbulb enabled. If I disable smartbulb they are fully responsive. They are also fully responsive at the switch for on/off with smartbulb enabled with binding though the hit rate is 75-80% working on the first try (switch shows that is is off/on or dimmed but does not get sent to bulbs).

1 Like

I have very similar issues as you in a similar setup: ~30 blue switches, Z2M, all controlling Hue bulbs. I, however, split my network into two to reduce the number of devices in each. That cut waaay down on the random errors and general stability, but did not fix binding.

The most frustrating is the binding issues that you note in your (2). I played around with sniffing and documented a bit here, but still need to follow up on that after the holidays.

I wonder if anyone is running blue switches bound to hue lights at a similar scale that is actually having success? From what I’m seeing in my linked thread above, the network flooding appears to be due to the large amount of devices trying to talk over each other with the Blue’s sending a whole lot of broadcast messages.

Also, @bgreet- I see your direct message about setting up a sniffer- I’ll try to get around to replying later today.

edit: I only have about 30 blue switches, not 50… I don’t know why I wrote that at first :upside_down_face:

If you haven’t already, check out changing the interval reporting in z2m (see separate post on the topic regarding flooding). It changed the game with my network including binding. Let me know if that works. I did the same initially (ie splitting network, checking exposure reports to 0) but what made things work was changing the reporting interval. Good luck!

2 Likes

Hey @terrence.bentley – how many Hue do you have? I was able to get 13 bound to a single switch and had that running for a while on my test setup. Granted, I was using ZHA, but I did get it setup on Z2M at least long enough to make a tutorial video.

1 Like

@Eric_Inovelli I have 30 Hue on one network (with ~18 blue series controlling them) and 25 Hue on the other (w/ ~10 blue series).

The most I have bound on a single switch is a group of 6, but I haven’t noticed any correlation in my testing between the # bound to a single switch vs bind reliability. When it works, it works BEAUTIFULLY- immediate response. The broken behavior tends to happen when pressing multiple switches quickly right after each other e.g. the use case of turning off all of the basement lights from a 3 gang bank of switches one after another: I can pretty reliably get at least one of the 3 bindings to fail when doing this. I can also reliably get the binding to fail by quickly toggling a single switch on and off a few times within a few seconds.

I’m not a zigbee expert by any means, but I can’t help but be suspicious of all the broadcast messages the switch sends to all the nodes in the network when the physical switch is pressed. Note that I can toggle on and off my GE zigbee switch bound to a Hue bulb as fast as I can press it and the binding stays reliable (it’s only sending ~4 messages per tap vs the 60+ messages when tapping the Blue series).

Again, I can’t stress enough zigbee is new to me and my “conclusions” about the causes of my poor reliability are pure speculation based on the evidence I’m seeing through my sniffing dongle: I could be way off base here. :slightly_smiling_face:

Makes sense – I think you’re right regarding traffic. I know Eric M is working on something relating to Z2M right now and the energy monitoring to minimize the traffic.

I skimmed back through this thread, so apologies if you’ve answered already, but have you turned off (set to 0) parameters 18, 19, 20 (all the power monitoring ones)? This may help while we work on a solution for Z2M.

Yes- those are all disabled and I have verified they aren’t causing any traffic when sniffing the network.

It’s these OnOff broadcast messages that appear to be the culprit:

Those highlighted messages are all the OnOff attribute messages being broadcast across the network to every node (note that these messages have nothing to do with the energy monitoring- these are specifically OnOff and/or LevelCtrl messages being broadcast).

Is it Z2M that would responsible for these OnOff broadcast messages when interacting with the switch or is it the firmware of the switch itself? I guess I would have assumed the latter, but I don’t have an easy way to test the message behavior in ZHA at the moment.

3 Likes

Interesting, thanks for documenting this - I’ll unfortunately have to defer to @EricM_Inovelli on this one :confused:

1 Like

From that trace, it looks like it tried to send a response to the coordinator (19963, sequence 29) but perhaps didn’t get a response. Then it starts broadcasting seq 29 to all nodes every 0.00250 seconds or so, perhaps in a panic mode of trying to get that packet back to the coordinator using an alternate route?

A trace of where that node is either source or destination may help figure that out.

1 Like

Ah, you may be on to something, @coreystup !

Here are two downloadable packet dumps from wireshark (you’ll obviously need the wireshark application to open these): one for the traffic on the inovelli blue switch and one for the traffic on the GE/Jasco zigbee switch (acting somewhat as a control). Both experiments were run by binding each switch to the same single Hue bulb and pressing either up or down on the respective paddle and monitoring the traffic. Both of these dumps start with the initial packet being sent from the switch to the bulb and ends when the traffic tails off. The GE switch generates a total of 34 packets (none being of the broadcast type) over 0.1 seconds, while the Inovelli generates 213 (133 being broadcast) packets over 1.8 seconds.

I’m honestly unsure of exactly how to interpret all of these packets, but I can spot some differences between the GE vs Inovelli packets that may be worth noting:

Starting with the GE:


We can see it send a packet (two of the same actually- redundancy?) to the coordinator at .04s to tell it the new state of the switch (Off) with flags of: “acknowledge request” = True and a “Disable Default Response” = False.

Then, at .11s:

We can see the coordinator sending back a “ZCL: Default Response” of “Success”. Presumably letting the switch know that it received the packet and all is good.

Now, moving to the Inovelli:


We can see the same type of reporting packet sent to the coordinator at .05s with flags: “acknowledge request” = True, but (in contrast to the GE) with a “Disable Default Response” = True as well. Is this telling the coordinator that it’s expecting an acknowledgement, but at the same time telling it not to send a response? Looking through the rest of the packets, we never see the coordinator send back a “ZCL: Default Response” = “Success” message to the inovelli switch (it does send a few things to it that look like route discovery things, but never a “success” message as we saw with the GE switch).

The only odd thing about the “inovelli panics when it doesn’t get a response from the coordinator” hypothesis is that the inovelli only “waits” ~0.04s between sending the packet to the coordinator and then starting its flood of broadcast messages. Whereas the GE switch actually doesn’t even receive its response from the coordinator for about 0.06s after sending its packet to the coordinator.

Anyhow- is this helpful? Again, the technical aspects of zigbee are new to me so I’m grasping at straws a bit here, but just attempting to provide enough info to help troubleshoot this for the folks who are having these issues.

And just to note: these two dumps are not anomalies: this is consistent traffic behavior that I’ve always seen on both the GE switch as well as all of my inovelli switches. I strongly don’t think this has anything to do with a poor mesh/issues in communication. The switches and the Hue bulb in this experiment are all within ~6ft of the coordinator and each other and none of my installed switches are in the bad batches.

edit: oh, and these are the zigbee keys I have configured in wireshark for my networks to be able to see the fully decrypted packets:

“f0:8e:97:53:9c:05:ca:c5:f5:70:0c:28:93:ec:f8:dc”
“5A:69:67:42:65:65:41:6C:6C:69:61:6E:63:65:30:39”
“a38895d224d436924aba8cd7da4f4d9d”
“01030507090b0d0f00020406080a0c0d”

3 Likes

Can you share a screenshot of your reporting tab? Like this one?

Definitely- here it is for the switch I have bound to that single Hue bulb:

Edit: and not sure worth noting, but the hue bulb that I did the direct binding to is not on this switch’s circuit so is not impacting power reporting. All of the hue lights on this switch’s actual circuit remained off during my testing.

1 Like

Can you change the onoff reporting min rep interval to 15?

1 Like

Good thought there- I hadn’t tried that yet. I just tested and I’m still seeing the same pattern with that rep interval on onOff set to 15:

I won’t upload the full trace unless it would be useful, but you can see my highlight on the initial bind “Off” command (1), then the OnOff attribute being sent to the coordinator as seq: 69 (2), then the flood of broadcast messages start for that same seq: 69 (3) that go off the bottom of the window: appears to be one message per node in my network, which I guess makes sense for a broadcast message.

(and thank you, of course, for all the help on this, @EricM_Inovelli ! I realized I’ve just been shooting out messages without saying that)

1 Like

I posed this unack’ed broadcast flood situation as a puzzle to one of the community zigbee enthusiasts (Tony on the Hubitat forum). He says he’s just a user trying to make sense of the documentation the best he can, but perhaps this can help figure out the pattern. His response:

Hi… if the question is regarding why the Inovelli apparently broadcasts at roughly 2ms intervals (possibly in response to a missing acknowledgement), that’s a head scratcher. I’m in no position to say for sure, but I’d expect the timeout for a message (at the application level) would be multiples of 50ms, since the EMBER_APSC_MAX_ACK_WAIT_HOPS_MULTIPLIER_MS is 50ms (SiLabs defines this as the ‘per hop delay’ used to determine the APS ACK timeout value). That’s also consistent with the observed GE traces showing no issues with a response arriving within 60ms.

But that’s for an APS ACK. There are of course retries happening at lower protocol levels and those timeouts would be different…

A while back (in response to a forum post where a figure of 30 seconds was posited as a retry timeout-- that seemed kind of long) I did some digging since to see how Zigbee timeouts and retries were handled (it’s complicated!) and my takeaway was:


Accounting for retries at application, network, and media access layers, the scenario would play out like this:

a: APS > NWK (application sends message expecting APS ack)
b: NWK > MAC (check for channel clear to transmit-- nothing else transmitting, do it… if channel busy, repeat up to 5X wait period up to 7 backoff periods of 320uS each)
c: MAC transmits packet, waits for ACK, if received within 864uS, DONE (no wait if broadcast)
If NO ACK–> repeat (c:) up to 3 times, if still NO ACK, MAC reports failure to NWK layer (so 4 transmits so far)

NWK layer waits up to 48ms, then retries (b:); still no ACK, repeat (b:) until 250mS elapsed, then report failure to APS layer. (at this point, have done at least 8 retries)

APS layer waits specified interval (multiple of EMBER_APSC_MAX_ACK_WAIT_HOPS_MULTIPLIER_MS) , then tells NWK layer to retry at (a:)

If still no ACK (after at least 16 retries, so far) APS layer retries all of the above starting at (a:) TWO MORE TIMES. Resulting in up to 48 transmission retries for a single unacknowledged packet…


In any event (at least in the Ember stack) the minimum application layer timeout–allowing for only a single hop-- would be 50ms. Retries more frequently than that would seem to be originating at the network or MAC layer.

From reading the thread you linked, it does seem like the observation about differences between GE and Innovelli re: default response enablement is significant. But that’s an application layer difference… and the repetetive sequence 29 broadcasts don’t seem to be waiting for APS ACK timeout intervals.

3 Likes

Wow, very interesting read- big thanks to yourself and Tony! It certainly makes me understand how in over my head I am trying to interpret any of this myself :upside_down_face: Still having fun investigating, anyhow

3 Likes

@mbbush wondering if we could get your thoughts?

@bgreet I’m flattered you mentioned me specifically. I also consider myself “just a user who reads documentation in detail”. I’ll try and find some time to read through this thread in the next few days and give an opinion, but no promises.

1 Like

Our engineers have identified an issue that they believe is causing the excess communication. The next firmware release should help with this. I believe it is coming next week.

13 Likes