Zigbee failures after 2.14 firmware upgrade

dallingham · April 24, 2023, 4:17am

Over the past several days, my switches have upgraded from 2.08 to 2.14, and I have added 6 additional Blue 2-1 switches. Since this time, my Zigbee network has become unreliable. Messages seem to be getting lost. Looking at the logs, I am starting to get messages like:

[139689718647728] Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>

Simple automations, such as turning a light on in response to motion are failing. Or in some cases, where the automation turns two lights, frequently only 1 or none of the lights turn on.

Up until the upgrade and the addition of the 6 additional switches, my network was rock solid.

Any suggestions on how to debug this? I’m running Home Assistant 2023.4.4 with ZHA on a HUSBZB-1 stick. I have 42 devices on the Zigbee network, 30 of which are Blue 2-1 switches.

pfak · April 24, 2023, 2:53pm

I’m having the same problem. Check out my topic:

dallingham · April 24, 2023, 3:16pm

My problem sounds similar, but I have not had any problems pairing the Blues with my network. Just once they are paired, they seem to randomly drop commands. Looking at the logs, I see that I’m getting a python traceback with the following error:

asyncio.exceptions.TimeoutError

I assume that means that a command was sent to the switch and it did not respond.

My best guess is either that the 2.14 firmware is having a problem and missing messages, or that my network is being overwhelmed. I have doubts that the network is the problem, because my network is fairly small and consists mostly of routers (30 Blues and about 5 Sonoff plugs) that are evenly distributed across the house.

Unfortunately, my Zigbee skills are a bit limited and I don’t know how to debug the issue.

pfak · April 24, 2023, 3:23pm

My original issue is as you described, however I got some Blue’s setup on a test bench on a separate network and noticed problems with pairing them as well.

Don’t know if the two issues are related. But I certainly have the Blue’s dropping children without notification, causing command failures.

dallingham · April 24, 2023, 7:50pm

I’ve used node-red to alter one of the motion sensors. On motion, I attempt to turn on two lights. After sending the turn_on, I check their states. If they are still off, I delay (250ms for one, 500ms for the other) and try again. This has increased the success rate, but still does not always work, since sometimes the second turn on fails.

Interestingly, the turn_off never seems to fail on either switch.

dallingham · April 24, 2023, 11:49pm

The problem seems to be even more basic than that. I have several switches in the basement that have a high failure rate with a remote turn_on from the Home Assistant UI. Some of these will fail a dozen times in a row before they will succeed. Other switches nearby are rock solid.

I get messages like this in the log file:

2023-04-24 16:39:32.107 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame received messageSentHandler: [<EmberOutgoingMessageType.OUTGOING_DIRECT: 0>, 44649, EmberApsFrame(profileId=260, clusterId=6, sourceEndpoint=1, destinationEndpoint=1, options=<EmberApsOption.APS_OPTION_NONE: 0>, groupId=0, sequence=102), 118, <EmberStatus.DELIVERY_FAILED: 102>, b'']
2023-04-24 16:39:32.107 DEBUG (MainThread) [bellows.zigbee.application] Received messageSentHandler frame with [<EmberOutgoingMessageType.OUTGOING_DIRECT: 0>, 44649, EmberApsFrame(profileId=260, clusterId=6, sourceEndpoint=1, destinationEndpoint=1, options=<EmberApsOption.APS_OPTION_NONE: 0>, groupId=0, sequence=102), 118, <EmberStatus.DELIVERY_FAILED: 102>, b'']
2023-04-24 16:39:32.108 DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0xAE69:1:0x0006]: command failed: 'on' args: '()' kwargs '{}' exception: 'Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'
2023-04-24 16:39:32.108 DEBUG (MainThread) [homeassistant.components.zha.entity] light.rec_room_overhead_light_2: starting transitioning timer for 1.25
2023-04-24 16:39:32.108 DEBUG (MainThread) [homeassistant.components.zha.entity] light.rec_room_overhead_light_2: turned on: {'on_off': DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')}

I have an even higher failure rate using the zha.issue_zigbee_cluster_command to try to set the LED sidebar. While a few days ago this was working reliably, it is now failing with the same DeliveryError, even on switches that are working solid for the turn_on/turn_off remote commands.

pfak · April 25, 2023, 2:47pm

@dallingham It’s because the Inovelli switches are dropping their children but not notifying the network.

dallingham · April 25, 2023, 2:50pm

I’m curious why this isn’t being more widely reported.

Is there some way I can track this with ZHA? Is it just specific switches that drop their children, or is it across the board?

Is this a recent problem? I didn’t seem to have any problems with 2.08.

pfak · April 25, 2023, 3:08pm

Is this a recent problem? I didn’t seem to have any problems with 2.08.

I don’t know. I just got the switches and they were on 2.08 and immediately OTAed to 2.14. I installed 20 of them and they’ve absolutely destroyed my network performance and reliability.

Eric_Inovelli · April 25, 2023, 3:10pm

@EricM_Inovelli – can we start a PM or something with these guys to troubleshoot?

Edit: @pfak and @dallingham – what is the date-code at the top left of your switches? There should be a four digit code underneath the faceplate.

I haven’t seen/heard any reports outside of a few here and there, so I’m curious to get to the bottom of this for you guys.

dallingham · April 25, 2023, 3:11pm

My network was very reliable. I had around 24 installed without issues. I got my new batch and installed them as 2.14 was rolling out, and that is where everything fell apart. I’m not sure if it is the increased network size or 2.14.

dallingham · April 25, 2023, 3:21pm

I’ll have to go around and pull off the face plates and check. I have a mix of the three batches. I have the original shipments, which contained some bad switches (some of which I reworked) and then the replacements. I’ve had these installed for several months without any issues. I just received 10 more from the latest batch, and that is when my problems started. I’ve installed 6 of the new batch this week, and that is when everything updated to 2.14. The network has become unstable during the past week.

So there are several variables here. Added more switches, increasing the size of the network. Switches from a new batch. Firmware upgrade. Any (or a combo of several) of these changes could be the root of the problem.

Eventually, as switch availability increases, I want to go back and replace all the reworked switches with new switches. I reworked and installed those because I had to have something in place for a remodel that was underway. However, all the reworked switches have been working well, and I haven’t had any problems until this week.

pfak · April 25, 2023, 3:27pm

@Eric_Inovelli Can you map IEEE to a date code? Otherwise I’m going to have to remove a lot of covers.

The units only stay in pairing mode for ~30 seconds. Is this normal? They all take 10-15 attempts to pair but then have a >100~200 LQI once paired. I do not experience this issue with any of the other Zigbee devices on my network (Hue, Leviton, Sengled, …)
All of these switches were purchased from AARtech in March/April of this year
The unit I’ve setup on my test bench is to experiment has a 2212 date code
I have observed one of the units has an IEEE address in the “bad batch”, but it has a fine LQI
Probably unrelated: I’ve had to RMA two switches so far out of the 25 I’ve purchased due to improperly manufactured lugs.

pfak · April 25, 2023, 3:36pm

My network was very reliable. I had around 24 installed without issues. I got my new batch and installed them as 2.14 was rolling out, and that is where everything fell apart. I’m not sure if it is the increased network size or 2.14.

I’m having problems pairing new switches out of the box with 2.08, so I am not sure if it’s a hardware issue or a firmware issue. Unfortunately all my switches shipped with 2.08, and despite my best efforts I have been unable to flash an older firmware (I tried to hack out the Zigbee2mqtt provider for Inovelli, replace it with ZigbeeOTA and then force an older firmware using an index definition but I just get “Image invalid” from the switch.)

I don’t know if these are separate issues, or the same issue. I did not have problems originally pairing them to the network, or at least not that I remember.

In my case I have setup a separate test Zigbee network to experiment.

dallingham · April 25, 2023, 3:36pm

My switches pair without any issues and I’ve never had an issue with them dropping out of pairing mode before I they get interviewed and added. They usually have an LQI of > 250. Some occasionally drop briefly to as low as 60, but there doesn’t seem to be any rhyme or reason to it. Sometimes they will drop for a few seconds, sometimes for a couple of minutes. Then they return to more stable values.

I’m using the ZHA Network card to monitor the LQI and RSSI in real time. Otherwise, I never would have noticed the occassional drop in LQI.

harjms · April 25, 2023, 3:41pm

Curious on this one. Pics?

pfak · April 25, 2023, 3:44pm

Curious on this one. Pics?

One that I disassembled just had no thread. Sorry for the leg in the photo

harjms · April 25, 2023, 3:46pm

Hmmm was it identified when you were torqueing down the screws onto the wire? I had a couple from the first batch that had this issue, but i didn’t bother opening up since they were being replaced.

pfak · April 25, 2023, 3:48pm

Yes. I torque down by hand using the back stabs and then do the wiggle test to make sure the wire is secure. This is when I identified the issue with 2 switches.

I have a few more new in box I should probably test before my 30-day RMA window with AARtech expires.

Eric_Inovelli · April 25, 2023, 4:51pm

Yeah I can certainly do that. Basically, do any of them have 943469xxxxxxxx or 385B44xxxxxxxx at the beginning? If not, then we can skip this step.

EDIT: I didn’t read your other post until now – you can disregard as it appears you know about the recall

Hmmm… no they should time out after like 1.5-2 minutes I think? I’m going off memory here, so I could be wrong, but I know it’s longer than 30 seconds.

If you could look at the date code on that one, that would be helpful. If it falls in the bad batch date code, that may explain that one – but definitely not the others

This is very perplexing – on the surface, everything appears to be ok, so I’m not sure what the issue is.

Let me start a PM thread with yours specifically so we can troubleshoot further as I think it may be different than OP’s problem and, admittedly, it’s above my paygrade so I’ll need to bring in the other Eric so we can look at logs.