Matter network instability with white switches

I have about 50 white dimmers installed in my house with about 15 more to come. My matter network has been pretty stable for the last couple of weeks. Occasionally a switch would drop off the network for 5 or 10 minutes, but it would always return. No big deal overall.

But 3 or 4 days ago my network became very very unstable for no particular reason. I didn’t install any new switches. I don’t think I made any Home Assistant updates. Now at least once a day my network will basically die. Switches start falling off and on in quick succession and the network never comes fully back up. I have to walk around my house manually resetting a dozen switches to get them back on line. No good!

Looking at the matter log there are lots of errors. Like so:

2024-11-19 13:59:54.240 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 4
2024-11-19 13:59:54.252 (Dummy-2) CHIP_ERROR [chip.native.DMG] Time out! failed to receive report data from Exchange: 12786i with Node: <0000000000000016, 1>
2024-11-19 13:59:54.254 (Dummy-2) CHIP_ERROR [chip.native.DMG] Time out! failed to receive report data from Exchange: 12787i with Node: <0000000000000013, 1>
2024-11-19 13:59:54.256 (Dummy-2) CHIP_ERROR [chip.native.DMG] Subscription Liveness timeout with SubscriptionID = 0xe975f608, Peer = 01:0000000000000010
2024-11-19 13:59:54.259 (Dummy-2) CHIP_ERROR [chip.native.DMG] Time out! failed to receive report data from Exchange: 12778i with Node: <0000000000000015, 1>
2024-11-19 13:59:54.261 (Dummy-2) CHIP_ERROR [chip.native.DMG] Time out! failed to receive report data from Exchange: 12781i with Node: <000000000000002A, 1>
2024-11-19 13:59:54.262 (Dummy-2) CHIP_ERROR [chip.native.DMG] Time out! failed to receive report data from Exchange: 12788i with Node: <000000000000000E, 1>
2024-11-19 13:59:54.264 (Dummy-2) CHIP_ERROR [chip.native.DMG] Subscription Liveness timeout with SubscriptionID = 0xb20a868f, Peer = 01:000000000000001A
2024-11-19 13:59:54.270 (Dummy-2) CHIP_ERROR [chip.native.DMG] Subscription Liveness timeout with SubscriptionID = 0x3cd4949d, Peer = 01:0000000000000035

I’m also seeing an occasional python stack trace:

2024-11-19 14:00:40.968 (MainThread) ERROR [aiohttp.server] Error handling request
Traceback (most recent call last):
File “/usr/local/lib/python3.11/site-packages/aiohttp/web_protocol.py”, line 477, in _handle_request
resp = await request_handler(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/site-packages/aiohttp/web_app.py”, line 559, in _handle
return await handler(request)
^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/site-packages/matter_server/server/server.py”, line 82, in _handle_ws
return await connection.handle_client()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/site-packages/matter_server/server/client_handler.py”, line 80, in handle_client
await wsock.prepare(request)
File “/usr/local/lib/python3.11/site-packages/aiohttp/web_ws.py”, line 204, in prepare
payload_writer = await super().prepare(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/site-packages/aiohttp/web_response.py”, line 426, in prepare
return await self._start(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/site-packages/aiohttp/web_response.py”, line 434, in _start
await self._write_headers()
File “/usr/local/lib/python3.11/site-packages/aiohttp/web_response.py”, line 513, in _write_headers
await writer.write_headers(status_line, self._headers)
File “/usr/local/lib/python3.11/site-packages/aiohttp/http_writer.py”, line 131, in write_headers
self._write(buf)
File “/usr/local/lib/python3.11/site-packages/aiohttp/http_writer.py”, line 76, in _write
raise ClientConnectionResetError(“Cannot write to closing transport”)
aiohttp.client_exceptions.ClientConnectionResetError: Cannot write to closing transport

I have a very vanilla setup. Inovelli switches are the only thing on my network. I’ve got a Home Assistant Green and a Connect ZBT-1.

Any suggestions on how I might proceed here? I knew getting into this I was embarking on a bit of an adventure, but at some point I will try my wife’s patience. :-p

Thanks!

-harryh

PS: I just posted a sample of the errors I see in the log. There are various other flavors of errors. I can definitely post more if that’s helpful to anyone.

Answering my own question a little bit, I restarted the matter server and, at least for the last few hours, that seems to have helped a lot.

I’m not sure why this would be especially different than restarting the whole home assistant server, but it seems to have been. Just been a few hours so far though.

I personally have noticed more issues with home assistant matter server than apple matter network,

Even tho home assistant uses the Apple TV for relaying and all switches are on v.1.0.5, some switches take longer to come online in home assistant and still randomly drop here and there evo tho they are rock solid in apple home.

I’m hoping when matter 1.4 and thread 1.4 get adopted a lot of these issues should stop when all devices become part of the same fabric vs every one creating their own fabric to run.

Can’t comment on what’s going on with your instance but just sharing my woes with home assistant

Welp. After restarting the matter server everything was fine for about 8 hours and then the matter network completely failed again. Every switch went off line. Lights randomly turning on and off in my house.

Oh well.

I can’t really afford to wait for matter 1.4 for things to work with at least some degree of reliability. Not sure what I’m gonna do at this point.

This feels like a Thread mesh problem to me rather than Matter. A couple of thoughts:

  • is your HA Connect ZBT-1 affected by the hardware problem? Home Assistant Connect ZBT-1 issue and replacement - Home Assistant I’ve seen discussions from people where a device that was working can degrade from the overheating and eventually become unstable. If you need an alternative quickly, the SMLIGHT SLZB-07 and Sonoff ZB Dongle-E work great for me.
  • Are you using the ZBT-1 in multiprotocol mode? Consensis appears to be forming that this is not a good idea. If you are using both Zigbee and Thread it seems to be quite a bit more reliable if you use separate adapters. I did the migration away from multiprotocol mode myself a while ago and am glad I did.
  • If you are using the ZBT-1 in multiprotocol mode then this doesn’t apply: Have you restarted the OTBR (openthread border router) addon? When you are running in non-multiprotocol mode then you should have a separate openthread border router addon that you can restart. It has its own logs that might give insight as well.
  • Also non-multiprotocol: you can turn on the OTBR web gui and look around. The OTBR thread device topology map is unreliable and almost useless but you might be lucky and get some insignt.

Anyway, to me, this feels like comms is breaking down and that would be a Thread issue rather than Matter. The ZBT-1 would be my prime suspect, especially if in multiprotocol mode.

Also, when I was using multiprotocol mode I found that a full hardware power cycle solved problems that other lesser restarts didn’t. (I was using multiprotocol on the HA Yellow built-in radio, YMMV)

I’m sorry this is happening to you. Spouse approval factor is a HUGE deal.

Addendum:

The SLZB-07 isn’t so easy to get quickly but the vendor directly provides flashing tools to install the thread radio software. Also, the SLZB-07 has hardware flow control enabled and this can be important for reliability.

The ZB Dongle-E will need a 3rd party flashing tool to move it from its own firmware to thread for the first time. Under the covers it’s the same tool but the process is a little more convenient for the smlight device.

Once they are running thread firmware then HA’s OTBR addon will manage the firmware updates itself. Both are viable alternatives if your ZBT-1 is at all suspect. Use whatever you can get the quickest.

3 Likes

Great thoughts Peter. Thx!

I agree that this is very likely an underlying Thread issue. I’m just seeing it in the Matter layer due to the problem below.

  • I’m not using the ZBT-1 in multiprotocol mode.
  • I didn’t know about the hardware issue with the ZBT-1. I will look into whether this is impacting me. Great idea.
  • I’ll also poke around in the OTBR gui to see if there is anything useful there.

I’m also wondering if I should install a 2nd thread border router in my home. I’ve got two floors and I’m wondering if the swtiches on the floor where I don’t have a TBR are having a hard time communicating back over the mesh network to my Home Assistant box.

I don’t really want a smart speaker like an Alexa or Apple Home in the mix. Anyone have any good suggestions for some kind of stand alone TBR I could use here?

-harryh

I have recently been upgrading my home in the same way. Have about 25 White Switches installed so far. I have HA Yellow and a HomePod as my TBR’s. I see the same behavior you mention in that once and awhile a switch will go offline in HA, but eventually it just comes back again on its own. I am just ignoring it and hoping it will become more stable as Thread support increases in HA. The reason I have a HomePod in the mix is because my Nanoleaf Downlights in my kid’s rooms are better supported there, and I can easily just share the device from HomePod into HA to control it from HA. Twice over the last week or so all my thread devices were offline when I woke up and I found that my HomePod was causing the problem, so I moved it from upstairs to the computer closet with my HA. That seems to have helped. Now that I have thread throughout the house via switches, I don’t think it’s necessary to have TBR’s throughout the house as the switches act as extenders for the Thread Network. I know this isn’t really specific to your situation, but I just thought I’d share my setup in case it helps you decide how to move forward with your issue. If you are to get another TBR, I would recommend a HomePod as it’s extremely easy to integrate into your existing HA Thread network.

There is some really good information about router selection and network forming on the openthread site نقش ها و انواع گره ها، نقش ها و انواع گره ها  |  OpenThread - it doesn’t work the way many people expect it to work. In particular, note the part about how router-eligible devices (such as your White 2-1 switches) automatically make themselves a router if needed - especially to heal a network partition. The mesh will try and maintain 16-23 routers at all times, with a maximum of 32.

Adding a second border router makes things more complicated for Home Assistant and wouldn’t help the mesh anyway. A TBR’s function is to bridge thread packets onto the wifi or ethernet network. While it might also contribute to mesh connectivity, it would only do so just like any other powered thread device. What wouldn’t help is that it creates a multipath route for the Linux kernel to the IPv6 prefix on the Thread mesh and the way this is implemented causes a bunch of non-determinism for HA that you could probably do without.

1 Like

Oh interesting. Very helpful. Like I said, one of my concerns is that the path from the switches furthest away from my Home Assistant / TBR is pretty far. I thought by putting a 2nd TBR on the 2nd floor of my home it might make things a bit easier. But maybe that wouldn’t be so helpful. I do have a few more switches to install on that floor. Maybe once I do that the strength of the mesh upstairs will be better and I’ll see fewer problems in that area.

The one thing about Thread (and Zigbee) that makes me a bit uncomfortable is that the router selection process seems like it is a little naive. It doesn’t seem to solve for maximum coverage if you have a non-uniform radio environment. “Good enough” is good enough, it doesn’t aim for optimal.

The openthread guides talk about the process. As it says, the mesh has a goal of 16-23 routers. If a joining device detects that there are 16+ routers then it won’t automatically become a router - UNLESS it detects that it can see a node that the mesh can’t see. In that case it will make itself a router. If that pushes the mesh over 23, then another router will drop out (if appropriate).

It is easy to imagine a scenario where you have a cluster of well-connected routers upstairs, and a cluster downstairs, and only one single device bridging between the two. As far as the mesh is concerned there is sufficient end-to-end connectivity, even though promoting a few more devices that could see both clusters might be more optimal.

An extra TBR wouldn’t help in the just-one bridging device case as it wouldn’t meet the promotion criteria to be a thread router.

What it could do is bridge two disconnected partitions via wifi/ethernet. Apple does leverage this this with their homepods. Home assistant does not do this sort of thing. It merely sets a multipath route in the Linux kernel and leaves it at that. At best, the kernel’s packet forwarding would pick one of the two TBRs at random when sending the packet and that’s it. If it picked the wrong one, then too bad. It could just as easily send packets for nearby devices instead to the far-away TBR and that would be no help at all.

The naming is unfortunate. It would be natural to assume that a TBR would improve the mesh connectivity but that is not its purpose.

What we really need is a good Thread node/link explorer. Like what Zigbee2MQTT has. Being able to see a reliable visual map of which nodes can see what devices/routers/etc and their link quality would be priceless. And while here, we could use Z2M’s bindings editor for Matter as well.

2 Likes

Are you using multi-admin? I’ve seen a few reports of matter network congestion issues when multi-admin is enabled on networks around your size.

2 Likes

Nope, no multi-admin. I have the most vanilla possible setup: Home Assistant Green, ZBT-1, bunch of inovelli switches. That’s literally it.

How many boarder routers do you have? I don’t have 50 thread devices, but I got 20 or so thread bulbs. And ones I up to my current 3 boarder routers a lot of my issues went away.

Just updating my own thread here: After a Matter Server restart I’m back to having a pretty rock solid network again. I really can’t say why, but my problems seem to have disappeared for the moment. Kinda weird, but nice!

Did you install today’s HAOS Matter server update yet???