The RF24 Core Library -Major Bug Fix
Recent Bug Fixes Affecting Auto-Ack/Pipe 0 and RF24 Core Lib w/Dynamic Payloads
So after all this time developing and maintaining the RF24 core library, I found yet another bug affecting the Auto-Acknowledgement functionality of the radios. An issue had long ago been identified and fixed regarding pipe 0, where the assigned reading address would be overwritten when transmitting, since the radios use pipe 0 exclusively for transmission. This affects the RX_ADDR_P0 register.
Somehow, we never realized that the receiving address, when written to the radios after switching from TX to RX mode would interfere with the reception of Acknowledgement packets in TX mode, since it would overwrite the RX_ADDR_P0 register on the switch to RX. Now the radio library caches BOTH the RX & TX address for RX_ADDR_P0, and writes to it when switching between modes as required.
This enables full functionality of the radios on all pipes, since previously, auto-ack would not work properly on pipe 0 in some cases. The changes have been committed to the source code and will be included in the next release, after v1.4.11.
There are minor impacts to throughput, but after careful consideration, these changes were included to fully enable the radios capabilities. We are working on a more efficient resolution.
This bug was mainly discovered due to my work on the nrf_to_nrf driver for nRF52x radios, which I already had caching both the TX & RX addresses. I realized the RF24 driver didn't do that.
***
I also found a bug that appeared to affect SPI functionality on Linux devices, but it turns out it affects all devices using the RF24 core driver with Dynamic Payloads. This includes the entire RF24 Comm Stack.
I first thought it affected SPI, assumed I was getting bad data, but nothing that I tried to adjust worked. Eventually I narrowed it down to the available() function consistently returning true. Once that was identified, I then discovered it had to do with 0 length payloads, and also discovered that doing a radio.read() had no effect when this happens, the RX buffers need to be flushed. So that's what is being implemented in the RF24 Core layer. When using Dynamic Payloads, and the payload size returns either >32 or 0, the buffers need to be flushed, so the RF24 layer will now do that for both sizes.
After all this time and searching for the problem, it was one line of code that had to be changed.
I'd included failure handling in all of the RF24Gateway examples due to this bug, which would intermittently cause the radios to become unresponsive, requiring them to be restarted/reconfigured. I searched high and low for a long time to find it, but at one point I thought it came down to an issue with the network layer not being able to process information fast enough.
The current approach is for the update() function in RF24Network to return a new system
type, indicating there has been corruption, and the RX buffers have
been flushed. The core RF24 layer will simply return 0 for Dynamic Payload Length.
This is being patched, will be available in the source code very soon, available in the next release. The Linux installer downloads from the source code.
RF24Gateway now displays a count of corrupted payloads
With these changes, I've begun keeping track of how often this happens. On faster devices like RP2040 or Raspberry Pi, it seems to be more prevalent. Of course, it happens way more on a device that is doing more reception than transmitting. In the picture above, the network has detected and flushed 32 corrupt payloads over a short period of time. This was how I was able to replicate the bug, by utilizing RF24Gateway as a testing tool, and hammering it with data from another RPi and from Arduino. Over time, the bug and its workings became clearer, so I was able to narrow it down to the network.update() function in RF24Network, and then further down the stack to the getDynamicPayloadSize() function of the RF24 core library.
I am also now logging the data on Arduino devices, via MQTT and NodeRed. I'm testing on an Arduino Nano, Due and RP2040, to see just how often this affects slower devices. As of writing this, no data is available yet, but I am running the tests long-term, so data will come in eventually, and will report back here, on this blog post.
Update: The issue affected the RP2040 doing standard communication in my "production" environment after a few days. I've filed a ticket with Nordic in hopes of identifying if this is a known issue, new issue or other.
https://devzone.nordicsemi.com/f/nordic-q-a/121237/nrf24l01-radio-r_rx_pl_wid-returns-0