HARD FAULT during xTaskResumeAll after ending a DFU session and disabling the Softdevice

Hi all,

I am working on nrf52840 chip with SDK 17.0.2.

Our application runs smoothly and we want to add it the capabilities of  upgrading another  nrf52840 chip using Nordic DFU service.

In order to do that we turn off our application RADIO, suspend most of our freeRTOS tasks and then call for vTaskSuspendAll to suspend the scheduler.

Then we enable the Softdevice (as part of ble_stack_init) and send the image to the remote nrf52840.

This works well.

When we finish the DFU process we call nrf_sdh_disable_request and wait until we know that the Softdevice is disabled.

Then we resume our tasks and want to resume the scheduler by calling   xTaskResumeAll();

The problem is that we get the following  hard fault: 

<error> hardfault: HARD FAULT at 0x00029350
<error> hardfault: R0: 0x00000A85 R1: 0x08F38168 R2: 0x00684088 R3: 0x0000000B
<error> hardfault: R12: 0x2000FE40 LR: 0x0002AEB7 PSR: 0x21000200
<error> hardfault: Cause: Data bus error (return address in the stack frame is not related to the instruction that caused the error).

The call stack is : 

 

What am I doing wrong ?

Thanks in advance for any assistance ,

Rafalino

Parents
  • Then it might be possible that we are overlooking into the memory corruption direction. If this is not a memory corruption by stack overflow, then the application is somehow passing the wrong timerID.

    You can write a small code snippet like below in prvProcessReceivedCommands  just before uxListRemove

    if (pxTimer->pvTimerID == 0x0x1E200000)
    {
        static volatile uint32_t counter = 0;
        counter++;     // <-- Put a breakpoint at this line
    }

    Compile your code, flash and start the code in the debugger. Put the breakpoint at the "counter++" and run the application in an attempt to trigger the hardfault. 

    The debugger should halt at the breakpoint and now your function call stack should allow you to browse through the functions that lead to this breakpoint. Try to understand the context of how this value has been passed to pvTimerID.

    Please note that there is known bug in the libuarte library when using FreeRTOS as the macros use to initialize libuarte instances initialize the app_timer_freertos instances wrongly (after an incompatible casting). Please check if you are affected by this.

Reply
  • Then it might be possible that we are overlooking into the memory corruption direction. If this is not a memory corruption by stack overflow, then the application is somehow passing the wrong timerID.

    You can write a small code snippet like below in prvProcessReceivedCommands  just before uxListRemove

    if (pxTimer->pvTimerID == 0x0x1E200000)
    {
        static volatile uint32_t counter = 0;
        counter++;     // <-- Put a breakpoint at this line
    }

    Compile your code, flash and start the code in the debugger. Put the breakpoint at the "counter++" and run the application in an attempt to trigger the hardfault. 

    The debugger should halt at the breakpoint and now your function call stack should allow you to browse through the functions that lead to this breakpoint. Try to understand the context of how this value has been passed to pvTimerID.

    Please note that there is known bug in the libuarte library when using FreeRTOS as the macros use to initialize libuarte instances initialize the app_timer_freertos instances wrongly (after an incompatible casting). Please check if you are affected by this.

Children
  • Hello Susheel,

    I have added the code that you suggested but could not see much on the call stack : 

    As for the UARTE, we do use UARTE and I would like to add the workaround that you suggested but could not fully understand where to set p_xLibUarteCOM0_app_timer_data->end_val = 0x0ULL;

    Can you clarify ?

  • The issue is not with using the UARTE but using the libUARTE. If you are not using libuarte then this is not the same issue.

    But if you are using LibUARTE, then there is a bug in the way nrf_libuarte where a macro to initialize nrf_libuarte_async_t structure, particularly p_app_timer_t  is done incorrectly when using freertos (since app_timer_freertos.c uses different timer casting than the app_timer.c). The dirty hack was to set .end_val in the below lstruct to 0.

    static app_timer_t xLibUarteCOM0_app_timer_data =
    {
    .end_val = 0xFFFFFFFFFFFFFFFFULL,
    };

    This is just to make the freertos app_timer initialization pass and then properly initialize these structures later.

    If none of the above works, then I am thinking that this could be related to Tickless mode (if you have enabled it in FreeRTOSConfig.h file?)

    If the chip is waking up from sleep at the time of you resuming all tasks, then it might be some how related to the port specific changes we did on the tick interrupt handling. Can you disable tickless mode to see if you can still reproduce it? 

    So far based on your investigation

    1) You confirmed that there are no stack overflows caught when enabling stack overflow checks and increasing the timer stack.

    2) You applied or will apply the work around given in the link if you are using nrf_libuarte library.

    3) Try to see if tickless disable have the same effect. 

  • Hey Susheel,

    We are using nrf_libuarte  but I have un-init that interface and disabled my UART task before enabling the Softdevice so I don't think that this is the case.

    As for tickless disable - we are using configUSE_TICKLESS_IDLE = 0 by default.

    What I see in my testing is that If suspend all my application tasks and then suspend the scheduler (via vTaskSuspendAll ), enable Softdevice and start BLE scanning before actually establishing BLE connection with the other chip, and at this point power off the remote chip so that the ble connection will fail, and then call disable Softdevice and and resume scheduler - I do NOT see the hard fault and I am able to resume my application tasks.

    It feels as if there is an internal connection timeout at the Softdevice that is triggered once there is a connection that we somehow don't clean when disable the Softdevice

    Does that make sense ? 

    Rafalino

  • Hmm,  Interesting observation.

    When we finish the DFU process we call nrf_sdh_disable_request and wait until we know that the Softdevice is disabled.

    Can you show me how you are waiting to know that the softdevice is disabled?

    In the examples of FreeRTOS we deliver with our SDK, the softdevice events are pulled in a task called "softdevice_task". 
    I am assuming that you are not using this task since you suspend all tasks just before enabling the softdevice. Which also makes me assume that all the activity of the softdevice and BLE is happening with a baremetal kind of event handling.

    Rafalino said:

    It feels as if there is an internal connection timeout at the Softdevice that is triggered once there is a connection that we somehow don't clean when disable the Softdevice

    Does that make sense ? 

    It seems like for some reason there is a need for larger delay that the wait you have in your application to know that the softdevice is disabled. Are you using XTAL HFCLK and/or LFCLK? or are you only using internal RC for the LFCLK? We have noticed that if you are using internal RC clocks and disable the softdevice in midst of internal clock calibration, then the softdevice disable function might return normally but takes longer than expected to actually disable the softdevice since it is busy calibrating internal clocks before actually servicing the disable softdevice request.

  • Can you show me how you are waiting to know that the softdevice is disabled?

    This is the part of disabling the SD:

    err_code = nrf_sdh_disable_request();
    APP_ERROR_CHECK(err_code);

    /* sd_is_enabled is changed to false once we know that the SD is disabled (on brg_dfu_state_obs handler)*/
    while(sd_is_enabled)
    {
                NRF_LOG_PROCESS();
               wlt_utils_watchdog_feed();
    }


    if( !xTaskResumeAll () )
    {
    taskYIELD ();
    }

    And here is the event handler : 

    static void brg_dfu_state_obs(nrf_sdh_state_evt_t state, void * p_context)
    {
             UNUSED_PARAMETER(p_context);

             NRF_LOG_INFO("%s state = %d", __FUNCTION__, state);
             if (state == NRF_SDH_EVT_STATE_DISABLED)
             {
                sd_is_enabled = false;
             }

    }

    Are you using XTAL HFCLK and/or LFCLK?

    We are calling the following in our application before we even start the Softdevice:

    nrf_drv_clock_hfclk_request(NULL); 
    nrf_drv_clock_lfclk_request(NULL);

    Any recommendation ?

Related