HARD FAULT during xTaskResumeAll after ending a DFU session and disabling the Softdevice

Hi all,

I am working on nrf52840 chip with SDK 17.0.2.

Our application runs smoothly and we want to add it the capabilities of  upgrading another  nrf52840 chip using Nordic DFU service.

In order to do that we turn off our application RADIO, suspend most of our freeRTOS tasks and then call for vTaskSuspendAll to suspend the scheduler.

Then we enable the Softdevice (as part of ble_stack_init) and send the image to the remote nrf52840.

This works well.

When we finish the DFU process we call nrf_sdh_disable_request and wait until we know that the Softdevice is disabled.

Then we resume our tasks and want to resume the scheduler by calling   xTaskResumeAll();

The problem is that we get the following  hard fault: 

<error> hardfault: HARD FAULT at 0x00029350
<error> hardfault: R0: 0x00000A85 R1: 0x08F38168 R2: 0x00684088 R3: 0x0000000B
<error> hardfault: R12: 0x2000FE40 LR: 0x0002AEB7 PSR: 0x21000200
<error> hardfault: Cause: Data bus error (return address in the stack frame is not related to the instruction that caused the error).

The call stack is : 

 

What am I doing wrong ?

Thanks in advance for any assistance ,

Rafalino

Parents
  • Then it might be possible that we are overlooking into the memory corruption direction. If this is not a memory corruption by stack overflow, then the application is somehow passing the wrong timerID.

    You can write a small code snippet like below in prvProcessReceivedCommands  just before uxListRemove

    if (pxTimer->pvTimerID == 0x0x1E200000)
    {
        static volatile uint32_t counter = 0;
        counter++;     // <-- Put a breakpoint at this line
    }

    Compile your code, flash and start the code in the debugger. Put the breakpoint at the "counter++" and run the application in an attempt to trigger the hardfault. 

    The debugger should halt at the breakpoint and now your function call stack should allow you to browse through the functions that lead to this breakpoint. Try to understand the context of how this value has been passed to pvTimerID.

    Please note that there is known bug in the libuarte library when using FreeRTOS as the macros use to initialize libuarte instances initialize the app_timer_freertos instances wrongly (after an incompatible casting). Please check if you are affected by this.

  • Hello Susheel,

    I have added the code that you suggested but could not see much on the call stack : 

    As for the UARTE, we do use UARTE and I would like to add the workaround that you suggested but could not fully understand where to set p_xLibUarteCOM0_app_timer_data->end_val = 0x0ULL;

    Can you clarify ?

  • Can you show me how you are waiting to know that the softdevice is disabled?

    This is the part of disabling the SD:

    err_code = nrf_sdh_disable_request();
    APP_ERROR_CHECK(err_code);

    /* sd_is_enabled is changed to false once we know that the SD is disabled (on brg_dfu_state_obs handler)*/
    while(sd_is_enabled)
    {
                NRF_LOG_PROCESS();
               wlt_utils_watchdog_feed();
    }


    if( !xTaskResumeAll () )
    {
    taskYIELD ();
    }

    And here is the event handler : 

    static void brg_dfu_state_obs(nrf_sdh_state_evt_t state, void * p_context)
    {
             UNUSED_PARAMETER(p_context);

             NRF_LOG_INFO("%s state = %d", __FUNCTION__, state);
             if (state == NRF_SDH_EVT_STATE_DISABLED)
             {
                sd_is_enabled = false;
             }

    }

    Are you using XTAL HFCLK and/or LFCLK?

    We are calling the following in our application before we even start the Softdevice:

    nrf_drv_clock_hfclk_request(NULL); 
    nrf_drv_clock_lfclk_request(NULL);

    Any recommendation ?

  • What kind of clock the softdevice thinks it is using depends on the first parameter you give to sd_softdevice_enable. If you are using nrf_sdh_enable_request then the LFCLK the softdevice uses on the configuration you set with NRF_SDH_CLOCK_LF_SRC in the sdk_config.h file. If this is set to 0, then you are making the softdevice think that it is using the internal RC which makes me think that the calibration might

    The while wait condition on sd_is_enabled will still not help if the issue is with the extra delay needed with clock calibration (will only be possible if you have set NRF_SDH_CLOCK_LF_SRC to 0.

    If none of that are relavent, then try to put about 80-100ms of extra delay just before xTaskResumeAll to confirm to check if this issue is with clock calibration or not. Because the clock calibration should not take more than 100ms worst case before it disables the softdevice.

  • Hello Susheel,

    NRF_SDH_CLOCK_LF_SRC is indeed set to 0.

    I have added delay of 200ms before xTaskResumeAll  but it didn't help.

    /* sd_is_enabled is changed to false once we know that the SD is disabled (on brg_dfu_state_obs handler)*/
    while(sd_is_enabled)
    {
           NRF_LOG_PROCESS();
           wlt_utils_watchdog_feed();
    }

    nrf_delay_ms(200);

    if( !xTaskResumeAll () )
    {
          taskYIELD ();
    }

    Any more suggestions? Thinking

  • If you already have the XTAL LFCLK, why use RC? you can change NRF_SDH_CLOCK_LF_SRC  to 1.

    If this does not fix, then I am starting to believe that it probably is not related to softdevice.
    We need to focus back on the tasks at hand. You already confirmed that it is not a stack overflow,

    If we can trust the hardfault stack frame which suggests that it is a timer task,  So start with commenting out creation of timers one by one until xTaskResumeAll does not crash. If all the timer creation code is commented out and no application timer are left and still able to trigger a hardfault, then we can forget about the timer task and assume that hardfault stack frame is useless in terms of info it holds to debug.

    move on by starting to comment out creating of tasks one by one until xTaskResumeAll does not create this hardfault, we would then atleast know which task resumption is creating this hardfault. 

  • Hello Susheel,

    I have put a break point at timer.c inside xTimerGenericCommand

    at the point we call

     xReturn = xQueueSendToBackFromISR( xTimerQueue, &xMessage, pxHigherPriorityTaskWoken );

    Then I was able to look at the call stack.

    My finding are :

             We perform nrf_ble_scan_start and when it returns with BLE_GAP_EVT_DISCONNECTED  we           call again  to nrf_ble_scan_start (should we call some cleaning function before ?)

    Then, what I see from the call stack is:

    • ble_conn_params.c  has on_disconnect event handler
    • conn_handle is 0 in my case and from 
      ble_conn_params_instance_t * p_instance  = instance_get(conn_handle); 

                we get p_instance  but it has everything set to 0  including the timer id.

                so although (p_instance != NULL)   is true, we get a garbage  instance->timer_id   and   

               then app_timer_stop(p_instance->timer_id) leads to hard fault.

               my workaround && p_instance->timer_id   seems to work but I don't think this is the proper

              solution.

              What do you say?

    static void on_disconnect(ble_evt_t const * p_ble_evt)
    {
         ret_code_t err_code;
        uint16_t conn_handle = p_ble_evt->evt.gap_evt.conn_handle;
        ble_conn_params_instance_t * p_instance = instance_get(conn_handle);

        if (p_instance != NULL && p_instance->timer_id) /* Workaround */
        {
              // Stop timer if running
              err_code = app_timer_stop(p_instance->timer_id);
              if (err_code != NRF_SUCCESS)
             {
                     send_error_evt(err_code);
              }

              instance_free(p_instance);
          }
    }

Reply
  • Hello Susheel,

    I have put a break point at timer.c inside xTimerGenericCommand

    at the point we call

     xReturn = xQueueSendToBackFromISR( xTimerQueue, &xMessage, pxHigherPriorityTaskWoken );

    Then I was able to look at the call stack.

    My finding are :

             We perform nrf_ble_scan_start and when it returns with BLE_GAP_EVT_DISCONNECTED  we           call again  to nrf_ble_scan_start (should we call some cleaning function before ?)

    Then, what I see from the call stack is:

    • ble_conn_params.c  has on_disconnect event handler
    • conn_handle is 0 in my case and from 
      ble_conn_params_instance_t * p_instance  = instance_get(conn_handle); 

                we get p_instance  but it has everything set to 0  including the timer id.

                so although (p_instance != NULL)   is true, we get a garbage  instance->timer_id   and   

               then app_timer_stop(p_instance->timer_id) leads to hard fault.

               my workaround && p_instance->timer_id   seems to work but I don't think this is the proper

              solution.

              What do you say?

    static void on_disconnect(ble_evt_t const * p_ble_evt)
    {
         ret_code_t err_code;
        uint16_t conn_handle = p_ble_evt->evt.gap_evt.conn_handle;
        ble_conn_params_instance_t * p_instance = instance_get(conn_handle);

        if (p_instance != NULL && p_instance->timer_id) /* Workaround */
        {
              // Stop timer if running
              err_code = app_timer_stop(p_instance->timer_id);
              if (err_code != NRF_SUCCESS)
             {
                     send_error_evt(err_code);
              }

              instance_free(p_instance);
          }
    }

Children
Related