HARD FAULT during xTaskResumeAll after ending a DFU session and disabling the Softdevice

Hi all,

I am working on nrf52840 chip with SDK 17.0.2.

Our application runs smoothly and we want to add it the capabilities of  upgrading another  nrf52840 chip using Nordic DFU service.

In order to do that we turn off our application RADIO, suspend most of our freeRTOS tasks and then call for vTaskSuspendAll to suspend the scheduler.

Then we enable the Softdevice (as part of ble_stack_init) and send the image to the remote nrf52840.

This works well.

When we finish the DFU process we call nrf_sdh_disable_request and wait until we know that the Softdevice is disabled.

Then we resume our tasks and want to resume the scheduler by calling   xTaskResumeAll();

The problem is that we get the following  hard fault: 

<error> hardfault: HARD FAULT at 0x00029350
<error> hardfault: R0: 0x00000A85 R1: 0x08F38168 R2: 0x00684088 R3: 0x0000000B
<error> hardfault: R12: 0x2000FE40 LR: 0x0002AEB7 PSR: 0x21000200
<error> hardfault: Cause: Data bus error (return address in the stack frame is not related to the instruction that caused the error).

The call stack is : 

 

What am I doing wrong ?

Thanks in advance for any assistance ,

Rafalino

Parents
  • Then it might be possible that we are overlooking into the memory corruption direction. If this is not a memory corruption by stack overflow, then the application is somehow passing the wrong timerID.

    You can write a small code snippet like below in prvProcessReceivedCommands  just before uxListRemove

    if (pxTimer->pvTimerID == 0x0x1E200000)
    {
        static volatile uint32_t counter = 0;
        counter++;     // <-- Put a breakpoint at this line
    }

    Compile your code, flash and start the code in the debugger. Put the breakpoint at the "counter++" and run the application in an attempt to trigger the hardfault. 

    The debugger should halt at the breakpoint and now your function call stack should allow you to browse through the functions that lead to this breakpoint. Try to understand the context of how this value has been passed to pvTimerID.

    Please note that there is known bug in the libuarte library when using FreeRTOS as the macros use to initialize libuarte instances initialize the app_timer_freertos instances wrongly (after an incompatible casting). Please check if you are affected by this.

  • Hello Susheel,

    I have added the code that you suggested but could not see much on the call stack : 

    As for the UARTE, we do use UARTE and I would like to add the workaround that you suggested but could not fully understand where to set p_xLibUarteCOM0_app_timer_data->end_val = 0x0ULL;

    Can you clarify ?

  • Hey Susheel,

    We are using nrf_libuarte  but I have un-init that interface and disabled my UART task before enabling the Softdevice so I don't think that this is the case.

    As for tickless disable - we are using configUSE_TICKLESS_IDLE = 0 by default.

    What I see in my testing is that If suspend all my application tasks and then suspend the scheduler (via vTaskSuspendAll ), enable Softdevice and start BLE scanning before actually establishing BLE connection with the other chip, and at this point power off the remote chip so that the ble connection will fail, and then call disable Softdevice and and resume scheduler - I do NOT see the hard fault and I am able to resume my application tasks.

    It feels as if there is an internal connection timeout at the Softdevice that is triggered once there is a connection that we somehow don't clean when disable the Softdevice

    Does that make sense ? 

    Rafalino

  • Hmm,  Interesting observation.

    When we finish the DFU process we call nrf_sdh_disable_request and wait until we know that the Softdevice is disabled.

    Can you show me how you are waiting to know that the softdevice is disabled?

    In the examples of FreeRTOS we deliver with our SDK, the softdevice events are pulled in a task called "softdevice_task". 
    I am assuming that you are not using this task since you suspend all tasks just before enabling the softdevice. Which also makes me assume that all the activity of the softdevice and BLE is happening with a baremetal kind of event handling.

    Rafalino said:

    It feels as if there is an internal connection timeout at the Softdevice that is triggered once there is a connection that we somehow don't clean when disable the Softdevice

    Does that make sense ? 

    It seems like for some reason there is a need for larger delay that the wait you have in your application to know that the softdevice is disabled. Are you using XTAL HFCLK and/or LFCLK? or are you only using internal RC for the LFCLK? We have noticed that if you are using internal RC clocks and disable the softdevice in midst of internal clock calibration, then the softdevice disable function might return normally but takes longer than expected to actually disable the softdevice since it is busy calibrating internal clocks before actually servicing the disable softdevice request.

  • Can you show me how you are waiting to know that the softdevice is disabled?

    This is the part of disabling the SD:

    err_code = nrf_sdh_disable_request();
    APP_ERROR_CHECK(err_code);

    /* sd_is_enabled is changed to false once we know that the SD is disabled (on brg_dfu_state_obs handler)*/
    while(sd_is_enabled)
    {
                NRF_LOG_PROCESS();
               wlt_utils_watchdog_feed();
    }


    if( !xTaskResumeAll () )
    {
    taskYIELD ();
    }

    And here is the event handler : 

    static void brg_dfu_state_obs(nrf_sdh_state_evt_t state, void * p_context)
    {
             UNUSED_PARAMETER(p_context);

             NRF_LOG_INFO("%s state = %d", __FUNCTION__, state);
             if (state == NRF_SDH_EVT_STATE_DISABLED)
             {
                sd_is_enabled = false;
             }

    }

    Are you using XTAL HFCLK and/or LFCLK?

    We are calling the following in our application before we even start the Softdevice:

    nrf_drv_clock_hfclk_request(NULL); 
    nrf_drv_clock_lfclk_request(NULL);

    Any recommendation ?

  • What kind of clock the softdevice thinks it is using depends on the first parameter you give to sd_softdevice_enable. If you are using nrf_sdh_enable_request then the LFCLK the softdevice uses on the configuration you set with NRF_SDH_CLOCK_LF_SRC in the sdk_config.h file. If this is set to 0, then you are making the softdevice think that it is using the internal RC which makes me think that the calibration might

    The while wait condition on sd_is_enabled will still not help if the issue is with the extra delay needed with clock calibration (will only be possible if you have set NRF_SDH_CLOCK_LF_SRC to 0.

    If none of that are relavent, then try to put about 80-100ms of extra delay just before xTaskResumeAll to confirm to check if this issue is with clock calibration or not. Because the clock calibration should not take more than 100ms worst case before it disables the softdevice.

  • Hello Susheel,

    NRF_SDH_CLOCK_LF_SRC is indeed set to 0.

    I have added delay of 200ms before xTaskResumeAll  but it didn't help.

    /* sd_is_enabled is changed to false once we know that the SD is disabled (on brg_dfu_state_obs handler)*/
    while(sd_is_enabled)
    {
           NRF_LOG_PROCESS();
           wlt_utils_watchdog_feed();
    }

    nrf_delay_ms(200);

    if( !xTaskResumeAll () )
    {
          taskYIELD ();
    }

    Any more suggestions? Thinking

Reply
  • Hello Susheel,

    NRF_SDH_CLOCK_LF_SRC is indeed set to 0.

    I have added delay of 200ms before xTaskResumeAll  but it didn't help.

    /* sd_is_enabled is changed to false once we know that the SD is disabled (on brg_dfu_state_obs handler)*/
    while(sd_is_enabled)
    {
           NRF_LOG_PROCESS();
           wlt_utils_watchdog_feed();
    }

    nrf_delay_ms(200);

    if( !xTaskResumeAll () )
    {
          taskYIELD ();
    }

    Any more suggestions? Thinking

Children
  • If you already have the XTAL LFCLK, why use RC? you can change NRF_SDH_CLOCK_LF_SRC  to 1.

    If this does not fix, then I am starting to believe that it probably is not related to softdevice.
    We need to focus back on the tasks at hand. You already confirmed that it is not a stack overflow,

    If we can trust the hardfault stack frame which suggests that it is a timer task,  So start with commenting out creation of timers one by one until xTaskResumeAll does not crash. If all the timer creation code is commented out and no application timer are left and still able to trigger a hardfault, then we can forget about the timer task and assume that hardfault stack frame is useless in terms of info it holds to debug.

    move on by starting to comment out creating of tasks one by one until xTaskResumeAll does not create this hardfault, we would then atleast know which task resumption is creating this hardfault. 

  • Hello Susheel,

    I have put a break point at timer.c inside xTimerGenericCommand

    at the point we call

     xReturn = xQueueSendToBackFromISR( xTimerQueue, &xMessage, pxHigherPriorityTaskWoken );

    Then I was able to look at the call stack.

    My finding are :

             We perform nrf_ble_scan_start and when it returns with BLE_GAP_EVT_DISCONNECTED  we           call again  to nrf_ble_scan_start (should we call some cleaning function before ?)

    Then, what I see from the call stack is:

    • ble_conn_params.c  has on_disconnect event handler
    • conn_handle is 0 in my case and from 
      ble_conn_params_instance_t * p_instance  = instance_get(conn_handle); 

                we get p_instance  but it has everything set to 0  including the timer id.

                so although (p_instance != NULL)   is true, we get a garbage  instance->timer_id   and   

               then app_timer_stop(p_instance->timer_id) leads to hard fault.

               my workaround && p_instance->timer_id   seems to work but I don't think this is the proper

              solution.

              What do you say?

    static void on_disconnect(ble_evt_t const * p_ble_evt)
    {
         ret_code_t err_code;
        uint16_t conn_handle = p_ble_evt->evt.gap_evt.conn_handle;
        ble_conn_params_instance_t * p_instance = instance_get(conn_handle);

        if (p_instance != NULL && p_instance->timer_id) /* Workaround */
        {
              // Stop timer if running
              err_code = app_timer_stop(p_instance->timer_id);
              if (err_code != NRF_SUCCESS)
             {
                     send_error_evt(err_code);
              }

              instance_free(p_instance);
          }
    }

  • Rafalino said:
         my workaround && p_instance->timer_id   seems to work but I don't think this is the proper

    Good find, and why do you think your workaround is not proper?

    I think your workaround is very proper to check if the timer_id is valid before attempting to access the API of that timer through that id.

Related