Skip to content

Comments

Drain TriggerRecords from the DataWriter input queue at end-run#476

Merged
bieryAtFnal merged 3 commits intodevelopfrom
kbiery/DataWriterProcessAllEventsAtStop
Feb 11, 2026
Merged

Drain TriggerRecords from the DataWriter input queue at end-run#476
bieryAtFnal merged 3 commits intodevelopfrom
kbiery/DataWriterProcessAllEventsAtStop

Conversation

@bieryAtFnal
Copy link
Collaborator

Description

Back in December, Michal reported that DFO errors were seen in data-taking runs at NP02 at end-run ("stop") time. The errors were along the lines of the following:

dfo-01: TriggerDecision 310418 didn’t complete within timeout in run 40999

We believe that these errors were the result of trying to write more data than the configured number of storage disks could handle. (The system was clogged, and the DFO didn't wait very long for the backup to clear - just 100 msec.)

A problem that was noticed when we investigated these messages is that TriggerRecords can get lost when this happens.

The changes in this PR attempt to avoid that problem by reading all TriggerRecords in the DataWriter input queue before stopping at end-run time.

This change is correlated with the following ones:

Those changes need to be merged before, or at the same time, as this one.

Here are suggested instructions for demonstrating the problem and validating the fix:

DATE_PREFIX=`date '+%d%b'`
TIME_SUFFIX=`date '+%H%M'`

#echo ""
#echo -e "\U1F535 \U2705 First, we're going to demonstrate some lost TriggerRecords and how a code change can fix that problem. \U2705 \U1F535"
#echo ""
#echo ""
#sleep 3

source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt latest
dbt-create -n NFD_DEV_260209_A9 ${DATE_PREFIX}FDDevTest_${TIME_SUFFIX}
cd ${DATE_PREFIX}FDDevTest_${TIME_SUFFIX}/sourcecode

git clone https://github.com/DUNE-DAQ/daqsystemtest.git -b develop
git clone https://github.com/DUNE-DAQ/dfmodules.git -b develop
git clone https://github.com/DUNE-DAQ/fdreadoutlibs.git -b develop
git clone https://github.com/DUNE-DAQ/fdreadoutmodules.git -b develop
git clone https://github.com/DUNE-DAQ/trigger.git -b develop
git clone https://github.com/DUNE-DAQ/hsilibs.git -b develop
cd ..

sed -i 's/m_file_handle->write(tr);/usleep(1250000);\n    m_file_handle->write(tr);/' sourcecode/dfmodules/plugins/HDF5DataStore.hpp

sed -i 's,<attr name="busy_threshold" type="s32" val="4"/>,<attr name="busy_threshold" type="s32" val="8"/>,' sourcecode/daqsystemtest/config/daqsystemtest/moduleconfs.data.xml 
sed -i 's,<attr name="free_threshold" type="s32" val="3"/>,<attr name="free_threshold" type="s32" val="7"/>,' sourcecode/daqsystemtest/config/daqsystemtest/moduleconfs.data.xml 

dbt-workarea-env
dbt-build -j 12
dbt-workarea-env

daqconf_set_connectivity_service_port local-1x1-config config/daqsystemtest/example-configs.data.xml
daqconf_set_rc_controller_port local-1x1-config config/daqsystemtest/example-configs.data.xml

export TRACE_FILE=$DBT_AREA_ROOT/log/${USER}_dunedaq.trace
daqconf_set_session_env_var local-1x1-config config/daqsystemtest/example-configs.data.xml TRACE_FILE $TRACE_FILE

mkdir -p rundir
cd rundir

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config ${USER}-local-test boot wait 2 conf wait 2 start --run-number 101 wait 3 enable-triggers wait 10 disable-triggers wait 2 drain-dataflow wait 2 stop-trigger-sources stop scrap terminate

egrep -i 'error|warning' log*.txt

echo ""
echo -e "\U1F535 \U2705 Note that the DFO complains about TriggerDecisions that didn't complete in time (in the logfile messages shown above). \U2705 \U1F535"
echo ""
echo ""
sleep 3

HDF5LIBS_TestDumpRecord test_raw_run000101_*.hdf5 | grep trigger_number

echo ""
echo -e "\U1F535 \U2705 Note that one (or more) of the TriggerRecord mentioned in the DFO errors is missing from the raw data file (as shown above). \U2705 \U1F535"
echo ""
echo ""
sleep 3

cd ../sourcecode
git clone https://github.com/DUNE-DAQ/iomanager.git -b eflumerf/AddDataPending
git clone https://github.com/DUNE-DAQ/ipm.git -b eflumerf/AddDataPending

cd dfmodules
git stash
git checkout kbiery/DataWriterProcessAllEventsAtStop
git stash pop
cd ../../

dbt-workarea-env
dbt-build -j 12
dbt-workarea-env
cd rundir

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config ${USER}-local-test boot wait 2 conf wait 2 start --run-number 102 wait 3 enable-triggers wait 10 disable-triggers wait 2 drain-dataflow wait 2 stop-trigger-sources stop scrap terminate

egrep -i 'error|warning' log*.txt

echo ""
echo -e "\U1F535 \U2705 Note that, with the modified code in 3 repos, the DFO still complains about TriggerDecisions that didn't complete in time (in the logfile messages shown above). \U2705 \U1F535"
echo ""
echo ""
sleep 3

HDF5LIBS_TestDumpRecord test_raw_run000102_*.hdf5 | grep trigger_number

echo ""
echo -e "\U1F535 \U2705 However, with the modified code, there are no longer any missing TriggerRecords in the raw data file (as shown above). \U2705 \U1F535"
echo ""
echo ""

For reference, there are a few notes on this topic in the agenda of the 21-Jan-2026 Dataflow WG meeting.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Testing checklist

  • Unit tests pass (e.g. dbt-build --unittest)
  • Minimal system quicktest passes (pytest -s minimal_system_quick_test.py)
  • Full set of integration tests pass (daqsystemtest_integtest_bundle.sh)

Further checks

  • Code is commented where needed, particularly in hard-to-understand areas

Copy link
Member

@eflumerf eflumerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following Kurt's excellent test procedure, I saw the expected failure and fix.

@bieryAtFnal bieryAtFnal merged commit 7c6a506 into develop Feb 11, 2026
3 of 4 checks passed
@bieryAtFnal bieryAtFnal deleted the kbiery/DataWriterProcessAllEventsAtStop branch February 11, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants