Drain TriggerRecords from the DataWriter input queue at end-run by bieryAtFnal · Pull Request #476 · DUNE-DAQ/dfmodules

bieryAtFnal · 2026-02-09T20:32:30Z

Description

Back in December, Michal reported that DFO errors were seen in data-taking runs at NP02 at end-run ("stop") time. The errors were along the lines of the following:

dfo-01: TriggerDecision 310418 didn’t complete within timeout in run 40999

We believe that these errors were the result of trying to write more data than the configured number of storage disks could handle. (The system was clogged, and the DFO didn't wait very long for the backup to clear - just 100 msec.)

A problem that was noticed when we investigated these messages is that TriggerRecords can get lost when this happens.

The changes in this PR attempt to avoid that problem by reading all TriggerRecords in the DataWriter input queue before stopping at end-run time.

This change is correlated with the following ones:

Those changes need to be merged before, or at the same time, as this one.

Here are suggested instructions for demonstrating the problem and validating the fix:

DATE_PREFIX=`date '+%d%b'`
TIME_SUFFIX=`date '+%H%M'`

#echo ""
#echo -e "\U1F535 \U2705 First, we're going to demonstrate some lost TriggerRecords and how a code change can fix that problem. \U2705 \U1F535"
#echo ""
#echo ""
#sleep 3

source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt latest
dbt-create -n NFD_DEV_260209_A9 ${DATE_PREFIX}FDDevTest_${TIME_SUFFIX}
cd ${DATE_PREFIX}FDDevTest_${TIME_SUFFIX}/sourcecode

git clone https://github.com/DUNE-DAQ/daqsystemtest.git -b develop
git clone https://github.com/DUNE-DAQ/dfmodules.git -b develop
git clone https://github.com/DUNE-DAQ/fdreadoutlibs.git -b develop
git clone https://github.com/DUNE-DAQ/fdreadoutmodules.git -b develop
git clone https://github.com/DUNE-DAQ/trigger.git -b develop
git clone https://github.com/DUNE-DAQ/hsilibs.git -b develop
cd ..

sed -i 's/m_file_handle->write(tr);/usleep(1250000);\n    m_file_handle->write(tr);/' sourcecode/dfmodules/plugins/HDF5DataStore.hpp

sed -i 's,<attr name="busy_threshold" type="s32" val="4"/>,<attr name="busy_threshold" type="s32" val="8"/>,' sourcecode/daqsystemtest/config/daqsystemtest/moduleconfs.data.xml 
sed -i 's,<attr name="free_threshold" type="s32" val="3"/>,<attr name="free_threshold" type="s32" val="7"/>,' sourcecode/daqsystemtest/config/daqsystemtest/moduleconfs.data.xml 

dbt-workarea-env
dbt-build -j 12
dbt-workarea-env

daqconf_set_connectivity_service_port local-1x1-config config/daqsystemtest/example-configs.data.xml
daqconf_set_rc_controller_port local-1x1-config config/daqsystemtest/example-configs.data.xml

export TRACE_FILE=$DBT_AREA_ROOT/log/${USER}_dunedaq.trace
daqconf_set_session_env_var local-1x1-config config/daqsystemtest/example-configs.data.xml TRACE_FILE $TRACE_FILE

mkdir -p rundir
cd rundir

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config ${USER}-local-test boot wait 2 conf wait 2 start --run-number 101 wait 3 enable-triggers wait 10 disable-triggers wait 2 drain-dataflow wait 2 stop-trigger-sources stop scrap terminate

egrep -i 'error|warning' log*.txt

echo ""
echo -e "\U1F535 \U2705 Note that the DFO complains about TriggerDecisions that didn't complete in time (in the logfile messages shown above). \U2705 \U1F535"
echo ""
echo ""
sleep 3

HDF5LIBS_TestDumpRecord test_raw_run000101_*.hdf5 | grep trigger_number

echo ""
echo -e "\U1F535 \U2705 Note that one (or more) of the TriggerRecord mentioned in the DFO errors is missing from the raw data file (as shown above). \U2705 \U1F535"
echo ""
echo ""
sleep 3

cd ../sourcecode
git clone https://github.com/DUNE-DAQ/iomanager.git -b eflumerf/AddDataPending
git clone https://github.com/DUNE-DAQ/ipm.git -b eflumerf/AddDataPending

cd dfmodules
git stash
git checkout kbiery/DataWriterProcessAllEventsAtStop
git stash pop
cd ../../

dbt-workarea-env
dbt-build -j 12
dbt-workarea-env
cd rundir

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config ${USER}-local-test boot wait 2 conf wait 2 start --run-number 102 wait 3 enable-triggers wait 10 disable-triggers wait 2 drain-dataflow wait 2 stop-trigger-sources stop scrap terminate

egrep -i 'error|warning' log*.txt

echo ""
echo -e "\U1F535 \U2705 Note that, with the modified code in 3 repos, the DFO still complains about TriggerDecisions that didn't complete in time (in the logfile messages shown above). \U2705 \U1F535"
echo ""
echo ""
sleep 3

HDF5LIBS_TestDumpRecord test_raw_run000102_*.hdf5 | grep trigger_number

echo ""
echo -e "\U1F535 \U2705 However, with the modified code, there are no longer any missing TriggerRecords in the raw data file (as shown above). \U2705 \U1F535"
echo ""
echo ""

For reference, there are a few notes on this topic in the agenda of the 21-Jan-2026 Dataflow WG meeting.

Type of change

Bug fix (non-breaking change which fixes an issue)

Testing checklist

Unit tests pass (e.g. dbt-build --unittest)
Minimal system quicktest passes (pytest -s minimal_system_quick_test.py)
Full set of integration tests pass (daqsystemtest_integtest_bundle.sh)

Further checks

Code is commented where needed, particularly in hard-to-understand areas

…from the input queue before stopping.

…k to use the new receiver data_pending() method.

…rocessAllEventsAtStop

eflumerf

Following Kurt's excellent test procedure, I saw the expected failure and fix.

Kurt Biery added 3 commits January 21, 2026 08:48

Added code to DataWriterModule.cpp to finish reading trigger records …

21fc9d5

…from the input queue before stopping.

Converted the draining of the input queue in DataWriterModule::do_wor…

1b66ceb

…k to use the new receiver data_pending() method.

Merge remote-tracking branch 'origin/develop' into kbiery/DataWriterP…

36c2d0f

…rocessAllEventsAtStop

This was referenced Feb 9, 2026

Add a data_pending() method to iomanager::Receiver. DUNE-DAQ/iomanager#129

Merged

Add a data_pending() method to Receiver DUNE-DAQ/ipm#117

Merged

eflumerf self-requested a review February 10, 2026 15:04

eflumerf approved these changes Feb 10, 2026

View reviewed changes

bieryAtFnal merged commit 7c6a506 into develop Feb 11, 2026
3 of 4 checks passed

bieryAtFnal deleted the kbiery/DataWriterProcessAllEventsAtStop branch February 11, 2026 02:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Drain TriggerRecords from the DataWriter input queue at end-run#476

Drain TriggerRecords from the DataWriter input queue at end-run#476
bieryAtFnal merged 3 commits intodevelopfrom
kbiery/DataWriterProcessAllEventsAtStop

bieryAtFnal commented Feb 9, 2026

Uh oh!

eflumerf left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

bieryAtFnal commented Feb 9, 2026

Description

Type of change

Testing checklist

Further checks

Uh oh!

eflumerf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants