Cannot match official YOLOv9 benchmark scores on COCO validation set

Hello, I have tried to run the validation task on the YOLOv9-s and so far I am wondering whether I do something wrong or not because I couldn't replicate the official reported benchmark scores. Could you please provide some guidance?

Steps to reproduce:
1. I have performed a clean instalation of this YOLOv9 implementation.
2. I have run the validation task on the YOLOv9-s model using the default configuration (specifically the one existing in the coco.yaml dataset, and using the validation.yaml task)

Tests performed:
1. When using the following configuration in validation.yaml:
- batch_size: 32
- nms.min_confidence: 0.0001
- nms.min_iou: 0.7
- nms.max_bbox: 1000

I get pretty much accurate metrics:
- AP@.5=0.6188273429870605 
- AP@.5:.95=0.4580431282520294

And they seem to be very close to the original reported AP@.5=0.634 and AP@.5:.95=0.468 presented in the original github repository for YOLOv9-s model: https://github.com/WongKinYiu/yolov9

2. However, the original implementation reports these metrics at min_confidence=0.001 NOT 0.0001 as this implementation does. When re-running with nms.min_confidence=0.001, i get:
- AP@.5=0.1537942737340927
- AP@.5:.95=0.11888031661510468

These are significantly lower than those reported in official repository.

3. Another strange thing that happens is when changing the batch size to 8 or 16. For example, for the following configuration in validation.yaml:
- batch_size: 16
- nms.min_confidence: 0.0001 (using the exact default reported by this repository)
- nms.min_iou: 0.7
- nms.max_bbox: 1000

I get the following metrics:
- AP@.5=0.4457399845123291
- AP@.5:.95=0.33201929926872253

And when using batch_size=8, I get the following metrics:
- AP@.5=0.2107953280210495
- AP@.5:.95=0.16253872215747833

I honestly do not understand why reducing the batch_size would influence the final reported AP@.5 and AP@.5:.95 on the same dataset, using the same model. The only thing I could think of is a bug inside maybe the NMS implementation or something in this direction that may influence the reported metrics when varying the batch_size. If you feel like I did something wrong while performing these tests, please correct me.

P.S.: I want to mention that the tests were performed using a single GPU.
P.S. 2: I have identified that the number of parameters in these models are higher than the ones reported in the official paper:
- YOLOv9-t: 3.7 M; Official YOLOv9-t: 2.0M
- YOLOv9-s: 9.8 M; Official YOLOv9-s: 7.1M
- YOLOv9-m: 32.9 M; Official YOLOv9-m: 20.0M
- YOLOv9-c: 51.2 M; Official YOLOv9-c: 25.3M
- YOLOv9-e:  Not available

Could you please explain the discrepancy? Thanks!

P.S. 3: I have also identified that different augmentation techniques are missing from this repository and the original paper seems to use those techniques, such as: 
- HSV saturation augmentation
- HSV value augmentation
- scale augmentation
- copy & paste augmentation

The ones used in the original implementation are presented in the Appendix section of the paper:
<img width="219" height="132" alt="Image" src="https://github.com/user-attachments/assets/ae3f60b5-efc0-4142-87f2-f971dcb4daec" />

I have also not seen any "close_mosaic_epochs" technique in the implementation, that eliminates the MOSAIC from the last epochs.

Could you please mention whether this implementation actually reflects the performance of the original YOLOv9, or it is just a work-in-progress implementation that will be further improved?
Thank you.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot match official YOLOv9 benchmark scores on COCO validation set #231

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot match official YOLOv9 benchmark scores on COCO validation set #231

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions