Error message "Bad memory unallocation" when cores exceed NUM_THREADS

When using PyTorch 2.10.0 on an Aarch64 system with more than 128 cores, running the Hugging Face THUDM/CogVideoX-2b model causes PyTorch's built in OpenBLAS ([apparently v0.3.30](https://github.com/pytorch/pytorch/blob/bbc2f7674eeb868439e513e0fec87a8edbdd4830/.ci/docker/common/install_openblas.sh#L6C39-L6C46)) to print out "Bad memory unallocation!" warnings:
```
OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata.
To avoid this warning, please rebuild your copy of OpenBLAS with a larger NUM_THREADS setting
or set the environment variable OPENBLAS_NUM_THREADS to 128 or lower
[...]
BLAS : Bad memory unallocation! :  768  0xf52c94000000
BLAS : Bad memory unallocation! :  768  0xf52c78000000
BLAS : Bad memory unallocation! :  768  0xf52c80000000
BLAS : Bad memory unallocation! :  768  0xf52c8e000000
BLAS : Bad memory unallocation! :  768  0xf52c8a000000
BLAS : Bad memory unallocation! :  768  0xf52c88000000
[...]
```

The problem has also been reproduced with OpenBLAS v0.2.20-7494-g986ba2949. All warnings go away if the threads are restricted with `OMP_NUM_THREADS=128` or OpenBLAS is recompiled with `NUM_THREADS=256`.

The following is a script that reproduces the problem (but note it downloads over 14GiB of data before it runs):
```python
#!/usr/bin/env python3
#
# apt update
# apt install python3-pip python3-venv
# python3 -m venv venv
# . venv/bin/activate
# pip install accelerate diffusers protobuf sentencepiece tiktoken torch transformers
# # WARNING: This script needs to download over 14GiB of model data!
# OMP_NUM_THREADS=140 python3 openblas-bad-memory-unallocation.py
#
# If you want to use a custom OpenBLAS:
# git clone https://github.com/OpenMathLib/OpenBLAS.git
# cd OpenBLAS
# git describe
# v0.2.20-7494-g986ba2949
# make -j 8 NUM_THREADS=128 USE_OPENMP=1 NO_SHARED=0 DYNAMIC_ARCH=1 TARGET=ARMV8 CFLAGS=-O3 BUILD_BFLOAT16=1
# cd ../
# OMP_NUM_THREADS=140 LD_PRELOAD=OpenBLAS/libopenblas.so python3 openblas-bad-memory-unallocation.py

import torch
from diffusers import CogVideoXPipeline

def main():
    # From https://huggingface.co/zai-org/CogVideoX-2b
    model_name = 'THUDM/CogVideoX-2b'
    
    # Create main processing pipeline
    print("Creating pipeline...")
    pipe = (CogVideoXPipeline.from_pretrained(
        model_name,
        torch_dtype=torch.float32,
    ).to('cpu'))
    
    # Configure pipeline
    print("Configuring pipeline...")
    generator = torch.Generator().manual_seed(42)
    
    # Warmup
    print("Warmup...")
    # The default model (CogVideoX-2b) specifically says it only works with 720x480
    pipe(prompt="warmup", generator=generator, num_inference_steps=1, num_frames=1)

    print("Finished!")

if __name__ == '__main__':
    main()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error message "Bad memory unallocation" when cores exceed NUM_THREADS #5639

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error message "Bad memory unallocation" when cores exceed NUM_THREADS #5639

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions