-
https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html
-
If the model is really big, it may require model sharding because the Qualcomm DSP is a 32bit system and has a 4GB size limit .
- For example for Llama 3 8B models, we need to shard the model into 4, but ExecuTorch still packages it into one PTE file.
Passes or transformation
-
See Preprocessing for definition
Quantization
- QNN backend currently supports exporting to these data types: fp32, int4/ int8 with PTQ, int4 with SpinQuant (Llama 3 only).