Invoking the runtime

https://pytorch.org/executorch/main/llm/getting-started.html
https://github.com/pytorch/torchchat/blob/main/runner/run.cpp
(Prerequisites) Export the model to .pte following torch.export() ⇒ Edge Compilation
Runner utils in Executorch
- It hosts the libary components used in a C++ llm runner.
  - stats.h on runtime status like token numbers and latency.
  - TextPrefiller
  - TextDecoderRunner
- With the components above, an actual runner can be built for a model or a series of models.
- An example is in /executorch/examples/models/llama/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.

Running an ExecuTorch Model Using the Module Extension in C++

https://pytorch.org/executorch/stable/extension-module.html
has a forward method, can pass special Tensor

#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>
 
using namespace ::executorch::extension;
 
// Create a Module.
Module module("/path/to/model.pte");
 
// Wrap the input data with a Tensor.
float input[1 * 3 * 256 * 256];
auto tensor = from_blob(input, {1, 3, 256, 256});
 
// Perform an inference.
const auto result = module.forward(tensor);
 
// Check for success or failure.
if (result.ok()) {
  // Retrieve the output data.
  const auto output = result->at(0).toTensor().const_data_ptr<float>();
}

Managing Tensor Memory in C++

https://pytorch.org/executorch/stable/extension-tensor.html

🤖 Harold's Notes

Explorer

Invoking the runtime

Running an ExecuTorch Model Using the Module Extension in C++

Managing Tensor Memory in C++

Graph View

Table of Contents

Backlinks