-
https://pytorch.org/executorch/main/llm/getting-started.html
-
https://github.com/pytorch/torchchat/blob/main/runner/run.cpp
-
(Prerequisites) Export the model to
.pte
following torch.export() ⇒ Edge Compilation -
- It hosts the libary components used in a C++ llm runner.
- stats.h on runtime status like token numbers and latency.
- TextPrefiller
- TextDecoderRunner
- With the components above, an actual runner can be built for a model or a series of models.
- An example is in /executorch/examples/models/llama/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.
- It hosts the libary components used in a C++ llm runner.
Running an ExecuTorch Model Using the Module Extension in C++
- https://pytorch.org/executorch/stable/extension-module.html
- has a forward method, can pass special Tensor
#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>
using namespace ::executorch::extension;
// Create a Module.
Module module("/path/to/model.pte");
// Wrap the input data with a Tensor.
float input[1 * 3 * 256 * 256];
auto tensor = from_blob(input, {1, 3, 256, 256});
// Perform an inference.
const auto result = module.forward(tensor);
// Check for success or failure.
if (result.ok()) {
// Retrieve the output data.
const auto output = result->at(0).toTensor().const_data_ptr<float>();
}