Puma aims to be a lightweight, high-performance inference engine for heterogeneous devices. Currently under active development.
Run make build
to build the puma binary.
Run ./puma help
to see all available commands.
For example, you can run ./puma version
to see the binary version.
Use llama.cpp as the default backend for quick prototyping, will implement our own backend in the future.