Cortex (Inference)
Cortex Module
Infèrence_Éngine // WébGPU_Accélerated
ᚠ ᛫ ᛟ ᛫ ᚱ ᛫ ᛒ ᛫ ᛟ ᛫ ᚲ
The Cortex module is the “Voice” of the AI. It handles local text generation using quantized Small Language Models (SLMs) running directly in the client’s browser via WebGPU.
N̷o̴ ̷c̶l̵o̷u̴d̸.̴ ̷N̵o̶ ̸l̷a̴t̸e̸n̴c̵y̷.̸ ̷P̴u̴r̵e̴ ̶t̵h̷o̵u̸g̸h̸t̵.̷
Features
- Local Inference: Runs entirely on the user’s device (Edge AI).
- WebGPU Acceleration: Utilizes the GPU via
@mlc-ai/web-llmfor near-native performance. - Offline Capable: Once the model is cached, it works without internet.
- Privacy First: No prompt data leaves the user’s device.
Supported Models
Usage
Initialization
Initialize the Cortex engine. This triggers the model download if not cached.
Generation
Generate text or dialogue.
Streaming
Stream the response character-by-character for a retro terminal effect.
Performance Tips
- Warmup: The first generation might be slower due to shader compilation.
- Caching: The model is cached in the browser Cache API. Subsequent loads are instant.
- VRAM: Ensure the user has enough VRAM.
smollm2-135mrequires ~200MB, whilellama3-8bneeds ~6GB.
ᚠ ᛫ ᛟ ᛫ ᚱ ᛫ ᛒ ᛫ ᛟ ᛫ ᚲ
