Cortex (Inference)

View as MarkdownOpen in Claude

Cortex Module

Infèrence_Éngine // WébGPU_Accélerated

ᚠ ᛫ ᛟ ᛫ ᚱ ᛫ ᛒ ᛫ ᛟ ᛫ ᚲ

The Cortex module is the “Voice” of the AI. It handles local text generation using quantized Small Language Models (SLMs) running directly in the client’s browser via WebGPU.

N̷o̴ ̷c̶l̵o̷u̴d̸.̴ ̷N̵o̶ ̸l̷a̴t̸e̸n̴c̵y̷.̸ ̷P̴u̴r̵e̴ ̶t̵h̷o̵u̸g̸h̸t̵.̷


Features

  • Local Inference: Runs entirely on the user’s device (Edge AI).
  • WebGPU Acceleration: Utilizes the GPU via @mlc-ai/web-llm for near-native performance.
  • Offline Capable: Once the model is cached, it works without internet.
  • Privacy First: No prompt data leaves the user’s device.

Supported Models

ModelParametersQuantizationSizeUse Case
smollm2-135m135Mq4f16_1~100MBLow-end devices, simple dialogue
llama3-8b8Bq4f16_1~4.5GBHigh-end desktops, complex reasoning

Usage

Initialization

Initialize the Cortex engine. This triggers the model download if not cached.

1import { createCortex } from 'forbocai';
2
3// Initialize with a lightweight model
4const cortex = createCortex({
5 model: 'smollm2-135m',
6 gpu: true
7});
8
9// Load the model (Async)
10await cortex.init();
11
12// Check status (Note: status is internal/private in strict mode, returns ready state via init)
13console.log("Cortex Ready");

Generation

Generate text or dialogue.

1const response = await cortex.complete("Hello, how are you?", {
2 temperature: 0.7,
3 maxTokens: 100
4});
5
6console.log(response);

Streaming

Stream the response character-by-character for a retro terminal effect.

1const stream = cortex.completeStream("Tell me a story about a cyber-knight.");
2
3for await (const chunk of stream) {
4 process.stdout.write(chunk);
5}

Performance Tips

  1. Warmup: The first generation might be slower due to shader compilation.
  2. Caching: The model is cached in the browser Cache API. Subsequent loads are instant.
  3. VRAM: Ensure the user has enough VRAM. smollm2-135m requires ~200MB, while llama3-8b needs ~6GB.

ᚠ ᛫ ᛟ ᛫ ᚱ ᛫ ᛒ ᛫ ᛟ ᛫ ᚲ