Cortex Module

Infèrence_Éngine // WébGPU_Accélerated

ᚠ ᛫ ᛟ ᛫ ᚱ ᛫ ᛒ ᛫ ᛟ ᛫ ᚲ

The Cortex module is the “Voice” of the AI. It handles local text generation using quantized Small Language Models (SLMs) running directly in the client’s browser via WebGPU.

N̷o̴ ̷c̶l̵o̷u̴d̸.̴ ̷N̵o̶ ̸l̷a̴t̸e̸n̴c̵y̷.̸ ̷P̴u̴r̵e̴ ̶t̵h̷o̵u̸g̸h̸t̵.̷

Features

Local Inference: Runs entirely on the user’s device (Edge AI).
WebGPU Acceleration: Utilizes the GPU via @mlc-ai/web-llm for near-native performance.
Offline Capable: Once the model is cached, it works without internet.
Privacy First: No prompt data leaves the user’s device.

Supported Models

Model	Parameters	Quantization	Size	Use Case
`smollm2-135m`	135M	q4f16_1	~100MB	Low-end devices, simple dialogue
`llama3-8b`	8B	q4f16_1	~4.5GB	High-end desktops, complex reasoning

Usage

Initialization

Initialize the Cortex engine. This triggers the model download if not cached.

1 import { createCortex } from 'forbocai';
2 
3 // Initialize with a lightweight model
4 const cortex = createCortex({
5     model: 'smollm2-135m',
6     gpu: true
7 });
8 
9 // Load the model (Async)
10 await cortex.init();
11 
12 // Check status (Note: status is internal/private in strict mode, returns ready state via init)
13 console.log("Cortex Ready");

Generation

Generate text or dialogue.

1 const response = await cortex.complete("Hello, how are you?", {
2     temperature: 0.7,
3     maxTokens: 100
4 });
5 
6 console.log(response);

Streaming

Stream the response character-by-character for a retro terminal effect.

1 const stream = cortex.completeStream("Tell me a story about a cyber-knight.");
2 
3 for await (const chunk of stream) {
4     process.stdout.write(chunk);
5 }

Performance Tips

Warmup: The first generation might be slower due to shader compilation.
Caching: The model is cached in the browser Cache API. Subsequent loads are instant.
VRAM: Ensure the user has enough VRAM. smollm2-135m requires ~200MB, while llama3-8b needs ~6GB.

ᚠ ᛫ ᛟ ᛫ ᚱ ᛫ ᛒ ᛫ ᛟ ᛫ ᚲ