# Cortex (Inference)

> Run local Small Language Models (SLMs) in the browser

# Cortex Module

`Infèrence_Éngine // WébGPU_Accélerated`

**ᚠ ᛫ ᛟ ᛫ ᚱ ᛫ ᛒ ᛫ ᛟ ᛫ ᚲ**

The Cortex module is the "Voice" of the AI. It handles local text generation using quantized Small Language Models (SLMs) running directly in the client's browser via WebGPU.

> *N̷o̴ ̷c̶l̵o̷u̴d̸.̴ ̷N̵o̶ ̸l̷a̴t̸e̸n̴c̵y̷.̸ ̷P̴u̴r̵e̴ ̶t̵h̷o̵u̸g̸h̸t̵.̷*

***

## Features

* **Local Inference**: Runs entirely on the user's device (Edge AI).
* **WebGPU Acceleration**: Utilizes the GPU via `@mlc-ai/web-llm` for near-native performance.
* **Offline Capable**: Once the model is cached, it works without internet.
* **Privacy First**: No prompt data leaves the user's device.

***

## Supported Models

| Model          | Parameters | Quantization | Size    | Use Case                             |
| -------------- | ---------- | ------------ | ------- | ------------------------------------ |
| `smollm2-135m` | 135M       | q4f16\_1     | \~100MB | Low-end devices, simple dialogue     |
| `llama3-8b`    | 8B         | q4f16\_1     | \~4.5GB | High-end desktops, complex reasoning |

***

## Usage

### Initialization

Initialize the Cortex engine. This triggers the model download if not cached.

```typescript
import { createCortex } from 'forbocai';

// Initialize with a lightweight model
const cortex = createCortex({
    model: 'smollm2-135m',
    gpu: true
});

// Load the model (Async)
await cortex.init();

// Check status (Note: status is internal/private in strict mode, returns ready state via init)
console.log("Cortex Ready");
```

### Generation

Generate text or dialogue.

```typescript
const response = await cortex.complete("Hello, how are you?", {
    temperature: 0.7,
    maxTokens: 100
});

console.log(response);
```

### Streaming

Stream the response character-by-character for a retro terminal effect.

```typescript
const stream = cortex.completeStream("Tell me a story about a cyber-knight.");

for await (const chunk of stream) {
    process.stdout.write(chunk);
}
```

***

## Performance Tips

1. **Warmup**: The first generation might be slower due to shader compilation.
2. **Caching**: The model is cached in the browser Cache API. Subsequent loads are instant.
3. **VRAM**: Ensure the user has enough VRAM. `smollm2-135m` requires \~200MB, while `llama3-8b` needs \~6GB.

***

**ᚠ ᛫ ᛟ ᛫ ᚱ ᛫ ᛒ ᛫ ᛟ ᛫ ᚲ**