Chakor — Custom LLM

Home Projects Infrastructure Chakor — Custom LLM

2026 AI / Infrastructure Live Demo ↗

Overview

Chakor is a fully custom large language model project — the model, the training pipeline, and the web interface are all built from scratch. Unlike projects that wrap existing models (Llama, Mistral, GPT-J, etc.), the weights here were trained from random initialization on curated data I assembled myself. The result is a ~7 billion parameter transformer that runs 24/7 on my own server, accessible at ai.tyfsadik.org.

The frontend, Chakor, is a full-stack Next.js 15 application I designed and developed independently — featuring authentication, persistent conversation history, an admin dashboard, and integrated web search. No third-party AI APIs, no pre-built chat UIs, no off-the-shelf model weights. Everything from the attention kernels to the login page is mine.

The Model

The architecture is a decoder-only transformer, implemented in PyTorch, following the design patterns established by modern open LLMs but built entirely from my own codebase. Below are the key architectural parameters:

Parameters: ~7 billion (custom-trained from scratch)
Layers: 32 transformer decoder layers
Attention heads: 32
Hidden size: 4096
Architecture: Decoder-only transformer (GPT-style autoregressive)
Implementation: PyTorch — every layer, loss function, and training loop written by hand

Training

Training a 7B model from random initialization requires a massive amount of data and sustained compute. The training corpus was assembled from a mix of web text, code, and instruction-style data — totalling over 100 billion tokens. I ran distributed training across a multi-GPU NVIDIA setup over a period of several days to weeks depending on the training phase.

Dataset: 100B+ tokens — web text, code, instruction data
Hardware: Multi-GPU setup (NVIDIA GPUs), distributed training
Training duration: Several days to weeks across phases
Final training loss: ~1.8–2.5 range
Perplexity: ~8–15 depending on evaluation set

After pretraining, I performed instruction tuning (supervised fine-tuning on instruction/response pairs) to improve the model's ability to follow prompts and maintain coherent multi-turn conversations — the same technique used to turn a raw pretrained model into a useful chat assistant.

Inference

After training, the PyTorch weights were converted to GGUF format for efficient inference with llama.cpp. GGUF enables quantized, low-overhead inference on both CPU and GPU without the full PyTorch runtime. The model runs continuously on my server with custom prompt formatting and context window management.

Inference engine: llama.cpp
Weight format: GGUF (converted from PyTorch checkpoint)
Optimizations: Quantization for low-latency CPU/GPU inference
Availability: 24/7 on self-hosted server

Chakor — The Web Interface

Chakor is the name of both the model and the web application I built to serve it. Rather than dropping in an off-the-shelf chat UI, I built the entire frontend and backend from the ground up to fit exactly what I needed.

Stack

Frontend: Next.js 15, React, Tailwind CSS
Backend: Node.js with Next.js API routes
Inference backend: llama.cpp (custom REST integration)
Reverse proxy: Nginx with WebSocket and SSE support
SSL: Let's Encrypt / Certbot with automated renewal

Features

Authentication system — custom-built user registration, login, and session management
Persistent conversations — full conversation history saved and retrievable per user
Admin dashboard — server monitoring, model status, user management
Web search integration — search APIs surfaced inline during conversations
Streaming responses — real-time token streaming via SSE (server-sent events)
Custom prompt formatting — tailored to the model's instruction-tuned format

System Architecture

graph LR U["User Browser"] NG["Nginx :443"] LE["Lets Encrypt"] FE["Next.js Frontend"] API["API Routes"] AUTH["Auth System"] LC["llama.cpp :8080"] WG[("GGUF Weights ~7B")] SR["Web Search API"] LE -- renews --> NG U -->|HTTPS| NG NG -->|proxy_pass| FE FE -->|requests| API API -->|session| AUTH API -->|POST completion| LC API -->|search| SR LC --- WG LC -->|SSE stream| API API -->|stream| FE FE -->|render| U style U fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style NG fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style LE fill:#181818,stroke:#444,color:#888 style FE fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style API fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0 style AUTH fill:#181818,stroke:#444,color:#888 style LC fill:#1a1a2e,stroke:#00ff88,color:#e0e0e0 style WG fill:#181818,stroke:#444,color:#888 style SR fill:#1a1a2e,stroke:#00d4ff,color:#e0e0e0

Request Flow

sequenceDiagram participant U as User Browser participant NG as Nginx participant FE as Next.js participant API as API Routes participant LC as llama.cpp participant SR as Search API U->>NG: HTTPS POST chat message NG->>FE: proxy_pass FE->>API: message and conversation history API->>SR: web search query SR-->>API: search results JSON API->>LC: POST /completion with prompt and context LC-->>API: streaming token response SSE API-->>FE: streamed output FE-->>U: real-time rendered response

Build Process

Design the Model Architecture

I designed a decoder-only transformer in PyTorch: 32 layers, 32 attention heads, 4096 hidden dimensions — resulting in ~7B parameters. Every component was written by hand: the multi-head self-attention, feed-forward layers, RMSNorm, rotary positional embeddings, and the causal attention mask.

# Simplified example of the core attention block
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=4096, n_heads=32):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.q = nn.Linear(d_model, d_model, bias=False)
        self.k = nn.Linear(d_model, d_model, bias=False)
        self.v = nn.Linear(d_model, d_model, bias=False)
        self.out = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        q = self.q(x).view(B, T, self.n_heads, self.head_dim).transpose(1,2)
        k = self.k(x).view(B, T, self.n_heads, self.head_dim).transpose(1,2)
        v = self.v(x).view(B, T, self.n_heads, self.head_dim).transpose(1,2)
        attn = (q @ k.transpose(-2,-1)) / (self.head_dim ** 0.5)
        if mask is not None:
            attn = attn.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(attn, dim=-1)
        return (attn @ v).transpose(1,2).reshape(B, T, C)

Assemble and Preprocess the Training Corpus

I curated a 100B+ token dataset from web text, open-source code repositories, and instruction-style data. This involved writing data pipeline scripts to download, clean, tokenize, and shard the data into efficient binary format for streaming during training.

# Tokenize and shard dataset to binary format
python prepare_data.py \
  --sources web_text code instruction \
  --tokenizer custom_bpe \
  --output_dir ./data/shards/ \
  --shard_size 500M

# Verify token count
python count_tokens.py ./data/shards/
# Output: 103.4B tokens across 207 shards

Pretraining (Distributed, Multi-GPU)

Training was run in a distributed setup across multiple NVIDIA GPUs using PyTorch DDP (DistributedDataParallel). The training loop ran for several days to weeks, with checkpoints saved periodically. Final training loss settled in the 1.8–2.5 range.

# Launch distributed training
torchrun --nproc_per_node=NUM_GPUS train.py \
  --model_config configs/7b.json \
  --data_dir ./data/shards/ \
  --batch_size 512 \
  --lr 3e-4 \
  --warmup_steps 2000 \
  --max_steps 500000 \
  --checkpoint_dir ./checkpoints/ \
  --log_interval 100

Instruction Tuning (SFT)

After pretraining, I fine-tuned the model on instruction/response pairs using supervised fine-tuning (SFT). This stage teaches the model to follow prompts, maintain context across multi-turn conversations, and produce useful, structured responses rather than raw text completion.

# Instruction tuning on curated chat data
python sft_train.py \
  --base_checkpoint ./checkpoints/pretrain_final.pt \
  --data instruction_data.jsonl \
  --epochs 3 \
  --lr 1e-5 \
  --output ./checkpoints/sft_final.pt

Convert to GGUF for llama.cpp

The final PyTorch checkpoint was converted to GGUF format using llama.cpp's conversion tooling. GGUF enables efficient quantized inference without the full PyTorch runtime, making it practical to run the 7B model continuously on server hardware.

# Convert PyTorch checkpoint to GGUF
python convert-hf-to-gguf.py ./checkpoints/sft_final/ \
  --outtype q4_K_M \
  --outfile chakor-7b-q4_K_M.gguf

# Start llama.cpp server
./llama-server \
  -m ./chakor-7b-q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 4096 \
  --n-predict 1024

Build Chakor (Next.js Full-Stack App)

I built the Chakor web app from scratch using Next.js 15 and React. The backend API routes handle authentication, session management, conversation persistence, and proxying requests to the llama.cpp inference server. The frontend streams responses in real time using SSE.

# Core API route: stream completion from llama.cpp
// app/api/chat/route.ts
export async function POST(req: Request) {
  const { messages, conversationId } = await req.json();
  const prompt = formatPrompt(messages); // custom prompt formatter

  const llamaRes = await fetch('http://localhost:8080/completion', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt, stream: true, n_predict: 1024 }),
  });

  // Stream tokens directly back to the client
  return new Response(llamaRes.body, {
    headers: { 'Content-Type': 'text/event-stream' },
  });
}

Deploy with Nginx + Let's Encrypt

The Next.js app is proxied through Nginx with WebSocket and SSE support enabled. Let's Encrypt provides the SSL certificate, with automatic renewal via a systemd timer.

server {
    listen 443 ssl;
    server_name ai.tyfsadik.org;

    location / {
        proxy_pass http://localhost:3000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

# Provision certificate
certbot --nginx -d ai.tyfsadik.org

Challenges & Solutions

Training stability at 7B scale: Early training runs diverged due to learning rate spikes. Resolved by implementing a cosine LR schedule with a linear warmup phase (2000 steps), gradient clipping at 1.0, and careful weight initialization following the GPT-NeoX paper.
Data quality at 100B+ tokens: Raw web data contains enormous amounts of boilerplate, duplicate content, and low-quality text that degrades model outputs. I wrote deduplication and quality filtering scripts to remove near-duplicates, perplexity-filter low-quality documents, and up-sample higher-quality sources like code and curated text.
GGUF conversion from custom checkpoint: llama.cpp's conversion scripts expect HuggingFace-compatible checkpoint formats. Since my model isn't derived from an HF model, I had to write a compatibility shim that maps my PyTorch state dict keys to the expected HF format before running the GGUF converter.
SSE streaming through Next.js API routes: Next.js API routes buffer responses by default, which breaks streaming. Resolved by using the Edge Runtime for the chat API route and returning a raw ReadableStream piped directly from the llama.cpp server.
Context management for multi-turn chat: llama.cpp processes a flat prompt string, not structured messages. I built a custom prompt formatter that assembles the full conversation history into the model's instruction format and truncates it intelligently when the context window is approaching capacity.
Nginx timeout on long generations: Long outputs caused Nginx to drop connections mid-stream. Fixed with proxy_read_timeout 300s and proxy_send_timeout 300s in the location block.

What I Learned

Transformer architecture internals — attention, positional encoding, normalization, and how architectural choices affect training dynamics at scale
Distributed training with PyTorch DDP — gradient synchronization, checkpoint management, and debugging divergence across multiple GPUs
Large-scale data pipeline engineering — deduplication, quality filtering, tokenization, and efficient sharded binary datasets
The practical difference between pretraining (learning from raw text) and instruction tuning (learning to follow prompts)
GGUF format internals and the weight conversion pipeline from PyTorch to llama.cpp
Full-stack Next.js 15 development — App Router, API routes, Edge Runtime, and SSE streaming
Nginx configuration for WebSocket and SSE proxying with long-lived connections
The full end-to-end lifecycle of a production LLM: architecture → training → conversion → serving → UI

AI LLM PyTorch Transformer llama.cpp GGUF Next.js Docker Nginx Linux Distributed Training Privacy

Chakor — Custom LLM from Scratch