Chakor — Custom LLM from Scratch
A ~7B parameter transformer model trained from the ground up, served via llama.cpp with a fully custom web interface — ai.tyfsadik.org
Overview
Chakor is a fully custom large language model project — the model, the training pipeline, and the web interface are all built from scratch. Unlike projects that wrap existing models (Llama, Mistral, GPT-J, etc.), the weights here were trained from random initialization on curated data I assembled myself. The result is a ~7 billion parameter transformer that runs 24/7 on my own server, accessible at ai.tyfsadik.org.
The frontend, Chakor, is a full-stack Next.js 15 application I designed and developed independently — featuring authentication, persistent conversation history, an admin dashboard, and integrated web search. No third-party AI APIs, no pre-built chat UIs, no off-the-shelf model weights. Everything from the attention kernels to the login page is mine.
The Model
The architecture is a decoder-only transformer, implemented in PyTorch, following the design patterns established by modern open LLMs but built entirely from my own codebase. Below are the key architectural parameters:
- Parameters: ~7 billion (custom-trained from scratch)
- Layers: 32 transformer decoder layers
- Attention heads: 32
- Hidden size: 4096
- Architecture: Decoder-only transformer (GPT-style autoregressive)
- Implementation: PyTorch — every layer, loss function, and training loop written by hand
Training
Training a 7B model from random initialization requires a massive amount of data and sustained compute. The training corpus was assembled from a mix of web text, code, and instruction-style data — totalling over 100 billion tokens. I ran distributed training across a multi-GPU NVIDIA setup over a period of several days to weeks depending on the training phase.
- Dataset: 100B+ tokens — web text, code, instruction data
- Hardware: Multi-GPU setup (NVIDIA GPUs), distributed training
- Training duration: Several days to weeks across phases
- Final training loss: ~1.8–2.5 range
- Perplexity: ~8–15 depending on evaluation set
After pretraining, I performed instruction tuning (supervised fine-tuning on instruction/response pairs) to improve the model's ability to follow prompts and maintain coherent multi-turn conversations — the same technique used to turn a raw pretrained model into a useful chat assistant.
Inference
After training, the PyTorch weights were converted to GGUF format for efficient inference with llama.cpp. GGUF enables quantized, low-overhead inference on both CPU and GPU without the full PyTorch runtime. The model runs continuously on my server with custom prompt formatting and context window management.
- Inference engine: llama.cpp
- Weight format: GGUF (converted from PyTorch checkpoint)
- Optimizations: Quantization for low-latency CPU/GPU inference
- Availability: 24/7 on self-hosted server
Chakor — The Web Interface
Chakor is the name of both the model and the web application I built to serve it. Rather than dropping in an off-the-shelf chat UI, I built the entire frontend and backend from the ground up to fit exactly what I needed.
Stack
- Frontend: Next.js 15, React, Tailwind CSS
- Backend: Node.js with Next.js API routes
- Inference backend: llama.cpp (custom REST integration)
- Reverse proxy: Nginx with WebSocket and SSE support
- SSL: Let's Encrypt / Certbot with automated renewal
Features
- Authentication system — custom-built user registration, login, and session management
- Persistent conversations — full conversation history saved and retrievable per user
- Admin dashboard — server monitoring, model status, user management
- Web search integration — search APIs surfaced inline during conversations
- Streaming responses — real-time token streaming via SSE (server-sent events)
- Custom prompt formatting — tailored to the model's instruction-tuned format
System Architecture
Request Flow
Build Process
Design the Model Architecture
I designed a decoder-only transformer in PyTorch: 32 layers, 32 attention heads, 4096 hidden dimensions — resulting in ~7B parameters. Every component was written by hand: the multi-head self-attention, feed-forward layers, RMSNorm, rotary positional embeddings, and the causal attention mask.
# Simplified example of the core attention block
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=4096, n_heads=32):
super().__init__()
self.n_heads = n_heads
self.head_dim = d_model // n_heads
self.q = nn.Linear(d_model, d_model, bias=False)
self.k = nn.Linear(d_model, d_model, bias=False)
self.v = nn.Linear(d_model, d_model, bias=False)
self.out = nn.Linear(d_model, d_model, bias=False)
def forward(self, x, mask=None):
B, T, C = x.shape
q = self.q(x).view(B, T, self.n_heads, self.head_dim).transpose(1,2)
k = self.k(x).view(B, T, self.n_heads, self.head_dim).transpose(1,2)
v = self.v(x).view(B, T, self.n_heads, self.head_dim).transpose(1,2)
attn = (q @ k.transpose(-2,-1)) / (self.head_dim ** 0.5)
if mask is not None:
attn = attn.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(attn, dim=-1)
return (attn @ v).transpose(1,2).reshape(B, T, C)
Assemble and Preprocess the Training Corpus
I curated a 100B+ token dataset from web text, open-source code repositories, and instruction-style data. This involved writing data pipeline scripts to download, clean, tokenize, and shard the data into efficient binary format for streaming during training.
# Tokenize and shard dataset to binary format
python prepare_data.py \
--sources web_text code instruction \
--tokenizer custom_bpe \
--output_dir ./data/shards/ \
--shard_size 500M
# Verify token count
python count_tokens.py ./data/shards/
# Output: 103.4B tokens across 207 shards
Pretraining (Distributed, Multi-GPU)
Training was run in a distributed setup across multiple NVIDIA GPUs using PyTorch DDP (DistributedDataParallel). The training loop ran for several days to weeks, with checkpoints saved periodically. Final training loss settled in the 1.8–2.5 range.
# Launch distributed training
torchrun --nproc_per_node=NUM_GPUS train.py \
--model_config configs/7b.json \
--data_dir ./data/shards/ \
--batch_size 512 \
--lr 3e-4 \
--warmup_steps 2000 \
--max_steps 500000 \
--checkpoint_dir ./checkpoints/ \
--log_interval 100
Instruction Tuning (SFT)
After pretraining, I fine-tuned the model on instruction/response pairs using supervised fine-tuning (SFT). This stage teaches the model to follow prompts, maintain context across multi-turn conversations, and produce useful, structured responses rather than raw text completion.
# Instruction tuning on curated chat data
python sft_train.py \
--base_checkpoint ./checkpoints/pretrain_final.pt \
--data instruction_data.jsonl \
--epochs 3 \
--lr 1e-5 \
--output ./checkpoints/sft_final.pt
Convert to GGUF for llama.cpp
The final PyTorch checkpoint was converted to GGUF format using llama.cpp's conversion tooling. GGUF enables efficient quantized inference without the full PyTorch runtime, making it practical to run the 7B model continuously on server hardware.
# Convert PyTorch checkpoint to GGUF
python convert-hf-to-gguf.py ./checkpoints/sft_final/ \
--outtype q4_K_M \
--outfile chakor-7b-q4_K_M.gguf
# Start llama.cpp server
./llama-server \
-m ./chakor-7b-q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 4096 \
--n-predict 1024
Build Chakor (Next.js Full-Stack App)
I built the Chakor web app from scratch using Next.js 15 and React. The backend API routes handle authentication, session management, conversation persistence, and proxying requests to the llama.cpp inference server. The frontend streams responses in real time using SSE.
# Core API route: stream completion from llama.cpp
// app/api/chat/route.ts
export async function POST(req: Request) {
const { messages, conversationId } = await req.json();
const prompt = formatPrompt(messages); // custom prompt formatter
const llamaRes = await fetch('http://localhost:8080/completion', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt, stream: true, n_predict: 1024 }),
});
// Stream tokens directly back to the client
return new Response(llamaRes.body, {
headers: { 'Content-Type': 'text/event-stream' },
});
}
Deploy with Nginx + Let's Encrypt
The Next.js app is proxied through Nginx with WebSocket and SSE support enabled. Let's Encrypt provides the SSL certificate, with automatic renewal via a systemd timer.
server {
listen 443 ssl;
server_name ai.tyfsadik.org;
location / {
proxy_pass http://localhost:3000;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
# Provision certificate
certbot --nginx -d ai.tyfsadik.org
Challenges & Solutions
- Training stability at 7B scale: Early training runs diverged due to learning rate spikes. Resolved by implementing a cosine LR schedule with a linear warmup phase (2000 steps), gradient clipping at 1.0, and careful weight initialization following the GPT-NeoX paper.
- Data quality at 100B+ tokens: Raw web data contains enormous amounts of boilerplate, duplicate content, and low-quality text that degrades model outputs. I wrote deduplication and quality filtering scripts to remove near-duplicates, perplexity-filter low-quality documents, and up-sample higher-quality sources like code and curated text.
- GGUF conversion from custom checkpoint: llama.cpp's conversion scripts expect HuggingFace-compatible checkpoint formats. Since my model isn't derived from an HF model, I had to write a compatibility shim that maps my PyTorch state dict keys to the expected HF format before running the GGUF converter.
-
SSE streaming through Next.js API routes: Next.js API routes buffer responses
by default, which breaks streaming. Resolved by using the Edge Runtime for the chat API route
and returning a raw
ReadableStreampiped directly from the llama.cpp server. - Context management for multi-turn chat: llama.cpp processes a flat prompt string, not structured messages. I built a custom prompt formatter that assembles the full conversation history into the model's instruction format and truncates it intelligently when the context window is approaching capacity.
-
Nginx timeout on long generations: Long outputs caused Nginx to drop connections
mid-stream. Fixed with
proxy_read_timeout 300sandproxy_send_timeout 300sin the location block.
What I Learned
- Transformer architecture internals — attention, positional encoding, normalization, and how architectural choices affect training dynamics at scale
- Distributed training with PyTorch DDP — gradient synchronization, checkpoint management, and debugging divergence across multiple GPUs
- Large-scale data pipeline engineering — deduplication, quality filtering, tokenization, and efficient sharded binary datasets
- The practical difference between pretraining (learning from raw text) and instruction tuning (learning to follow prompts)
- GGUF format internals and the weight conversion pipeline from PyTorch to llama.cpp
- Full-stack Next.js 15 development — App Router, API routes, Edge Runtime, and SSE streaming
- Nginx configuration for WebSocket and SSE proxying with long-lived connections
- The full end-to-end lifecycle of a production LLM: architecture → training → conversion → serving → UI