Hackr News App

Microgpt

(karpathy.github.io)

1919 points

by: tambourine_man

3 days ago

323 comments

teleforce

3 days ago

[ - ]
Someone has modified microgpt to build a tiny GPT that generates Korean first names, and created a web page that visualizes the entire process [1].
Users can interactively explore the microgpt pipeline end to end, from tokenization until inference.
[1] English GPT lab:
https://ko-microgpt.vercel.app/
camkego

2 days ago

[ - ]
I have no affiliation with the website, but the website is pretty neat if you are learning LLM internals. It explains: Tokenization, Embedding, Attention, Loss & Gradient, Training, Inference and comparison to "Real GPT"
Pretty nifty. Even if you are not interested in the Korean language
sprobertson

2 days ago

[ - ]
This kind of thing is pretty easy to do with a much leaner model https://docs.pytorch.org/tutorials/intermediate/char_rnn_gen...

love2read

2 days ago

[ - ]
By "modified" this person of course means that they swapped out the list of X0,000 names from English to Korean names. That is seemingly the only change.
The attached website is a fully ai-generated "visualization" based on the original blog post with little added.

sahildeepreel

2 days ago

[ - ]
so impressive!
verma7

3 days ago

[ - ]
I wrote a C++ translation of it: https://github.com/verma7/microgpt/blob/main/microgpt.cc
2x the number of lines of code (~400L), 10x the speed
The hard part was figuring out how to represent the Value class in C++ (ended up using shared_ptrs).
WithinReason

3 days ago

[ - ]
I made an explicit reverse pass (no autodiff), it was 8x faster in Python
red_hare

3 days ago

[ - ]
This is beautiful and highly readable but, still, I yearn for a detailed line-by-line explainer like the backbone.js source: https://backbonejs.org/docs/backbone.html
tomjakubowski

2 days ago

[ - ]
I believe that Backbone's annotated source is generated with Docco, another project from the creator of CoffeeScript.
https://ashkenas.com/docco/
It's really neat. I wish I published more of my code this way.
ashish01

3 days ago

[ - ]
That is really beautiful literate program. Seeing it after a long time. Here is a opus generate version of this code - https://ashish01.github.io/microgpt.html
subset

3 days ago

[ - ]
Andrej Karpathy has a walkthrough blog post here: https://karpathy.github.io/2026/02/12/microgpt/

altcognito

3 days ago

[ - ]
ask a high end LLM to do it
subset

3 days ago

[ - ]
I had good fun transliterating it to Rust as a learning experience (https://github.com/stochastical/microgpt-rs). The trickiest part was working out how to represent the autograd graph data structure with Rust types. I'm finalising some small tweaks to make it run in the browser via WebAssmebly and then compile it up for my blog :) Andrej's code is really quite poetic, I love how much it packs into such a concise program
amelius

3 days ago

[ - ]
Storing the partial derivatives into the weights structure is quite the hack, to be honest. But everybody seems to do it like that.
geokon

3 days ago

[ - ]
> What’s the deal with “hallucinations”? The model generates tokens by sampling from a probability distribution. It has no concept of truth, it only knows what sequences are statistically plausible given the training data.
Extremely naiive question.. but could LLM output be tagged with some kind of confidence score? Like if I'm asking an LLM some question does it have an internal metric for how confident it is in its output? LLM outputs seem inherently rarely of the form "I'm not really sure, but maybe this XXX" - but I always felt this is baked in the model somehow
andy12_

3 days ago

[ - ]
The model could report the confidence of its output distribution, but it isn't necessarily calibrated (that is, even if it tells you that it's 70% confident, it doesn't mean that it is right 70% of the time). Famously, pre-trained base models are calibrated, but they stop being calibrated when they are post-trained to be instruction-following chatbots [1].
Edit: There is also some other work that points out that chat models might not be calibrated at the token-level, but might be calibrated at the concept-level [2]. Which means that if you sample many answers, and group them by semantic similarity, that is also calibrated. The problem is that generating many answer and grouping them is more costly.
[1] https://arxiv.org/pdf/2303.08774 Figure 8
[2] https://arxiv.org/pdf/2511.04869 Figure 1.

chongli

2 days ago

[ - ]
Having a confidence score isn't as useful as it seems unless you (the user) know a lot about the contents of the training set.
Think of traditional statistics. Suppose I said "80% of those sampled preferred apples to oranges, and my 95% confidence interval is within +/- 2% of that" but then I didn't tell you anything about how I collected the sample. Maybe I was talking to people at an apple pie festival? Who knows! Without more information on the sampling method, it's hard to make any kind of useful claim about a population.
This is why I remain so pessimistic about LLMs as a source of knowledge. Imagine you had a person who was raised from birth in a completely isolated lab environment and taught only how to read books, including the dictionary. They would know how all the words in those books relate to each other but know nothing of how that relates to the world. They could read the line "the killer drew his gun and aimed it at the victim" but what would they really know of it if they'd never seen a gun?

DavidSJ

3 days ago

[ - ]
Yes, the actual LLM returns a probability distribution, which gets sampled to produce output tokens.
[Edit: but to be clear, for a pretrained model this probability means "what's my estimate of the conditional probability of this token occurring in the pretraining dataset?", not "how likely is this statement to be true?" And for a post-trained model, the probability really has no simple interpretation other than "this is the probability that I will output this token in this situation".]

Lionga

3 days ago

[ - ]
The LLM has an internal "confidence score" but that has NOTHING to do with how correct the answer is, only with how often the same words came together in training data.
E.g. getting two r's in strawberry could very well have a very high "confidence score" while a random but rare correct fact might have a very well a very low one.
In short: LLM have no concept, or even desire to produce of truth

Otterly99

2 days ago

[ - ]
There is this paper that proposed data compression as a way to judge the ability of a LLM to "understand" things correctly, training on older texts and trying to predict more recent articles:
https://ar5iv.labs.arxiv.org/html//2402.00861
podnami

3 days ago

[ - ]
I would assume this is from case to case, such as:
- How aligned has it been to “know” that something is true (eg ethical constraints)
- Statistical significance and just being able to corroborate one alternative in Its training data more strongly than another
- If it’s a web search related query, is the statement from original sources vs synthesised from say third party sources
But I’m just a layman and could be totally off here.
jorvi

3 days ago

[ - ]
> I'm not really sure, but maybe this XXX
You never see this in the response but you do in the reasoning.
danlitt

2 days ago

[ - ]
Can it generate one? Sure. But it won't mean anything, since you don't know (and nobody knows) the "true" distribution.
kuberwastaken

3 days ago

[ - ]
I'm half shocked this wasn't on HN before? Haha I built PicoGPT as a minified fork with <35 lines of JS and another in python
And it's small enough to run from a QR code :) https://kuber.studio/picogpt/
You can quite literally train a micro LLM from your phone's browser
growingswe

3 days ago

[ - ]
Great stuff! I wrote an interactive blogpost that walks through the code and visualizes it: https://growingswe.com/blog/microgpt
O4epegb

2 days ago

[ - ]
> By the end of training, the model produces names like "kamon", "karai", "anna", and "anton". None of them are copies from the dataset.
All 4 are in the dataset, btw

joenot443

3 days ago

[ - ]
This is awesome! Normally I'm pretty critical of LLM-assisted-blogging, but this one's a real winner.
evntdrvn

3 days ago

[ - ]
You should totally submit that to HN as an article, if you haven't already.

spinningslate

3 days ago

[ - ]
That’s beautifully done, thanks for posting. As helpful again to an ML novice like me as Karpathy’s original.
hei-lima

3 days ago

[ - ]
Great!
evntdrvn

3 days ago

[ - ]
really nice, thanks
la_fayette

3 days ago

[ - ]
This guy is so amazing! With his video and the code base I really have the feeling I understand gradient descent, back propagation, chain rule etc. Reading math only just confuses me, together with the code it makes it so clear! It feels like a lifetime achievement for me :-)
etothet

3 days ago

[ - ]
Even if you have some basic understanding of how LLMs work, I highly recommend Karpathy’s intro to LLMs videos on YouTube.
- https://m.youtube.com/watch?v=7xTGNNLPyMI - https://m.youtube.com/watch?v=EWvNQjAaOHw
znnajdla

3 days ago

[ - ]
Super useful exercise. My gut tells me that someone will soon figure out how to build micro-LLMs for specialized tasks that have real-world value, and then training LLMs won’t just be for billion dollar companies. Imagine, for example, a hyper-focused model for a specific programming framework (e.g. Laravel, Django, NextJS) trained only on open-source repositories and documentation and carefully optimized with a specialized harness for one task only: writing code for that framework (perhaps in tandem with a commodity frontier model). Could a single programmer or a small team on a household budget afford to train a model that works better/faster than OpenAI/Anthropic/DeepSeek for specialized tasks? My gut tells me this is possible; and I have a feeling that this will become mainstream, and then custom model training becomes the new “software development”.
rramadass

3 days ago

[ - ]
C++ version - https://github.com/Charbel199/microgpt.cpp?tab=readme-ov-fil...
Rust version - https://github.com/mplekh/rust-microgpt
freakynit

3 days ago

[ - ]
Is there something similar for diffusion models? By the way, this is incredibly useful for learning in depth the core of LLM's.
fulafel

3 days ago

[ - ]
This could make an interesting language shootout benchmark.

0xbadcafebee

3 days ago

[ - ]

Since this post is about art, I'll embed here my favorite LLM art: the IOCCC 2024 prize winner in bot talk, from Adrian Cable (https://www.ioccc.org/2024/cable1/index.html), minus the stdlib headers:

  #define a(_)typedef _##t
  #define _(_)_##printf
  #define x f(i,
  #define N f(k,
  #define u _Pragma("omp parallel for")f(h,
  #define f(u,n)for(I u=0;u<(n);u++)
  #define g(u,s)x s%11%5)N s/6&33)k[u[i]]=(t){(C*)A,A+s*D/4},A+=1088*s;
  
  a(int8_)C;a(in)I;a(floa)F;a(struc){C*c;F*f;}t;enum{Z=32,W=64,E=2*W,D=Z*E,H=86*E,V='}\0'};C*P[V],X[H],Y[D],y[H];a(F
  _)[V];I*_=U" 炾ોİ䃃璱ᝓ၎瓓甧染ɐఛ瓁",U,s,p,f,R,z,$,B[D],open();F*A,*G[2],*T,w,b,c;a()Q[D];_t r,L,J,O[Z],l,a,K,v,k;Q
  m,e[4],d[3],n;I j(I e,F*o,I p,F*v,t*X){w=1e-5;x c=e^V?D:0)w+=r[i]*r[i]/D;x c)o[i]=r[i]/sqrt(w)*i[A+e*D];N $){x
  W)l[k]=w=fmax(fabs(o[i])/~-E,i?w:0);x W)y[i+k*W]=*o++/w;}u p)x $){I _=0,t=h*$+i;N W)_+=X->c[t*W+k]*y[i*W+k];v[h]=
  _*X->f[t]*l[i]+!!i*v[h];}x D-c)i[r]+=v[i];}I main(){A=mmap(0,8e9,1,2,f=open(M,f),0);x 2)~f?i[G]=malloc(3e9):exit(
  puts(M" not found"));x V)i[P]=(C*)A+4,A+=(I)*A;g(&m,V)g(&n,V)g(e,D)g(d,H)for(C*o;;s>=D?$=s=0:p<U||_()("%s",$[P]))if(!
  (*_?$=*++_:0)){if($<3&&p>=U)for(_()("\n\n> "),0<scanf("%[^\n]%*c",Y)?U=*B=1:exit(0),p=_(s)(o=X,"[INST] %s%s [/INST]",s?
  "":"<<SYS>>\n"S"\n<</SYS>>\n\n",Y);z=p-=z;U++[o+=z,B]=f)for(f=0;!f;z-=!f)for(f=V;--f&&f[P][z]|memcmp(f[P],o,z););p<U?
  $=B[p++]:fflush(0);x D)R=$*D+i,r[i]=m->c[R]*m->f[R/W];R=s++;N Z){f=k*D*D,$=W;x 3)j(k,L,D,i?G[~-i]+f+R*D:v,e[i]+k);N
  2)x D)b=sin(w=R/exp(i%E/14.)),c=1[w=cos(w),T=i+++(k?v:*G+f+R*D)],T[1]=b**T+c*w,*T=w**T-c*b;u Z){F*T=O[h],w=0;I A=h*E;x
  s){N E)i[k[L+A]=0,T]+=k[v+A]*k[i*D+*G+A+f]/11;w+=T[i]=exp(T[i]);}x s)N E)k[L+A]+=(T[i]/=k?1:w)*k[i*D+G[1]+A+f];}j(V,L
  ,D,J,e[3]+k);x 2)j(k+Z,L,H,i?K:a,d[i]+k);x H)a[i]*=K[i]/(exp(-a[i])+1);j(V,a,D,L,d[$=H/$,2]+k);}w=j($=W,r,V,k,n);x
  V)w=k[i]>w?k[$=i]:w;}}

ruszki

3 days ago

[ - ]
> [p for mat in state_dict.values() for row in mat for p in row]
I'm so happy without seeing Python list comprehensions nowadays.
I don't know why they couldn't go with something like this:
[state_dict.values() for mat for row for p]
or in more difficult cases
[state_dict.values() for mat to mat*2 for row for p to p/2]
I know, I know, different times, but still.
easygenes

2 days ago

[ - ]
Inspiring. Definitely got nerd sniped by this. Now you can train it in under a second on one CPU core with no dependencies: https://github.com/Entrpi/eemicrogpt
Detailed optimizing journey in the readme too.
ThrowawayTestr

3 days ago

[ - ]
This is like those websites that implement an entire retro console in the browser.
astroanax

3 days ago

[ - ]
I feel its wrong to call it microgpt, since its smaller than nanogpt, so maybe picogpt would have been a better name? nice project tho
colonCapitalDee

3 days ago

[ - ]
Beautiful work
MattyRad

3 days ago

[ - ]
Hoenikker had been experimenting with melting and re-freezing ice-nine in the kitchen of his Cape Cod home.
Beautiful, perhaps like ice-nine is beautiful.
vadimf

3 days ago

[ - ]
I’m 100% sure the future consists of many models running on device. LLMs will be the mobile apps of the future (or a different architecture, but still intelligence).
sieste

3 days ago

[ - ]
The typos are interesting ("vocavulary", "inmput") - One of the godfathers of LLMs clearly does not use an LLM to improve his writing, and he doesn't even bother to use a simple spell checker.
smj-edison

2 days ago

[ - ]
Somewhat unrelated, but the generated names are surprisingly good! They're certainly more sane then appending -eigh to make a unique name.
coolThingsFirst

3 days ago

[ - ]
Incredibly fascinating. One thing is that it seems still very conceptual. What id be curious about how good of a micro llm we can train say with 12 hours of training on macbook.
jonjacky

2 days ago

[ - ]
I wonder if such a small GPT exhibits plagiarism. Are some of the generated names the same as names in the input data?
WithinReason

3 days ago

[ - ]
Previously:
https://news.ycombinator.com/item?id=47000263
borplk

3 days ago

[ - ]
Can anyone mention how you can "save the state" so it doesn't have to train from scratch on every run?
retube

3 days ago

[ - ]
Can you train this on say Wikipedia and have it generate semi-sensible responses?
stuckkeys

3 days ago

[ - ]
That web interface that someone commented in your github was flawless.
spopejoy

2 days ago

[ - ]
Sorry to RFELI5 but but ... I thought a "token" was a word? The example is of names and the output is new improvised names, implying that a character is a token? Or do all LLMs operate at character level?
Also is there some minima of training data? E.g. if you just trained on "True" "False" I assume it would be .5 Bernoulli? What is the minimum to see "interesting" results I guess.
jimbokun

3 days ago

[ - ]
It’s pretty staggering that a core algorithm simple enough to be expressed in 200 lines of Python can apparently be scaled up to achieve AGI.
Yes with some extra tricks and tweaks. But the core ideas are all here.
chenster

2 days ago

[ - ]
The best ML learning for dummies.
hmcamp

1 day ago

[ - ]
Cool
lynx97

2 days ago

[ - ]
Question: Can this be modified to score a "document"? I'd basically like to pass it a name, and get a score (0..1) on how realistic themodel "thinks" the document is? This would be extremely helpful for a project of mine.
hoppp

2 days ago

[ - ]
Was that code generated by claude?
bytesandbits

3 days ago

[ - ]
sensei karpathy has done it again
huqedato

3 days ago

[ - ]
Looking for alternative in Julia.
dhruv3006

3 days ago

[ - ]
Karapthy with another gem !
geon

3 days ago

[ - ]
Is there a similarly simple implementation with tensorflow?
I tried building a tiny model last weekend, but it was very difficult to find any articles that weren’t broken ai slop.
mold_aid

3 days ago

[ - ]
"art" project?
shevy-java

3 days ago

[ - ]
Microslop is alive!
ViktorRay

3 days ago

[ - ]
Which license is being used for this?
kelvinjps10

3 days ago

[ - ]
Why there is multiple comments talking about 1000 c lines, bots?
tithos

3 days ago

[ - ]
What is the prime use case
with

3 days ago

[ - ]
"everything else is just efficiency" is a nice line but the efficiency is the hard part. the core of a search engine is also trivial, rank documents by relevance. google's moat was making it work at scale. same applies here.
profsummergig

3 days ago

[ - ]
If anyone knows of a way to use this code on a consumer grade laptop to train on a small corpus (in less than a week), and then demonstrate inference (hallucinations are okay), please share how.