Rendered at 09:59:18 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
segmondy 8 hours ago [-]
I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.
effisfor 8 minutes ago [-]
I applaud all you tinkerers for pushing on the state of the home-brewed art here. Like crypto, AI is drowned out by hucksters, very few people talk about developing resilience. Or the researchers who will push on open source models in efforts to cram them onto an electric toothbrush or tamagotchi. Bravo to you all.
radku 29 minutes ago [-]
I have pretty much almost this exact setup with 2x3090s and with slightly faster DDR4 512GB and 64 core Epyc! [0] I've been enjoying it a lot. Can't wait to give this model a try.
Apart of running local models I use this rig as my main remote development platform. All Claude Code sessions are running there in tmux now. And my fingers can't be happier not having to deal with constantly hot laptop. Not to mention that Claude Code is such a battery hog.
Running that full load is at least 600 W, so in a day ~14 kWh. At $0.2 a kWH, that would be $2.80/day or $1k a year of op-ex in electricity.
Unless you really want privacy or the fuzzy feeling of owning your own, it’s cheaper, more convenient and has much faster tok/s if you pay a hyper scaler.
That said, I do like the direction we are heading and look forward to seeing what host your own hardware we get in 2 years.
walrus01 2 hours ago [-]
Not everyone lives in a place where electricity is $0.20 a kWh. For instance BC Hydro residential rates are $0.11 (CAD) for the first tier and $0.14 for the second tier of consumption in a month. At current exchange rate $0.14 CAD is $0.099 USD a kWh. Hydro Quebec is even cheaper.
At a theoretical 6 tok/s, 86400 seconds in a day, approx 500,000 tokens of GLM5.2 output for 2 bucks a day seems like a pretty good bargain to me. Of course not counting the one time cost of the hardware to run it. But I see people dropping $4000-5000 on all kinds of much less useful stuff.
Additionally in a place where people use electric baseboard heating or electric in floor radiant heating, or really any other heating element based system in winter that's less efficient than a heat pump, additional electrical from a computing load is basically "free" since you would be spending that same money otherwise to heat your house. If a computer with 512GB of RAM is dumping the waste heat into your room, it accomplishes a portion of the same thing as a baseboard.
Not to mention there is a whole other less measurable benefit of having a locally hosted model that can't be turned off or arbitrarily restricted by a service provider, and where all of your queries and context cache aren't subject to surveillance by any third party.
discordance 57 minutes ago [-]
Where I live prices are often higher than 20c/kWh, but lets take your example and halve it (10c/kWh) so it's ~$1.40/day or ~$500/year.
On Openrouter, the cheapest GLM 5.2 provider costs $3/MTok (at 44 tps). Assuming most use is output tokens, that's still the equivalent of 450k token/day, so we're in the same ball park, but without the capex for 2 3090's and the machine.
Self hosted only makes economic sense if your priority is being in control / avoiding surveillance.
walrus01 54 minutes ago [-]
That's true, there's a lot of places where power is considerably more expensive than $0.20 USD/kWh. But also the 600W figure assumes that it's fully loaded 24x7x365.
Running a system that will be 600W under max CPU usage on all cores and RAM and a few 3090-class GPUs, that same system might be only 90W or around there when idle at 0.00 unix load.
If we say: (600 * 24 * 31)/1000 = 446kWh in a month at full load 24 hours a day
But it could be less, such as: (90 * 24 * 31)/1000 = 66.96 kWh of idle time in a month, and 223kWh of "full load" 600W time in a month, if it's at full load only 12 hours a day.
If you're the only user accessing it and you only "use" it 12 hours a day, that cumulative yearly dollar figure would be almost halved. Or even less if a person is using it in bursts and intermittently throughout an 8 hour workday.
tmountain 1 hours ago [-]
Lots of people have solar. Green AI, imagine that!
cultofmetatron 24 minutes ago [-]
if only there was a magical place where geothermal and hydroelectric is ubiquitous and the weather is cold enough that no one is going to be complaining about free heating.
walrus01 20 minutes ago [-]
To be fair, Vancouver is such a magical place in terms of electrical cost, but the cost of living and real estate are otherwise through the roof, with decrepit and nasty (would need $100k in renovations immediately if you're not treating it as a teardown) single family detached homes on the east side of the city selling for 3.2 million.
SXX 2 hours ago [-]
I guess you missed recent news. Problem is that cloud LLM might just sliently sabotage your work by downgrading output model with no notice.
Or cloud LLM might just refuse to sell to you because it dont like your passport.
yorwba 2 hours ago [-]
So you're buying expensive hardware as insurance for the case that your cloud provider turns against you and you have to switch to another of the twenty offering the same model https://openrouter.ai/z-ai/glm-5.2 or in the worst case buy the same hardware later? How does that make sense?
2 hours ago [-]
swiftcoder 2 hours ago [-]
This is not really a problem for the open-weight models, you can always give your money to an inference provider in a different jurisdiction
dzjkb 1 hours ago [-]
how do you rent 2 3090s for $2.80/day?
dxuh 5 hours ago [-]
"All it takes to run" might be fair if you paid $2400, but right now the total price is way closer to $10k (almost 5k for the RAM and 2k each for the GPUs). Today that is a lot of expensive hardware.
segmondy 4 hours ago [-]
512gb 2400mhz ddr4 ram = $1600 not $5000. https://www.ebay.com/itm/188284985172
You can get creative and source 2-3 2080ti 22gb from China for about $250 a piece. You can either be resourceful and find a way or find a whole bunch of excuses.
officialchicken 2 hours ago [-]
> You can either be resourceful and find a way or find a whole bunch of excuses.
How about addressing this false dichotomy with the likelihood that someone who is new or interested in a tech isn't willing to drop thousands of dollars on used hardware for a whim or learning exercise.
zozbot234 5 hours ago [-]
AIUI the llama.cpp implementation for this model is still quite half-baked due to missing the support for DSA sparse attention mechanism. This leads to running the model with a different mechanism that it has not been trained for, which has been shown to lead to lower quality and performance.
Anyway, I think GLM 5.2 in many ways is not as interesting as DeepSeek V4 series, which uses an even more advanced attention mechanism and can save a lot of memory capacity for KV cache, especially at larger contexts. Which in turn opens up wide batching especially on consumer platforms. GLM doesn't have that, in some ways it feels broadly similar to Kimi 2.6 wrt. the underlying performance architecture. Both are a bit too heavy to run reasonably at full quality on ordinary hardware.
nextaccountic 6 hours ago [-]
How can you combine CPU cores and multiple GPU? Are you running some layers in cpu, others in gpu #1, and others in gpu #2? What about the bandwidth and latency between them?
Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?
I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software
nodja 5 hours ago [-]
Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.
xrd 4 hours ago [-]
This is a good place to start reading about dual gpus.
checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.
fsuts 6 hours ago [-]
6 tokens per second?
Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones
segmondy 5 hours ago [-]
I have been putting up with it forever. We are spoiled by MixtureOfExperts. Folks were delighted to run llama3-70B at such speed. We were happy with 15-20tk/sec with 8b models, and if you could run llama3-405B at 1tk/sec you were a god. To each their own. I can live with 6 high quality tokens. If I could get a Fable equivalent model, I'll gladly take 2tk/sec if that's what it took to run it locally.
manmal 5 hours ago [-]
But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.
all2 5 hours ago [-]
Is think you would use it more like email and less like text messages, so the domain of communication shifts drastically. The other part is, you don't have to run just that model, you can offload a lot of chores to smaller models.
Mashimo 4 hours ago [-]
Run one task, while you do another? Or while you sleep / eat / rave?
froh 5 hours ago [-]
do you use caveman or similar?
walrus01 2 hours ago [-]
I get a lot done with something that's also approximately 6 tokens/second, if you're willing to give it a well defined set of prompts and projects to work on, leave it for an hour or two, then come back and check what it's done. And often to remember to give it something of more consequence to do for at least 3-4 hours of wall clock runtime before heading to bed.
edg5000 6 hours ago [-]
Very cool. So it's not just about GPU VRAM which I incorrectly thought. I though you'd need 512 GB GPU VRAM. I don't think it cost only 2400; 512GB ram would be more expensive though back in the day. But not mortgage-grade 200.000 which I estimated myself (which assumed running in 100% VRAM; overkill for a single user probably).
segmondy 5 hours ago [-]
you can use system ram with a system like llama.cpp which offloads to system ram. token generation is a function of system bandwidth, the faster the bandwidth the better. so I'm on 8 channel 2400mhz. if I had a 12 ddr channel, I would get 1.5x the speed at 2400mhz. of course ddr5 is much faster, so a 12 ddr at 4800mhz will provide 3x the speed for token generation or roughly 18tk/sec. prompt processing is all about compute, so the more cpu cores you have the faster it can do PP.
redox99 8 hours ago [-]
That's crazy good for $2400.
xrd 12 hours ago [-]
So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.
$500k is a vast overestimation. For massive concurrency at FP8 or even BF16 maybe.
NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.
You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.
hbbio 9 hours ago [-]
Yes, a single GB300 workstation also does it, probably even more than 120tok/s.
Official price 85k...
__m 10 hours ago [-]
How fast will the hardware become outdated? Are there big improvements expected in the next 3 years?
easygenes 9 hours ago [-]
M5 Ultra will ship before end of year, likely. Though with current RAM shortage, likely max spec will be 256GB and in short supply.
In late 2027 or early 2028, Nvidia will release Vera Rubin DGX Spark, likely with double or better the performance of current Blackwell, though unclear if memory capacity will go up much from current 128GB. Two to four of those will run models like this decently.
In 2028 we should expect Vera Rubin RTX discrete lineup, including the replacement to the RTX PRO 6000. Likely memory spec will be minimum 128GB. Good chance of up to 200GB. Two to four of those will run NVFP4 models in this class very well.
9 hours ago [-]
jiqiren 6 hours ago [-]
I hope all this speculation comes true. Right now this ram crunch is ridiculous.
Tepix 5 hours ago [-]
I think there is a gap right now for running large models such as GLM 5.2 in Q4 or Q8.
My hope is on Intel Crescent Island 480GB cards. Let‘s see how expensive they‘ll be.
digitaltrees 8 hours ago [-]
I feel like the models are good enough for a decade of future work. So Once you have a working set up you can keep using it to do the work at the same level. There will be better stuff and may make that type of work obsolete but if you can do useful things it won’t be worth less.
segmondy 8 hours ago [-]
P40 was release 2016 and still selling like hotcakes!
easygenes 9 hours ago [-]
[dead]
mgambati 12 hours ago [-]
With 2 wouldn’t have good results. Ideal range for coding is at least Q8.
kibibu 11 hours ago [-]
According to this very article, 4-bit dynamic is essentially lossless
Aurornis 10 hours ago [-]
Watch out. Those claims are often made based on KL-divergence over some arbitrary corpus, not performance in the real world or benchmarks.
I’ve found that I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6 for actual long-context tasks.
ijidak 9 hours ago [-]
Crossing my fingers that this boom jumpstarts 90's like improvements in computing hardware.
I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
Most of the money and energy went to mobile for the last fifteen years.
Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.
0xbadcafebee 7 hours ago [-]
Definitely the stagnation was due to a lack of use cases, but this isn't a bad thing. We don't need most of the hardware advancement we got.
Business hardware got beefier because businesses demanded more data (or more specifically: the industry told businesses they needed more data), with no idea of what to actually do with it once they got it. To get all that data, bandwidth needed to be increased, with more iops to read/write it, more storage to keep it, and more memory and cpu to process it. But 99% of the data is junk. Companies have "data lakes" so big they need to come up with excuses to use the data, or risk somebody pointing out that they're spending a fortune hoarding bits.
Consumer hardware hasn't had a new use case since like 2012. Faster wifi for broadband & local file transfers, and higher-resolution video, are the only reasons one needed newer hardware. We actually got a resolution so high it makes no perceivable difference. And yeah we got faster CPUs and memory, but as soon as we did it got all eaten up by the most inefficient, wasteful software conceivable. Same use cases as 13 years ago, just more expensive, harder to use, and buggier. We should've gotten a new CPU architecture that was faster and more energy efficient. Finally it was delivered, but with a moat around the golden Apple.
Here we are two and a half decades into the Internet era, and my damn bluetooth earbuds and webcam microphone don't work half the time that I open a video conferencing app. Hardware can stay exactly like it is for the next few decades and I'd be happy. I just want software that works, and doesn't get continuously slower, forcing me to buy bigger hardware; or more draconian, locking me out of being able to use it how I want.
omnimus 4 hours ago [-]
The natural progression when performance is enough would be price. We were starting to see that but not anymore. I wonder if somebody is afraid the future where generally useful computation is cheap.
gruez 8 hours ago [-]
>I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
No, we're running into limits of moore's law, and it's showing in prices for new nodes, where they're getting denser but not cheaper.
horsawlarway 7 hours ago [-]
It's true we hit limits, but I feel like a lot of it was "limits" in the sense that the tradeoff stopped being worth the cost, so we optimized in other areas.
So we hit limits on clock speed in the early 2000s (ex - the 4ghz wall) but it also turned out that mobile as the driver for sales meant no one really cared much about clock speed compared to performance/watt.
Clock speed mattered, but only relative to how many watts it took to get it (and above 4ghz... too many watts).
But we've seen a 15x improvement over the last 20 years. Performance/Watt is WAY up.
My guess is that LLMs are going to drive another "improvement cycle" in areas that we didn't care much about before.
I've built about 10 personal desktop machines (1 every ~4 years) and I can honestly say that I didn't care much about memory bandwidth prior to 2021.
In the same way that I didn't care much about how many watts my pentium 4 was using in 2005.
But now... now I care a lot about memory bandwidth. I care about memory speeds and total system ram in a manner I really, really didn't before.
So I think we're going to see a big shift to machines built on unified ram with a crazy focus on squeezing memory bandwidth and total ram capacity as far as we can.
My bet is that we'll get a similar 10-15x improvement by 2040 in unified system ram designs.
I fully expect to see 2tb unified ram desktops and 200gb unified ram phones be relatively common on a 20 year timeline, assuming we see similar levels of geopolitical stability (ex - world war 3 throws a wrench into things).
BobbyTables2 6 hours ago [-]
Yeah, even Windows managed to not drive terribly dramatic upgrades in general computing
(besides Windows’ absurd RAM usage and now requiring a TPM).
In the old days, Microsoft Entertainment Pack games were somewhat visibly taxing on some lower end systems.
linzhangrun 8 hours ago [-]
Physical limitation of the manufacturing process may be more significant factor, starting from the TSMC 10nm ten years ago
cheema33 11 hours ago [-]
I have the RAM, but not the VRAM. What kind of speed/tps could you expect from a 3090 with 24GBs of RAM? I am somewhat tempted to pick a GPU with 24GBs of RAM.
ekidd 2 hours ago [-]
A GPU with 24GBs of RAM is mostly useful for running a very carefully squeezed Qwen3.6 27B (4-bit Unsloth quants, 8-bit K/V cache, possibly MTP, 128k context). This is a fun little model that's smart enough to do debugging, refactoring, and implementing "clean" specs that don't force it to make complicated design choices. I've seen it rip through a 9-year-old Terraform AWS config, and (without using the network) correctly identify nearly everything that would need to be upgraded or migrated for modern AWS. But if I give it some poorly conceived spec with lurking design headaches, then it goes on an endless thinking binge and ultimately fails.
Speed-wise, I don't have numbers, but it feels subjectively faster than Opus in Claude Code. YMMV.
Once you go above "a used 3090 at a decentish price", then I strongly recommend renting cloud GPUs or at least testing models using paid APIs. This allows testing your use case before spending piles of money.
phamilton 10 hours ago [-]
Generation is basically just memory bandwidth math.
Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.
SlavikCA 6 hours ago [-]
And with MTP (or other speculation techniques) you can ~double that.
uberex 10 hours ago [-]
Funny I casually asked Gemini and it said 500k for unquantized with decent throughput.
stymaar 9 hours ago [-]
This is why you shouldn't believe uncritically an answer from an LLM (neither should you do for any answer from a human either though).
andy_ppp 6 hours ago [-]
But I did my research online and the sun cycle is every 11 years and something something global warming is a hoax every single year now.
colinsane 8 hours ago [-]
i asked gemini and it replied with "Error: 400 Your prompt was blocked by safety filters. Please revise and try again."
digitaltrees 6 hours ago [-]
I asked and it said “403 forbidden - careful peon attempts to bypass the late stage capitalism api with your monetary offerings in exchange for you daily tokens will get you perma banned right to jail”.
j45 9 hours ago [-]
LLMs aren't discrete calcluators or estimators of things unless framed and guided to do so.
uberex 2 hours ago [-]
Good job I didn't use a vanilla LLM without tool use harness then.
skiing_crawling 11 hours ago [-]
"it can fit" on 256GB of RAM, but it will be heavily quantized and still run very slowly. The headline number is not token generation, its prompt processing. So if you get 10 tok/s and an API gives you 20-30 tok/s, it doesn't seem that bad on its face, but a mac studio or any other machine that's not loading all of it into GPU will do PP 20-50X slower than a purely GPU based setup, which is what actually makes this unusable without $50k in GPUs.
On top of that, you will still be heavily quantized.
gerdesj 10 hours ago [-]
A nvidia spark thingie has 128GB unified RAM. They also have a dual port version of one of these things: https://www.nvidia.com/content/dam/en-zz/Solutions/networkin.... ie 2 x 100GB/s ports, they may even be 2 x 200GB/s. Once I've got my paws on one, I'll know more.
You can cluster these beasts too. Two and three (with two IP subnets) is fairly obvious. Four or more might need a switch depending on how much network latency affects things.
Apple seem to have forgotten about M series with gobs of RAM. I can't get the Apple shop to show more than 96GB of unified RAM and that costs a kidney.
mapontosevenths 10 hours ago [-]
I have one, and I love it. That said my buddies Mac smokes it for inference workloads in terms of tokens per second AND its more usable for other things.
If you are training and doing research it's great, if you want to cluster them it cant be beat, but if you just want local inference on a single box buy a mac or even a strix halo device.
colinsane 8 hours ago [-]
can those macs boot linux? i've heard about Asahi but have no idea how far along they are. i've got my fleet configured with nix and sure, nix can target darwin, but there's a _lot_ of sharp edges there: i don't really want to pull that thread unless i have to...
mapontosevenths 8 hours ago [-]
I don't know. I think he just uses LMStudio most of the time on his, but that's one place I can say the spark really shines for me.
I'm a Linux guy, but also don't always have alot of time. The Spark comes out of the box with a nice Linux distro that's pre-configured to be easy to setup and the guides and online resources make getting up and running trivial, for even some complex tasks. You would have to do a LOT of tinkering just to figure out some of the things the nvidia resources walk you through natively. They have guides for a ton of stuff that include the optimal settings so you don't have to figure it all out through trial and error.
Check out these "playbooks" for some examples. [0] There's a lot to be said for not having to piece all that together yourself.
I think between unboxing mine setting it up to run headless, and generating tokens was like 20 minutes total for me.
Fizz43 9 hours ago [-]
which mac is smoking the spark?
pmarreck 8 hours ago [-]
pretty much any of them, dude, as long as you have enough RAM, since it uses unified RAM and a powerful SoC CPU/GPU. Literally any M-class model, but the M5 is currently top tier.
dannyw 7 hours ago [-]
The DGX Spark has basically the same memory bandwidth as a M5 Pro, and far more than a M5.
Only the M3 Ultra really beats it, and once you start scoping out the cost of a M3 Ultra with 128GB or 256GB, the DGX Spark doesn’t look bad after all.
mapontosevenths 8 hours ago [-]
Yep. Memory bandwidth is what decides how fast LLM's generate tokens (mostly). The DGX Spark has something like 270 GB/s of memory bandwidth, and the m5 ultra is ~615 GB/s. Theoretically DOUBLE the speed. In practice he only generates like 25% more tok/s, but that's still very impressive.
The spark can fine tune models in 1/4 the time and excels at other compute tasks in ways that Mac never can. Plus the high bandwidth ConnectX-7 ports would be like $1700 to buy on a card just for the network adapters... But for generating tokens, it just plain loses.
fsuts 5 hours ago [-]
How noisy does his fan get…
jauntywundrkind 9 hours ago [-]
200 Gb / s (not GB/s)!
(Still potentially very useful! But not magically ultra fast.)
Computer0 10 hours ago [-]
128 gb of much slower ram than Apple.
dannyw 7 hours ago [-]
DGX Spark is ~273GB/s. That’s about M5 Pro territory, and twice as fast as the M5. You’d have to go to the M5 Max, or M3 Ultra, to get higher memory bandwidth than the Spark.
Frannky 7 hours ago [-]
There is a push from multiple directions at the same time:
- new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM
- Nvidia, amd, intel, Cerebras etc pushing new hardware
- oss models getting crazy good, like glm 5.2
- flash models getting very good like deepseek V4 flash
- quantizations
- harnesses being able to use different models (big for difficult stuff, small for grunt work)
So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!
pheggs 11 hours ago [-]
I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?
UncleOxidant 11 hours ago [-]
If we didn't have a RAM/GPU shortage right now they would be more nervous than they are. But as it is very few people are going to be able to afford a rig that can run this model effectively. That's probably not going to change for several more years yet. I think if the Z.ai folks decide to come out with a flash version of GLM-5.2 specialized for coding that came in about about 80B params, then the US frontier labs would probably be more worried. Overall, the Chinese AI companies have been showing the way to do the same amount with less (sometimes much less) and as that trend continues it's going to make the frontier labs worried - but even the Chinese AI companies are going to want to protect their moat by not releasing capable models that are significantly smaller than their current flagship models. AliBaba Qwen seems to be there now - it's gotten mighty quiet from them lately - their latest 395B model is just too large for most folks to run at home and they don't seem to be making any noises about releasing smaller ones this time around.
gpm 10 hours ago [-]
The ram/gpu shortage won't last forever though. Moreover we can be pretty confident that long-term the prices will obey wrights law and come down in cost significantly (from the pre-shortage prices) as we learn to produce them more efficiently.
LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.
UncleOxidant 10 hours ago [-]
> The ram/gpu shortage won't last forever though.
No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.
DougN7 7 hours ago [-]
Long enough for them to IPO and all the execs to retire. I doubt they care beyond the IPO.
mannanj 10 hours ago [-]
> The ram/gpu shortage won't last forever though
Don't underestimate the markets ability to remain irrational
colinsane 8 hours ago [-]
the companies which have the power to alleviate these shortages are the same companies who are profiting most from the shortage. scarcity is an asset, it's not irrational that a concentrated marked will produce more of that asset.
selectodude 8 hours ago [-]
The solution for high prices is high prices.
If making RAM and SSDs is now cause for a 10 figure valuation, after enough time somebody will dive in.
elorant 10 hours ago [-]
Very few people, but quite a lot of companies especially after per token pricing took effect and companies see their invoices skyrocketing. You pay an upfront cost once and you’re done.
dannyw 7 hours ago [-]
When a large open weight model is released, a lab, startup, or a rich hoist can easily do logit-level distillation and create a XXb param model or whatever, and in theory you should get a really good distill.
verdverm 10 hours ago [-]
I suspect the time horizon is shorter because of software advances. We are getting more capability out of smaller models
Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)
UncleOxidant 8 hours ago [-]
> Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)
True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.
Infernal 7 hours ago [-]
Do we know where those key players went?
verdverm 7 hours ago [-]
Good to know, I think the trend is clear based on the models coming out of China and well see more capabilities in smaller, more efficient models.
cogman10 11 hours ago [-]
I don't think so. I could easily see a company deciding to host and run these models for their own development. If you have a dev team of about 10 people, a one time $50k investment in an LLM server has to be pretty tempting. Unlimited tokens, decent performance, upgrade options, and potential product integrations.
For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.
twelvechairs 11 hours ago [-]
Surely for most the desire is just an LLM provider that doesnt store or sell their queries (including by national actors). As long as that is allowed to happen surely its the answer for the vast majority.
eventualcomp 11 hours ago [-]
Where is $50k coming from again?
stingraycharles 11 hours ago [-]
That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.
Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.
cogman10 11 hours ago [-]
The hardware requirements aren't evolving and the local models have only been improving.
It's not like you'd lose capabilities, if anything this solution just gets better with time.
chatmasta 10 hours ago [-]
If the newer models require more/better hardware then you’ll lose capabilities.
I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.
cogman10 9 hours ago [-]
The newer models don't require more/better hardware. There's a small army of local llm enthusiasts who are running LLMs using 3090s and H100s because they have lots of memory. Them being old isn't really that big of an issue as the compute power needed is relatively low all things considered.
The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.
dannyw 7 hours ago [-]
Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth.
TPS = active weights in GB / your memory bandwidth.
That’s it for decode. That’s all.
cogman10 11 hours ago [-]
As in who pays for it or how did I arrive at that number?
For who pays for it, obviously the employer would.
For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.
simplyluke 9 hours ago [-]
You don't even need to run them locally for them to be a threat. Plenty of companies are looking at paying third party companies to host these models and they come in at fractions of the price of the frontier labs.
fny 11 hours ago [-]
The RAM requirements are still pretty painful.
yieldcrv 11 hours ago [-]
equilibrium in one or two more years on the consumer/prosumer side
think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM
a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again
denser open source models, packing more experts for smaller active layers
it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s
stingraycharles 11 hours ago [-]
Fairly certain that model sizes and computational requirements will grow as the price for LLM compute drops.
3stacks 9 hours ago [-]
Maybe there's a conversation to be had about how much is enough... Unless something beyond my imagination happened, I would be happy enough with Opus 4.5 levels of productivity
stingraycharles 44 minutes ago [-]
This really sounds like “640kb should be enough”.
I’m sorry, but I just can’t imagine us running smaller models than we are using right now in 5-10 years from now.
yieldcrv 9 hours ago [-]
have you seen the open source LLM space? people fulfill all niches and there are active communities at every range of RAM and all are looking for the most capable in their respective range
a lot of innovation occurring
scosman 8 hours ago [-]
It's not economic to run them locally. It's amazing for privacy, and fun hobby. But you're either looking at super slow CPU builds with $10k in RAM, $90k worth of GPUs, or a really quantized model that doesn't compare in quality.
I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.
CamouflagedKiwi 11 hours ago [-]
The hardware requirements to run this locally are still very high. Seems far enough off mainstream for those companies not to be too worried yet.
notatoad 10 hours ago [-]
locally on what hardware? something like the new dgx spark, ryzen halo, or mac studio will cost you ~ $4k plus whatever you pay for power. at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.
for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.
anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.
fc417fc802 5 hours ago [-]
> at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.
How so? Model capability at a fixed hardware level has been consistently (and rapidly) increasing. You might or might not be able to run state of the art 2 (or 4 or whatever) years from now but you can reasonably expect the hardware to last upwards of a decade with model performance consistently improving over that time frame.
You can get a tolerable (at least by some metrics) experience using 10 year old hardware today.
chatmasta 10 hours ago [-]
Just a hunch, but I think the most cost effective “local” deployment method right now is renting GPU clusters by the hour and running all the inference software on them yourself. This will be cheaper than capital expenditure on hardware that will depreciate and become last-gen, and cheaper than OpenRouter pay per token.
SXX 2 hours ago [-]
You forget that after 2 years you still gonna have said Mac Studio that can be sold off for 30-50% of the price.
Of course its gonna lose value faster if something magical happen with hardware manufacturing, but you'll likely get 25% back at least.
On other side you cant really predict how valuable claude max gonna be in a year because Anthropic can further enshittify it.
c7b 4 hours ago [-]
You can get a 128GB Strix Halo for under $3k. Used to be under $2k. Even if you believe it'll be completely obsolete for AI in two years, it'll still be good for many other things. Games for at least several more years, a great home server and/or desktop almost indefinitely. Plus, we might actually reach good enough levels for some AI use cases, if we're not already there.
And never underestimate the potential for enshittification. Your local rig will only deliver better performance over time as more and more tweaks come out. With cloud services expect the opposite to happen as subsidies run out. It's entirely possible that they will intersect on a bang per buck basis within two years.
tomr75 10 hours ago [-]
people who can't afford Claude max 200 are using qwen 3.6 27b for local coding assistance already
fsuts 5 hours ago [-]
Why do you think they are rushing to IPO!!
stymaar 9 hours ago [-]
Honestly, Qwen3.6 is already what you need for the large majority of tasks.
(I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).
cjbprime 52 minutes ago [-]
I've got access to a 192GB RAM Mac Studio, which is below the stated minimum RAM. Can swapping off fast disk be used to make it work out, especially since it's MoE?
walrus01 48 minutes ago [-]
Seems like a good way to shorten the lifespan of an NVME SSD significantly by using up its TB written lifespan, if you let it extensively swap. Also the performance will be absolutely abysmal like 0.1 tok/second.
Havoc 3 hours ago [-]
I bet OpenAI and Anthropic hate the timing of glm 5.2.
Kinda shows they have a headstart rather than a magic moat
2 hours ago [-]
zkmon 2 hours ago [-]
I have high respect for unsloth's work, helping millions to get started with local AI, but this post appears kind of download bait.
Offloading too many layers to CPU is not going to work at all. I have tried this many times and had to rm -rf on those heavy hf cache folders. Also I doubt 1-bit or 2-bit quants of GLM 5.2, running mostly outside of VRAM can beat Q8_0 of Qwen3.6-27B fully loaded in VRAM - on usefulness.
walrus01 2 hours ago [-]
I really don't think anyone is going to have a good time trying to run it on anything with 256GB of RAM no matter what the post says. 512 is the much more realistic minimum. I'm fortunate enough to have two 512GB RAM dual xeon workstations in my home office that I bought cheap before the price rise to mess around with things...
c7b 4 hours ago [-]
Can someone explain the math to me? Why is 1-bit only ten percent less memory than 2-bit?
idonotknowwhy 3 hours ago [-]
2 reasons.
First, it's not really "1 bit", actually much closer to 2-bit.
IQ1_M is actually 1.75bit and IQ2_XXS is 2.06bit
This is from the ./llama-quantize --help with most of the quant types and their size in bpw:
https://pastebin.com/bCUqGfeE
And to elaborate on the "dynamic" aspect inconito said in the other comment, if you click on one of the .gguf files in huggingface:
There are a lot of Q5_K, Q6_K, etc tensors.
Only the routed experts (ffn_gate_exps.weight, ffn_up_exps.weight, ffn_down_exps.weight) are heavily quantized, and it looks like the down_proj is actually iq3_xxs for this model.
incognito124 3 hours ago [-]
Keyword dynamic, the parameters are quantized on a case by case basis
jzer0cool 1 hours ago [-]
1 bit requirement (1-bit 223 GB wowza). What you all recommend with 24-48 vram, or is this approach much out dated now.
numlock86 5 hours ago [-]
Is this really worth it, though? Throughout the years my experience with quantized models has been that they feel like a lobotomized version of the original. Doesn't matter if it's an LLM, dedicated diffusion model or some other dedicated task. Sure, they get the job done. But a lot worse. The only ones that can somewhat hold up are the ones provided by the vendor directly. Gemma4 comes to mind. However I suspect they have some secret sauce other than just "let's quantize this" since they have the original model and its data at hand.
There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.
nicman23 5 hours ago [-]
it is not a flat quant but a dynamic
drudolph914 6 hours ago [-]
GLM 5.2 is the first time I'm actually excited about AI! I'm not the most bullish on AI code for several few reasons, but the biggest reason is the ownership model. We all know we're near the tail end of the "subsidized pricing" window for AI, and I've been hoping for so long to get an open weight model that is _close enough_ to the SOTA before this window closes - and we actually got it! I'm excited to be able to in the near future run GLM locally, and use these things like a tool instead of living in this for-rent model for the rest of my life. I'm excited to actually enjoy programming again
edg5000 5 hours ago [-]
One advantage about local LLM: You could serialize the context yourself, without being constrained by APIs. And let's not forget, the Big 2 encrypt their thinking. If you use custom clients, which is a very grey area alreay, being able to produce the context string raw is a big bonus. Takes away a lot of annoying constraints and needless mystique/obfuscation.
But I don't know how usable GLM 5.2 is vs the Big 2.
CGamesPlay 10 hours ago [-]
Can somebody help me understand the Quantization Analysis? It says "dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are generally lossless" while showing a top-1% token agreement on the chart of 97.5%. Not what I would consider "generally lossless". Is this implying that some post-processing is going to account for the 2.5% loss? Beam search?
dannyw 6 hours ago [-]
Generally 97.5% token agreement is very positive. Like the article explains, the difference isn’t the model thinking the capital of France isn’t Paris, but rather maybe saying “The capital of France is Paris” instead of “Paris is the capital of France”.
maxignol 1 hours ago [-]
Lucky me, I never go out without my 256gb unified ram mac x)
andai 11 hours ago [-]
How is this model half the size of DeepSeek V4 Pro? Is it because DeepSeek did more aggressive cost cutting on the attention mechanism?
jonathanhefner 8 hours ago [-]
> Runing GLM-5.2 on local hardware
Do the runes make it smarter or just run faster (or both)?
nicman23 5 hours ago [-]
depends on the color
suyash 3 hours ago [-]
We really need a quantized version for regular laptop
snootypoot 9 hours ago [-]
if sam altman didnt exist i could afford to run this
numlock86 5 hours ago [-]
if sam altman didn't exists this model would most likely not exist as well
ramgine 10 hours ago [-]
I have up to 1tb of ddr4 in my server but it only has a 12gb vram 3060. Would getting a 24gb vram make this a viable system or am I throwing money away?
segmondy 8 hours ago [-]
You can run it today with that 12gb vram 3060, but I would suggest getting 2 3090s. Use cmoe option. This will keep the attention/route tensors on the GPU and offload the rest to system memory. Try it now and see the performance.
rnewme 8 hours ago [-]
Should work yes.
Wowfunhappy 10 hours ago [-]
> The full model requires 1.51TB of disk space
...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.
But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!
gcr 10 hours ago [-]
There are two forms of compression relevant to LLMs:
1. Reduce the number of parameters
2. Reduce the resolution of each parameter (quantization)
For 1, changing the architecture is typically only possible by the labs producing the models, which is why each OSS model release tends to feature a small number of carefully chosen model sizes (for example, Gemma4 comes in e2B, e4B, 12B, 26Ba4B, and 31B sizes).
Generally, models with higher parameter counts have more world knowledge. For coding models, this shows up as a stronger command of uncommon libraries/languages. Very small models (<20B) also lack “smarts.”
Reducing the resolution of each parameter is easier which is why lots of practitioners have their own quantizations, but this makes it harder for a model to “think” fluently. Interacting with heavily quantized models feels like interacting with someone who didn’t get any sleep the night before.
Models that have higher-fidelity quantization take more RAM and have higher “smarts,” but don’t necessarily have more world knowledge. Models with aggressive quantization tend to be more likely to make rookie mistakes, emit malformed tool calls, get stuck in loops, or even exhibit signs of “neuroticism” / “distress” in their thinking tokens.
Parameter counts = world knowledge, quantization = “smarts.”
This is a soft rule of thumb, the difference isn’t very strong.
walrus01 2 hours ago [-]
> ...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
TBH this is like the near last ranking consideration in cost for being able to download and run this. Even though HDD and SSD prices have gone nuts as a result of the recent demand/shortage, it's not like 1.5TB of space costs a lot.
Even if you fed it into xzip with the most cpu intensive compression options and it didn't compress at all (eg: like trying to xzip an AV1 video, or whatever), it's still the cost of a single fast food hamburger meal in $/TB. The real concern is the RAM to run it.
But anyways, anecdotally, many 16-bit full precision GGUF files will compress to about 65% of original size with default xz options. I have a log here showing that's what IBM Granite 4.1 30b compressed to, which I'm keeping around but in lukewarm storage.
throwdbaaway 6 hours ago [-]
On ZFS with zstd compression, I am getting 1.34x compressratio for the BF16 weights (across multiple models).
Here's the du output for GLM-5.2:
$ du -s -BG /cube/models/zai-org/GLM-5.2/
1099G /cube/models/zai-org/GLM-5.2/
Probably not at all, considering weights are randomly initialized.
dofm 9 hours ago [-]
Can't run this myself.
But I do like Unsloth Studio, quite a lot. It's nicely designed.
nullc 11 hours ago [-]
Just running cpu only w/ Q6 on 9684X I get about 1tok/s ... also still get about 1tok/s/stream when running 16 in parallel.
hxii 10 hours ago [-]
Any time I see one of these posts about models of this size a quote comes to mind – "Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should".
Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.
segmondy 8 hours ago [-]
Completely worth it. At 6tk a second. If I can get 2 hrs of token generation. That's 2hrs * 3600secs * 6tk = 43200 tokens, at about 10tk to a line of code, that's about 4320 lines. Let's even trim it more and slice it by half. That's 2160 lines of code a day. Most professional programmers can't deliver that much consistently in a day.
The key to a model this large is (1) Use it to plan, generate lots of plan and farm out to a smaller model. Then for very specific and complicated portions precisely prompt for what you need.
uberex 6 hours ago [-]
Thats not a complete reasoning. Even frontiers need to revisit and fix things. Add 10 loops to that and it is 20 hours. Still great compared to a 2023 human, but why am I not just paying pocket money for Claude Pro instead?
segmondy 5 hours ago [-]
You're talking about agentic workflow. Agentic is cruise controls. Race car drivers shift manually for more precision and to go faster. If the only way you know how to code with AI is agentic, then you are putting yourself on a crutch.
uberex 5 hours ago [-]
You are saying you can one shot without loops on something like GLM-5.2?
zuzululu 11 hours ago [-]
wonder if AMD's new ai chip can run this with ease? I'm seriously consider buying it. GLM 5.2 is just shy of GPT 5.4 so I would welcome offloading any grunt work locally
I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.
Nothing beats a local LLM disconnected from the cloud.
UncleOxidant 11 hours ago [-]
Are you talking about Medusa Halo? It's going to support up to 256GB unified memory (up from 128GB for Strix Halo and 192GB for Gorgon Halo). That might just be barely enough to run a 2-bit quant GLM-5.2. It will expand memory bus to 384-bits, vs. 256-bits for Strix Halo which will help with bandwidth (projected to be around 500 GB/sec). But don't expect Madusa Halo-based machines to appear until sometime in 2028.
The other way this could go is that Z.ai could decide to release a smaller model targeted towards coding. They've done that before (GLM-4.7-Flash had 30B params). It would be great if they decided to release something in the 80B-100B param range. Something that size would easily run in a current Strix Halo system.
monksy 7 hours ago [-]
Strix Halo only supports 96gb of video memory then it goes to 32gb to the host system.
zuzululu 9 hours ago [-]
yeah you are correct 2 bit quant won't be enough
guess we'll be paying $200/month for a while
nl 11 hours ago [-]
> I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
We are maybe 10 years off that.
RAM prices are going to continue to increase for the next 2 years at least.
Even putting that aside it's currently around 40-70,000 EUR to run this with a FP8 quantization (which you need to get close to maximum performance).
To actually get GPT 5.5-xhigh performance in the real world you need more headroom to support things like subagents (which will fill up your KV cache).
I like local models but realism is important. The sweet spot for the next 3 years will continue to be ~35B MoE models. They might match GPT 5.5-xhigh for chat-style problems but not for coding.
hsuduebc2 11 hours ago [-]
I wonder, if in the near future any acquisitions of some RAM producers with intent to just keep RAM prices up, will happen from the AI companies. It could seriously hurt their business, if companies will be able to host their AI in some time.
nl 9 hours ago [-]
I think AI companies have enough things to spend capital on already.
zuzululu 9 hours ago [-]
[dead]
Iolaum 11 hours ago [-]
At full quantization GLM 5.2 may be close to GPT 5.4. But at Q2 or whatever one needs in order to run it on a pro-sumer device it will be worse.
Also I m not sure where you are getting the under 2k value. I bought a Framework desktop 128GB last year and my setup was around 2.7k. The same setup now sells for around 4.7k.
nh43215rgb 11 hours ago [-]
Even with upcoming AI Max+ PRO 495 we are capped with 192GB, so no...
benjiro29 11 hours ago [-]
"GLM 5.2 is just shy of GPT 5.4"... If your running the full model. As in have 750 (FP8) to 1.5TB(FP16) of memory available.
Do not mix the benchmark results of GLM 5.2 FP16/FP8 with FP4 or FP2.
* FP4 will mean a accuracy loss of about 3%. Not noticeable but more chance for mistakes.
* FP2 ... what is what most people are able to run at home, for a "reasonable" price. Your looking at over 17% loss in accuracy.
At that point, your running at less then claude-sonnet-4.6, as the issues compound with accuracy losses. And reasonable priced is still in the ~ $5000 range (192GB + GPU 32GB active/kv cache system).
For that price your using a Codex / Claude Pro subscription for the next 4+ years with better models (by default), let alone with a FP2 GLM 5.2 version. And your looking at < 10 fps. A MacStudio with 512GB will net you 18 a 20fps+ with FP4, but ... i mean, those used to be $10.000.
Unfortunately the local hardware cost is a major issue for running large models like that.
Edit: Its funny whenever the issue of cost and what you need to give up vs the subscription services, there are always people who downvote in bad faith.
kgeist 8 hours ago [-]
The cost of local hardware is amortized if a whole team uses it instead of just 1 dev (GPUs are extremely underutilized if you launch just 1 generation stream). I'm not sure why everyone always assumes solo devs with Macs. We've just ordered a large datacenter-grade node for use by the whole dev team, and the calculations show that it's going to cost the same amount of money if we kept using AWS Bedrock (infosec reasons) for a couple years but... it gives us 100% privacy, we're immune to all the AI regulation dramas in the US/EU, all the random outages, and the developers won't have to think about token limits/weekly caps etc. ever again. And all that with a model which is Opus-grade
(it's not our first AI server, we already have experience deploying LLMs for our clients, so the numbers look solid)
froh 5 hours ago [-]
> GPUs are extremely underutilized if you launch just 1 generation stream
why is that? b/c the thing is waiting for the hoooman and idling? or some parallelizable interleaving steps?
I have no intuition yet how this works under the hood.
cataflutter 2 hours ago [-]
Some of the inference engines can process multiple requests in parallel more efficiently than doing them sequentially. Not sure of the exact mechanism but e.g. llama.cpp's llama-server can do this (you tell it the number of slots to have when starting, then fire HTTP requests at it and it batches them together when it can).
Waiting for the hooman (or tool calls) won't help either, of course.
zuzululu 9 hours ago [-]
you are right that means GLM is still quite far off from truly competitive
i think your answer was perfect not sure why you are being downvoted
kccqzy 11 hours ago [-]
The AMD 395 supports up to 128GB unified RAM. So still not enough even at 1-bit quant unfortunately.
Apart of running local models I use this rig as my main remote development platform. All Claude Code sessions are running there in tmux now. And my fingers can't be happier not having to deal with constantly hot laptop. Not to mention that Claude Code is such a battery hog.
[0] https://medium.com/@rathko/i-built-an-epyc-64-core-512gb-ram...
Unless you really want privacy or the fuzzy feeling of owning your own, it’s cheaper, more convenient and has much faster tok/s if you pay a hyper scaler.
That said, I do like the direction we are heading and look forward to seeing what host your own hardware we get in 2 years.
At a theoretical 6 tok/s, 86400 seconds in a day, approx 500,000 tokens of GLM5.2 output for 2 bucks a day seems like a pretty good bargain to me. Of course not counting the one time cost of the hardware to run it. But I see people dropping $4000-5000 on all kinds of much less useful stuff.
Additionally in a place where people use electric baseboard heating or electric in floor radiant heating, or really any other heating element based system in winter that's less efficient than a heat pump, additional electrical from a computing load is basically "free" since you would be spending that same money otherwise to heat your house. If a computer with 512GB of RAM is dumping the waste heat into your room, it accomplishes a portion of the same thing as a baseboard.
Not to mention there is a whole other less measurable benefit of having a locally hosted model that can't be turned off or arbitrarily restricted by a service provider, and where all of your queries and context cache aren't subject to surveillance by any third party.
On Openrouter, the cheapest GLM 5.2 provider costs $3/MTok (at 44 tps). Assuming most use is output tokens, that's still the equivalent of 450k token/day, so we're in the same ball park, but without the capex for 2 3090's and the machine.
Self hosted only makes economic sense if your priority is being in control / avoiding surveillance.
Running a system that will be 600W under max CPU usage on all cores and RAM and a few 3090-class GPUs, that same system might be only 90W or around there when idle at 0.00 unix load.
If we say: (600 * 24 * 31)/1000 = 446kWh in a month at full load 24 hours a day
But it could be less, such as: (90 * 24 * 31)/1000 = 66.96 kWh of idle time in a month, and 223kWh of "full load" 600W time in a month, if it's at full load only 12 hours a day.
If you're the only user accessing it and you only "use" it 12 hours a day, that cumulative yearly dollar figure would be almost halved. Or even less if a person is using it in bursts and intermittently throughout an 8 hour workday.
Or cloud LLM might just refuse to sell to you because it dont like your passport.
How about addressing this false dichotomy with the likelihood that someone who is new or interested in a tech isn't willing to drop thousands of dollars on used hardware for a whim or learning exercise.
Anyway, I think GLM 5.2 in many ways is not as interesting as DeepSeek V4 series, which uses an even more advanced attention mechanism and can save a lot of memory capacity for KV cache, especially at larger contexts. Which in turn opens up wide batching especially on consumer platforms. GLM doesn't have that, in some ways it feels broadly similar to Kimi 2.6 wrt. the underlying performance architecture. Both are a bit too heavy to run reasonably at full quality on ordinary hardware.
Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?
I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software
https://github.com/noonghunna/club-3090/blob/master/docs/DUA...
Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones
https://unsloth.ai/docs/models/glm-5.2#usage-guide
In a prior thread, someone said it would take $500k in hardware:
https://news.ycombinator.com/item?id=48629970
NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.
You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.
Official price 85k...
In late 2027 or early 2028, Nvidia will release Vera Rubin DGX Spark, likely with double or better the performance of current Blackwell, though unclear if memory capacity will go up much from current 128GB. Two to four of those will run models like this decently.
In 2028 we should expect Vera Rubin RTX discrete lineup, including the replacement to the RTX PRO 6000. Likely memory spec will be minimum 128GB. Good chance of up to 200GB. Two to four of those will run NVFP4 models in this class very well.
I’ve found that I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6 for actual long-context tasks.
I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
Most of the money and energy went to mobile for the last fifteen years.
Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.
Business hardware got beefier because businesses demanded more data (or more specifically: the industry told businesses they needed more data), with no idea of what to actually do with it once they got it. To get all that data, bandwidth needed to be increased, with more iops to read/write it, more storage to keep it, and more memory and cpu to process it. But 99% of the data is junk. Companies have "data lakes" so big they need to come up with excuses to use the data, or risk somebody pointing out that they're spending a fortune hoarding bits.
Consumer hardware hasn't had a new use case since like 2012. Faster wifi for broadband & local file transfers, and higher-resolution video, are the only reasons one needed newer hardware. We actually got a resolution so high it makes no perceivable difference. And yeah we got faster CPUs and memory, but as soon as we did it got all eaten up by the most inefficient, wasteful software conceivable. Same use cases as 13 years ago, just more expensive, harder to use, and buggier. We should've gotten a new CPU architecture that was faster and more energy efficient. Finally it was delivered, but with a moat around the golden Apple.
Here we are two and a half decades into the Internet era, and my damn bluetooth earbuds and webcam microphone don't work half the time that I open a video conferencing app. Hardware can stay exactly like it is for the next few decades and I'd be happy. I just want software that works, and doesn't get continuously slower, forcing me to buy bigger hardware; or more draconian, locking me out of being able to use it how I want.
No, we're running into limits of moore's law, and it's showing in prices for new nodes, where they're getting denser but not cheaper.
So we hit limits on clock speed in the early 2000s (ex - the 4ghz wall) but it also turned out that mobile as the driver for sales meant no one really cared much about clock speed compared to performance/watt.
Clock speed mattered, but only relative to how many watts it took to get it (and above 4ghz... too many watts).
But we've seen a 15x improvement over the last 20 years. Performance/Watt is WAY up.
My guess is that LLMs are going to drive another "improvement cycle" in areas that we didn't care much about before.
I've built about 10 personal desktop machines (1 every ~4 years) and I can honestly say that I didn't care much about memory bandwidth prior to 2021.
In the same way that I didn't care much about how many watts my pentium 4 was using in 2005.
But now... now I care a lot about memory bandwidth. I care about memory speeds and total system ram in a manner I really, really didn't before.
So I think we're going to see a big shift to machines built on unified ram with a crazy focus on squeezing memory bandwidth and total ram capacity as far as we can.
My bet is that we'll get a similar 10-15x improvement by 2040 in unified system ram designs.
I fully expect to see 2tb unified ram desktops and 200gb unified ram phones be relatively common on a 20 year timeline, assuming we see similar levels of geopolitical stability (ex - world war 3 throws a wrench into things).
In the old days, Microsoft Entertainment Pack games were somewhat visibly taxing on some lower end systems.
Speed-wise, I don't have numbers, but it feels subjectively faster than Opus in Claude Code. YMMV.
Once you go above "a used 3090 at a decentish price", then I strongly recommend renting cloud GPUs or at least testing models using paid APIs. This allows testing your use case before spending piles of money.
Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.
On top of that, you will still be heavily quantized.
You can cluster these beasts too. Two and three (with two IP subnets) is fairly obvious. Four or more might need a switch depending on how much network latency affects things.
Apple seem to have forgotten about M series with gobs of RAM. I can't get the Apple shop to show more than 96GB of unified RAM and that costs a kidney.
If you are training and doing research it's great, if you want to cluster them it cant be beat, but if you just want local inference on a single box buy a mac or even a strix halo device.
I'm a Linux guy, but also don't always have alot of time. The Spark comes out of the box with a nice Linux distro that's pre-configured to be easy to setup and the guides and online resources make getting up and running trivial, for even some complex tasks. You would have to do a LOT of tinkering just to figure out some of the things the nvidia resources walk you through natively. They have guides for a ton of stuff that include the optimal settings so you don't have to figure it all out through trial and error.
Check out these "playbooks" for some examples. [0] There's a lot to be said for not having to piece all that together yourself.
https://build.nvidia.com/spark
I think between unboxing mine setting it up to run headless, and generating tokens was like 20 minutes total for me.
Only the M3 Ultra really beats it, and once you start scoping out the cost of a M3 Ultra with 128GB or 256GB, the DGX Spark doesn’t look bad after all.
The spark can fine tune models in 1/4 the time and excels at other compute tasks in ways that Mac never can. Plus the high bandwidth ConnectX-7 ports would be like $1700 to buy on a card just for the network adapters... But for generating tokens, it just plain loses.
(Still potentially very useful! But not magically ultra fast.)
- new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM
- Nvidia, amd, intel, Cerebras etc pushing new hardware
- oss models getting crazy good, like glm 5.2
- flash models getting very good like deepseek V4 flash
- quantizations
- harnesses being able to use different models (big for difficult stuff, small for grunt work)
So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!
LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.
No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.
Don't underestimate the markets ability to remain irrational
If making RAM and SSDs is now cause for a 10 figure valuation, after enough time somebody will dive in.
Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)
True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.
For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.
Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.
It's not like you'd lose capabilities, if anything this solution just gets better with time.
I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.
The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.
TPS = active weights in GB / your memory bandwidth.
That’s it for decode. That’s all.
For who pays for it, obviously the employer would.
For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.
think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM
a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again
denser open source models, packing more experts for smaller active layers
it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s
I’m sorry, but I just can’t imagine us running smaller models than we are using right now in 5-10 years from now.
a lot of innovation occurring
I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.
for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.
anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.
How so? Model capability at a fixed hardware level has been consistently (and rapidly) increasing. You might or might not be able to run state of the art 2 (or 4 or whatever) years from now but you can reasonably expect the hardware to last upwards of a decade with model performance consistently improving over that time frame.
You can get a tolerable (at least by some metrics) experience using 10 year old hardware today.
Of course its gonna lose value faster if something magical happen with hardware manufacturing, but you'll likely get 25% back at least.
On other side you cant really predict how valuable claude max gonna be in a year because Anthropic can further enshittify it.
And never underestimate the potential for enshittification. Your local rig will only deliver better performance over time as more and more tweaks come out. With cloud services expect the opposite to happen as subsidies run out. It's entirely possible that they will intersect on a bang per buck basis within two years.
(I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).
Kinda shows they have a headstart rather than a magic moat
Offloading too many layers to CPU is not going to work at all. I have tried this many times and had to rm -rf on those heavy hf cache folders. Also I doubt 1-bit or 2-bit quants of GLM 5.2, running mostly outside of VRAM can beat Q8_0 of Qwen3.6-27B fully loaded in VRAM - on usefulness.
First, it's not really "1 bit", actually much closer to 2-bit. IQ1_M is actually 1.75bit and IQ2_XXS is 2.06bit This is from the ./llama-quantize --help with most of the quant types and their size in bpw: https://pastebin.com/bCUqGfeE
And to elaborate on the "dynamic" aspect inconito said in the other comment, if you click on one of the .gguf files in huggingface:
https://huggingface.co/unsloth/GLM-5.2-GGUF/blob/main/UD-IQ1...
There are a lot of Q5_K, Q6_K, etc tensors. Only the routed experts (ffn_gate_exps.weight, ffn_up_exps.weight, ffn_down_exps.weight) are heavily quantized, and it looks like the down_proj is actually iq3_xxs for this model.
There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.
But I don't know how usable GLM 5.2 is vs the Big 2.
Do the runes make it smarter or just run faster (or both)?
...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.
But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!
1. Reduce the number of parameters
2. Reduce the resolution of each parameter (quantization)
For 1, changing the architecture is typically only possible by the labs producing the models, which is why each OSS model release tends to feature a small number of carefully chosen model sizes (for example, Gemma4 comes in e2B, e4B, 12B, 26Ba4B, and 31B sizes).
Generally, models with higher parameter counts have more world knowledge. For coding models, this shows up as a stronger command of uncommon libraries/languages. Very small models (<20B) also lack “smarts.”
Reducing the resolution of each parameter is easier which is why lots of practitioners have their own quantizations, but this makes it harder for a model to “think” fluently. Interacting with heavily quantized models feels like interacting with someone who didn’t get any sleep the night before.
Models that have higher-fidelity quantization take more RAM and have higher “smarts,” but don’t necessarily have more world knowledge. Models with aggressive quantization tend to be more likely to make rookie mistakes, emit malformed tool calls, get stuck in loops, or even exhibit signs of “neuroticism” / “distress” in their thinking tokens.
Parameter counts = world knowledge, quantization = “smarts.”
This is a soft rule of thumb, the difference isn’t very strong.
TBH this is like the near last ranking consideration in cost for being able to download and run this. Even though HDD and SSD prices have gone nuts as a result of the recent demand/shortage, it's not like 1.5TB of space costs a lot.
Even if you fed it into xzip with the most cpu intensive compression options and it didn't compress at all (eg: like trying to xzip an AV1 video, or whatever), it's still the cost of a single fast food hamburger meal in $/TB. The real concern is the RAM to run it.
But anyways, anecdotally, many 16-bit full precision GGUF files will compress to about 65% of original size with default xz options. I have a log here showing that's what IBM Granite 4.1 30b compressed to, which I'm keeping around but in lukewarm storage.
Here's the du output for GLM-5.2:
But I do like Unsloth Studio, quite a lot. It's nicely designed.
Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.
The key to a model this large is (1) Use it to plan, generate lots of plan and farm out to a smaller model. Then for very specific and complicated portions precisely prompt for what you need.
I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.
Nothing beats a local LLM disconnected from the cloud.
The other way this could go is that Z.ai could decide to release a smaller model targeted towards coding. They've done that before (GLM-4.7-Flash had 30B params). It would be great if they decided to release something in the 80B-100B param range. Something that size would easily run in a current Strix Halo system.
guess we'll be paying $200/month for a while
We are maybe 10 years off that.
RAM prices are going to continue to increase for the next 2 years at least.
Even putting that aside it's currently around 40-70,000 EUR to run this with a FP8 quantization (which you need to get close to maximum performance).
To actually get GPT 5.5-xhigh performance in the real world you need more headroom to support things like subagents (which will fill up your KV cache).
I like local models but realism is important. The sweet spot for the next 3 years will continue to be ~35B MoE models. They might match GPT 5.5-xhigh for chat-style problems but not for coding.
Also I m not sure where you are getting the under 2k value. I bought a Framework desktop 128GB last year and my setup was around 2.7k. The same setup now sells for around 4.7k.
Do not mix the benchmark results of GLM 5.2 FP16/FP8 with FP4 or FP2.
* FP4 will mean a accuracy loss of about 3%. Not noticeable but more chance for mistakes.
* FP2 ... what is what most people are able to run at home, for a "reasonable" price. Your looking at over 17% loss in accuracy.
At that point, your running at less then claude-sonnet-4.6, as the issues compound with accuracy losses. And reasonable priced is still in the ~ $5000 range (192GB + GPU 32GB active/kv cache system).
For that price your using a Codex / Claude Pro subscription for the next 4+ years with better models (by default), let alone with a FP2 GLM 5.2 version. And your looking at < 10 fps. A MacStudio with 512GB will net you 18 a 20fps+ with FP4, but ... i mean, those used to be $10.000.
Unfortunately the local hardware cost is a major issue for running large models like that.
Edit: Its funny whenever the issue of cost and what you need to give up vs the subscription services, there are always people who downvote in bad faith.
(it's not our first AI server, we already have experience deploying LLMs for our clients, so the numbers look solid)
why is that? b/c the thing is waiting for the hoooman and idling? or some parallelizable interleaving steps?
I have no intuition yet how this works under the hood.
Waiting for the hooman (or tool calls) won't help either, of course.
i think your answer was perfect not sure why you are being downvoted