LLMs in 2026: What’s Real, What’s Hype, and What’s Coming Next

Episodes About the Show

LLMs in 2026: What’s Real, What’s Hype, and What’s Coming Next

Episode #53 02.23.26

Our Guest Sebastian Raschka Discusses

Listen

YouTube

Apple

Spotify

Is AI actually going to replace developers? Or is the hype getting ahead of reality?

On this episode of Digital Disruption, we’re joined by Sebastian Raschka, AI research engineer and author.

Sebastian Raschka has over a decade of experience in artificial intelligence and machine learning. His work bridges academia and industry, serving as a Senior Engineer at Lightning AI and as a faculty member at the University of Wisconsin–Madison. He is the author of Build a Large Language Model (from Scratch) and is widely recognized for his practical, code-driven approach to AI education and research. His expertise lies in large language model (LLM) research, transformer architectures, reinforcement learning, and the development of high-performance AI systems, with a strong focus on real-world implementation.

Sebastian Raschka joins Geoff Nielson to unpack the real state of LLMs in 2026. As an LLM research engineer, Sebastian bridges deep technical expertise with practical, real-world AI implementation. In this conversation, he cuts through AI hype to focus on what’s actually achievable with modern LLMs, reasoning models, reinforcement learning, and inference scaling and where the limitations still exist. Sebastian explains why most companies should not build a large language model from scratch but also why understanding the fundamentals may be one of the most important investments technology leaders can make.

This conversation breaks down:

Why coding is currently the strongest LLM use case.
Why “reasoning” models still fail simple tasks like counting letters in “strawberry.”
The reality behind Math Olympiad gold-level AI claims.
The true cost of training large models (millions in GPU compute).
The privacy risks of uploading proprietary data into APIs.
How enterprises should think about fine-tuning vs. API-based prompting.
Why benchmarks and leaderboards can be misleading.

Become a Member

00;00;00;27 - 00;00;21;28

Geoff Nielson

Hey everyone! I'm super excited to be sitting down with Sebastian Rocha. He's the author of the Build a Large Language Model from scratch and build a large reasoning model from scratch book and video series. As an Lam research engineer. He's British academic teaching at the University of Wisconsin-Madison with the hyper practical working for the AI development platform lightning AI.

00;00;22;00 - 00;00;41;24

Geoff Nielson

But what I love about Sebastian is that he has zero appetite for AI hype, and can dive right into what you can actually do with these tools. I want to ask him what impact AI is having on coders. Is it really going to make them obsolete? What real capabilities can we expect AI to develop from here and who should be actually building an LLM from scratch?

00;00;41;26 - 00;00;45;28

Geoff Nielson

Let's find out.

00;00;46;01 - 00;01;09;17

Geoff Nielson

Sebastian, thanks so much for joining today. For those who don't know, Sebastian is the author of building an LLM from scratch book as well as video series on YouTube. Sebastian really gets deep into the technical aspects of LMS and how we can actually, you know, create these and build our own and get beyond the hype. Sometimes we hear about of AI, but just before we get into, you know, who should be doing that, what that looks like.

00;01;09;18 - 00;01;23;09

Geoff Nielson

I wanted to zoom out a little bit. And, Sebastian, maybe you can tell me a little bit about, you know, in your view, the state of LMS in 2026. How are the capabilities advancing? What do you see on the horizon for this technology?

00;01;23;11 - 00;01;53;27

Sebastian Raschka

Yeah. Thanks for first. Thanks for inviting me on the podcast to talk about LMS. It's yeah, one of my favorite topics. So I think we will have a lot of fun in this, episode. So, but yeah, you you began with a very broad question here, like the state of LMS in 2026, I would say, 2025 was particularly, interesting because there was at the beginning of 2025 deep and then this, new paradigm, we can maybe get, into this in more detail later.

00;01;53;27 - 00;02;17;06

Sebastian Raschka

But the reinforcement learning with, verifiable rewards, which is like a technique to develop, your reasoning capabilities and LMS reasoning is also in quotation marks. It's a broad topic. It's like the reasoning in LMS. I would say we shouldn't take it too literal, like how humans reason, but it is like, set of techniques that make an LMS better at solving complex tasks.

00;02;17;09 - 00;02;45;00

Sebastian Raschka

And 2025 was pretty much like dominated by this idea of developing, these so-called reasoning, sometimes called thinking models. So everyone like from OpenAI to Google, Claude, grok or the overweight LMS, they all have now, like different variants of LMS, like the regular instruct or the regular variant, and then the thinking variant. We can maybe talk also later a bit about the trade offs here.

00;02;45;02 - 00;03;06;04

Sebastian Raschka

But so I think this was like 2025, we will see this kind of like continue in 2026 because these techniques, they are still relatively new and people are currently, I would say in the first iteration of version of these techniques where, okay, this works. And now it's like, let's hone in on that. And that's make that even better.

00;03;06;06 - 00;03;28;08

Sebastian Raschka

Add some tips and tricks to that and like, really like, exploit that type of mechanism. So we will see more of that. At the same time, I think also, I would say a lot of progress came from the inference scaling side. So inference or that basically two paradigms for one is like the training. And then is the usage, the inference.

00;03;28;11 - 00;03;54;01

Sebastian Raschka

And there's also like a trade off, you can spend a lot of money in training, which is very expensive. And then you get that one model and it gets maybe used for a few months, and then it gets replaced by the next model. Or you may also even update this, but yeah, training is very expensive and it gives you, you know, longer training like the scaling loss training on more data, gives you better, models, essentially better performance.

00;03;54;03 - 00;04;15;01

Sebastian Raschka

And it is expensive though, so you can also spend money, extra compute. And I mean, extra compute usually costs extra money because you need more, resources during inference. That means after the training, when you use that, so you can say, okay, instead of let's say I have a user and the user, has a query.

00;04;15;01 - 00;04;34;11

Sebastian Raschka

Instead of just giving the first answer, you can have maybe three answers, like if it's a math question, you have the LLN with different settings, running three times, and then you take like the highest scoring one or the majority vote and that, but that will be three times as expensive. So that's one, one version of inference scaling.

00;04;34;11 - 00;05;08;25

Sebastian Raschka

There are other techniques, for example, generating longer, outputs. And that sometimes helps to also to think through a problem like in quotation marks, think coming back to the reasoning. And so, I would say to answer your question before we go into technical details, 2026, I would say it is still on that trajectory of we can still make a lot of progress in training, especially, maybe not so much that pre-training, because that's not where the low hanging fruit is anymore, but more like the reinforcement learning for, the reasoning capabilities.

00;05;08;28 - 00;05;31;10

Sebastian Raschka

And at the same time, more clever inference, scaling techniques like more, more of that. So they like these two things. And I think, this is going to continue. I don't see anything right now surprising on the horizon. But this is always like that in AI. If someone new, a new architecture that is much better than the current, set of school, that's like $1 trillion idea.

00;05;31;10 - 00;05;53;24

Sebastian Raschka

Basically we would, so if someone has something like that, it wouldn't be something someone had shared already. So it's like, it's always like gonna be a surprise if something like that learns. But I don't see any indicators or any anything that, you know, gives me like, like the confidence that there will be something really, really different.

00;05;53;24 - 00;05;57;05

Sebastian Raschka

It will be more like honing in on these things basically. Yeah.

00;05;57;08 - 00;06;17;13

Geoff Nielson

I think that's really interesting and so much of, you know, your outlook, Sebastian, comes back to this notion of reasoning or inference and, you know, use the word thinking in quotation marks. But the reason this is so interesting to me is it sounds like, you know, there's not going to be, in your view, a huge, you know, step up in 2026.

00;06;17;13 - 00;06;45;11

Geoff Nielson

From what we see right now, we're going to continue to see performance improvements. But the reason I find the reasoning piece so interesting is so much of what you hear in the media and so much of the noise coming out of, you know, Silicon Valley and from, you know, CEOs, you know, in any industry is these sort of, you know, grandiose statements about what they will use AI for or what I can do and how transformative it is.

00;06;45;14 - 00;07;09;14

Geoff Nielson

You know, I'm curious in your mind, if you can, you know, share a little bit more about what are some of the better use cases for this and maybe, you know, debunk some of the things that this technology is just still not going to be do to be able to do at the end of 2026. And I think specifically, you know, when you mentioned reasoning about, you know, I call it the strawberry problem.

00;07;09;16 - 00;07;25;25

Geoff Nielson

And you may have heard of this example, but you know, the issue right now that if you ask a lot of the leading, you know, LMS, how many hours are there in strawberry? It can't accurately answer that question. It says, oh, there's one and there's two hours, you know, there's one in straw and one in berry, which is, you know, obviously not right.

00;07;26;02 - 00;07;37;01

Geoff Nielson

And sort of lays bare some of the reasoning limitations here. So, what what will this be good for at the end of the year? What will it still not be good for at the end of the year.

00;07;37;03 - 00;08;01;27

Sebastian Raschka

Yeah. So I should also, well yeah. It's also a broad question like not a super broad question, but like I wanted to preface this with another point. I to add on, on my previous answer with the reasoning capabilities, it's also a spectrum like, well, what I'm trying to say is the same NLM with, let's say, different inference getting can have different capabilities.

00;08;01;27 - 00;08;23;14

Sebastian Raschka

I mean, or so basically you can have an LLM and you can sort of coming also back to the strawberry problem. You can use it in the high power reasoning mode, but it might make mistakes on simple tasks like that. It's like almost it's called like overthinking. But it's also I mean, again, I don't want to say let's think like humans, but it is also something humans suffer from.

00;08;23;16 - 00;08;39;04

Sebastian Raschka

Like, you know, I know a lot of things, I, you know, I'm a researcher. I can do a lot of things, but I don't ask me at 11. Pm. What, like a simple math question, like 21 times, I don't know, 11 or something like that. And I will give you a wrong number. Maybe because I, I don't I'm tired.

00;08;39;04 - 00;09;00;06

Sebastian Raschka

My brain doesn't work anymore or in in other ways. Like I sometimes make really dumb, stupid mistakes, but it doesn't mean I can't do these other things. And I think for me that's also true. Where, the counting the R and strawberries, it's almost like, well, it's not like, you're not evaluating it in the real use case you care about.

00;09;00;06 - 00;09;25;27

Sebastian Raschka

I would say like, this is something where you would not use the term. And, you can actually now I think that's also one of the big drivers of progress in 2025 and going to be in 2026 is, tool use like the LMS can use tools. And so they don't have to do everything from memory like Counting the hours and strawberry instead of the, at an I'm trying to do that.

00;09;25;27 - 00;09;48;18

Sebastian Raschka

And I think the limitation is usually around the tokenization and, like basically how let's work. But well, instead of having the M tell you based on its own internal compute, the LM can do like a tool call. It could use a Python interpreter, like read this as a string and then just use string finding like the letter finding in the string.

00;09;48;18 - 00;10;20;14

Sebastian Raschka

And then it gives you the accurate answer. And so I think we also have to think about it like this, that there are different modes how we use it. And you know that your mileage may vary and problems, you know, that well, I saw coming back also again to 2025 again. There was open I Google and some others that participated in these Math Olympiad type of competitions and they got really, really, really good results.

00;10;20;16 - 00;10;42;28

Sebastian Raschka

I think well, what they call gold level gold level performance, but at least for ChatGPT open, they didn't use a model that is publicly available. So they used like some custom version. And this usually also involves inference scaling. So for example, the same is true for deep math version two. They had like a paper and they had that same level.

00;10;42;28 - 00;11;09;16

Sebastian Raschka

And they cranked up the self refinement steps where the LMM evaluates its own outputs and refines those and has multiple tries at each problem. And it boosted, performance significantly. So what I'm also here trying to, say is like, it depends a bit on how you use the, it's like, like a small MLM can be really good at being, you know, efficient and cheap at a certain problem.

00;11;09;16 - 00;11;35;21

Sebastian Raschka

But it's not going to solve all your tasks. You can specialize a little more for complicated things, but then it might fail at another task. And right now, I think, well, when you go to a church, dpd.com, for example, this is like a general purpose model and it has some modes where it does this auto mode, like thinking Non-thinking deciding what is the right one, but it is still trying to be, you know, like jack of all trades in that sense that, some people use it for summarizing emails.

00;11;35;21 - 00;12;01;18

Sebastian Raschka

Some people, I mean, you know, ask medical questions, some people use it for coding and so. Well, it's just like, at that it's pretty good. It does a little bit of everything, but it's not super specialized. And I think, with Elon Musk right now the most at the biggest use case and the most promising or utility wise, the biggest, use case right now is coding, for example, it's very good at coding.

00;12;01;20 - 00;12;17;16

Sebastian Raschka

And we'll, we'll see it, how well it performs at other things and, but yeah, I think I'm diverging here. There was like a question to, you know, at the end of the 2026 year, what is like some of the tasks I would say are coding for sure. It's maybe the boring answer, but, it's a text problem.

00;12;17;16 - 00;12;23;22

Sebastian Raschka

It's pretty easy and pretty easy, but it's pretty approachable for me. So yeah.

00;12;23;25 - 00;12;51;08

Geoff Nielson

If you work in it, Infotech Research Group is a name you need to know no matter what your needs are. Infotech has you covered. AI strategy covered. Disaster recovery covered. Vendor negotiation covered. Infotech supports you with the best practice, research and a team of analysts standing by ready to help you tackle your toughest challenges. Check it out at the link below and don't forget to like and subscribe!

00;12;51;10 - 00;13;15;19

Geoff Nielson

I completely agree with you. From everything I've seen, this seems to be one of the low hanging fruit areas where it seems like, you know, we can. We can, you know, dramatically improve, you know, productivity with developers here. You had a quote I came across in your blog where you said that I still write most of the code I care about by myself without, you know, I is is that still true?

00;13;15;19 - 00;13;27;29

Geoff Nielson

And, you know, what do you think the implications are in terms of how developers and, you know, development teams use lens or don't use lens for what they're trying to accomplish.

00;13;28;02 - 00;13;51;04

Sebastian Raschka

This is to a large extent still true. So but so I yeah, I also use LMS for coding in different, I would say in different ways. The other week I wrote a, I like I can do Python coding. I'm a scientific, coder. I can use PyTorch, Python, some other languages too, you know, like for scientific computing.

00;13;51;04 - 00;14;12;25

Sebastian Raschka

But I am not really, web designer, and I. Well, I can't really build apps, you know, like, if you ask me how to build an iPhone, Apple Mac OS app, I have no idea. I've never done that before. But for myself, I automated things all my life like my own 15, 20 years ago. I usually write, scripts that do something to help me, like rename files or these types of things.

00;14;12;28 - 00;14;30;03

Sebastian Raschka

For example, for my blog post, I have a workflow where I have all these images in one, PDF. And then I usually export it, and, I have a script that converts it to different, crops it automatically and converts a different file format, but it's still like a bit tedious. I have to go to the location of my script, type the commands and everything.

00;14;30;03 - 00;14;49;26

Sebastian Raschka

So I thought, okay, I can just make my life easier here and develop, Mac OS app like a native app where I can just drag and drop the file in and it performs the cropping and the conversion and everything for me. And that is something I use, and I'm sure that it took a few hours, maybe 1 or 2 hours, to just get it right the way I wanted it, but it's something I would have not been able.

00;14;49;26 - 00;15;04;17

Sebastian Raschka

Would I have not been able to do otherwise? Like it's like, it's kind of like magic. I have not a native, Mac OS app that does that for me, and I can do that for a lot of things, like everyday lives, things on my computer. For that, I would like, just use the LRM.

00;15;04;18 - 00;15;18;27

Sebastian Raschka

I don't really care about how it does it. I mean, I can see, okay, it works. And if it doesn't work, I can give it some more prompts. I no idea how, you know, Swift UI works. I could probably figure it out, but I always wanted to learn it. But that's something where I don't have time to learn it.

00;15;18;27 - 00;15;49;16

Sebastian Raschka

I have so many other things that are more important each day than just learning how to do that, because it's not, you know, my main job essentially. But then for things I care about, for my, let's say, research experiments, they're I usually write most of the code myself just to figure out, like, just to think also through the problem and through the shortcomings, just to get an idea like, okay, this is I have a pretty good idea what I want to do, but I don't often, let's say, well, there are cases where I make mistakes.

00;15;49;19 - 00;16;16;00

Sebastian Raschka

And I usually use an alarm to like, just get a second opinion, like, hey, does it look okay? Like it's something where back in the day I had a colleague or something. All you do PR on GitHub, open source projects with other people chiming in, and it is like another layer between that. Before you shared with other people, you do like a sanity check with lmms to to make your work better in a sense like to, you know, just have like kind of like a proofreader, second pair of eyes in that sense.

00;16;16;00 - 00;16;36;20

Sebastian Raschka

It's really good at that. Sometimes also, it has suggestions to make things more stable. Sometimes I, also I have multiple experiments. They have, let's say different settings. And I've written some code here and then I fix that here. I now want to apply it to all the other scripts, all the other experiments. And then I would use an alarm and say, hey, look at this.

00;16;36;20 - 00;17;05;22

Sebastian Raschka

What I've done here. Now basically copy that over to all the other ones. I could do that manually, but sometimes it's tedious and this is something that I can easily review because I know the code and I can see what changes it made. And then, oh, that looks okay. I just, you know, check. Okay. Next, next, next. And in that sense, I do use alarms for my coding workflow, but depending on the context, well, I don't want it to like, do everything for me in a sense, because that is then I have no idea what's going on anymore.

00;17;05;22 - 00;17;13;05

Sebastian Raschka

I want to do it more like in a controlled, type of way for for the work I care about. That was basically what the code was also about.

00;17;13;05 - 00;17;33;07

Geoff Nielson

Yeah, it makes complete sense to me. And the reason I bring it up is, you know, in the context of, you know, there's certainly some alarmist narratives out there that you may have heard, you know, the most extreme one basically saying that, well, you know, computer sciences or development as a whole, discipline is going to be obsolete because we won't need to hire these people anymore.

00;17;33;07 - 00;17;59;11

Geoff Nielson

Everybody can vibe code and the machines can do it by themselves. And there's, there's, you know, a more balanced version of that as well that just says developer productivity is going to be so radically changed that, you know, one developer can have the same, you know, throughput as, you know, maybe ten developers could a year ago. Do you see that as holding any water and holding any water in the next, you know, two years?

00;17;59;11 - 00;18;02;15

Geoff Nielson

Or is that just sort of fanciful thinking?

00;18;02;18 - 00;18;27;02

Sebastian Raschka

I think there is a kernel of truth in that, in a sense that, it is true. Like I noticed that myself, like the use cases I just described, that it just goes faster if I tell, let's say the like, apply my patch that I have here to the other files or something like that. So in that sense 100% also, for example, for my own website, I added a dark mode button that I otherwise it would have taken me like weeks or months.

00;18;27;02 - 00;18;47;13

Sebastian Raschka

I mean, I had it on my to do list for like years, and I never got to it because I knew it's going to take a lot of work. And then I that that in one day basically. So in that sense, yeah, it is true. It, it kind of like makes things faster, but it's still work. You know, it's like even like with this macro, as I described it still took me a few hours to do that.

00;18;47;15 - 00;19;08;12

Sebastian Raschka

And it's a very basic app. And so what I'm trying to say, it's not making people developing code or, designing apps or building apps obsolete because it's still work. You can just say, I mean, maybe one day, but I don't even think that's true, where you can say, okay, build x, y, z it it will build a version of that.

00;19;08;12 - 00;19;31;01

Sebastian Raschka

But usually the first version is not the final version. So there are iterations. You have to test it, you have to use it, you have to tweak it. And that is going to be still work. I think what will change is that the, what I, what I'm hoping is with LMS apps like that or websites, they get better than they used to be.

00;19;31;04 - 00;19;54;17

Sebastian Raschka

I'm hoping it's like, people use LMS to improve what they would build otherwise, but not to just have more low quality work. You know, like, I think that that's what I'm hoping for the future, but everything is still work. I also noticed that for, I mean, my own experiments, it's not just the code, it's just it's also running the experiments, doing the comparisons, thinking of additional things to compare.

00;19;54;19 - 00;20;32;16

Sebastian Raschka

And that's all still work. I think the same is true. Like there are some people on the internet saying like, software is basically free now, or like, you know, free in the sense that, and then, can do it. There's no value in, let's say, open source projects anymore. And I don't think that's true because, well, I think I would always take something that has been developed over many years and tested over something that the LM gives me as a one shot solution, basically, because, yeah, people spend I mean, I think the best of both worlds is if people use a LMS to improve things that are already there and build new

00;20;32;16 - 00;20;47;12

Sebastian Raschka

things, but then iterate over it and just make it better than they would be otherwise, like adding more tests, making it more robust, patching bugs and that type of thing. And, it's going to be still a lot of work to do all these things, even if you have a LMS. You know.

00;20;47;15 - 00;21;14;13

Geoff Nielson

And it feels, that makes complete sense to me. And it, it feels even more reasonable, given the conversation we were having earlier about the fact that if we really want to push any of these LMS to their limits, that probably won't be done through the generalizable, you know, just ChatGPT standard model. It's looking at, you know, some of these, you know, unique use models for unique use cases, really.

00;21;14;13 - 00;21;41;03

Geoff Nielson

And as soon as you're starting to build out some of those, you need someone who actually understands what they're doing to be able to set those up appropriately. And so, I mean, first of all, I want to feed that back to you, because that was something I took away from our earlier conversation, that it sounds like just trusting a singular LM to be able to push the frontier in every given area is going to be less effective than being able to, you know, have more specific ones for specific tasks.

00;21;41;08 - 00;21;43;16

Geoff Nielson

Is that fair?

00;21;43;18 - 00;22;04;09

Sebastian Raschka

Yeah, that is a good characterization. And it comes back to, although the problem in deep learning, like, deep learning as the field of training neural networks, artificial neural networks, because Yann is essentially, at the end of the day, deep neural network. And the one problem is basically, if you train it on one thing, it will forget other things.

00;22;04;09 - 00;22;24;19

Sebastian Raschka

It's like, but it's the same also. Again, it learns work different from humans, but it's the same like for us, right? I mean, if I just solve math problems every day, I will get really good at math. If I don't do math, then for a few years, like, do something else. You forget things, you know, like it's, because you get new information.

00;22;24;19 - 00;22;42;03

Sebastian Raschka

You don't. Let's say, hone in on your skills, and then, it's kind of like that. And the same is true for LMS. So people, when they develop or pre-train items, they are very careful what goes into the pre-training mix and also in which order. And then once you have that base model, you fine tune it usually.

00;22;42;03 - 00;23;01;12

Sebastian Raschka

And then, in addition you have also often domain specific fine tuning. It's like an older paper. But I think it was called llama which had like a nice graphic on that, how they develop the model in these different stages. And usually at the end, the last stage is more like what you really care about.

00;23;01;12 - 00;23;20;24

Sebastian Raschka

I mean, you if you want to develop, for example, a coding NLM, you always have to have the coding data already in the pre-training. You have to carry it through. But at the end you will have like a specific phase where you just fine tune it on coding problems, but then it will probably get worse at math or Spanish or something like that.

00;23;20;27 - 00;23;33;25

Sebastian Raschka

And that's the trade off. So you start off still with a generalist element, but then you kind of, specialize it in that trades of other skills that it becomes worse at other things basically. So yeah, that that's basically how it works.

00;23;33;27 - 00;23;53;11

Geoff Nielson

Right. And I can wrap my head around that. And it it makes the case really nicely for building some of these more specialized models. So I'm curious, Sebastian, you know, we've got the proprietary generalized models sort of on one side of the spectrum, we've got, you know, completely building your own LM on the other side of the spectrum.

00;23;53;11 - 00;24;18;11

Geoff Nielson

And then in the middle and, you know, feel free to reject my characterization here. We've got sort of the, you know, custom GPT TS or finding ways to customize some of their proprietary models. When in your mind, does it make sense to be customizing a proprietary model versus building your own? And, you know, who are the the types of people and what what are the use cases that make the most sense when we start to talking about building your own?

00;24;18;13 - 00;24;41;03

Sebastian Raschka

Yeah. So there are different levels of that. You can, essentially start from scratch, like just pre-training, fine tuning on one model like that. That's like the most work. And that I would say I would not recommend anyone doing except you are a company who, well, whose goal it is to build LMS, essentially, or you are a big company, with a lot of money.

00;24;41;06 - 00;25;01;24

Sebastian Raschka

And really you want to do something specialized. I remember, it's a few years ago now. I think Bloomberg had a pre-trained a model from scratch and just focusing on their news headlines and writing news articles or something like that, like at that scale, maybe it makes sense, but it's so it's going to cost millions of dollars. So it's, you know, it's not cheap.

00;25;01;26 - 00;25;27;05

Sebastian Raschka

And so I think well, I think what we are going to see is big, like fields like know finance law there. I think it does make sense if you want to develop because like, lawyers can't just use ChatGPT for data privacy reasons and other reasons. And so I think like if like there's some like people get together in that field, there's a lot of money in that field.

00;25;27;05 - 00;25;45;11

Sebastian Raschka

They could spend a few, dozens or millions, hundreds of millions of dollars to develop an LM, like a base model for law type of things. But then it's like this general law model, and then you still have to maybe fine tune it on your internal data at your company or something like that. But so, yeah, one thing would be completely from scratch.

00;25;45;14 - 00;26;11;04

Sebastian Raschka

And like I said, I would not recommend it because it's very expensive unless, you know, a select few players might want to do that. The second use case would be or the second variant would be you would take an existing, pre-trained model and then you specialize it. And I think that is more feasible because there are a lot of, elements out there and all different sizes, like all the open weight models or for example, a deep Rebic model.

00;26;11;04 - 00;26;28;26

Sebastian Raschka

Quinn three is a very popular model, or even from up my OpenAI, the, GPT OS model, an open weight model. It's free to use and you can then fine tune it, but it's again, not trivial, so you will still end up spending tens to hundreds of thousands of dollars if you want to have a really good model there, depending on the size.

00;26;28;26 - 00;26;51;10

Sebastian Raschka

So it might vary, but it's also something, as a hobbyist, I think that totally out of Reach is like, unless you really, really are passionate about something. But the problem is like, if you want to really build something competitive like that, well, you have to have a big user, a customer or use case for that because it's going to cost and it will also at some point become obsolete.

00;26;51;10 - 00;27;12;23

Sebastian Raschka

For example, if I, I don't know when Quinn four comes out, but like let's say I today, use a Quinn three based model that's from summer 2025. And I spent a lot of money and time to make it really good. I don't know if you want this Quinn four or some other model is much, much better. And then my model is completely obsolete, and I have to start over again.

00;27;12;26 - 00;27;33;11

Sebastian Raschka

But, yeah, it does make sense. Still, for certain things. And then so taking a base model, fine tuning it, and the third use case would be taking an existing model and essentially just customizing it with the prompt. And I think that's a lot of, what a lot of people do. They use an API, where they don't even host the model.

00;27;33;11 - 00;27;57;10

Sebastian Raschka

And then with a prompt, you can steer it in a certain way. It's not going to be perfect. But for a lot of use cases, you get to, a lot of like bang for the buck because you don't have to train your own line, basically. But, again, there are limitations. For example, if you use an API, you have restrictions, you can use your private data, or you shouldn't, at least you use your private data because data is public.

00;27;57;10 - 00;28;18;24

Sebastian Raschka

There were a few instances on the news, in 2026 where prominent people did that data leaked. So I think, yeah, it really, really depends on what your goal is. But if I, for example, today need something to translate, I know articles from one English language into the other. I wouldn't go out there and train my own on them.

00;28;18;24 - 00;28;26;29

Sebastian Raschka

I would just use, I don't like pick one of the popular ones like Jupiter Gemini. Use a prompt for that and see how far it gets me, basically. Yeah.

00;28;27;07 - 00;28;48;03

Geoff Nielson

Right. And I'm, you know, chuckling a little bit and I have to ask I want to come back to something you said, which is, you know, you're sort of you're sort of steering people away from using from building an LM from scratch, which I'm chuckling at because, you know, it's something that you've obviously invested a lot of time teaching.

00;28;48;03 - 00;29;10;05

Geoff Nielson

And so for the people that, you know, you're teaching this to, is it mostly people who are interested in learning this either is hobbyists or they're interested in learning the the fundamentals of how this works, so that then when they use our LMS, you know, in a more, you know, commercial and setting, maybe it's more proprietary. They understand, you know, the basic building blocks.

00;29;10;05 - 00;29;11;29

Geoff Nielson

Who who's typically interested in this?

00;29;12;02 - 00;29;33;08

Sebastian Raschka

Yeah. You bring up a good point because it sounds paradoxical. On the one hand, I'm building these things from scratch, and then I tell people not to use them, so I don't, I would say I'm not trying to steer people away from it. It's more I want to set the expectations right. You know, like, like just what you said, like, who are these people who should be doing it?

00;29;33;10 - 00;30;02;15

Sebastian Raschka

And I think, so I also know from my experience how much work it is to build something from scratch. It is actually better than something out there. So, so if there is someone like there were some readers, for example, they stumbled upon my book and they are not coders, they are like from different fields. And the expectation is, oh, I, read the book and I will be able to let's say there was actually a case with language translation, and, and the translates, documents for me, better than ChatGPT.

00;30;02;22 - 00;30;24;01

Sebastian Raschka

And that's not going to happen basically, unless you spend I mean, LMS also, usually they're not trained by a single person. They'll use retrained by a large team. And so to answer your question, we can use an analogy, for example, like something else. For example, let's say you are interested in cars. And you want to learn, how, you know, you just you're passionate about cars.

00;30;24;01 - 00;30;46;13

Sebastian Raschka

You want to understand how the cars work, the motor works, everything works, the steering and everything. But you would not build a Ferrari like, you know, it would be very expensive as a single person. I mean, that's even. It's not even in reach. You would need a team. You need a factory, you need, design documents. You need a lot of time and money and millions of hundreds of millions of dollars to develop a Ferrari.

00;30;46;15 - 00;31;03;27

Sebastian Raschka

Instead, you would maybe develop, you know, like a simpler, I don't car that would resemble a car from the 1980s or something like that, something you can build in your garage. But building that car, you will understand how the Ferrari works. A Ferrari is essentially just a fancy version of that. And so the book is in a similar way.

00;31;03;27 - 00;31;24;12

Sebastian Raschka

The goal is kind of like to understand how things work. It's for people. Well, I mean, people who maybe want to build these large types of models of education wise, because right now, I mean, how would you learn that, you know, like how would you do get hired at the company? You have to show usually that you have some skills already.

00;31;24;12 - 00;31;42;29

Sebastian Raschka

And so you have to start somewhere and that that could be an entry point to learn how to build these elements. But it's also for people who don't even want to build a LMS. They just want to understand what are the limitations. Why? Why does the LMS struggle with, strawberry? The number of letters? They're like, what goes into the how does the whole workflow work?

00;31;43;02 - 00;32;04;00

Sebastian Raschka

And one way would be you could explain everything conceptually. You can, you know, explain everything in words and like, yeah, the LM as, text it, tokenizers it, it converts it into numbers. Then they go in and then there's some computation. But that's all very vague and could be misunderstood because a lot of like things details you glance over.

00;32;04;06 - 00;32;20;24

Sebastian Raschka

And I think the best way to really see how it works is, yeah, by actually doing it, by going through the actual steps. And these steps also don't like you at the end you have the working LM. And so, you know, okay, this is this is actually working. This is not made up. It's not like fantasy explanations. It's really concrete.

00;32;20;27 - 00;32;32;02

Sebastian Raschka

It works. And that is essentially also part of the goal. At the end you can develop your own element. But then the disclaimer is it is a lot of work to get something that is really competitive, basically.

00;32;32;03 - 00;32;49;17

Geoff Nielson

All right. Because the setting up the LM in some ways is not even the hard part. There's the training, there's the pre-training, there's the data set. Yeah. You have that like all of that is what, you know, separates your car you made in your garage versus the Ferrari. Right.

00;32;49;20 - 00;33;16;07

Sebastian Raschka

Yeah. And it's also a good point in the the data set you mentioned in the book, I'm only using public domain data from a Project Gutenberg like simple example book that is, public domain. Hundreds of years old where because of copyright, concerns like to not just, you know, do that, but if you want to build a real big LM, you need not trillions of tokens like, you need, terabytes of data, basically.

00;33;16;07 - 00;33;35;04

Sebastian Raschka

And, that would be impossible for a human who buys the book to do, because you would have to buy all the hard drives. You have to buy oil, render hundreds of GPUs and everything. And that would be really, yeah. Not feasible. And, in that case, the book is also data wise, focusing only on a very small data set.

00;33;35;04 - 00;33;58;24

Sebastian Raschka

And the model will then learn how to write text that is similar to this book, basically. But it's not like going to be, you know, your next ChatGPT because that that's impossible. It's like, I mean, not impossible, but you would for a single person, it's kind of, not feasible. You would spend a lot of money, a lot of time, and, you know, and so the goal is really, helping, explaining things.

00;33;58;27 - 00;33;59;14

Sebastian Raschka

Yeah.

00;33;59;17 - 00;34;28;16

Geoff Nielson

Right. Let me, let me take this in maybe a slightly different direction, but the the data set conversation got my wheels turning a little bit. One of the challenges for a lot of organizations trying to do this themselves is the basically marrying capabilities of an Lem with the quality of their current data that they may have sitting in their enterprise applications.

00;34;28;18 - 00;34;57;23

Geoff Nielson

And you know, I think it's easy when we talk about, you know, a book, for example, you ingest a book and you, you know, kind of tokenize the words and, you know, all the characters and all of that. When we think about, you know, whether it's data that's structured in a database, whether we think about it as, you know, the unstructured data that maybe people have around that or metadata, how capable R8 LMS right now at making sense of that.

00;34;57;23 - 00;35;26;19

Geoff Nielson

And is that something that you see changing in the next year or two, or is that an inherent limitation of the, you know, of the structure of the lab? And basically the reason I'm asking that is for organizations that are worried about the quality of their data, is that is that a problem that's being misunderstood? Is it going to go away soon or is that just an inherent limit of this model that they're going to have to come to terms with sooner or later?

00;35;26;22 - 00;35;35;05

Sebastian Raschka

So let me just try to rephrase to see if I understood your question correctly. Like the limitation is like working with your personal data. Like like data, you have.

00;35;35;08 - 00;35;40;24

Geoff Nielson

That's wrong with organizational proprietary data. Call it customer data or employee data or something like that.

00;35;40;29 - 00;36;05;12

Sebastian Raschka

You know, and so there are different ways you can work with that data. So usually the limitation of I mean it's the limitation, but there are so many tricks now that this is becomes more feasible. And the limitation was usually that you can only fit so much into the context of items and see this is, where a book like the from scratch book would come in handy because you would see or understand, like what is the context?

00;36;05;12 - 00;36;25;22

Sebastian Raschka

How does the element process the data? And there's a limitation to the size. If I crank that up, it's going to be really expensive. But it's sometimes, at some point exceed it's I can't just put everything into the context. I have to be smart about it. And so, traditionally, I mean, I did in an ideal world, you would put everything into the context, but that that doesn't work because it's too expensive.

00;36;25;22 - 00;36;54;03

Sebastian Raschka

So people developed something called a rack. Rag, which stands, I think, for retrieval, augmented generation. And so that's like a application layer around the LM, where you take the documents you have that you care about, and you chunk them up, put them into a database, and then the M, you query the LM, it produces like a vector embedding, like, let's say a compressed version of your thing, like the query.

00;36;54;05 - 00;37;22;17

Sebastian Raschka

And it looks for like simple, like let's say dot product. You look for similarity to other chunks in your database, and then you retrieve that chunk and you hope that this is gonna be relevant. So it's essentially like a Smart Lookup. But you could also, in simple ways, think about it like that. You have a query and then you chunk up your document into, smaller parts, and then you go through it iteratively and try to find what is the most similar one.

00;37;22;17 - 00;37;47;03

Sebastian Raschka

And then can they use that to answer the question. So for example, let's say you are in a law firm and you say what was the case in 1983 where x, y, z happened and you try to, pull that out and then the LM can use that as part of the answer, basically. It's not perfect because you have chunking to document and it's not the full context.

00;37;47;03 - 00;38;06;22

Sebastian Raschka

You have always like little chunks. So but one of the I wouldn't say breakthroughs, but one of the it's because it's more like a continuous development. But one of the progress parts of the progress we've seen in 2025 was that, contexts are no longer supported. Context sizes are longer. So we are now it really depends.

00;38;06;22 - 00;38;24;29

Sebastian Raschka

But there are like even like open wait lists like, Nvidia Nemo, Tron that can do up to 1 million tokens. Of course it's going to be more expensive. You need more GPU power for that. But, I think even like, you know, online, the version I think it can do 100,000, 200,000 tokens. And I think that's about the size.

00;38;24;29 - 00;38;44;18

Sebastian Raschka

I mean, I might be wrong, but I think it's about the size of, like, one of the Harry Potter books, the first one or something like it's a long context. And so for many people this is actually sufficient. So you don't need any specific fancy application around the, a lambda process that you just put it in there.

00;38;44;21 - 00;39;14;28

Sebastian Raschka

And, there is the problem of, it's called like the needle in the haystack problem, where what people found, though, I mean, there are multiple problems, but there's also something related to attention sinks, where the M kind of focuses more on the beginning of the text that you put in. But then also the needle in the haystack problem is, when people develop these long contexts supporting LMS, they have let's say you have a question like, I don't know, like some, some factual data you want to retrieve.

00;39;14;28 - 00;39;32;03

Sebastian Raschka

It's kind of buried in these 100,000 words. The LM should find it. And the longer the context, the harder it is for the LMS to answer correctly, because there's a lot of noise. It sometimes gets distracted. But, I mean, it's similar for us humans to like the more stuff you throw at it, the more complicated it is to figure it out.

00;39;32;05 - 00;39;58;08

Sebastian Raschka

And so that's kind of like like where people I would say, well, where companies made a lot of progress last year to make that better. So it kind of most of the time works. And I would recommend for most people, just try that first instead of building something. It's also like how I approach problems. You do the simplest thing first you write down what performance you get, and then you try to iterate and tweak it and try other things and see if it's better.

00;39;58;16 - 00;40;16;29

Sebastian Raschka

But before using the most complicated thing, always try to, you know, do the obvious, simple thing, and then maybe that gets you already most of the way there, and then you can iterate later and see if it's worth the effort to iterate there. If it's worth spending three months to build something around it that can do it, maybe 1% better.

00;40;17;01 - 00;40;44;26

Sebastian Raschka

One limitation though, is that the case? You described? You may not want to do that with ChatGPT because, well, you data will be online. I think beginning of 2026, there was like a case where, I mean, based on the news, I read that a government employee uploaded some sensitive documents that got leaked. I think they found out because it appeared in answers from other people and was like, confidential AI security type of data.

00;40;44;28 - 00;41;05;03

Sebastian Raschka

And so, yeah, I think as far as I know, I mean, I don't want to say anything, to like anything that's wrong, but I think ChatGPT, for example, does use your data for training the models they don't really like. You know, I don't think they specifically single out specific data and like, publicize it or something, but it's implicitly used for the training.

00;41;05;03 - 00;41;26;09

Sebastian Raschka

They try to anonymize everything, but, well, if you upload it to ChatGPT, you have to be aware it might be part of the training data. And so you can't do it for everything. There are laws for certain fields where you can't just, you know, conscious patient data like sensitive data and but then again, you can use a local NLM that runs locally, for example.

00;41;26;12 - 00;41;45;12

Sebastian Raschka

I mean, most lambs that run locally, a support up to 160,000 tokens, which is like again, the Harry Potter book. It's a lot of data. And those I mean, there's special ones like neutron three from Nvidia has 1 million tokens. So there's always like something you can do locally that gets you most of the way there.

00;41;45;15 - 00;42;18;12

Sebastian Raschka

And at the beginning, one more thing will let you interject. There's like a paper that I find really interesting, at the beginning of 2026. Let me see. I think it was called recursive language models. So the title of the paper. So what they do is it's kind of like a clever trick. They have like the query and they want to answer, let's say, a question like, like similar to a Rec set up where you want to you have a lot of data to process, a whole document base, a whole folder of, let's say, a lot of databases that can't fit into the context.

00;42;18;15 - 00;42;47;23

Sebastian Raschka

And so what they do is they, parse. So they instead of letting the NLM do everything, they parse the input into like a string and Python in a coding environment, and then let the them come up with ways to like to chunk it up into sub problems. So for example, if your problem is let's say summarize, I don't let's, let's say summarize all the chapters in this gigantic book or something.

00;42;47;26 - 00;43;10;27

Sebastian Raschka

You have 12 chapters. Instead of feeding the whole book into the realm and trying to have all the summaries for the 12 chapters, what do you say? What that, decides to do. Oh, maybe I can just have one chapter each and process it in 12 different, parallel, let's say execution loops, and then I just pull together each summary from all the 12 chapters and write an overall summary or something like that.

00;43;10;27 - 00;43;28;05

Sebastian Raschka

So like just chunking it. It's not really like rocket science is really just using a coding environment where the, can use a tool to chunk up everything itself, but that gets you already, most of the way there. And they did that also with ChatGPT. It doesn't have to be a local atom. You can do it with APIs, with tool cards.

00;43;28;05 - 00;43;42;29

Sebastian Raschka

And so there is a lot of I would say in the field, there are a lot of workarounds where you have limitations in the Llvm itself, but you solve the limitations by doing clever tricks in, in the surrounding, API layer or like the application basically.

00;43;43;01 - 00;43;49;24

Geoff Nielson

And does that fall under the bucket in your mind of, you know, reasoning enhancements, or is that something else?

00;43;49;27 - 00;44;10;14

Sebastian Raschka

I would say it's something else. It can be reasoning related. So if it's like a problem that requires good reasoning, capabilities. But I would almost if I had to group it into something, it's more like general inference scaling, for example, where, I mean, it's not even inference getting in that sense that it makes it more expensive.

00;44;10;14 - 00;44;35;24

Sebastian Raschka

You are just chunking it up basically into separate, into a separate sub called. But like you said, I mean, it could be related to reasoning if your query is a reasoning query, but here I think it's also a bit tricky. Because if you have like let's say your task is to solve a math proof, like something really complicated where you have a lot of sequential steps where you derive all the questions, like the individual intermediate steps.

00;44;35;27 - 00;44;54;28

Sebastian Raschka

I think that would not be a good case for the method, because the method runs things in parallel, like sub called some like parallel and kind of like independent of each other and reasoning models, they usually benefit from this, so-called chain of thought where they think through a problem which is more like sequential.

00;44;55;00 - 00;45;29;23

Geoff Nielson

Right. And the reason I'm asking is it probably gets a little bit outside of your field of expertise, Sebastian, but I'll ask anyway, is when we think probably in other, you know, AI or automation applications that start to get outside of LMS in the transformer model, but maybe have some overlap with, you know, I'm thinking about, you know, a genetic AI or some of these I use cases where, you know, these different kind of, you know, AI systems work with each other to understand what a task, you know, what the outcome of a task should be.

00;45;29;23 - 00;45;45;14

Geoff Nielson

And actually, you know, orchestrate to get it done. It seems like there's some overlap in terms of, the processing gears and being able to understand, you know, what's more likely and chunk it up is that, you know, does that come into play or is that completely off base?

00;45;45;16 - 00;46;05;05

Sebastian Raschka

I do think it's a very good point. I do think it's kind of like related in a sense, like to, understand the input. Process it and present it in a way that can then be processed in this particular case, instead of interacting with other agents. It's kind of like interacting with itself. But it could be other agents, too.

00;46;05;05 - 00;46;30;00

Sebastian Raschka

I mean, it's basically how can I, you know, divide and conquer my problem here, like in a sense. But then it it calls itself on on the chunking problem. But it, it could be as well, like delegating it to other types of models. And I would say that is one of the biggest progress drivers in the recent months that, yeah, the tool calling like using different tools for different things.

00;46;30;00 - 00;46;57;27

Sebastian Raschka

And so trying to do everything itself. And I think that's also where a bit of, let's say the magic, comes in when you use something like Gemini or ChatGPT or Claude. It's not just the, I mean, it's just an hypothesis. But I do think if you take something like deep IPsec or some other open weight model that is really good, and you would put that into whatever like framework they have behind the scenes and let's say Gemini or to Jupiter.

00;46;57;29 - 00;47;23;23

Sebastian Raschka

I think it would be almost like identically or similarly good. Like so what I'm trying to say is I don't think the M itself is necessarily the differentiating factor anymore. They're all kind of like similarly good. But what is really important is how you like format things like how to deal with context and how to the history and then how, like the previous conversation, the previous, yeah.

00;47;23;26 - 00;47;46;14

Sebastian Raschka

Like the previous back and forth is how you process that, how you use tools and that's that stuff where there's a lot of work that goes into making it really, you know, robust. I mean, I don't know exactly because it's proprietary, how they process input it. Let's say if I sometimes type something, I make a typo and it's sometimes even like a relatively technical term.

00;47;46;14 - 00;48;03;18

Sebastian Raschka

I have a typo in my prompt, but often I don't even care about fixing it. Because I know, okay, it's kind of dealing with that already. So it's just instead of all delete, delete and fix that board, I just have it was a typo in there. And I can see based on the response, because often the response involves repeating part of the answer.

00;48;03;25 - 00;48;37;15

Sebastian Raschka

I can see it fix the spelling of my word, so I don't know if it's necessarily like the M itself, because it's like, has all the different spellings or tokens or sub tokens, or there could even be like, you know, like, processing there, like fix obvious typos because that enhances the performance of the other. So because then instead of having to have a huge vocabulary or sub tokens for all the different ways someone can misspell a certain word, you can just have like, dictionary fix, like, you know, that go simple, fix and make it the nest work for the M itself.

00;48;37;15 - 00;48;58;04

Sebastian Raschka

So I think there's a lot of magic like that happening behind the scenes to, improve the performance. And I think that's why you see that the performance of something like Jupiter Gemini is better than something you would run locally, where most of the, tools that run locally, they kind of like run at barebones without much stuff around it, basically.

00;48;58;06 - 00;49;26;17

Geoff Nielson

And, and, you know, I'm glad you brought that up because it still feels like there's so much tied to like, the quality and the clarity of the prompt. And there's some, you know, cosmetic fixes it can do for you. But, you know, I'm curious on your on your thoughts on this, Sebastian around, performance benchmarking. Because if we're talking about, you know, variable output based on the clarity of what you're asking for, a lot of performance benchmarks or like there's just they seem like they're pretty clear.

00;49;26;17 - 00;49;48;24

Geoff Nielson

They're trying to do something quite clear. And, you know, to to bring back a point you made earlier, it also feels like you said, like some of these, you know, organizations are not using their publicly available models or there's some inference scaling that's happening. It's that's, you know, not really there behind the scenes. You know, how much weight do you put into performance benchmarks at all right now?

00;49;48;24 - 00;49;55;27

Geoff Nielson

And how much should people be looking at those as a measure of the capability of these tools going forward?

00;49;55;29 - 00;50;17;08

Sebastian Raschka

Yeah, that's a good question. I think that's one of the biggest problems in the field. How to evaluate models, let's say fairly and so. Well, there are different ways to like benchmarks. Why is there are different types of benchmarks? Top of my head, I would say there are 3 or 4 ones. So let's see if I, can come up with, what I have in mind.

00;50;17;10 - 00;50;36;03

Sebastian Raschka

So that one is basically more like the classic model. You. So that is like, multiple choice benchmark. And so, that one is basically, you know, like a trivia question, almost like, you know, who wants to be a millionaire type of question? It's like a question. And then there are ABC and and the model has to select one of these answers.

00;50;36;06 - 00;50;54;02

Sebastian Raschka

And people use that usually to test the knowledge. So does the does the NLM know about world knowledge and math and like basic things. But it's ultimately not how you use an element. It's basically you would never give it all the solution and say give me A, B and C and you would, you know, ask Freeform for example.

00;50;54;05 - 00;51;12;12

Sebastian Raschka

But then the free form is really hard to evaluate programmatically because there are different ways to spell it. You can have it as a word, as a sentence. And so that's why they do A, B and C and d. And so like multiple choice. But then it has a lot of limitations. And so yeah I mean I think that comes also like it's like a minimum threshold.

00;51;12;12 - 00;51;38;05

Sebastian Raschka

Like I think you know should have a minimum score on these benchmarks to be okay. But then at some point if it it's passing a minimal threshold, I don't think it matters if it's 90 or 95% that, basically also, we have to keep in mind also some people run these benchmarks with and without tool use. And GPT s model, open weight model had a nice, chart about that.

00;51;38;07 - 00;51;58;12

Sebastian Raschka

When you have a model and you have allow it to use tools, then it gets much better performance than not. Like, for example, if you ask a model, who won, let's say the Soccer World Cup in 1998, it can try to remember it, but it can also use a tool and look it up on the internet, let's say on the official website.

00;51;58;14 - 00;52;18;27

Sebastian Raschka

And then you increase the accuracy on these types of things. So that is one, type of benchmark. And I think, well, it is like a minimum threshold. It should be able to do these things and answer this correctly, but it doesn't really tell you how, how the, performs when I actually use it and query it in prompted in different ways because they can be also sensitive to the prompt format.

00;52;19;04 - 00;52;43;19

Sebastian Raschka

So another way is, like these so-called leaderboards where, they have like a website and then you can use different LMS for the same prompt, and then you can compare the answers, and you say, oh, I prefer this answer over this answer. So that sounds like actually more related to what we would care about. Like, let's say I have Gemini and should you be side by side and answer the question, okay.

00;52;43;19 - 00;53;01;00

Sebastian Raschka

Oh, I actually prefer prefer this answer. And if you do that a lot with a lot of people with a lot of pairwise comparisons, you can use a statistical model like a lottery model to convert it into a ranking, like, into numbers like one, two, three, four, five, six, you know, like, so you can say, oh, this is on top one.

00;53;01;03 - 00;53;31;02

Sebastian Raschka

But, this also has limitations because, people really prefer also a certain style, like people are sensitive to the answer of style and not necessarily the correctness. So because if I ask a question to an alarm, I usually don't know the answer, like if I have like a challenging math problem. Otherwise I wouldn't ask really. Right. And so, then you get different answers that say, and you say, oh, I prefer this one because it's maybe nicer explained, or I like the language better, but it doesn't mean it's more correct or something.

00;53;31;02 - 00;53;52;18

Sebastian Raschka

So so with leaderboards, it's kind of like sensitive to a certain style. There was like an incidence or incidence, but like the thing with lama for in the summer, where I mean, I don't know the full story because I, I'm not affiliated with the companies. I don't know the behind the scenes, but what was reported was that they used a different model on the leaderboards, and it got really high leaderboard scores.

00;53;52;18 - 00;54;10;09

Sebastian Raschka

But in reality, it was not a good model because it it's all like, you know, what is it called like? It's like there's no substance that's in always behind it. It's more like the the glamor. Like it looks better than it really is behind the scenes than when you actually use it on hard tasks. And so it's always a bit challenging.

00;54;10;10 - 00;54;45;18

Sebastian Raschka

Another way to evaluate models would be verification. For example, you can have math or code or something like that, like something that you can verify, let's say math, and you have the correct answer like it's a numeric answer. And there are tools where you can compare it to numeric answers. You can usually how it works is you tell the editor, hey, I have this problem, you know, and then explain and then write the intermediate, sorry, write out whatever you need and then put the final answer in a box, like an answer box.

00;54;45;18 - 00;55;02;12

Sebastian Raschka

It's usually the LaTeX box, format. And then you can programmatically, retrieve this answer and compare it to the reference answer. And then you can use a calculator like, you can use a calculator to say, oh, these numbers are the same or the different. Like they are different tools. Like, you know, Wolfram Alpha. I use SymPy.

00;55;02;12 - 00;55;21;24

Sebastian Raschka

It's like an open source program or a library for Python where you can symbolically compare solutions. And that is really I mean, this is accurate, right? I mean, if the model form, follows the instructions and puts the answer in the prompt, I can really with a lot of certainty. There may be some parsing errors, but like 99% of the time you have a fair comparison.

00;55;21;26 - 00;55;43;26

Sebastian Raschka

But the problem is it's kind of limited to, the aftermath or code with the code compiles. It's it's harder to evaluate the whole answer it it's whole you can only yeah, evaluate the last, answer point. And so I think I listed three. So one more that comes to mind. One way to evaluate is, using, judges and judges.

00;55;43;29 - 00;56;08;12

Sebastian Raschka

And so it's basically using an, an either LM and then you provide a rubric like evaluate, if the answer is correct, then the intermediate steps, if they make sense and you give a different criteria and you say, given these criteria, evaluate this answer, given that other reference answer and gives you you can say, give me the answer, like the quality of the answer on a level between 0 and 10, where ten is best.

00;56;08;14 - 00;56;30;20

Sebastian Raschka

And then you can do that for different levels and approach the numbers and say, oh, this model gets 8.5 on average. This model gets 9.5 on average. That is a numeric way. You can kind of evaluate free form answers. But then again there's always the catch. The catch is while you're using an a different LM for that. And it might not correctly always evaluate things of value.

00;56;30;22 - 00;56;52;26

Sebastian Raschka

It might have a bias towards a certain answer style. And so long story short, each of these different methods to evaluate a LMS has its shortcomings. And, the best way is to look at all of them together in context, not like a single one, and try to see, well, what are the weaknesses and strengths of the different, LMS on these benchmarks?

00;56;52;26 - 00;57;09;27

Sebastian Raschka

And that's really hard. And in the end, they all look kind of similar when you see like releases. And really you have to use it basically and see all that works for me. That doesn't work for me. You know, it's like, it's really hard. It's one of the biggest problems right now to have a fair comparison.

00;57;09;27 - 00;57;11;10

Sebastian Raschka

Basically.

00;57;11;13 - 00;57;33;17

Geoff Nielson

It makes sense. And it's, you know, it's interesting because there's at the frontier, there's, you know, these small differences and okay, do I do I prefer this style slightly more than this other style versus is it getting the answer fundamentally wrong. And, and you know, do you use, you know, in language or like how do you how do you best answer that question?

00;57;33;19 - 00;58;00;05

Geoff Nielson

I, I am curious, Sebastian. I have to imagine, you know, you've been doing this for a while, and I have to imagine the interest in this from, you know, non-technical people has has increased in the last couple of years. What do you see as like what do you personally see as some of the biggest misconceptions people have about LMS and how they function and how they can, you know, get, get value and use out of them?

00;58;00;07 - 00;58;24;09

Sebastian Raschka

It's a good question, actually. I think, top of my head. Well, I don't think there's a huge misconception, but it comes more like down to expectations. Again, like what we mentioned earlier that, well, it's really hard. It's very expensive to train a LMS. I think that's like, the, the misconception is, oh, I just have to, you know, do x, y, z, and, I can do that on a weekend basically.

00;58;24;09 - 00;58;49;02

Sebastian Raschka

I think that's, that's a I think it's also a person can understand everything on a basic level. That's not a problem. But I think the challenge compared to previous, let's say, problems in the field, is that you need a whole team to do it, like you need an expert in GPU infrastructure. You need the person, the researcher who implements the core architecture.

00;58;49;05 - 00;59;09;15

Sebastian Raschka

You need to run experiments. It's like, it's not usually something someone can do by themselves or on a weekend. I mean, understanding, yes, but doing this whole thing is a lot of work. Looking, for example, at I mean, most people don't share these types of details, so. But there's a, I think it was a 2 or 3 paper.

00;59;09;17 - 00;59;30;13

Sebastian Raschka

They had like a very nice section on what it took to train that model back then. And, I forgot all the, let's say, nitty gritty numbers, but they had like, they reported, okay, we trained it on so many a thousand strips of GPUs, but we had so and so many failures each day because it's also like hardware failed, like, crashes or GPU crashes.

00;59;30;13 - 00;59;50;08

Sebastian Raschka

And then you might lose your whole model. And so you have to checkpoint it, or you have to build in robustness to when one GPU fails, that doesn't crash your whole, million dollar run. Right? So it's like, that's a lot of of that. And this usually requires, a whole team. Basically, it's not like a single person because you have to monitor everything all the time.

00;59;50;08 - 01;00;09;14

Sebastian Raschka

And I mean, yeah, you can have notifications coming up and everything, but it's like a full time job for a lot of people to develop and edit them. Yeah. So, so I think that's the thing that has changed compared to machine learning or deep neural network training before. And that's also why you see not that many academic items anymore.

01;00;09;14 - 01;00;29;06

Sebastian Raschka

For example, there are a few. But for example, back in the day, I mean, it was convolutional networks, image classifiers, everyone was able to do that by themselves in a small lab because university labs are usually, you know, like a handful of people. But now, you know, the budget is like, you have to have a huge budget, a whole team, a lot of time, a lot of expertise.

01;00;29;06 - 01;00;36;27

Sebastian Raschka

It's a lot of things. And that's why this is now mostly restricted to companies basically that have resources space to you.

01;00;36;29 - 01;01;06;21

Geoff Nielson

So I'm going to play that back to you and let me know if I'm getting this right. But it sounds like just based on how much this field has evolved, how many resources, you know, the biggest players have, the, you know, amount of staff required, the amount of compute required, the amount of, you know, just cost required and just raw data required to do this that, you know, if you're looking to do this as a small shop or as a hobbyist, it is valuable to learn how to build an LLM from scratch.

01;01;06;23 - 01;01;21;27

Geoff Nielson

That's something that will help you personally and professionally, but it's probably not something in most cases that you're going to then implement, you know, and try and do at scale, because it's just not as practical as it may have been 5 or 10 years ago. Is that fair?

01;01;22;00 - 01;01;56;19

Sebastian Raschka

Yes, that is fair. Also with a caveat. So I agree with you. But if you fine tuned, so that's like the pre-training, and the fine tuning, if you use an open weight, an alarm that is already out there and you build on top of that, then it becomes easier. So also just to put in some numbers, if so, taking the deep sigmoidal again, because it's just such a popular model, the version three and R1 that came out, December 2024 and January 2025, because they had some nice numbers in there.

01;01;56;22 - 01;02;20;14

Sebastian Raschka

If you would rent the GPU they used, they, they put in the price for like, I think $2 per hour. It would have cost, $5 million to train the deep version three model. And this is not including any, cost for the like staff, like the salary and everything or, like the building. And that's not like the physical building, like the rent for the building to have people there.

01;02;20;17 - 01;02;38;15

Sebastian Raschka

And it also doesn't include the field runs, because when you do something like that, you will fail a lot of times until you find the right configuration, a little trial and error. So but if you only take the solution that worked and you run it, the pure GPU cost would be around $5 million, which is a lot of money.

01;02;38;17 - 01;03;07;02

Sebastian Raschka

But then if you look at the fine tuning, the reasoning training they had, it was more like on the order, I forget, but like 100, $200,000, something like that. Much, much, much lower. And that is much more approachable. Now, this is for 671 billion parameter model. It's a very, very big model. Now, if you go down in size, from 600 to, let's say 20 billion, you could probably do something really good with a few thousand of dollars.

01;03;07;04 - 01;03;27;21

Sebastian Raschka

And then it again would make sense. But again, it's, not going to be something on a weekend. It usually requires a few weeks or a few months to get really good results, but once you are confident and can do it, you can then later on swap out the item and can repeat the procedure. So once you you learned the workflow, it's actually easy to.

01;03;27;24 - 01;03;47;22

Sebastian Raschka

Yeah, once you get going, it's kind of easy to apply your skills in a sense on other elements. And with that I wanted to say is, yeah, you can can do interesting things. And there are also APIs, not I mean, again, not affiliated with any of the companies, but there is let me think. Think I think it's thinking labs.

01;03;47;22 - 01;04;19;18

Sebastian Raschka

Thinking machines. There's like a company from one of the OpenAI co-founders, that has an API where you can fine tune and customize them so you don't have to have the GPUs yourself. You can use, like, like a ChatGPT API. You can use an API. They have. But for fine tuning, you can give it the data and the settings and then run it on cloud machines without having to worry about managing, let's say, GPU failures and that type of thing.

01;04;19;23 - 01;04;38;06

Sebastian Raschka

But again, I think here it really, really helps to learn how to build it from scratch. First. To understand what you're doing with these different settings. For example, in my book, in the reason it built a reasoning model from scratch book, I have a chapter on reinforcement learning with verifiable rewards, which is essentially the, deep, reinforcement learning.

01;04;38;09 - 01;05;01;01

Sebastian Raschka

And I'm only running it on, data set. It's called math 500 where there are, so my sorry, once 500 is a test set that is popular for benchmarking. But I have like the training set, it has 12,000 math problems that are not overlapping with this. And I'm training on that. And just to give you some numbers, it takes about, so there are 12,000, about 12,000 problems.

01;05;01;01 - 01;05;20;29

Sebastian Raschka

It takes about a day to train for 500 steps. So on one GPU, the on you. But if you would do 12,000 it would be ten about 20 days or something to just, train it. And it's a small model. But doing that on a few steps, you understand what is the final rewards, what is happening there? What am I comparing against?

01;05;20;29 - 01;05;41;15

Sebastian Raschka

What are the reference answers? What is it checking? What are the different settings? There's something like number of rollouts, batch size. There are different settings in the trpo like clip ratios and everything. And like the epsilon to clip ratio. There are lot of little tweaks and knobs. And by building it from scratch, you know what all these types of things mean.

01;05;41;18 - 01;06;02;26

Sebastian Raschka

And then once you understood, oh, that's what I'm doing. That's what's happening here. Then you can, for example, go to an API and say, oh, I'm actually confident I know what the setting is. I use the before I understand it's just a knob I have to tune. It will try this setting in this setting and you know, like where it helps a lot to build from scratch to just get that intuition or before you have a production, system.

01;06;02;26 - 01;06;03;26

Sebastian Raschka

Yeah.

01;06;03;29 - 01;06;32;10

Geoff Nielson

That's great. So just, you know, as we start to wind down the conversation, I did want to ask you, Sebastian, you know, what's what's your kind of takeaway advice for technology leaders who, you know, may be interested in learning more about LMS or ensuring their teams better understand labs, like what's kind of the the main takeaways you would give them in terms of how they should move forward, into this space?

01;06;32;13 - 01;06;56;25

Sebastian Raschka

Yeah. Well, shameless plug here. I would do, I would say doing like lightweight coding and element from scratch. I mean, not from like scratch, scratch without any template. But for example, you know, my book or something where you do have like a guide that guides you through it. And if you're comfortable with, Python, PyTorch, this, this is something you could technically do on a weekend or maybe 4 or 5 days.

01;06;56;27 - 01;07;20;09

Sebastian Raschka

And then that gives you like the foundation. And then there's a lot of jargon out there, mixture of experts and like different attention mechanism, group query, attention, multi-head, latent attention. But it's all derived from the original GPT model. So once you understand the core building block, it kind of like demystifies all the other things. So they are all basically, built on top of that or like flavors of that.

01;07;20;09 - 01;07;46;23

Sebastian Raschka

And I think, like I think it is important. I mean, that's what I also like to do is to build a foundation of something. And then it's always easier to look something up and you can see how it evolves from there compared to starting. Okay. I have no idea how anything works. What is this mixture of experts? And I think so that is a good point because I think I had actually, there were some prominent people asking me, for example, about advice on mixture of experts.

01;07;46;23 - 01;08;04;21

Sebastian Raschka

Would that be something worth investing in? Like more like a big picture investment as a person who, let's say, is not building elements? And for example, the misconception would be, oh, mix of experts. I can have, I can train different LMS and then I can, combine them together. I can have a math element, I have a Spanish other.

01;08;04;21 - 01;08;32;04

Sebastian Raschka

And then I don't have to train all everything together. And I think that sounds very plausible. But then if you, it's not how mixture of experts works, though. It's a very different thing. And I think by building the foundation, it helps you really demystify these misconceptions. Basically like by, so mixture of experts is essentially a module in the LM that is like, it's called a feedforward, layer.

01;08;32;04 - 01;08;50;06

Sebastian Raschka

It's just like basically a weight matrix. It's like a classic multi-layer perceptron. And if you have that connection, you know, you can't just swap anything in that you have. It has to be trained end to end. And, the experts are also more implicit. You can tell, okay, this is just doing math and this is doing, Spanish.

01;08;50;06 - 01;09;14;15

Sebastian Raschka

It's more like, yeah, it's maybe this one is stronger at math because it gets more activated when there's a math problem. But it's not like this discrete distinction. It's more fuzzy. And I think these are things that are really hard to understand by not looking at the fundamental, architecture, building blocks. It's really like even right now, I'm trying to explain it.

01;09;14;17 - 01;09;34;00

Sebastian Raschka

In a sense where it's more like big picture, but big picture doesn't capture it. I think that's maybe my my message is like, I think there are a lot of, visualizations and big pictures that are very they're not incorrect, but they're very fuzzy. They're very vague. And then they can lead to misunderstandings because they don't show the full picture.

01;09;34;02 - 01;09;56;19

Sebastian Raschka

They try to do big picture. And I think really doing that, you don't have to learn about every nitty gritty detail like GPU optimizations. It's like a I guess it's like a detour if you don't really train on it. Like you don't have to necessarily understand NpF4, like floating point for position one and how that's implemented. But having a big picture, you, you would understand okay.

01;09;56;19 - 01;10;11;02

Sebastian Raschka

For bit precision is less than 16. But so in that sense it's cheaper. But it also approximates because you can't store as much information. And like understanding and appreciating these type of, nuances I would say, yeah.

01;10;11;04 - 01;10;19;22

Geoff Nielson

Right. And and there's just no substitute to your point for like getting your hands a little bit dirty and yeah, seeing how it actually operates in an environment.

01;10;19;22 - 01;10;39;03

Sebastian Raschka

Yeah. Because I think that's like the it doesn't lie. It's the truth. It's like it's not hand-wavy it's really concrete. And this way you. Yeah. You don't you don't have them these types of I wouldn't say knowledge gaps because. Well, even if you build something from scratch, you don't build, everything in all different directions from scratch.

01;10;39;03 - 01;11;05;07

Sebastian Raschka

You focus on the core, but the core itself, it is the true core. It's not, you know, like a vague concept anymore. And so it does. It's like almost like self verifying, you know, it's. Yeah. So it's like when you have math equations and math, sorry. But like when you, derive things from first principles you like, there are certain formulas, you can just use them and memorize them, or you can derive them from, from first principles.

01;11;05;10 - 01;11;17;27

Sebastian Raschka

And then you can see if you do that from first principles, all this formula, I mean, you don't have to do it all the time. It's just you do it once, but then you know, okay, this is actually rooted in something and these are the parts. And this is why it is because it's how you derived it this way.

01;11;17;27 - 01;11;30;04

Sebastian Raschka

It's not just, a fantasy formula that someone came up with that is like, a reasoning or like a, yeah, process behind it, basically. And then, yeah, I think that answers a lot of questions that people would have.

01;11;30;06 - 01;11;48;25

Geoff Nielson

While and it feels like this is a space where there's so much misinformation or so much hype and there's so much, opportunity for people to misunderstand how it works, that there's a lot of benefit to being able to just actually go to the source and see it for yourself.

01;11;48;27 - 01;12;09;13

Sebastian Raschka

Yeah, yeah. 101 like something that just came to mind when you were saying things about, like the fundamental, like misunderstanding of certain things. For example, there was this model I think it's a very cool research. It's called the hierarchical reasoning model. And there was also something called the tiny reasoning models. And that came out last year in 2025.

01;12;09;13 - 01;12;32;22

Sebastian Raschka

And they, I think they even won the arc benchmark arc is like a logic puzzle type of benchmarks, and it got a lot of hype. It was huge in every like in the media. And, everywhere basically. But, I think it would have helped if people like, build something like that or like follow the paper because, it is a really, really cool model, but it's not an end.

01;12;32;22 - 01;12;51;18

Sebastian Raschka

They compared it to ChatGPT. It's a tiny, like transformer model, and it is only, working on this particular task you can have, you can train it to do, let's say, Sudoku. You can train it on these, arc benchmark puzzles, but you can't say, translate my sentence from Spanish to English because it can only do that one thing.

01;12;51;20 - 01;13;10;19

Sebastian Raschka

And, I think here it would help, if people. Yeah, they understand a bit like, what is the architecture, how is it trained? And then you see these limitations. Basically, it doesn't mean it's a bad model. It's actually a pretty impressive for its size. But then it wouldn't be fair to compare it to an LLM, because it can do a lot of things.

01;13;10;19 - 01;13;28;04

Sebastian Raschka

And it is a general model where this is very, very, very specific. And I think these are things where if you understand the fundamentals, you can kind of like escape this type of hype. You can kind of oh, this is yeah, this is clearly just a news headline. They're trying to get attention with something. And I think there's a lot of that.

01;13;28;04 - 01;13;49;07

Sebastian Raschka

Like you said, there's a lot of this type of hype where, you know, it sounds good. It gets like, I guess hype and clicks and everything, but often they that it's if it's too good to be true, it's often too good to be true. There's usually, some catch and it's easier to find that catch if, if you know the fundamentals essentially.

01;13;49;07 - 01;13;49;28

Sebastian Raschka

01;13;50;00 - 01;14;01;08

Geoff Nielson

Well said. Sebastian, I wanted to say a big thank you today for coming on to the show. It's been a really interesting and insightful conversation. And, yeah, I was excited, how deep you could take us on some of these topics.

01;14;01;11 - 01;14;16;06

Sebastian Raschka

Yeah. Thanks so much for inviting me. I had a lot of fun talking about all these technical things. You know, that's what I do for work, that I do what I do for a hobby. And I had a lot of fun. And so. Yeah, thanks for inviting me here. And. Yep. It was fun.

01;14;16;09 - 01;14;41;20

Geoff Nielson

If you work in it, Infotech Research Group is a name you need to know no matter what your needs are. Infotech has you covered, AI strategy covered. Disaster recovery covered. Vendor negotiation covered. Infotech supports you with the best practice research and a team of analysts standing by ready to help you tackle your toughest challenges. Check it out at the link below and don't forget to like and subscribe!

Subscribe Anywhere

YouTube

Apple

Spotify

All Episodes

The Next Industrial Revolution Is Already Here

Digital Disruption is where leaders and experts share their insights on using technology to build the organizations of the future. As intelligent technologies reshape our lives and our livelihoods, we speak with the thinkers and the doers who will help us predict and harness this disruption.

Episode #56 03.16.26

Listen

Our Guest Amy Webb Discusses

AI Convergence: Amy Webb on Why This Is the Year of Creative Destruction

On this episode of Digital Disruption, we’re joined by the CEO of the Future Today Strategy Group and tech futurist Amy Webb.

Amy joins Geoff Nielson to unpack what 2026 really looks like through the lens of artificial intelligence, programmable biology, quantum computing, biological computing, geopolitics, and systems-level change. Amy argues that we’ve officially entered a new convergence cycle, a rare historical moment where AI, biotech, computing architectures, economic systems, and geopolitics collide to create an entirely new reality. This isn’t incremental innovation. It’s structural transformation.

Episode #55 03.09.26

Listen

Our Guest Bala Muthiah Discusses

Will AI Replace Software Engineers? Here’s What Lyft’s Engineering Director Says

Bala Muthiah, Director of Engineering at Lyft, sits down with Geoff to cut through the hype around AI in software development and explore what’s actually changing inside high-performing engineering teams.

Episode #54 03.02.26

Listen

Our Guest Deborah Liu Discusses

Ex-Ancestry CEO: AI Will Wipe Out Businesses

Deborah joins Geoff to share a candid, practical look at modern leadership in 2026. Drawing on her experience scaling billion-user platforms and transforming legacy organizations, she explains why “adding AI” isn’t a strategy and what it truly means to build an AI-native company.

Episode #53 02.23.26

Listen

Our Guest Sebastian Raschka Discusses

LLMs in 2026: What’s Real, What’s Hype, and What’s Coming Next

Is AI actually going to replace developers? Or is the hype getting ahead of reality? Sebastian Raschka joins Geoff Nielson to unpack the real state of LLMs in 2026.