Sid Ghatak is a distinguished leader in AI and analytics with over three decades of experience spanning the public and private sectors. An alumnus of the University of Michigan and the University of Chicago, he has driven significant innovations, such as enhancing government efficiency at the GSA and increasing Time Warner’s revenue by 80% through data strategies. Ghatak founded Increase Alpha, where his deep learning algorithm … .more
QuantBeats Ep. 06
S. Ghatak: From Government Tech to Quant Alpha
Discussion Points:
- How decades of government tech and AI experience led to a unique alpha-seeking strategy.
- How the use of curated datasets outperform brute-force modeling.
- How to avoid AI hallucinations and contaminated training data.
Listen Full Episode Here
Dan: Hello and welcome to QuantBeats. I’m Dan Hubscher, managing director and founder of Changing Market Strategies and your resource for all things quant. My co-host is current QuantPedia CEO and head of research and former 300 million euro quant portfolio manager, Radovan Vojtko. Rado, please say hello.
Rado: Hello, hello everybody.
Dan: Click the Q logo on your screen to hear our backgrounds in Episode 1 or check out the QuantBeats website. Now we welcome our guest from Increase Alpha, Sid Ghatak. Please just give us a hello Sid. Hello everyone. Thanks. So before we learn how Sid went from government technology strategist to quant fund manager and tech firm leader, as a registered representative, I need to show you the disclosures on the screen now.
And at the end of the episode, we’ll be giving you ways to contact me if you have questions for any of us. About anything you heard here. Okay. Let’s dip our toes into quantitative finance to help investors, quants, and anyone just curious to gain new insights in the quants strategies and the market dynamics and glimpse the future of algorithmic trading.
So, Sid, can you please give us the one minute version of who you are, what you do, and any disclosures we need to know about? And then Rado will take the discussion deeper into your background.
Sid: Absolutely. Thanks so much and appreciate making time for me, inviting me onto your podcast today. So I’m Sid Ghatak.
I’m founder and chief executive officer of Increase Alpha. I’m also founder and chief executive officer of PivotalEdge Capital, which is a hedge fund.From a disclosure perspective, everything I speak about today is regarding Increase Alpha and none of the comments I make today are regarding PivotalEdge Capital.
I also did work for the federal government. All the statements I make today are on behalf of my private enterprises. And in no way, shape or form reflect the opinions of the federal government or the work that I did while I was at the federal government. And finally, the last disclosure is I also serve as the chief technical advisor for the National Artificial Intelligence Association.
I may from time to time put that hat on during this conversation as appropriate, you know what I do. But the primary points that I’ll raise today are regarding our work at Increase Alpha.
Dan: Excellent. Okay. Thank you very much Sid. That was clear. Rado, I’ll hand over to you. Please take it away.
Rado: So thank you Sid. Thank you for joining us. I liked your intro and I’m especially interested to learn more about your history. So how did you get from working for government into the finance, into the quant world, et cetera, et cetera. Because I mean, it’s an unusual step. What is your history?
I mean, this is interesting. Yeah.
Sid: Yeah, I look at my history as a very unusual history in general. ‘Cause I don’t have any real hedge fund experience meaning that I did not work for a fund or anything like that. I spent the bulk of, actually almost the entire part, of my career working in technology. I started BA in Economics at University of Michigan.
I moved to Chicago and started my career as a Series 7 & 63 stockbroker for Prudential Securities and made 400 phone calls a day when I thought I was actually going to be analyzing stock reports every day. That was the hope that I was going to be doing, but I only got to do that at night. And for 12 hours a day I made phone calls, and after a year decided maybe this wasn’t what I wanted to do with my life.
I went to work for Bank of Montreal. And then I also was starting my graduate studies at University of Chicago at night in Finance and strategy, and spent a number of years really immersed in corporate debt and really understanding corporate filings, understanding debt covenants and calculating loan covenants and interest rates and all of those just really interesting things. But the one really interesting part of my job was I was the human data integrator between 13 mainframe systems. And what that meant was we had 13 main training systems that really ran the back office of the bank. And every month we had to, all those systems had to be integrated, tied, reconciled, journal entries, prevailed, exceptionally manual process that I went through.
And we had a whole team that did that. And I really didn’t enjoy that job whatsoever, but it had to be done so that I could do the other part. So I was set about automating that whole piece and got the manuals from the mainframes. Understood how to extract data, how to write applications and programs to move data back and forth.
And within a couple of months, I was able to take what took a whole month down to about two, two or three days of work and was fully automated. That was my first glimpse into the power of technology. And the importance of data and the importance of very, very clean data. Really cutting my teeth at early part of my career gave me a taste for that kind of work.
And then I quickly shifted from banking into technology. So I started technology consulting in the late nineties. Spent most of my career afterwards building, budgeting, planning, and forecasting systems for Dow 30 and Fortune 500 companies, including JP Morgan Chase, Merck Pharmaceuticals, Northrop Grumman, British Aerospace.
The data side of the business really fascinated me, so I spent a lot of time really honing my expertise in data warehousing, business intelligence systems, data delivery systems, and then kept moving up the food chain through advanced analytics. And then when artificial intelligence started to become a new thing to work on, I started working on that as well. So all of my technology is purely self-taught, so I really, I don’t think I’ve taken a computer class since high school. I’ve really taught everything else, everything I know, I’ve learned it by getting the manual, getting a data set, breaking some things, fixing it, and then seeing how they work. Along the way I was a professor at Villanova University and authored and taught their master’s program in Advanced analytics and business Intelligence. And shortly thereafter, got an invitation from the White House to work with the General Services Administration within their technology transformation services group.
They had these teams called Centers of excellence. I was the director of the Data and analytics center of excellence, and also worked with the Artificial intelligence center of excellence. In that role, I authored the AI maturity model for the federal government, which recommended how artificial intelligence should be deployed at the public sector.
That framework is still out there and still in use today. I also contributed to several executive orders that were deployed by the last two administrations, the ethical and safe use of artificial intelligence. My last role within the federal space was the chief AI architect for the Federal Energy Regulatory Commission.
So that was my professional work. Along the way I thought it’d be an interesting technology experiment to see if I could predict stock prices. You know, something just a very simple activity. I say that with tongue-in-cheek, with heavy sarcasm. ‘Cause obviously that’s what we all do. But I take a crack at that.
So what started as a research project, you know, out of my work in Villanova, I really started thinking, trying to get to the bottom of what are the various methods that people are using now? Where do quantitative methods work? Where do they fail? Why do they work sometimes and why do they not work in other cases?
What’s really the missing link? Started experimenting back in 2017 with various eases, finally came up with what I thought was a really workable solution. Put that in production in 2021, and that has formed the basis of our product today. So I can get into that at any point, but that’s sort of the whole background of my bizarre evolution from broker to semi quant type of person.
Rado: I like your background, especially that brokerage part because I was doing something similar. I was just not selling the stocks, I was selling the hot water. Yeah. I mean, you have those barrels of the water that are in the office. I try to do this for, I don’t know maybe two or three days. And I realized that it’s not for me and the data analysis it better than, I mean, just trying to make the calls, so
Sid: yeah, I mean, you were smarter than I am. Like I wish I left after two or three days, but I think I set up a lot of interviews and didn’t realize it at the time, but I actually interviewed with the firm in Wolf of Wall Street and I didn’t realize. It was really quite funny when I went to the interview, it was exactly like in the movie.
It was a bunch of guys exactly like that. Very, very aggressive, not a, you know, a lot of shiny suits. I was in that room and realizing I don’t really belong here. These are not my, yeah, kind of people, but now I go to quant conferences and I’m like, oh, these are my people. Okay, these are the type of people. Exactly.
Rado: Okay. So as I understand, I mean you are like safe man, technology professional or data analyst and you moved into the quant world and everybody who is in the quant world, we are like cooks, so I mean we are cooking our models, somehow, but I mean, as every cook, I mean, we like to cook our meals differently.
So I mean, somebody likes to use more meat, somebody likes to use more vegetables. Somebody likes to have, I mean, a lot of the spices, et cetera, et cetera. So, I mean, in our quant world, there are multiple different sources from which we can, I mean, come up and have a great model. And it can be data, it can be portfolio management, it can be portfolio structuring, et cetera.
So there are different ways how you can come up with a great model. What is, in your opinion, the most important part in the quant process? I mean, how to build a good model?
Sid: Right. So I think it always comes back to the data. I think the data is always the key piece of it because, and I’m sure a lot of people will take a different approach, it’s self-serving in a way. Data is actually where I’m the most comfortable. So the reason I think it’s data is ’cause that’s also the thing I’m most comfortable working with. I understand the challenges of organizing it, cleansing it, having it in the right structure and why it needs to be normalized.
What data should be used, what data should not be used, understand the context of the data with the actual problem you’re attempting to solve. So, because I’m so familiar with it and know it, so intimately and intuitively, and I’ve been working on the data side of the problem for so long, that’s always my first reaction in terms of like, okay, if I’m gonna cook something, I’m gonna find the coolest data that I can find, or the data that nobody else is using may put it through a very simple, you know, recipe, right?
The actual transformation of it may not be that complicated to get the result, but if I use very interesting data, I likely will get very interesting results.
“If I use very interesting data, I likely will get very interesting results”
Rado: And what are the data that should not be used? Because, I mean, this is the, this is the part that, that sounds very interesting to me, so what are the data that we should not use?
Sid: Data that should not be used? Okay. So I, having, again, having lived in that space for so long. There’s the, the whole concept of personally identifiable information and data about individuals. I think that’s really, for me, it’s a bright red line. In terms of you use data and when you don’t use data you should not be using data about people without their explicit, you know, consent.
Rado: Okay. I understand. And
Sid: informed consent, right? Then it gets into other situations of like intellectual property. You should not be using data that you don’t have the rights to use. If you haven’t paid proper fare use for that data, you’re, you know, you’re getting economic value off of assets that you don’t necessarily have the rights to or have not procured rights to.
So just ’cause the data exists on the internet does not necessarily mean that you could just grab it, take it, use it, and do what you’d like with it. The ethical and legal components of that are absolutely critical. And so when I started going down this process, having that mindset, I’m like, you know what?
I could go out and find amazing data, integrate all this stuff. It might not be right. So for example, before I started this I actually had a small venture where I was assisting federal election campaigns. So from the federal government, from the Federal Election Commission site, you could literally download every single political donation that’s ever been made going back to 1996.
So I spent a month doing that, so I extracted all of this data, every donation over $50 is recorded, publicized, and can be extracted. So I took all that. And then there’s other, there are data brokers that are available that will marry your name and your information with everything else. And there are other data brokers that will give you other things.
And there are other data brokers that will give you other things. And pretty soon I can create a whole profile of you without you even realizing that I did it from fairly cheap and public sources that would be exceptionally valuable, but ethically would be wrong. And so realizing that, and I’m like, okay, that’s not a business I want to get into, really be involved with that and go a right way.
But yeah, so the ethics of data is really the first question anyone should be asking about whether or not they should or should not use it.
Rado: I think you are right. I mean, it is really important because even when we start building the models and we are not having the ethical data, and we will become successful in the future, this is something that can hound us in the future because it’ll still be with us even in five years or in 10 years.
Sid: Absolutely. And I think on that side what it’s really interesting is because I don’t think I mean, most people don’t have a nefarious mind. I like to think that most people are good. So you’re trying to solve a problem. And then let’s say you decide to go with alternative data and you’re gonna use satellite imagery off of parking lots.
I think it’s a that’s a very popular one. Yeah, it’s very
Rado: popular. Yeah,
Sid: very popular. Okay. So I’ve got satellite imagery off of parking lots, but because it’s high-res satellite imagery, now I actually have the license plates of the cars themselves, which could then be extracted. And let’s say I just use that along with card swipes.
You know, at buildings as well as foot traffic, as well as video feed from inside of malls to understand where people are going. Well, now I have a lot of information inadvertently about people that in the wrong hands could be used incorrectly. So that’s something to be really think about is have you accidentally created a profile of a person just by buying five or six different data sets from different people.
Rado: Right. And I mean, especially now that there are so many alternative data sets we can choose from.
Sid: It’s exponentially growing. Yeah, it’s
Rado: exponentially growing. Yeah, you’re right. Okay. And here is the question. So I mean, as we have so many data sets, I mean there are like hundreds and hundreds of data sets, how to orientate ourselves in the data. Which of those data are useful, which are not useful. And especially, I mean, as we are moving into the age of AI, which are the data sets that make sense for us as quants to use or we can use all of them, or I mean, or especially in the connection to AI, because right now with the AI we can use, I mean anything so.
Sid: So this, this is a really interesting question and it’s a very different answer today than it is six months ago, or a year ago. A year ago, if you look at the amount of data that’s on the internet and what was generated, who wrote it, where the source was, vast majority of it was human generated content. Video, text, all of that stuff.
You know, I think very recently it flipped over to more than a half the content that’s available on the internet now has been machine generated and AI generated, but less of it is human generated. So there’s two problems. One is can that AI generated content be trustworthy? Is it trustworthy? Right. Is it true? You know, the hallucination problem is, you know, we can dive into that for an hour just on that by itself.
But how factual, how correct, how trustworthy is all that machine generated data. And then secondly, there’s the second problem model collapse. If you are using AI to ingest AI generated data, it only takes 2, 3, 4 times before the model collapses and you’re just getting nonsense as an out.
For me, the provenance of that data to ensure that it was legitimate human generated data is absolutely critical. Otherwise, that is no longer in your control. You can’t control for some of those things if you’re just randomly extracting whatever information you’d like from the internet.
“The provenance of that data to ensure that it was legitimate human generated data is absolutely critical.”
Dan: Yeah, it’s interesting and it’s also a big word, so for anybody who doesn’t know, just can you explain data provenance a little bit please?
Sid: Data provenance. Yeah, so data provenance mean, okay, so how can you track the creation of that data back to its original source? You know, how do you know that it actually was created by the person that you think it was created?
That’s a very tricky subject. There’s no benchmark standard for that, because there’s no fingerprints, there’s no signatures, there’s, you know, those kinds of things. It’s just not there. So I think ChatGPT newest version of video generation. Very difficult to tell if some of those videos are true or not.
I think I’ve seen some watermarks, you know, recently so that you can tell, but I don’t even know if that’s mandatory or not, but you could very easily edit that out and then you could create something, you know, that is intentionally nefarious or accidentally nefarious. Now, personally, if I’m gonna use data to move for something that actually has material consequences, you know, human life, obviously that’s much more important than what we do. But if we’re talking about managing money or moving money, or making financial decisions, I want to know that the data I’m using is as true and good as possible. Data provenance for me is absolutely critical. I need to be able to source it all the way back through to the actual place that was created.
And if I can’t do that, then I’m personally, I’m not gonna use it. Now, I would recommend that most people follow the same road, but I think it’s tempting. It’s out there. It looks good, let me take a shortcut here and take a shortcut there. You’re playing with fire because it only needs to blow up once for your entire system to, to implode.
So I urge people to really think about that and proceed with caution in terms of the data sources you’re inside to use.
“Every time I’m making financial decisions, I wanna know that the data I'm using is as true and good as possible”
Rado: As I see at the moment, and I had the discussion in the past with other managers, right now in the age of AI, the data analysis is cheaper and cheaper, and I mean going in price to zero, what is really important are the data. So I mean, there are like the new gold because I mean if the analysis is cheap, then where is the alpha? I mean the alpha is in the data. The better data we have in our model, the better the model is because I mean, in the end, we are all using plus or minus the same models or stuff that’s there. So I mean, what can differentiate us from the others are the data. Okay. And here is a question about the AI. So I mean, everybody’s talking about the AI, everybody’s talking about how to use AI, so how to use the AI?
Sid: How do we use AI?
Rado: Yeah. How do we use the AI to analyze the data?
Sid: Yeah. So we use AI in a couple of different ways. Maybe before we even jump down that road, it’s important to even define what AI is, right? Because I think in public context, AI is now synonymous with LLMs, which is synonymous with ChatGPT, right? That’s right. We’re thinking generative AI, LLM, ChatGPT, Claude, Gemini, all of those generative tools are moving towards that artificial general intelligence, that super intelligence, that Meta and all the other companies are attempting to build.
I don’t prescribe to that theory. I don’t believe that is what AI is. I think that is a subset of AI. I think that’s an application of AI, but you have other methods of deep learning that you know, that I can talk about that we use. You have other types of machine learning, you have other types of analytics, you have other types of automations.
You know, there are many other flavors of artificial intelligence besides the generative AI that everyone is consumed with at the time. I think, you know, when you first experience it, it’s very cool. It looks like you know, it’s magical almost. When you ask it a question, it produces something that’s pretty coherent.
You think you’re talking to a human being, but you know it’s a marvelously engineered system that lulls someone into a false sense of security. But behind that, there are some serious problems and that’s why we don’t use them for those problems. So for example, the hallucination problem is real. That we don’t, if, again, as I said earlier, if I can’t trust it, then I’m not gonna use it to, to move money.
But it does some things that are really, really good and some things that are really, really powerful and we do use it in some way. So in terms of how do we use LLMs, how do we use generative AI? And we use it for document summarization. So one of the things that we found is that there’s just so much fantastic work that’s being, that’s published, you know, on SSRN and Archive, and there’s so much great academic work for a human to read that and comprehend that and summarize it and be able to you know, store that information. That’s more than I think anybody could do. And it would cost me a lot of money to hire a bunch of PhDs to just read academic papers. I enjoy reading them, my kids think I’m a nut ’cause that’s what I read most of the time now. But in order to stay on top of what is happening in the space, you know, I personally think that’s the best place to go.
“In terms of [how] we use LLMs, or generative AI, we use it for document summarizations”
Go straight to what’s been published in academic research. What’s the latest and greatest, you know, things that people are thinking about. So we partnered with a firm Zanista. And what they’ve done for us is they summarize a set of papers and on topics that we’re interested in that relate to our business.
So every month they go out and find the best possible papers with the best possible research on our subject areas. They have a proprietary, they have their own application of LLMs that they then use to extract, consolidate, summarize, and then give back to me summaries of all these papers. And so that way I can stay on top of that.
I use NotebookLM to take their 300 page document and turn it into a 40 minute podcast that I can listen to in the car. And then it’s easier for me to like identify, oh wait, that one paper, I do need to read that one paper. So of the 2000 papers that were bubblers that month, that gets me to the one or two or three that I really need to pay attention to, to see if that has any impact on the work.
So we use it for research. That’s how we’re using it. Well, we don’t use any off the shelf generative AI technologies in our core trading systems. That’s all proprietary. That’s everything that we’ve built in-house.
Dan: That’s good. I get the point that AI is bigger than ChatGPT and LLMs, but I’m gonna slow you down here for just a second and ask for another definition because I have a bug up my unmentionable area about this hallucination problem. It’s well understood. Problem. We all know what it is, but I think the engineers that kind of deduced this did the world a disservice by naming it that. So can you tell us what’s really going on? Does AI really hallucinate or is there something else happening?
Sid: You know, I think that it’s one of those things where if it’s not the word of the year or has not been named the word of the year, it should be the word of the year, right? Because everyone’s talking about these hallucinations and so, I don’t think it’s an hallucination in terms of the model is on LSD and it’s hallucinated.
It’s a statistical anomaly in the fact that it has made essentially a spurious correlation. You’ve asked it a question and because it’s a probabilistic engine, in meaning that it is returning things that have high probability likelihood of information that you yourself want to see, that you will find, you know, helpful, but because it’s based on correlations and the tighter the correlation, the better the correlation, the more likely it thinks, it is something that you’ll want to see. That’s what it has done. It has found things that are statistically correlated but are not actually related or have causal relations.
“On AI hallucination: It has found things that are statistically correlated but are not actually [causally] related, and presented [the relationship] very authoritatively as fact”
And then has presented it back very authoritatively as fact. And that I think is that second part of it is really critical is that it gives it to you, but it gives it to you in such a way that you’re like, oh, it must be right. But it doesn’t really know. It just presented it very confidently. And then we are so accustomed to thinking that you know the machine is right, but we’re hesitant to question it. You know, for example, if you freeze your calculator and it says two plus two is four, you know, it’s always two plus two is four, right? You can rely on the, on the calculator to, it’s never two plus two is six or two plus two is 4.5.
It’s not guessing that it’s four and knows definitively, and I think we are accustomed to thinking that machines will return the same answer, you know, every single dime. The reality is can ask ChatGPT the same question six times and it may give you six different answers because it’s just trying to give you something that based on its engine that it thinks you will like.
It is searching for that reinforced learning that says, I returned this set of text and you’re gonna give it a thumbs up. Great job ChatGPT. That was really helpful. Now do this. It’s looking for that pat on the head that says you liked it. So it will give you more of that, which then gets into the second problem that these models have that are quite sycophantic.
And I think the last version of ChatGPT came out with that problem as well. It said too much stuff that you wanted to hear and that becomes very dangerous, right? Because it tells you what you want to hear, right? So let’s play that out for just a moment. If we’re using ChatGPT to analyze NVIDIA. You know?
We’re gonna use some fundamental analysis on NVIDIA. You know, ChatGPT tell me what NVIDIA sales outlook looks like in the next four quarters. And ChatGPT goes out there and there’s a ton of information on NVIDIA, so who knows where it pulled the information from.
We turned on the deep learning, deep thinking mode. It comes back with some sources. So now we know I got him stuff from Yahoo Finance from Wall Street Journal and got some stuff from the Qs and Ks, but we don’t really know what its thought process was to make the recommendation it’s about to make.
But it tells you that, yeah, NVIDIA’s gonna go up 47% over the next four quarters, and you like that answer. So you say, that’s great. Go find me more companies like NVIDIA that will go up 40, 50% over the next quarter. It doesn’t, it’s not thinking critically, right? So now it’s like, okay, now I know that Rado wants to see companies that will go up that way.
Let me go find him some more. You find some more that you like but you stop asking hard questions because you think, Hey, this thing is, it’s so smart, it’s so fast, it must be right.
Rado: Yeah, it’s amazing. Absolutely. Yeah, it’s great.
Sid: It’s great. Perfect. And then off you go, and now all of a sudden, you know, you invested client money in NVIDIA and six other companies that ChatGPT said to invest in, you know, most of ’em don’t go well.
And then when the client, you’ve lost 80% of your client’s cash, and the client says, well, why did you pick. Are you really gonna say, well, ChatGPT told me to. Like, how long will you have your job? You know, if that is your answer. So that was something my dad used to always say, my dad would like, even when I was little, he didn’t really trust computers.
But then this was back in like the eighties and nineties, I don’t think it was that he didn’t trust them. He didn’t trust the reliance on it. He grew up with slide rules and he was an engineer. He always wanted me to make sure I understood this stuff, but if my brother or I or whoever told him, well, the computer told me to, he would never hit us, obviously, but it looked like he would. So, that was just not an acceptable answer in our house.
Rado: Yeah. So what is the answer? So I understand that when we do not understand what the LLMs are doing. It’s not very good idea to use LLMs for things that are important. And I absolutely agree with you. It’s good to use LLMs for things that in which they are good at.
So it means, I don’t know, summarizing the text, et cetera, et cetera, but not things that are really important. But if you have a lot of the data, what we can do now. So to use some other tools ?
Sid: Yeah, I think so. I think, you know, I think one of the disservices that we’ve seen in the last few years is all the oxygen in the air and all the resources that we’ve spent, you know, trying to get these LLMs to work the way that we would like them to work. Even though they, in my opinion, I don’t think they ever will. Right. I don’t think they will get these hallucination some of these inherent problems are just inherent problems with the technology.
So I think they’re phenomenal and will continue to do great things, but I don’t think they’re the end all, be all. Again, my opinion is from Increase Alpha. My opinion that I propose within the National AI Association is that we need to start looking at all the other technologies that are out there.
So, for example, causal relationships. Really that’s an area that has not yet been as fully invested, but offers significant upside. So once you’ve figured out the correlation between two things there are a lot of causal analysis and causal applications that are, you know, being created. Like truly identify, is there an actual link between A and B that’s resulting in C mathematically, you can prove that it’s reliable over time. That allows for us to have transparency, you know, in how these black box models work. It allows us to have explainability in terms of, you know, why did you make decision C? Oh, it’s because I knew A, B, C, you know, all these other things, all these other factors
they played into it with this component. That’s why I made this recommendation. That’s why I thought it was good. You have to be able to go backwards, not just forward. And I think that’s a interesting way of looking at generative AI. You create a question, right? It creates text, so it generates that, and then we sort of stop.
We need to be able to do is be able to go backwards from the answer and say, okay, how did you come up with that answer? That will allow, and that is what, you know, explainable AI is, there’s a lot of upside because that has yet to be fully built out and developed.
Rado: So, I mean, if I translate that into the quant world, so yeah, we can use the AI or we can use all of different rules, but still it comes back to one thing and it’s that we need to understand fundamentally what is the link between the prediction we are getting and the data we are having.
So that’s not some black box. So even that, I mean, it’s AI, it can simplify our work, but we need to understand we have this and this recommendation, our AI model is recommending this and this. But I mean, the reason for that is because these four or five or six different things are important for the company, and they are correct.
And they are right. And the company is doing well because of these five things. And it’s not the black box.
Sid: Exactly. One thing that I wanted to bring up, and I’ve been thinking about this since we talked last time, like, how could quants seize LLMs to help in their day-to-day work. If they can’t use it to, you know, build models, you know, they can’t just embed it directly.
Like what’s a good use for it? And I was thinking, you know, one of the things that are great about quants is, you know, we think mathematically, we can think very creatively, but it’s very difficult to explain what we do to non bots
Rado: That’s nice.
Sid: Like, that’s why I enjoy going to quant conferences ’cause I don’t have to try to simplify things.
You know, I don’t need to find analogies all the time, right. But that gets very exhausting and frustrating. You lose a lot of the nuance in what we do when you have to, you know, simplify things for non-technical folks. That’s actually a great use for an LLM. Like you’ve got a situation, you’ve got a strategy, you’ve got an idea, but then using an LLM to give it that, and then asking it to create an essay or paragraph or an explanation to a non-technical person, that takes a a lot of load off a quant, and they don’t probably don’t wanna do it anyway, so they’ll put it off. But that way you can get it done very quickly and most likely will be something that is usable and that the, you know, whoever you’re sending it to will probably understand. The big caveat to that is be really careful that you’re not doing that like on a public line and you put a proprietary model out in, you know, in public space and then all of a sudden, you know, if you lost your IP, like if you’re going to do that, make sure it’s a in-house, you know, piece of technology that’s not going anywhere. But, it is an interesting, from a time savings perspective, like it, it makes a lot of sense.
I have a hard time being creative when it comes to like LinkedIn posts and headlines and that kind of thing. I’ll give it. But that’s a great use for it. I’ll come up with an idea and then, hey, give me 10 different ways to explain this and post it out to my LinkedIn community, it’ll come up with 10 things that are, you know, usually fairly good and you can tell it don’t be so aggressive, be a bit more conservative and they’ll, it’ll iterate it that way. But, you know, for folks who don’t really think in that way, it’s, it could be a real time saver. And you can be very careful about not giving away information that you don’t want to.
Rado: Yeah. Because that information will be used against you.
Sid: Of course, of course, of course. Exactly.
Rado: Just a quick tip, so I’m also using the ChatGPT in the way that you described, and also I like to use it for a creativity like exercises. So gimme 10 ideas for prediction xyz, and I know that from those 10 ideas, like nine of them would not be interesting, but one or two may be interesting.
It doesn’t mean that they are the best or the idea that the ChatGPT suggested is the good one, but I mean, what it means that it’ll get me interested in that particular answer and I will start digging deeper and deeper and deeper. And when I do the research, I realize, yeah, that’s something that I didn’t know about and maybe I can use this part, or I can use those data, et cetera, et cetera. It’s also useful for creative purposes. So, I mean.
Sid: I think so, again, it does a good job of being your critical thinking partner, right? So like, let’s say you’ve got an idea and say, you know, give me five reasons why this won’t work.
Rado: Yeah.
Sid: Right? So like, if you start doing it, engaging with it in that way, I think it’s really helpful. It’s just really important to validate right? The old trust, but verify.
I remember when maybe it was probably, I think it was version 3.5 or 4, I need to write an essay about something.
So I was like, oh, I could just ask ChatGPT to do it. So I said, you know, here’s the topic, and I gave a fairly complicated prompt and generated a phenomenal thing, and I said
Hey, go find five sources that prove the point. And so thinking that it would do that, right? And first it told me it didn’t have access to the internet before that time.
That’s fine. Just use what you have access to. It came back with five completely fabricated articles.
Rado: Yes
Sid: Going to your point earlier, you about hallucinations, each of those articles were, it was very interesting. They sounded real, right. It sounded like something the New York Times, the Wall Street Journal or Financial Times or Washington Post would write.
It was a headline that sounded legitimate. The headline itself was actually not based on complete fabrication. The thing it actually did happen. The only problem was the article didn’t exist. And so it reflected the theme of the Wall Street Journal, but none of the articles existed.
I’m like, okay, this is dangerous territory. Like it went to the point of creating the headline, giving me a link to it, think, and then when I clicked the link, it went to, you know, page 4 0 4. It didn’t even, it wasn’t even found.
Dan: This is exactly how I learned about the hallucination problem because similarly to you, I asked LLM to find some articles for me to support a point, and it did, and it made up the headline, the author, the date, the URL, everything, and file not found. And I asked the LLM flat out. Was that real? Yes. No it isn’t. Did you just make it up?
Yes, I made it up. Okay. Gimme a real one. And it made up another one and I asked it, did you make that one up? And it said yes. So that’s, that is how it goes. So we’re stepping with this discussion towards the policy discussion I wanted to get to with you, Sid.
But just before we do, I wanted to kind of tie a bow around the intersection of AI and data in your case as a quant, right? Because I know something about how you do this and you’re giving us a very interesting insight into the old aphorism. 80% of a data scientist’s job is janitorial. And I know I’ve said it on this podcast a zillion times, but you’re really showing us what the janitor does, right?
So way back in the beginning of the hour, you showed us how you’re taking, to go all the way back to the beginning, you’re taking the personally identifiable information out. Not looking at that, you’re taking out some of the AI generated data ’cause that’s gonna put a feedback loop in your AI model. You don’t want that.
You’re using the data that you’re, I think you’re saying is even more valuable than the quant algos themselves, to build what predictive algorithms and custom factors and those kinds of things in the strategies. But once you’re kind of got rid of that extraneous data, isn’t it still true that for your machine learning models trying to predict stock prices, the more data the better.
Sid: That’s an interesting question, and I think it’s sort of the fundamental question of how much data is enough data, right? Where do you draw that line, right? Is more data always better? I think that’s not necessarily the case. I think there’s a fine balance, sort of like efficient frontier in terms of optimal number of companies or portfolio to at some point when you start to degrade, you add data until you start seeing a degradation in your accuracy.
And that was the method that we used, which was we kept adding to it and we had gone through the correlation exercise to identify, to narrow down all the, you know, the umpteen thousand features. You know, ideas that we had to a set that actually, okay, now these actually have some correlation between them.
Now do they have causation? Alright, let’s go through that exercise of adding the causation. Just because they have causation, does that mean that they help each other or do they contradict? Because you could have six pieces of data that all have a causal relationship, but because they’re contradictory, some are contradictory, they all cancel each other out so collectively, they add no value to it. And so that is a different mathematical exercise. What we do is we borrow McKinsey approach. It’s called MECE, which is Mutually Exclusive Collectively Exhaustive. So we look at the problem from a business perspective and say, what data do we have that will support and move our, and help us answer this question?
And all the data has to be mutually exclusive, meaning they don’t duplicate each other, but they’re collectively exhaustive, meaning they cover all scenarios and only the data that fits in that criteria is the one that we actually use. So we’re not attempting to boil a pot of noodles and throw it at the wall, see which ones stick, and then run with that because we think all of it could be helpful, right?
We have to selectively say. This particular strand of noodle, you know, and this particular noodle and this like, so there may only be, you know, a handful in that dish, but they’re the exact ones that will answer that question to your point earlier about like, you know, being a cook, that’s how we cook.
There may not be much on that plate. You know what I mean?
Every single thing on that plate is intentionally chosen to support everything else. It’s sort of like going to a, like a very minimalist French restaurant. There’s only like three ingredients, you know? It becomes the most amazing meal you’ve ever had, because those three ingredients were just curated and carefully chosen to compliment each other.
Rado: Yeah. Yeah. And it’s not 50 because then it’s just a mess.
Sid: Exactly right. Exactly. Yeah.
Rado: I like this because, I mean, this is probably one of the most important things that I heard in the last few weeks. Yeah, really, you are right there can be in your model more data than you need. So by actually removing data, you can increase the quality of the model itself because it can be too much. Yeah.
Sid: And because it’s impossible to really understand all the variables, you really have to test it over time. And so one of the things that we did is, which is very different than I think everybody else does, is we did a four year pure out of sample forward test. So we built our model. We only, we set it in motion in 2021.
And since 2021, all we’ve done is retrain the model and extracted the companies that are no longer listed or that we don’t have public data for them, right? So we eliminated every single bias we could possibly think of. There’s no look ahead bias ’cause we didn’t, you know, we’re doing it point in time.
There’s no survivorship bias because you know, we picked the list as they existed at that time. So everything is point in time, and now we’ve got four years of data that we can really sift through to see what worked, what didn’t work, and then understand patterns from that. From that basis, now we have the ability to completely apply it to almost any asset class in the world.
This prediction engine that we have, this methodology that we have can really be used not just on US equities, but derivatives, options, you know, futures, fx, crypto, overseas, anything that is time-based data. Our core engine is simple enough that we could use it there, and then the question just becomes how do we find those MECE data points to feed it and run with it?
Dan: Wow. Okay. So contrary to the trend that we’re seeing of training LLMs and other kinds of models on the entire world of the internet. You’re trying to ask what is the smallest data set possible that’s gonna give you the right answer.
Sid: Exactly. You’re right. Exactly right. So what is the least amount of data that we can use on the smallest engine possible to answer the most specific question? That is like our framework and what that does, right? But there’s more human brain power that goes into that upfront work of like doing that analysis and really, you know, using a human judgment to figure out what those data points are and being intentional about the selection.
And then, I hate to use an analogy, but we have lots of models to choose from, right? I could use it in a diesel engine, I could use it in internal combustion engine. I could use that in a jet engine. I could use that in different types of engines and what is the right model to use with that fuel, and then how do we get the best result?
So that allows us to pair the data with the model to get that result. And one you may need to generate very, very quickly, so all that needs to be optimized. Maybe some other ones you’ve got a bit more time so you could be a bit more lagged. So that really gets understanding of the problem as well.
So we’re trying to apply all the tools that we have at our disposal to answer those questions. One thing that I know that’s on everyone’s mind is what are the costs, right? Because we all know these, you know, the trillion dollar data centers that are being built and how much money’s being thrown at it.
We already published this in our papers. It’s already public. Our annual AWS spend to run everything that we run is about $35,000 a year. That’s all of it, including our compute, our daily inference, our retraining, every year, storage, all of it is $35,000 a year. The model is small enough, be run on a cell phone, you know, because we’ve optimized it and we are completely free of needing any GPUs.
So we don’t have any NVIDIA Reliance or anything like that. So our cost base is so small and our technology needs are so small, I hate to say we’re immune from anything that’s happening in the market, but we’ve intentionally designed it. We’re not subject to market movements, power and consumption and energy and such.
Rado: Yeah, I mean, when I’m listening to you, I’m thinking about whether the NVIDIA is such a great company. I mean, if all of those data chips will not be used in the future.
Sid: Well, I think right now it’s a phenomenal trade. Right. So it just depends which direction, you know, in your time, right.
Rado: Yes. Yes. In which direction we’ll do it. Right.
Sid: It’s, I mean, it’s gonna make you money one way or the other, so.
Sid: Yeah.
Rado: So what I will take from this conversation is that last thing about the dataset. There is definitely optimal number of the data sets that are used in the model, and as in a recipe in cooking, there can be too much data in a model.
It’s an art. But for me, I mean this is something that I will use a lot in my conversation in the future.
Sid: You know, thinking about it, just different kinds of, the food analogy and the cooking analogy is such a great analogy. You know what I mean? ’cause everyone has to eat. So if you’ve ever had like perfectly cooked steak, it only has one ingredient.
It may have a little bit of salt and pepper, right? But maybe that’s it. You have literally all it needs. Or you could put it into a pot roast and make a great stew, but then you’ve got lots and lots and lots of ingredients, you know? Or you could do it something else. These are all different things. Now my mouth is watering ’cause all I had is coffee and I haven’t eaten today.
Dan: You guys are making me so hungry. I didn’t have breakfast this morning. I’m starving now. I think everyone listening is hungry at this point. No, it’s not helping. Okay. Okay. Let’s shift gears to film noir. You know what film noir is Sid? Yes. Do you know Rado, do you know film noir? Yeah. Okay, so Sid, shifting gears, where were you on the night of Monday, September 15th, 2025?
Sid: So September 15th, 2025. That was my congressional testimony, I believe. Was it not? Yes. I thought you were gonna ask this past Monday, where was I when AWS went down? Because I could talk about that in conjunction to September 15th if you’re gonna.
Dan: You guys are making me so hungry. I didn’t have breakfast this morning. I’m starving now. I think everyone listening is hungry at this point. No, it’s not helping. Okay. Okay. Let’s shift gears to film noir. You know what film noir is Sid? Yes. Do you know Rado, do you know film noir? Yeah. Okay, so Sid, shifting gears, where were you on the night of Monday, September 15th, 2025?
So if you want to tell us about that and what was going on there, and then wrap it into, you know, Monday’s AWS adage that’s great.
Sid: Sure. After I left the government and I joined the National Artificial Intelligence Association as their Chief Technical Advisor, in that capacity we work with legislatures, we work with other government officials to formulate AI policy that allows the technology to grow into what it can be without overly tight restriction. There’s a right set of guardrails to place on the technology, but if you make it overly restricted, then the technology cannot grow into what it can be. So we’re trying to work with the federal government to figure out where right place to put those guardrails is.
Really foster a culture of entrepreneurship, of innovation, of creativity in the space because we really believe that this is the next, you know, evolution of technology. It has the potential to transform society just like the industrial Revolution. I’ve talked in that testimony that I gave at Congress.
The analogy that I used then was like the railroads and steam engine. The funny analogy was the until the steam engine had been developed, no human had ever moved faster than a horse that was just the fastest anyone had ever done. So AI in its current form, is now allowing people to realize like, oh wow what happens if you can move this fast? What is the art of a possible with this technology? But you know, the steam engine was not the last innovation there. There have been obviously multiple other methods of transport that it led to, and steam engines led to, you know, other types of train agents that moved faster and were safer.
I could do more things and obviously that led to the car, led to the plane and all this, these other things. So I think we’re at the very early stages of this revolution, and I think if we do this correctly it’ll really be good for humanity as a whole, however, and there’s the big but. Monday. I was directly affected by the AWS outage, right?
We, I was having to travel and screwed up my travel plans and, you know, had to adjust accordingly. But, you know, when we think about AI and systemic risk to the financial markets, let’s just say X, Y, Z technology becomes the default standard for trading systems and it’s in one of these massive data centers that are being built and then the power goes out or there’s an outage, or it’s says something as simple as what happened on Monday with a setting in a database.
You know, then you’ve got systemic risk that affects the entire financial system that has serious consequences. So I think what’s really key is to really understand redundancy and failover and backups and not having single points of failure, mission critical systems. And so I think of it, it exposed. It was a grim reminder of how fragile our entire ecosystem actually is when something like that can go down.
Now, keep in mind, I would even tell my own investors and my own clients, like most of our stuff is on AWS East. What’s the likelihood that AWS East goes down? How often has that ever occurred? Right. So, you know, of course now AWS East wind down we internally are revisiting our own failovers and such.
Fortunately, our business was not affected by it. However, it did make us look internally at our own contingency plans and our own business continuity plans.
Rado: think we can speak for the next two hours about AI and data, et cetera.
We have to get Sid back. Exactly.
Sid: Yeah, no, I’d happy to come back ’cause I think, again it’s nice to talk at this level. You know, one of the things I would say just around going back to data and maybe to close it off. That’s been my approach and one of the things after 30 years of working with large companies, to your point Dan, about 80% of data science being janitorial, it got very frustrating, got very, it was difficult to have the level of excitement with another client like, oh great, I get to spend 80% of my time fixing your mistakes and cleaning up your data. You know, you can only do that so many times before you lose the enthusiasm for it. So when I had the opportunity to build my own firm, I knew from the very beginning that data was not gonna be not just a problem, but it was gonna be a strength of ours.
It was gonna be a foundational pillar of what we do. I was never gonna be in a situation where our data was dirty and needed to be modified and cleansed and all of those things. It was gonna be like a core foundation of what we do extremely well and was gonna be the reason we were successful going forward.
What I find really interesting is that when you look at the largest companies in the world of, you know, turn of the century, you know, it’s IBM and General Electric, right? Those are two biggest companies on the planet. You know, now they’re barely registered. Now these are two companies that had a massive amount of data about their product. They were leading technology companies. They had massive amount of data about their customers, but because they weren’t able to harness that correctly and monetize that effectively, you know, they’re not the leaders in their respective industries or in the US as a whole. When you look at the companies that are the most valuable, they’re all data centric companies.
They use data as their fuel to exist. Meta, Google, Apple, all of these companies are completely centered on data. Now they have phenomenal product, they have phenomenal engineers, they have phenomenal tech. But if they didn’t have the data that powered all that, would they be as valuable as they actually are? I would argue no. So we are definitely seeing that data does trump all but this focus on data is really not taught in school. When you look at programs for data science or machine learning or PhDs and you know, in computer science, maybe one or two classes on data quality, how to get data, you know, in the right way to power your models.
It’s all around the math, how to build these models. But very little is about how to get the data itself. You’re, they’re often given a clean data set and then go off and build a model. That doesn’t happen. I would love that to happen.
Rado: There’s not a really real situation, I mean. No, exactly.
Sid: And I think, you know, a lot of these folks are done a disservice,
right. They get their PhD from a top school thinking that they’re gonna go off and do that. So it’s like I found out cruelly in my early twenties. You think you’re gonna go off and build all these cool models, but really what you can do is spend all this time cleaning up all the data. You know, I thought I was gonna go do financial analysis and I just wound up, you know, being a telemarketer.
It’s unfortunate we’re doing our best and brightest of disservice by not preparing them correctly for the world they actually will live in. And then again, I know we’re probably over, but I think the underlying reason AI initiatives are failing across the landscape is ’cause companies don’t have a handle on their data, right?
The data isn’t ready to be used in the way it needs to be used for these products to really live up to their potential. So until you know when companies fix it, maybe then they’ll be ready, but I’m not holding my breath because, you know, I’ve been holding my breath for 30 years. It still hasn’t happened.
Dan: Okay. So we’ve given all kinds of love to the data scientists, telemarketers a little bit, but we want to give a parting gift to the quant managers and quant analysts out there. Just one quick insight in the last minute remaining, the 10 prescriptive recommendations for a quant model that was published by a famous person. You had something to say about one of those bits. Which one was it?
Sid: Absolutely. So I think the linear was non-linear component of our work, right. And I think when we look at what core quant models are, we’re attempting to use linear math to model a non-linear world, the market is non-linear and so there’s only so much that linear regression, which at its core is what models are, it’s only gonna be able to capture so much. And there’s obviously, there is the question of is there any more alpha to be found? One could argue, and I’m not saying this as fact, but one could argue that maybe all the alpha, that linear models could have found has been found. In order to find additional alpha, in order to increase alpha as we have, you have to use non-linear approaches. That’s one of the reasons why we actually don’t use linear regression in our models. Our models are actually deep learning neural networks that attempt to really sort of model this non-linear world the best we can with the data that we have. But I think we’re still at the early stages of that as well. But what we found is it has massive potential because our particular product, even though we’re using completely public data, data that’s been out there for everyone else to use, by using non-linear models, we’ve been able to extract pure, uncorrelated, unique alpha. And that has been, you know, very attractive to our clients. And that’s where we’ve gotten a lot of traction, a lot of attention recently since our launch is ’cause we’ve been able to find this unique and uncorrelated alpha.
Dan: Good. Okay. And you had to put the noodles and the food back in there again. So now I’m gonna eat breakfast.
Sid: may have some ramen or some pho. Yeah, that’s right. For lunch,
Dan: you know.
Rado: I’m salivating now. Okay. Then the next time we’ll start the cooking podcast.
Sid: Cooking with quants on HGTV. If Sheldon can do fun with flags, we can do cooking with quants, you know.
Dan: Okay, good. I think we need to go back to the studio at this point.
So thanks Rado. Thanks Sid for edutaining us today. That was great to meet Sid from Increase Alpha, to meet Rado from QuantPedia. To join us as a guest or if you’re interested to otherwise support the channel, contact me, Dan Hubscher at these details here shown on the screen and just note that any questions regarding investment offerings will be deferred to an email follow up for compliance purposes, as these videos are not intended to be about investment offerings.
But, if you like this video, please do come back for our next interview with a quant manager guest. You might also wanna watch the other videos on the QuantBeats channel, so don’t forget to like, comment and subscribe to QuantBeats on YouTube. Again, you can get there via the Q logo in the corner of your screen.
Or finally, you can visit our websites and YouTube channels for more shown here. So Sid Rado, everyone, thanks for joining and as always. Thanks for the time. Have a quant day. Take care, guys.
Rado: Thanks.
Sid: Thanks. Thanks everybody. Thanks much. Bye.


