We recently hosted a fireside chat on safe and efficient AI with notable Stanford CS PhD researchers Dan Fu and Eric Mitchell. The conversation covered various aspects of AI technology, including the innovations that Dan and Eric have pioneered in their respective fields.
Dan is co-inventor of FlashAttention. He’s working on improving efficiency and increasing the context length in Large Language Models (LLMs). His experience in developing groundbreaking AI technologies allows him to provide profound insights into the future capabilities of LLMs. During the event, Dan discussed the implications of his work on enabling new generative AI use cases, as well as brand new techniques for efficient training.
Eric’s work focuses on AI safety and responsibility. He is the co-author of DetectGPT, a tool capable of differentiating between AI-generated and human-generated text. In recent times, DetectGPT has gained press attention for its innovative approach to addressing the growing concern with AI-generated content. Eric shared his thoughts on the potential impact of DetectGPT and similar tools, discussing the necessity for safe AI technologies as the field expands.
During the discussion, we touched on practical applications of generative AI, and the forecast for open source vs. proprietary LLMs. We also touched on the prospect of AGI, the ethical ramifications, cybersecurity implications, and overall societal effects of these emerging technologies.
For those who couldn’t attend in person, we are excited to recap the high points today (answers are edited and summarized for length):
Aparna: Can you tell us a bit about yourselves and your motivation for working in AI?
Dan: I focus on making foundation models faster to train and run, and I’m interested in increasing sequence length to allow for more context in the input data. Making sure you’re not limited by a specific number of tokens. But you can feed in as much data as you’d like, as much context and use that to teach the model what you want to say. I’ve been interested in machine learning for a long time and have been at Stanford for five years now. It’s a thrilling time to work in this field.
Eric: I’m a fourth-year PhD student at Stanford, and I got into AI because of my fascination with the subjective human experience. I’ve taken a winding road in AI, starting with neuroscience, 3D reconstruction, robotics, and computer vision before being drawn to the development of large language models. These large language models are really powerful engines, and we’re sort of just starting to build our first cars that can drive pretty well. But we haven’t built the seatbelts and the antilock brakes, and these safety and quality of life technologies. So that’s what I’m interested in.
Aparna: What major breakthroughs have led to the recent emergence of powerful generative AI capabilities? And where do you think the barriers are to the current approach?
Dan: That’s a really great question. There has been a seismic shift in the way that machine learning (ML) is done in the past three to four years. The old way was to break a problem into small parts, train models to solve one problem at a time, and then use those building pieces to build up a system. With foundation models, we took the opposite approach. We trained a model to predict the next word in a given text, and these models can now do all sorts of things, like write code, answer questions, and even write some of my emails. It’s remarkable how the simplest thing can scale up to create the largest models possible. Advances in GPUs and training systems have also allowed us to scale up and achieve some incredible things.
I think one of the barriers is the technical challenge of providing sufficient context to the models, especially when dealing with personal information like emails. Another barrier is making these models more open and accessible, so that anyone can see what goes into them and how they were trained. So making the process more open the same way that anybody can look at a Kubernetes stack and see exactly what’s happening under the hood. Or anybody can open up the Linux kernel and figure out what is running under there. those are frontiers that I hope that we push on pretty quickly. This would enable better trust and understanding of the models.
Eric: I agree with Dan’s points. Additionally, a challenge we’re facing is the need to solve specific problems with more general models. However, we’ve found that large scale self-supervised training can be effective in tackling these specific problems. For example, the transformer architecture has been helpful in representing knowledge efficiently and improving upon it. In general, the ability to do large scale self-supervised learning on just a ton of data has been key to the recent progress.
Furthermore, we need a way to explain our intent to the model in a way that it can correctly interpret and follow it. This is where the human preference component comes in. We need to be able to specify our preferences to the model, so that it can draw upon its knowledge and skills in a way that is useful for us. This is a qualitative shift in how these models interact with society, and we are only scratching the surface.
Aparna: I’d like to go a little bit deeper technically. Dan, could you explain how your work with attention has made it possible to train these large generative AI models?
Dan: Sure, I can give a brief overview of how attention works at a high level. So you have these language models, and when you give it a sentence, the attention mechanism compares every word in that sentence to every other word in that sentence. If you have a databases background, it’s kind of like a self join, where you have a table that is your sentence, and then you join it to itself. This leads to some of the amazing abilities that we’ve seen in generative AI. However, the way that attention used to be calculated was quite inefficient. You would compare every word to every other word, resulting in a hard limit on the context of the models. This meant that the maximum context length was around 2000, which is what could fit in memory on an A100 GPU.
If you look at databases and how they do joins, they don’t write down all the comparisons between all the joins, they do it block by block.About a year ago, we developed an approach called Flash attention which reduced the memory footprint by doing the comparisons block by block. This enabled longer context lengths, allowing us to feed in a whole essay instead of just a page of text at a time. We’ve been really humbled by the very rapid adoption. It’s in PyTorch, 2.0. GPT-4, for example, has a context length of 8k, and a context length option of 32k.
Aparna: That’s really interesting. So, with longer context lengths, what kinds of use cases could it enable?
Dan: The dream is to have a model that can take all the text ever written and use it as context. However, there’s still a fundamental limitation to attention because even with a reduced memory footprint, you’re still comparing every word to every other word. If you think about how language works, that’s not really how we process language. I’m sure you can’t remember every word I’ve said in the past few minutes. I can’t even remember the words I was saying. That really led us to think okay, are there some alternatives to attention that don’t scale fundamentally quadratically. We’ve been working on some models called Hungry Hungry Hippos. We have a new one called hyena, where we try to make the context length a lot longer. And these models may have the potential to go up to hundreds of thousands of words, or even millions. And if you can do that, it changes the paradigm of what you can do with these models.
Longer context lengths enable more complex tasks such as summarization, question answering, and machine translation. It also allows for more efficient training on large datasets by utilizing more parallelism across GPUs. But if you have a context length of a million words, take your whole training set, feed it in as input, you could have an embodied AI, and have say a particular agent behave in a personalized way when responding to emails or talking to clients.
Longer context can also be particularly useful in modalities like images, where it means higher resolution. For example, in medical imaging, where we are looking for very small features, downsampling the image may cause loss of fine detail. In the case of self-driving cars, longer context means the ability to detect objects that are further away and at a higher resolution. Overall, longer context can help us unlock new capabilities and improve the accuracy of our models.
Aparna: How do you see the role of language models evolving in the future?
Dan: I think we’re just scratching the surface of what language models can do, and there are so many different ways that they can be applied. One of the things that’s really exciting to me is the potential for language models to help us better understand human language and communication. There’s so much nuance and complexity to how we use language, and I think language models can help us unpack some of that and get a better understanding of how we communicate with each other. And of course, there are also lots of practical applications for language models, like chatbots, customer service, and more.
Personally, I’m very excited to see where small models can go. We’re starting to see models that have been trained much longer than we used to train them, like a 7 billion parameter model, or a 13 billion parameter model, that with some engineering, people have been able to get to run on your laptop. When you give people access to these models, in a way that is not super expensive to run, you’re starting to see crazy applications come out. I think it’s really just the beginning.
Eric: It has been an interesting kind of phase change just going from GPT3 to GPT4. I don’t know how much people have played with these models side by side or if people have seen Sebastien Bubeck’s somewhat Infamous First Contact talk now where he kind of goes through some interesting examples. One thing that’s weird about where the models are now is that usually, the pace of progress was slower than the time it took to understand what the capabilities of the technology were, but recently, it felt like a bit of an inversion. I would be surprised to see this slowdown in the near future. And I think it changes the dynamic in research.
Most machine learning research is quantitative, focused on building models, evaluating them on datasets, and getting higher scores. However, Sebastien’s talk is interesting because it evaluates models qualitatively with no numbers, which feels less rigorous but has more credibility due to Sebastien’s rigorous research background. The talk includes impressive examples, such as a model drawing a unicorn or writing 500 lines of code for a 3D game. One fascinating example is the model coaching people in an interpersonal conflict, providing direct and actionable advice that is useful in real-life situations. A big caveat is that current outputs from GPT-4 are much worse than the examples given in the talk. Sebastien’s implication or claim is that aligning the model to follow human intent better reduces its capabilities. This creates a tough conflict between economic incentives and what’s useful for society. It’s unclear what people will do when faced with this conflict.
Aparna: Do you think there will be ethical concerns that arise as language models become more sophisticated?
Eric: Yeah, I think there are also going to be questions around ownership and control of these models. Right now, a lot of the biggest language models are owned by big tech companies, and there’s a risk that they could become monopolies or be used in ways that are harmful to consumers. So we need to be thinking carefully about how we regulate and govern these models, and make sure that they’re being used in a responsible and ethical way.
One of the big challenges is going to be figuring out how to make language models more robust and reliable. Right now, these models are very good at generating plausible-sounding text, but they can still make mistakes and generate misleading or incorrect information. So I think there’s a lot of work to be done in terms of improving the accuracy and reliability of these models, and making sure that they’re not spreading misinformation or bias.
Aparna: Given your PhD research Eric, what are the main areas that warrant concern for AI safety and responsibility?
Eric: In summary, there are three categories of issues related to AI ethics. The first category includes concrete near-term problems that many in the AI ethics community are already working on, such as unreliable and biased models that may dilute collective knowledge. The second category is a middle-term economic alignment problem, where incentives in industry may not be aligned with making models that are safer or more useful for society. The third and longest-term category involves high-stakes decisions made by very capable models, which could be used by bad actors to do harm or may not align with human values and intentions. While some may dismiss the risks associated with these issues, they are worthy of serious consideration.
My research is focused on developing auxiliary technologies to complement existing mass-produced products. I am specifically working on model editing, pre-training models in safer ways, and developing detection systems for AI-generated texts. The aim is to give practitioners and regulators more tools to safely use large language models. However, measuring the capabilities of AI systems is challenging, and my team is working on building a comprehensive public benchmark for detection systems to help better assess their performance.
Aparna: I’m excited about the prospect of having evaluation standards and companies building tooling around them. Do you think there’ll be regulation?
Eric: In my opinion, we can learn from the financial crisis that auditors may not always work in practice, but a system to hold large AI systems to sensible standards would be very useful. Currently, there are questions about what capabilities we can expect from AI systems and what technologies we have to measure their capabilities. As a researcher, I believe that more work needs to be done to give regulators the tools they need to make rules about the use of AI systems. Right now, we have limited abilities to understand why an AI model made a certain prediction or how well it may perform in a given scenario. If regulators want to require certain things from AI model developers, they need to be able to answer these questions. However, currently, no one can answer these questions, so maybe the only way to ensure public safety is to prohibit the release of AI models until we can answer them.
Aparna: Stanford has been a strong contributor to open source and we’ve seen progress with open models like Alpaca, Dolly, and Red Pajama. What are the advantages and disadvantages of open sourcing large language models?
Dan: As an open source advocate and a researcher involved in the Red Pajama release, I believe making these large language models open source can help people better understand their capabilities and risks. The release of the 1 trillion token dataset allowed us to question what goes into these models and what happens if we change their training data. Open sourcing these models and datasets can help with understanding their inner workings and building on them. This is crucial for responsible use of these models.
The effort behind Red Pajama is to recreate powerful language models in an open manner by collecting pre-training data from the internet and human interaction data. The goal is to release a completely open model that is auditable at every step of the process. Small models trained on a lot of text can become surprisingly powerful, as seen in the 7 billion parameter model that can fit on a laptop. The llama model by Facebook is not completely open, as it requires filling out a form and has questionable licenses.
Eric: The open source topic is really interesting. I think many people have heard about the call for a pause on AI research letter. Open source is great, and it’s why OpenAI relies on it a lot. However, a few weeks ago, a bug in an open source framework they were using caused some pretty shocking privacy violations for people who use Chat GPT, where you could see other people’s chat histories. In some sense, I think the cat is already out of the bag on the open source question. The pre-training phase is where a lot of the effort goes into these models, and we already have quite a few really large pre-trained models out there. So even if we paused right now and said no more big pre-trained models can be released, there’s already enough out there for anyone who is worried about it to worry a lot.
Aparna: So with these smaller models, running on laptops and on mobile and edge devices what new use cases will open up?
Dan: Sure, I think it’s amazing that our phones have become so powerful over the past decade. If I could have a language model running on my phone that functions as well as the GPT models we have today, and can assist me in a conversational way, that would be awesome.
Eric: I think it’s exciting and cool from a privacy perspective to have these models running locally. They can be really powerful mental health professionals for people, and I believe these models can be meaningful companions to people as well. Loneliness sucks, and the COVID years have made this very clear to a lot of people. These are the types of interactions that these models are best suited for. They understand what we’re saying, they can respond intelligently, and they can ask us questions that are meaningfully useful.
From this perspective, having them locally to do these types of things can be really powerful. Obviously, there’s a significant dual-use risk with these models, and we’ve tried to do some work to partially mitigate these things. But that’s just research right now. There are already very real and powerful models out there.
I think it’s great and exciting, and I’d be lying if I said I can’t foresee any way this could be problematic in some ways. But the cat is out of the bag, and I believe we will see some really cool and positive technologies from it.
Aparna: My final question is about Auto GPT, a new framework that uses GPT to coordinate and orchestrate a set of agents to achieve a given goal. This autonomous system builds upon the idea of using specialized models for specific tasks, but some even argue that this approach could lead towards AGI. Do you believe this technology is real and revolutionary?
Eric: Yes, Auto GPT is a real framework that uses large language models to critique themselves and improve their performance. This idea is powerful because it suggests that models can improve themselves without the need for constant human feedback. However, Auto GPT is not yet advanced enough to replace human jobs as it can still get stuck in loops and encounter situations where it doesn’t know what to do. It’s also not trustworthy enough to handle tasks that require a high level of complexity and verification. While the ideas behind Auto GPT are promising, it’s not a revolutionary technology in and of itself and doesn’t massively improve the capabilities of GPT.
Dan: So, I was thinking about what you said earlier about the generative AI revolution and how it’s similar to the internet boom in 2000. But I see it more like electricity, it’s everywhere and we take it for granted. It’s enabled us to do things we couldn’t before, but it has also displaced some jobs. For example, we don’t have lamplighters or people who manually wash clothes anymore. However, just like how people in the early 20th century imagined a future where everything would be automated with electricity, we still have jobs for the moment. It’s hard to predict all the impacts AI will have, but it will certainly change the types of jobs people are hired for. I think it’ll become more integrated into our daily lives and introduce new challenges, just like how electrical engineering is a field today. Maybe we’ll see the emergence of foundation model engineering. That’s just my two cents on AGI – I’m not sure if it’ll be fully realized or just a tool to enhance AI capabilities.
Eric: I think the employment question is always brought up in discussions about AI, but it’s not clear that these models can replace anyone’s job right now or in the near future. They are good for augmenting people, but not at tasks they’re not already qualified for. It’s not a drop-in replacement for humans. I don’t think we’ll see mass unemployment, like with the electricity revolution. The internet analogy is similar, in that it was thought to make people more productive, but it turned out to be a distraction tool as well. Generative AI may not have a net positive impact on productivity in the near term, but it will certainly entertain us.