Revealing Biology’s Hidden Patterns: Wistar’s Dr. Noam Auslander on the Power and Potential of Machine Learning
Dr. Noam Auslander, Ph.D., is assistant professor of the Molecular and Cellular Oncogenesis Program at the Ellen and Ronald Caplan Cancer Center. She focuses on developing machine learning methods to understand the factors driving cancer development and to identify patterns that can improve cancer diagnosis and treatment.
“If you define your problem correctly, and you have enough data, you have the ability to learn something very complex that you cannot see with your eyes.”
How would you explain the difference between artificial intelligence and machine learning to somebody who is not a scientist?
Artificial intelligence is more general term. Any software that imitates the human learning system is artificial intelligence. If you build a robot, and that robot does nothing but respond to your requests, that’s artificial intelligence. Machine learning is a field of study contained within artificial intelligence that involves creating sets of algorithms that can be used to learn a particular task, independent of receiving instructions from humans.
As your field has advanced, how much of that advancement has been a matter of increased computing power versus improved methods?
It’s both of those things combined. Increased computing power has allowed algorithms created 15 or 20 years ago to suddenly become very efficient, very good. These older neural networks had architectures that consumed too much computing power at the time, but once we had the GPUs, they started to work much better. And then based on that there has been an explosion of new research. The algorithms have evolved even more, making them much, much better.
What role do you see for machine-learning models in biomedical data analysis and research?
Our models can extract more information and identify more patterns in data than humans could on their own. Right now, people are building models that will do things like predict clinical outcomes, predict biological factors, and understand more about biology. I think that’s very promising, because if you define your problem correctly, and you have enough data, you have the ability to learn something very complex that you cannot see with your eyes. But still, it requires a person who understands the data, understands what they are doing, and understands how to use the model correctly.
How do you develop models that can be used to generate meaningful insights about real-world data?
We first need to understand the question or problem we’re trying to address, and we need to understand the data well enough to represent it correctly in the algorithm. This usually means talking with the clinicians or the biologists to understand what they’re trying to do. We also need to understand how we define a good performance. Is the goal to build a test that can be used in the lab or in the clinic? Or are we trying to learn something new in biology? All of these factors go into designing the model.
What makes some data sets better suited to a machine learning approach than others?
In general, the more data we have, the more amenable it is for these methods, especially if it’s good, clean data. But there are also scenarios where you can take a model that’s been trained for one thing and apply it to another task. A good example is imaging data, like radiology. You can take a pre-trained model for imaging that has already looked at a lot of data. And instead of training the entire architecture, you can train a part of it to only recognize the specific thing you are trying to recognize. You’re using technology that has already learned from other problems that you had much more data for, and this makes it much, much easier.
What’s your biggest frustration you encounter when developing and training models?
It’s almost always not enough data. That can lead to overfitting, which means the model stays too close to the training data set and can’t begin to generalize and make the predictions that allow it to work independently. Or sometimes the data is too complex, we can’t trust it, it’s not annotated correctly, or there are clinical variables that are notated differently by different clinicians. Those kinds of things make it very difficult for us.
How do you keep up with all the changes in your field?
The area of machine learning is moving very fast, so we have to keep track of a lot of literature and a lot of new technology. It’s impossible to follow everything that happened even in the last year — if you’re two to five years behind, that’s pretty good. At the same time, it’s a very interdisciplinary field, so for every project we do, we have to keep up with the research in at least two different disciplines. So, in a way, we are keeping up with at least twice as much as what normal researchers do.
What do you think is the most fun or interesting thing about what you do?
It’s always fun and interesting to work in an area that’s changing so fast — you can be the first to do a lot of things. If you think of an important problem or question, you can be the person to address it. And because there is so much data being generated, we can make real biological discoveries, find out completely new things, without relying on a lab. We can use data that’s already out there and find out something that’s completely new.
The type of work you do requires a lot of creativity and problem solving. When you feel stuck on a problem, how do you get your creativity flowing again to look at the problem in a new way?
When I get stuck on a problem, like part of an algorithm not working, I leave it for a while. I’m a runner, so sometimes I’ll go for a run, and while I’m running I’ll have better ideas come to me. I think it’s always good to stop looking at the problem. Leave it for a while, then come back and take a fresh look.
For more information, email comm-marketing@wistar.org