Bridging the Gap Between Physical Numerical Simulations and Machine Learning: Introducing The Well

Community Article Published December 2, 2024

Useful links: GitHub / NeurIPS 2024 Paper / Website / ArXiv.

As a machine learning researcher, it is nowadays hard to ignore the sheer scale of datasets and models required to tackle complex problems. As an example, FineWeb, a highly curated dataset used in the training of billion-parameters LLMs, weighs in at 44TB. This distilled version of the internet is the result of years of intense research by the NLP community, supported by open-source efforts like Hugging Face’s Datasets library.

In contrast, scientific data presents unique challenges — it is harder to gather, filter, and interpret. While anyone can assess the coherence of generated text, evaluating the plausibility of, say, a protein sequence or a turbulent astrophysical process often requires deep domain expertise. This complexity necessitates close collaboration between subject-matter experts and machine learning researchers, adding layers of difficulty to an already intricate process.

Despite these difficulties, datasets for modeling physical dynamics are expanding. While fluid-dynamics simulations have gained traction as common benchmarks, they’ve been addressing only a limited range of physics or offering a sparse number of high-resolution snapshots. Additionally, the size and complexity of individual samples often constrain their broader utility. These limitations underscore the need for new datasets tailored and scaled for modern machine learning use. This led us to create The Well, a unified collection of diverse physical processes, readily usable to train neural network surrogates at scale.

What is The Well?

The Well comprises 16 datasets totaling over 15TB, with individual sizes ranging from 6.9GB to 5.1TB. All data is provided on uniform spatial grids sampled at constant time intervals and formatted in the HDF5 format for simplicity, accessibility, and compatibility with scientific workflows. To facilitate usage, we also provide a PyTorch interface for a seamless integration with machine learning models.

We collaborated closely with domain experts to generate and curate datasets representing complex physical phenomena and standardized them into a unified format. This approach ensures that datasets are self-sufficient, easily shareable, and ready for direct application to machine learning models, eliminating preprocessing overhead. By prioritizing usability, we allow researchers to focus on the true challenge: predicting the physics.

Opportunities for the Numerical Simulation Community

Through conversations with experts in numerical simulations, we identified a significant communication gap between their field and the machine learning community. This disconnect, inflated by the hype surrounding AI, often leads to skepticism about what machine learning can truly accomplish. With The Well, we aim to make a first step toward bridging this gap, by offering a platform that encourages collaboration while providing challenging datasets that represent advanced and, sometimes, poorly understood physical processes.

Some of these simulations are among the most advanced in the world for their respective phenomena, requiring millions of CPU hours and highlighting the need for efficient surrogate models. Predicting these processes is analogous to video forecasting, but introduces unique challenges: on one end, the evolution of data relies on well-defined but complex physical laws, while on the other hand, the data itself can be more complicated to handle (e.g., featuring many channels, or maintaining high precision during training and inference).

Machine learning, in our view, should be seen as complementary to numerical simulations, and not as a replacement. For example, it can help scientific research by providing quick approximations of physical behaviors, such as estimating the growth rate of a phenomenon or predicting its stationary state. This allows researchers to allocate computational resources more effectively and accelerate scientific discovery.

Opportunities for the Machine Learning Community

Beyond advancing scientific research, The Well offers unique challenges for the machine learning community. Unlike natural images and videos, the spatial frequencies and dynamics of our simulations differ significantly, providing a new set of benchmarks for innovating in computer vision. Additionally, our datasets introduce problems rarely encountered in standard vision tasks, such as:

Generalization to unseen physics: Can a model trained on a subset of the collection generalize effectively to unseen physics?
Knowledge transfer across resolutions: Can a model trained at one resolution generalize effectively to higher resolutions or dimensionality for the same physics?
Temporal variation: How can models handle data sampled at varying time intervals while maintaining predictive accuracy?
Physical parameter generalization: Can a model trained on a subset of physical parameters predict simulations of unseen parameter values?

We hope these tasks will push the boundaries of what modern ML models can achieve, resulting in innovation in computer vision and generative modeling. You can learn more about these challenges in Appendix D of the paper.

Tap into the Well

With The Well, we look forward to sparking a dialogue between two communities that rarely interact. Numerical simulations pose significant challenges for ML researchers, while ML models offer the potential to accelerate simulation-based research. Imagine predicting the evolution of a neutron star in a few seconds using a pre-trained ML model or using insights from simulation data to drive breakthroughs in generative modeling. By connecting these fields, we aim to accelerate discovery and innovation in both domains. We can't wait to see what you will achieve with the Well!

Upvote