Challenging the "bigger is better" paradigm with data pruning and strategic sampling approaches
Imagine trying to find a handful of exquisite seashells among millions scattered across an endless beach. That's the challenge facing materials scientists today, as massive computational databases containing millions of material properties threaten to overwhelm researchers with sheer volume rather than empower them with genuine insights. For years, the prevailing mantra has been "bigger is better"—more data points must inevitably lead to better machine learning models and faster discovery of revolutionary materials1 .
But what if we've been thinking about this all wrong? What if, hidden within these enormous datasets, lies profound redundancy that actually slows progress rather than accelerating it? This is the compelling question that researchers like Jason Hattrick-Simpers and his colleagues have been exploring, with findings that fundamentally challenge how we approach materials informatics.
Their work reveals that sometimes, less truly is more—especially when it comes to training the next generation of materials discovery algorithms.
The materials science community has invested tremendous effort into compiling massive databases through high-throughput calculations and experiments. Projects like the Materials Project, JARVIS, and OQMD now contain millions of entries—formation energies, band gaps, bulk modulus values, and other critical properties for hypothetical and known materials1 . The recently released Open Catalyst datasets pushed this further with over 260 million DFT data points for catalyst modeling1 .
Developing models on these datasets requires immense computational resources—one recent project consumed over 16,000 GPU days (equivalent to nearly 44 years of continuous computation on a single GPU)1 .
Such staggering computational requirements put advanced modeling out of reach for most researchers and institutions1 .
Surprisingly, extensive analysis has revealed a significant degree of redundancy across multiple large datasets for various material properties1 .
| Database Name | Content Focus | Scale | Applications |
|---|---|---|---|
| Materials Project (MP) | Energy and band gap data for millions of crystal structures | Millions of entries | Materials design, property prediction |
| JARVIS | Diverse material properties via DFT | Millions of entries | Materials discovery, AI development |
| OQMD | Quantum materials data | Millions of entries | Materials screening, design |
| Open Catalyst | Catalyst modeling data | Over 260 million data points | Catalyst development, reaction simulation |
Groundbreaking research published in Nature Communications has delivered a startling conclusion: up to 95% of data in some materials datasets can be safely removed from machine learning training with minimal impact on prediction performance for standard cases1 . This revelation forces us to reconsider not just how we build machine learning models, but how we collect and curate materials data in the first place.
The key insight is that our current datasets often over-represent certain material types while leaving others underexplored. Think of it like a food critic who only samples pizza from 100 different chain restaurants—they'll become excellent at judging pizza, but know nothing about sushi, curry, or salad. Similarly, our models become experts on over-represented materials but fail dramatically when encountering truly novel compositions or structures1 .
The redundancy problem isn't just about computational efficiency—it has serious implications for scientific robustness. Models trained on redundant datasets show severe performance degradation when faced with out-of-distribution samples—precisely the novel materials that discovery campaigns aim to identify1 .
To demonstrate data redundancy systematically, researchers designed an elegant experiment centered around a pruning algorithm that progressively removes data points from training sets while monitoring model performance1 . The approach follows these critical steps:
The researchers applied this methodology across multiple material properties (formation energy, band gap, and bulk modulus) using three different machine learning approaches: conventional random forests, XGBoost, and the state-of-the-art graph neural network ALIGNN1 .
The findings were striking across all model architectures and material properties. For formation energy prediction, the random forest and XGBoost models trained on just 20% of carefully selected data performed as well as models trained on 90% and 70% of randomly selected data, respectively1 .
| Model Type | Dataset | Data Needed for Comparable Performance | Performance Change with 80% Reduction |
|---|---|---|---|
| Random Forest | JARVIS18 | 13% | <6% RMSE increase |
| XGBoost | MP18 | 20-30% | 10-15% RMSE increase |
| ALIGNN | OQMD14 | 30% | 15-45% RMSE increase |
Perhaps most importantly, the research demonstrated that uncertainty-based active learning algorithms—similar to the approaches Jason Hattrick-Simpers has highlighted for materials synthesis—can construct much smaller but equally informative datasets1 8 . These algorithms strategically select data points that maximize information gain, rather than relying on random sampling or exhaustive enumeration.
The move toward more efficient machine learning in materials science is supported by a growing collection of electronic resources and computational tools. These resources help researchers navigate the complex landscape of data management, reagent selection, and computational analysis.
Type: Reintelligence platform
Primary Function: Uses machine learning to decode published data and present figures with insights
Application: Reduces time and uncertainty in planning materials and methods
Type: Resource database
Primary Function: Provides up-to-date product information, reviews, and new technologies
Application: Helps scientists stay current with available research tools
Type: Search engine
Primary Function: Generates price comparisons across multiple vendors
Application: Enables cost-effective procurement of lab supplies
Type: Lab management platform
Primary Function: Manages inventory, order requests, and supply tracking
Application: Streamlines lab operations and purchasing
These tools represent the practical infrastructure supporting the shift toward more thoughtful, efficient materials research—complementing the theoretical advances in data pruning and active learning.
The groundbreaking work on data redundancy in materials science—including research highlighted by Jason Hattrick-Simpers—marks a critical turning point in how we approach computational materials discovery. By moving beyond the "bigger is better" mentality, the field can focus on information richness rather than narrowly emphasizing data volume1 .
Smaller, well-curated datasets dramatically reduce computational barriers, making advanced materials modeling accessible to more researchers1 .
Models trained on informative rather than redundant data show better performance on novel, out-of-distribution materials—precisely what we need for genuine discovery1 .
As Jason Hattrick-Simpers has noted regarding materials synthesis, the future lies in "kernel-learning-assisted synthesis condition" approaches that provide "dynamic guidance" for creating novel materials8 . This fusion of human expertise and intelligent algorithms represents the next frontier—where machines help us not just with data analysis, but with deciding which data is worth collecting in the first place.
The path forward isn't about collecting less data overall, but about collecting smarter data—strategically exploring the materials space to maximize information gain while minimizing redundancy. In this new paradigm, materials scientists can spend less time managing overwhelming datasets and more time making genuine discoveries that address pressing global challenges—from sustainable energy to advanced computing and beyond.