Beyond Big Data: How Smarter Machine Learning is Revolutionizing Materials Science

Challenging the "bigger is better" paradigm with data pruning and strategic sampling approaches

October 2023 Jason Hattrick-Simpers Materials Informatics

The Data Deluge in Modern Materials Science

Imagine trying to find a handful of exquisite seashells among millions scattered across an endless beach. That's the challenge facing materials scientists today, as massive computational databases containing millions of material properties threaten to overwhelm researchers with sheer volume rather than empower them with genuine insights. For years, the prevailing mantra has been "bigger is better"—more data points must inevitably lead to better machine learning models and faster discovery of revolutionary materials¹ .

But what if we've been thinking about this all wrong? What if, hidden within these enormous datasets, lies profound redundancy that actually slows progress rather than accelerating it? This is the compelling question that researchers like Jason Hattrick-Simpers and his colleagues have been exploring, with findings that fundamentally challenge how we approach materials informatics.

Their work reveals that sometimes, less truly is more—especially when it comes to training the next generation of materials discovery algorithms.

The Big Data Problem: When More Isn't Better

The materials science community has invested tremendous effort into compiling massive databases through high-throughput calculations and experiments. Projects like the Materials Project, JARVIS, and OQMD now contain millions of entries—formation energies, band gaps, bulk modulus values, and other critical properties for hypothetical and known materials¹ . The recently released Open Catalyst datasets pushed this further with over 260 million DFT data points for catalyst modeling¹ .

Training Demands

Developing models on these datasets requires immense computational resources—one recent project consumed over 16,000 GPU days (equivalent to nearly 44 years of continuous computation on a single GPU)¹ .

Accessibility Barriers

Such staggering computational requirements put advanced modeling out of reach for most researchers and institutions¹ .

Redundancy Revelation

Surprisingly, extensive analysis has revealed a significant degree of redundancy across multiple large datasets for various material properties¹ .

Large Materials Science Databases

Database Name	Content Focus	Scale	Applications
Materials Project (MP)	Energy and band gap data for millions of crystal structures	Millions of entries	Materials design, property prediction
JARVIS	Diverse material properties via DFT	Millions of entries	Materials discovery, AI development
OQMD	Quantum materials data	Millions of entries	Materials screening, design
Open Catalyst	Catalyst modeling data	Over 260 million data points	Catalyst development, reaction simulation

A Paradigm Shift: Challenging the "Bigger is Better" Mentality

Groundbreaking research published in Nature Communications has delivered a startling conclusion: up to 95% of data in some materials datasets can be safely removed from machine learning training with minimal impact on prediction performance for standard cases¹ . This revelation forces us to reconsider not just how we build machine learning models, but how we collect and curate materials data in the first place.

Impact of dataset pruning on model performance across different material properties

The key insight is that our current datasets often over-represent certain material types while leaving others underexplored. Think of it like a food critic who only samples pizza from 100 different chain restaurants—they'll become excellent at judging pizza, but know nothing about sushi, curry, or salad. Similarly, our models become experts on over-represented materials but fail dramatically when encountering truly novel compositions or structures¹ .

The redundancy problem isn't just about computational efficiency—it has serious implications for scientific robustness. Models trained on redundant datasets show severe performance degradation when faced with out-of-distribution samples—precisely the novel materials that discovery campaigns aim to identify¹ .

The Key Experiment: Pruning Down to What Truly Matters

Methodology: How to Identify Informative Data

To demonstrate data redundancy systematically, researchers designed an elegant experiment centered around a pruning algorithm that progressively removes data points from training sets while monitoring model performance¹ . The approach follows these critical steps:

1 Dataset division: Researchers first perform a (90, 10)% random split of the original dataset to create a training pool and an in-distribution test set¹ .

2 Out-of-distribution testing: To truly test robustness, they create a separate OOD test set using materials from a more recent database version—simulating real-world scenarios where models encounter truly novel materials¹ .

3 Progressive pruning: The pruning algorithm systematically reduces the training set size from 100% down to just 5% of the original pool¹ .

4 Performance monitoring: At each step, models are trained on the pruned dataset and evaluated on ID test data, unused pool data, and OOD data¹ .

The researchers applied this methodology across multiple material properties (formation energy, band gap, and bulk modulus) using three different machine learning approaches: conventional random forests, XGBoost, and the state-of-the-art graph neural network ALIGNN¹ .

Results and Analysis: Surprising Performance with Less Data

The findings were striking across all model architectures and material properties. For formation energy prediction, the random forest and XGBoost models trained on just 20% of carefully selected data performed as well as models trained on 90% and 70% of randomly selected data, respectively¹ .

Model Type	Dataset	Data Needed for Comparable Performance	Performance Change with 80% Reduction
Random Forest	JARVIS18	13%	<6% RMSE increase
XGBoost	MP18	20-30%	10-15% RMSE increase
ALIGNN	OQMD14	30%	15-45% RMSE increase

Perhaps most importantly, the research demonstrated that uncertainty-based active learning algorithms—similar to the approaches Jason Hattrick-Simpers has highlighted for materials synthesis—can construct much smaller but equally informative datasets¹ ⁸ . These algorithms strategically select data points that maximize information gain, rather than relying on random sampling or exhaustive enumeration.

The Scientist's Toolkit: Essential Resources for Modern Materials Research

The move toward more efficient machine learning in materials science is supported by a growing collection of electronic resources and computational tools. These resources help researchers navigate the complex landscape of data management, reagent selection, and computational analysis.

BenchSci

Type: Reintelligence platform

Primary Function: Uses machine learning to decode published data and present figures with insights

Application: Reduces time and uncertainty in planning materials and methods

Biocompare

Type: Resource database

Primary Function: Provides up-to-date product information, reviews, and new technologies

Application: Helps scientists stay current with available research tools

LabSpend

Type: Search engine

Primary Function: Generates price comparisons across multiple vendors

Application: Enables cost-effective procurement of lab supplies

Quartzy

Type: Lab management platform

Primary Function: Manages inventory, order requests, and supply tracking

Application: Streamlines lab operations and purchasing

These tools represent the practical infrastructure supporting the shift toward more thoughtful, efficient materials research—complementing the theoretical advances in data pruning and active learning.

Conclusion and Future Directions: Toward Smarter Materials Discovery

The groundbreaking work on data redundancy in materials science—including research highlighted by Jason Hattrick-Simpers—marks a critical turning point in how we approach computational materials discovery. By moving beyond the "bigger is better" mentality, the field can focus on information richness rather than narrowly emphasizing data volume¹ .

Democratizing Access

Smaller, well-curated datasets dramatically reduce computational barriers, making advanced materials modeling accessible to more researchers¹ .

Accelerating Discovery

Active learning approaches that strategically select informative data points can dramatically reduce both computational and experimental costs¹ ⁸ .

Improving Robustness

Models trained on informative rather than redundant data show better performance on novel, out-of-distribution materials—precisely what we need for genuine discovery¹ .

As Jason Hattrick-Simpers has noted regarding materials synthesis, the future lies in "kernel-learning-assisted synthesis condition" approaches that provide "dynamic guidance" for creating novel materials⁸ . This fusion of human expertise and intelligent algorithms represents the next frontier—where machines help us not just with data analysis, but with deciding which data is worth collecting in the first place.

The path forward isn't about collecting less data overall, but about collecting smarter data—strategically exploring the materials space to maximize information gain while minimizing redundancy. In this new paradigm, materials scientists can spend less time managing overwhelming datasets and more time making genuine discoveries that address pressing global challenges—from sustainable energy to advanced computing and beyond.