Have a request for an upcoming news/science story? Submit a Request

New Anvil datasets available to accelerate discovery driven by artificial intelligence

  • Science Highlights
  • Anvil

Anvil, one of Purdue University’s most powerful supercomputers, is undergoing an upgrade to its data repositories in order to provide researchers with easy access to large artificial intelligence (AI) datasets. These AI datasets are hosted on the system and will enable scientific breakthroughs and a faster time-to-discovery using AI and machine learning techniques.

Datasets Anvil supercomputer and Anvil AI partition in the datacenterare invaluable for research, but this holds especially true for AI research. AI modeling and machine learning typically rely on immediate access to massive collections of data, whether the end goal is to train a new model or to use an existing one for more specific research. Domain-specific datasets accommodate a researcher’s needs via a single package. However, these packets of data have a downside.

The problem inherent in datasets is that they contain exactly what a researcher wants—a colossal amount of data. The sheer volume of information contained within datasets entails an extraordinary storage footprint, long transfer time, and impact on the machine’s memory. Even researchers utilizing HPC resources can be waylaid by the effort of obtaining the datasets they need and ensuring they are located where their system can use them. The Anvil team at the Rosen Center for Advanced Computing (RCAC) decided to help researchers bypass this issue by amassing datasets into a data repository that is pre-downloaded and ready for use on an HPC system, backed by a fast underlying network. Now, anyone with access to Anvil can use these datasets immediately in their work, saving them both time and hassle on their projects, and accelerating research.

“Making popular datasets natively available on Anvil fundamentally changes how researchers work,” said Haniye Kashgarani. “Datasets with very large numbers of files are hosted in Anvil Object Storage and are also made available in optimized formats such as SquashFS and LMDB on the Anvil file system. This immediate, high-performance access allows scientists to fully leverage HPC and AI workflows without the overhead of data transfers, storage constraints, or redundant downloads. As demand grows, additional widely used datasets can be added to the platform upon request.”

The most recent additions to Anvil’s data repositories are its AI datasets. This collection covers computer vision, PhysicalAI, and robotics, and supports tasks such as detection, segmentation, tracking, control, reinforcement learning, and large-scale model pretraining and evaluation across domains, including everyday objects, smart spaces, and embodied PhysicalAI. There are currently nine datasets in the collection, with more to come. The new AI datasets will enable scientists on Anvil to leverage machine learning techniques and quickly develop AI models that can be embedded into physical systems such as robots or drones, without needing to download and manage the data themselves.

In addition to the new AI datasets, Anvil hosts dataset collections for geospatial, hydrological, meteorological, covariates, igenomes, and GeoAI research. In total, these collections amount to over 215TB worth of data. RCAC’s efforts to centralize and host these valuable datasets on Anvil make the data more easily discoverable, accessible, and usable for scientists throughout the nation.

“One of our goals with Anvil is to push the limits of scientific discovery,” says Arman Pazouki, Director of Scientific Applications at RCAC and co-PI on the Anvil project. “Hosting these datasets on Anvil makes research more efficient, allowing our users to focus on conducting science instead of on data management. As a result, researchers will be able to harness the power of AI and machine learning easier than ever before, expediting the rate at which scientific breakthroughs are possible. This is just one small step in how Anvil is helping to reshape the world of research and expand access on a national scale.”

Researchers who would like to use Anvil’s datasets can learn more here: Anvil Dataset Documentation

To learn more about High-Performance Computing and how it can help you, please visit our “Why HPC?” page.

Anvil is one of Purdue University’s most powerful supercomputers, providing researchers from diverse backgrounds with advanced computing capabilities. Built through a $10 million system acquisition grant from the National Science Foundation (NSF), Anvil supports scientific discovery by providing resources through the NSF’s Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS), a program that serves tens of thousands of researchers across the United States. Anvil also supports advanced artificial intelligence research as an official resource provider of the National Artificial Intelligence Research Resource (NAIRR) Pilot.

Researchers may request access to Anvil via the ACCESS allocations process or through the NAIRR allocations process. More information about Anvil is available on Purdue’s Anvil website. Anyone with questions should contact anvil@purdue.edu. Anvil is funded under NSF award No. 2005632.

Written by: Jonathan Poole, poole43@purdue.edu

Originally posted: