Skip to main content

Protected Data Archive User Guide

The Protected Data Archive system is a large, long-term, multi-tiered file caching and storage system utilizing both online disk and robotic tape drives, intended for storing sensitve or restricted datasets.

 

Protected Data Archive Overview

The Protected Data Archive (PDA) uses a Spectra Logic TFinity robotic tape library with a capacity of over 200 PB.

Hardware overview
Storage Subsystem Current Capacity Hardware
Disk Cache Over 800 TB 2 Seagate 5U84 Storage arrays
Long-Term Storage Over 200 PB LTO-9 Robotic Tape Library

All files stored on the Protected Data Archive appear in at least two separate storage devices:

  • Two copies is permanently on tape.

Both primary and secondary copies of larger files reside on separate tape cartridges in the robotic tape library. After a period of inactivity, HPSS will migrate files from disk cache to tape.

The PDA writes two copies of every file either to two tapes, or to disk and a tape, to protect against medium errors. Unfortunately, HPSS does not automatically switch to the alternate copy when it has trouble accessing the primary. If it seems to be taking an extraordinary amount of time to retrieve a file (hours), please contact support. We can then investigate why it is taking so long. If it is an error on the primary copy, we will instruct the PDA to switch to the alternate copy as the primary and recreate a new alternate copy.

Link to section 'Sensitive and Restricted Data on the Protected Data Archive' of 'Protected Data Archive Overview' Sensitive and Restricted Data on the Protected Data Archive

The Protected Data Archive is suitable for sensitive and restricted datasets. Example datasets that have been reviewed and approved include NIH Database of Genotypes and Phenotypes (dbGaP), licensed datasets such as the UK Biobank, and deidentified human genomic data. The PDA is not approved for export controlled data subject to ITAR, or CUI.

Link to section 'Protected Data Archive Storage Quota' of 'Protected Data Archive Overview' Protected Data Archive Storage Quota

There is currently no quota on PDA disk use. Archive users will receive a monthly email report showing your current PDA  usage.

Files belonging to deleted accounts will also be retained, but inaccessible except by special request after the accounts have been terminated. The files will be kept for no more than ten years or the usability of the media on which they are stored, whichever comes first.

Link to section 'Proteced Data Archive File Recovery' of 'Protected Data Archive Overview' Proteced Data Archive File Recovery

Data on the PDA is not backed up elsewhere in a traditional sense. New and modified files in the disk cache are migrated to tape within 30 minutes, and Fortress maintains two copies of every file on different media to protect against media failures, but there is no backup protecting against accidental deletions.

If you remove or overwrite a file on the PDA, it is gone. You cannot request to have it retrieved.

Accounts on the Protected Data Archive

Link to section 'Obtaining an Account' of 'Accounts on the Protected Data Archive' Obtaining an Account

Any Purdue PI requiring large-scale archival storage of sensitive or restricted data may obtain access to the Protected Data Archive. 

Research groups are assigned a group data storage space within ${resource.name}  group space. Faculty should request a Data Depot trial to create a shared PDA space for their research group.

The Protected Data Archive is suitable for sensitive and restricted datasets. Example datasets that have been reviewed and approved include NIH Database of Genotypes and Phenotypes (dbGaP), licensed datasets such as the UK Biobank, and deidentified human genomic data. The PDA is not approved for export controlled data subject to ITAR, or CUI.

The PDA sets no limits on the amount or number of files that you may store. However, there are several restrictions on the nature of files you may store:

  • Many small files: The PDA is a tape archive and works best with a few, large files. Large sets of small files should be compressed into archives with utilities such as htar. Other technical limitations are detailed on the Fortress FAQs.
  • Backing up individual or departmental computers. The PDA is intended to be a research data store for sensitive and restricted data, and not a personal or enterprise backup solution.

Storing more than 50 TB of data will incur a cost recovery charge for tape media.

Link to section 'Login & Keytabs' of 'Accounts on the Protected Data Archive' Login & Keytabs

It is not possible to login directly to ${resource.name} via SSH or SCP. Access to the PDA is via Globus High Assurance.

 

File Storage and Transfer

Learn more about file storage transfer for the Protected Data Archive.

Your home directory on the PDA is the default directory that in which your archive files are stored.

On the PDA, your home directory will appear as /home/myusername, but this is not the same directory as your home directory on any other Purdue IT systems. 

The following link will take you to more information about transferring files in and out of the PDA.

Globus High Assurance

Globus, previously known as Globus Online, is a powerful and easy to use file transfer service for transferring files virtually anywhere. It works within RCAC's various research storage systems; it connects between RCAC and remote research sites running Globus; and it connects research systems to personal systems. You may use Globus to connect to your Protected Data Archive storage space. The Protected Data Archive   uses Globus' high assurance subscription, providing additional controls and security options to allow Globus to be used for datasets that contain protected data, in compliance with the data’s security requirements

Since Globus is web-based, it works on any operating system that is connected to the internet. The Globus Personal client is available on Windows, Linux, and Mac OS X. It is primarily used as a graphical means of transfer but it can also be used over the command line.

Link to section 'Globus Web:' of 'Globus High Assurance' Globus Web:

  • Navigate to http://transfer.rcac.purdue.edu
  • Click "Proceed" to log in with your Purdue Career Account.
  • On your first login it will ask to make a connection to a Globus account. Accept the conditions.
  • Now you are at the main screen. Click "File Transfer" which will bring you to a two-panel interface (if you only see one panel, you can use selector in the top-right corner to switch the view).
  • You will need to select one collection and file path on one side as the source, and the second collection on the other as the destination. This can be one of several Purdue endpoints, or another University, or even your personal computer (see Personal Client section below).

The collection providing access to the PDA is: 

  • PDA: "Purdue Protected Data HPSS Archive", a search for "PDA" should provide appropriate matches to choose from.

From here, select a file or folder in either side of the two-pane window, and then use the arrows in the top-middle of the interface to instruct Globus to move files from one side to the other. You can transfer files in either direction. You will receive an email once the transfer is completed.

Link to section 'Globus Personal Client setup:' of 'Globus High Assurance' Globus Personal Client setup:

Globus Connect Personal is a small software tool you can install to make your own computer a Globus endpoint on its own. It is useful if you need to transfer files via Globus to and from your computer directly.

  • On the "Collections" page from earlier, click "Get Globus Connect Personal" or download a version for your operating system it from here: Globus Connect Personal
  • Name this particular personal system and follow the setup prompts to create your Globus Connect Personal endpoint.
  • Your personal system is now available as a collection within the Globus transfer interface.

Link to section 'Globus Command Line:' of 'Globus High Assurance' Globus Command Line:

Globus supports command line interface, allowing advanced automation of your transfers.

To use the recommended standalone Globus CLI application (the globus command):

Link to section 'Sharing Data with Outside Collaborators' of 'Globus High Assurance' Sharing Data with Outside Collaborators

Globus allows convenient sharing of data with outside collaborators. Data can be shared with collaborators' personal computers or directly with many other computing resources at other institutions. See the Globus documentation on how to share data:

For links to more information, please see Globus Support page and RCAC Globus presentation.

Helpful?

Thanks for letting us know.

Please don't include any personal information in your comment. Maximum character limit is 250.
Characters left: 250
Thanks for your feedback.