Data Structure¶
The IDS is based on a hierarchical structure that is used to organise data, especially meta-data, so data about the data (with more information to be found here)
The four levels of the IDS are: - Project - Experiment - Dataset - (Data)File
The first letters of these form the acronym PEDD. The following diagram shows the hierarchy and examples per level:
Only at the core of this model, we can see the data files that represent the raw data.
The data files are the most granular level of the IDS. The outer layers represent metadata, as files are grouped into datasets, which are in turn grouped into experiments. Experiments are grouped into projects. Metadata entries, so to say Properties are inherited.
Metadata should be captured or added at the level where it encompasses the most commonality (if all Experiments were carried out at the same temperature, then this temperature only needs to be entered at the Experiments level once, and the underlying Dataset or (Data)File inherit this). Following this principle and making use of inheritance minimises effort and also minimises the risk of errors by not repeatedly typing-in the same information.
At each level of the hierarchy and at the individual file level, there are mandatory metadata fields that you can use to describe your data. There is also the ability to associate a custom metadata schema at each level, which allows you to record any relevant domain-specific observations and variables. The Instrument Data Service Search functionality allows you to filter for data based on metadata.
How to use the IDS data structure¶
As the three top levels of the IDS hierarchy reflect meta-data, you anticipate what parts of a given research project you will restrict access to and this might also influence your decision on how to structure your data.
What should I use as an identifier for my Project?
- Usually, there should be one(!) Project that corresponds to the research project or unit of research activity that you are collecting data for.
- an externally-assigned identifier such as a Research Activity ID (RAID), if available.
- your institution's project code for the project, if available
- The project name or another unique designation.
What should I use as an identifier for my Experiment?
- Usually, an Experiment is based on a single study-sample or aggregates all Datasets/(Data)Files that are related to a specific research property
- A code for the experiment, if your research group has a coding system for individual experiments.
- The experiment name or another unique designation.
What should I use as an identifier for my Dataset?
- Usually, a Dataset is based on a single instrument or aggregates all (Data)Files that are related to a specific research property, for example, the run-conditions
- A code for the dataset, if your research group has a coding system for individual datasets.
- The dataset name or another unique designation
- The identifier must be unique across all Datasets. There may be cases where you want to use the dataset name as the identifier, and there are similarly named Datasets. For example, you may have two experiments that use the same microscopy instrument, and the acquired images are stored in a Dataset in each Experiment, both named "Microscopy". To keep the identifiers unique, you can prefix the two identifiers with the identifier of the Experiment the dataset falls under, to distinguish between them.
What should I use as an identifier for my Files?
- Usually, these names are not to be altered directly, but are generated by the instrument or software that creates the data files.
TODO: Decide if we want to put a copy of [this] here. (https://uoa-eresearch.github.io/cer-documentation/Research%20Data/Data%20Storage/IDS/Wizard/Tutorial/02-decide-structure/#sample-data-structure-plan)