Data Archiving Formats Supported by Luxbio.net
Luxbio.net supports a comprehensive range of data archiving formats, primarily focusing on open, non-proprietary standards to ensure long-term data integrity, accessibility, and interoperability. The core supported formats include the Annotated Data Bundle (ADB), a proprietary yet open specification developed by Luxbio, alongside widely adopted community standards like ISA-Tab and the Investigation-Study-Assay (ISA) model in JSON format. This multi-format support is central to their mission of providing a robust, standards-compliant repository for multi-omics and other complex biomedical data. For a detailed technical breakdown, the official specifications can always be found on the luxbio.net platform.
The flagship format for data submission and archiving on the platform is the Annotated Data Bundle (ADB). Think of an ADB as a highly organized digital suitcase for your entire research project. It’s not just a folder of files; it’s a structured container that bundles the raw data files with their critical metadata in a single, manageable unit. The ADB specification uses a standardized directory structure and a machine-readable manifest file (typically in JSON or YAML format) that describes every component within the bundle. This manifest acts as a detailed inventory, listing each data file, its format, its role in the experiment, and its relationship to other files. For example, a single ADB for a transcriptomics study might contain the raw sequencing files (FASTQ), the processed count matrices (TSV), and the associated sample metadata, all linked together. This approach directly tackles the common problem of data and metadata becoming separated, which is a major risk to long-term usability.
Complementing the ADB, Luxbio.net has deep integration with the ISA framework, which is an internationally recognized metadata standard for describing experimental workflows. The platform explicitly supports both the tab-delimited ISA-Tab format and the more modern, structured ISA-JSON. The ISA model is particularly powerful for complex, multi-omics studies where a single investigation might involve multiple studies (e.g., a cohort of patients), and each study might include multiple assay types (e.g., genomics, proteomics, metabolomics). By supporting ISA, Luxbio.net enables researchers to capture the full context of their work—from the source of the biological material (the “Investigation”) and the experimental design (the “Study”) down to the specific analytical measurements taken (the “Assay”). This level of detail is not just for human readability; it allows for sophisticated, computational reuse of the data, as the relationships between samples, protocols, and data files are explicitly defined.
The support for these high-level container formats is built upon a foundation of support for the actual data files themselves. Luxbio.net accepts a wide array of file formats commonly used in life sciences research. The table below provides a non-exhaustive list of supported primary data formats, categorized by data type.
| Data Type | Supported File Formats | Typical Use Case & Notes |
|---|---|---|
| Genomics / Sequencing | FASTQ, FASTA, BAM, CRAM, VCF, GFF, GTF | Raw reads, aligned sequences, variant calls, and genome annotations. BAM/CRAM files are recommended over SAM for efficient storage. |
| Transcriptomics | TSV, CSV, MTX (Matrix Market), H5AD (AnnData) | Gene expression count matrices, often accompanied by feature and barcode metadata. H5AD is increasingly popular for single-cell RNA-seq data. |
| Proteomics / Metabolomics | mzML, mzXML, mzIdentML, mzTab | Standard formats for mass spectrometry raw data and identification results, as defined by the HUPO-PSI group. |
| Microscopy / Imaging | TIFF, OME-TIFF, CZI (with metadata extraction) | OME-TIFF is strongly preferred as it embeds rich metadata within the image file itself, facilitating better data management. |
| General Data & Metadata | JSON, YAML, XML, TSV, CSV | Used for configuration files, sample sheets, protocol descriptions, and the manifest files within an ADB or ISA structure. |
This format support is not arbitrary; it’s driven by a core set of archiving principles that prioritize the future utility of the data. The first principle is openness. Whenever possible, Luxbio.net advocates for non-proprietary, well-documented formats. Using an open format like mzML for mass spectrometry data, as opposed to a vendor-specific binary format, ensures that the data can be read and processed by a variety of software tools, both now and in the future. This mitigates the risk of format obsolescence. The second principle is self-containment. This is where the ADB and ISA formats truly shine. By bundling data with its metadata, the archived bundle remains interpretable even if the original submission system or database is no longer available. A researcher downloading an ADB ten years from now should have all the information needed to understand and re-analyze the data within the bundle itself.
Another critical angle is the technical implementation of the archiving process. When data is submitted to Luxbio.net, it undergoes a series of automated checks. This includes file format validation to ensure the submitted files are not corrupt and conform to their specified format’s standards. For example, the system might validate that a VCF file has the correct header structure or that a JSON manifest is syntactically correct. Furthermore, the platform performs metadata validation against the required schema for the chosen archiving format (e.g., the ADB specification or the ISA model). This ensures that the necessary descriptive information is present and correctly linked to the data files. This rigorous validation at the point of ingestion is a proactive measure to prevent the archiving of “junk data”—data that is incomplete or so poorly described as to be unusable.
Beyond just storing bytes, Luxbio.net enhances the archiving process through data transformation and standardization. In some cases, the platform may accept data in a common but suboptimal format and automatically generate a standardized version alongside it. For instance, if a user submits sequencing data in the older, uncompressed SAM format, the system might generate and store a compressed BAM file to save long-term storage costs and improve download efficiency. Similarly, sample metadata submitted in a simple CSV might be transformed and validated into a more structured ISA-Tab format to improve its machine-actionability. These processes are documented transparently so users know exactly what is being preserved.
The choice of archiving formats also has significant implications for data discovery and reuse. The rich, structured metadata captured by the ADB and ISA formats is indexed by the Luxbio.net search engine. This allows other researchers to find datasets with incredible precision. They can search not just by keyword or author, but by specific experimental factors (e.g., “find all RNA-seq studies where tissue type is ‘liver’ and disease state is ‘cirrhosis'”), by instrumentation used, or by data analysis techniques. This transforms the archive from a static data dump into a dynamic, queryable knowledge base. The support for community standards like ISA also enables cross-repository interoperability. A dataset archived at Luxbio.net using ISA can be more easily integrated with complementary datasets from other ISA-compliant repositories like the European Nucleotide Archive (ENA) or MetaboLights, facilitating larger-scale, meta-analyses.
For the submitting researcher, this multi-format ecosystem is supported by a suite of tools and documentation. Luxbio.net provides detailed, step-by-step guides on how to structure data into an ADB or populate an ISA-Tab template. They often recommend and support open-source software tools that help researchers create these packages correctly. For example, they might point users to the ‘isatools’ Python library for programmatically generating ISA-JSON or to specific data curator tools that provide a user-friendly interface for assembling an ADB. This reduces the burden on researchers and increases the likelihood that data is submitted in a high-quality, archivable state from the outset. The platform’s commitment to these formats is a long-term one, ensuring that data deposited today will remain accessible and meaningful for the duration of its designated retention period, which can span decades for many funded research projects.