If you have data describing new sORFs, we would be happy to integrate them in our database. Nevertheless, the processing of data by MetamORF is computationally heavy and integration of new data cannot be fully automatized. So please contact us to discuss about this.
Please note that:
- Data should specifically be describing short open reading frames (sORFs)
- Both computational predictions and experimental data will be considered for inclusion
- Data describing canonical open reading frames will not be considered for inclusion
- Data sources already included (see data sources section of the documentation), in particular being part of sORFs.org database (and already inserted in MetamORF), will not be considered for inclusion
Here are some additionnal guidelines regarding the preferred format to facilitate the integration of your data in MetamORF. Feel free to contact us if you have any question.
Here are some essential information we need to get in order to integrate your data in MetamORF:
- The name(s) of the species in which you identified sORFs
- The method(s) you used to identify the sORFs (computational prediction, ribosome profiling, mass spectrometry etc.)
- The exact version of the genome annotation in which you provide us the coordinates
- For each of the coordinates you provide, the system in which they are provided (e.g. 1-based system). If you are confused about this, we strongly advice to read this article published on the UCSC blog which provides detailed information regarding the coordinate counting systems.
- The publication we should cite for your data (DOI, PMID, URL etc.)
For each ORF in the dataset, the following information are expected be provided:
- An unique identifier allowing to specifically identify the ORF in your dataset
- The chromosome or scaffold on which the ORF is located
- The strand on which the ORF is located
- The absolute genomic coordinates of the start position of the ORF
- The absolute genomic coordinates of the stop position of the ORF
- The absolute genomic coordinates of each start and end position of the exons constituting the ORF (when it is spliced)
- The length of the ORF (in both nucleotides and amino acids)
- The start codon sequence of the ORF
- The nucleic acid sequence of the ORF
- The amino acid sequence of the ORF (including its first amino acid)
- The identifier or name of the transcript on which the ORF is located (if the method used to identify the ORF allows to get such information).
- The strand of the transcript on which the ORF is located
- The absolute genomic coordinates of the start position of the transcript on which the ORF is located
- The absolute genomic coordinates of the end position of the transcript on which the ORF is located
- The absolute genomic coordinates of the start and stop position of the CDS of the transcript on which the ORF is located (if this is a coding transcript)
- The biotype of the transcript on which the ORF is located
- At least one of the identifiers (Ensembl and/or NCBI and/or HGNC etc.), names or symbol of the gene on which the ORF is located
- The biological context in which the ORF has been identified (cell line, cell type, tissue, organ, pathological condition etc.)
- The presence of a Kozak context near to the start codon
- The ORF categories to which the ORF belong if you performed an annotation of the ORFs (such as upstream, downstream etc.)
- The ORF score of the ORF (if computed)
- The PhyloCSF score of the ORF (if computed)
- The PhastCons score of the ORF (if computed)
- The FLOSS score and FLOSS category of the ORF (if computed)
All elements marked with a are mandatory.
All elements marked with a will be computed by our algorithm if you do not provide it.
All elements marked with a will be computed by our algorithm if you do not provide it but provided a transcript identifier.
Please note that comma-separated and tab-separated files in which each line describe one single ORF are preferred. FASTA and BED-like formats (with additionnal columns) are allowed for submission when clear metadata about the content of each attribute are provided.
In the case you provide information about ORFs identify:
- in several original studies (i.e. several original publications), one file is expected for each dataset
- in several species, one file may be provided for each species. Otherwise, an attribute stating clearly the species in which the ORF has been identified is expected in the file.
- in several biological contexts or conditions, one file may be provided for each context. Otherwise, an attribute stating clearly the cell type or condition in which the ORF has been identified is expected in the file.
- for data coming from third party databases, all information may be provided in one single file. In such case, the ORF unique identifiers of your database are expected to be provided in the file. Please provide us also with the schema of the URL allowing to redirect the user from MetamORF entries to the corresponding entries of your database.
Please feel free to contact us if you have any question regarding the submission of your dataset to be included in MetamORF.