wiki:PhenotypeImportExportFormats

Version 50 (modified by mah79, 5 years ago) (diff)

--

Draft flat file formats for phenotype annotations

Three use cases described so far:

  • Import into chado
    • from curation tool
    • from user-submitted genome-wide phenotype screen data (i.e. bulk submissions)
  • Export for data sharing

Whether we can use a single format for all three, or any two, of the above use cases is not completely settled. Column order is arbitrary. Header content TBD.

Drafts for multiple formats

User submissions - single file

ContentsExampleCardinality
1Gene systematic IDSPBC11B10.091,>1 (See 1)
2FYPO IDFYPO:00000001
3Allele descriptionG146D (See 2)1,>1 (See 1)
4Expressionendogenous (See 3)1,>1 See 1
5Parental strain972 h- (See 4)1
6Strain name (background)SP286 (See 5)0,1
7Genotype descriptionh- ura4-D18 leu1-32 ade6-M210 (See 6)0,1
8Gene namecdc20,1,>1 (See 7)
9Allele name cdc2-1w0,1,>1 (See 7)
10Allele synonymcdc2-1W (See 1)0,1,>1
11Allele type amino acid substitution1 (See 8)
12EvidenceECO:0000000 (See 9)1
13ConditionPECO:0000004 (See 10)1,>1
14Penetrance %, range%, or small CV ('high', 'medium', or 'low', or corresponding FYPO_EXT ID)0,1
15Expressivity integer + units, allowing ranges, or small CV ('high' (synonym: strong), 'medium', or 'low' (synonym: weak), or corresponding FYPO_EXT ID)0,1
16Extensionannotation_extension=assayed_using(PomBase:SPBC582.03)0,1,>1
17ReferencePMID:236978061
18taxon48961
19Date201201011

Internal use and output -- separate entity and annotation files

  1. Entities: alleles
ContentsExampleCardinality
DatabasePomBase1
Allele systematic IDSPBC11B10.09-011
Allele namecdc2-1w (See 7)0,1
Allele synonymcdc2-1W0,1
Allele descriptionG146D1
Gene systematic IDSPBC11B10.091

Note: allele descriptions stored in the database may end up requiring one or more separate tables, and if so, they may be hard to represent in one column of a flat file.

  1. Entities: genotypes
ContentsExampleCardinality
DatabasePomBase1
Genotype ID1234561
Genotype name/descriptionh- ura4-D18 leu1-32 (See 7)1
Allele systematic IDSPBC11B10.09-01 (See 11)1,>1
Strain background972 h- (OR strain ID, if we house them in the db)1

Note: include diploid, and heterozygous vs. homozygous at locus/loci of interest, in description where applicable.

  1. Entities: strains
DatabasePomBase1
Strain ID1234561
Strain name972 h-1
Strain description(some text, I guess)0,1

Questions:

  • Split or lump the entities?
  • Have a file, or entries in a "lumped" file, for strains? If no, don't need file format No. 3.
  1. Annotations
ContentsExampleCardinality
DatabasePomBase1
Genotype ID1234561
Expressionendogenous (See 3) Also see Outstanding Issues1
FYPO IDFYPO:00000001
EvidenceECO:0000000 (See 9)1
ReferencePMID:35029421
Condition at high temperature See 100,1
Penetrance Penetrance %, range%, or small CV ('high', 'medium', or 'low', or corresponding FYPO_EXT ID)0,1
Expressivityinteger + units, allowing ranges, or small CV ('high' (synonym: strong), 'medium', or 'low' (synonym: weak), or corresponding FYPO_EXT ID)0,1
Extensionannotation_extension=assayed_using(PomBase:SPBC582.03)0,1,>1
Date201201011

Table footnotes:

  1. Cardinality is one for single genes, >1 for multiple genes (double mutants, triple mutants, etc.). Also see Outstanding Issues.
  1. May include 'deletion' or residue description (see DescribingResidues); may also have SO ID; see Outstanding Issues.
  1. Allowable values: 'overexpression', 'knockdown', 'endogenous', 'null', 'not specified'. Deletions should always have 'null' expression. (Added in 2013-09-09 update)
  1. This column is for the An entry in this column is mandatory, but "unknown" will be an allowable value, in case submitters don't actually know the background. In a substantial majority of cases, the parental background will be 972 h-, so these will be the defaults in the curation tool. Users will be able to change h- to h90 (968) or h+ (975) upon entering allele details, and can contact curators if they have anything non-standard. Status as at 2012-01-25 curator meeting: After discussing options -- e.g. gathering selectable marker or other background details -- we have decided to use the defaults noted above, and try to capture "nutritional" or other marker alleles only when they are "of interest" in an experiment (e.g. when having particular markers present makes a difference to a phenotype, or when a phenotype is only relevant in one mating type). Also see "older notes" below. Update 2013-09-09: added bit about allowing "unknown"; clarified split between this and new column 3, i.e. that this is for the ancestral lab background, usually 972 h- or the isogenic H+ or h90.
  1. Use this column for a lab's in-house name/ID/designation for the background strain (i.e. the derivative of the parental strain that has selectable marker alleles etc.). (Split from column 2 in 2013-09-09 update)
  1. List of alleles present in the background strain, such as selectable marker alleles (don't include the "allele(s) of interest). If no name is used for an allele in the literature (should be rare for background alleles), can use allele systematic ID in this column.
  1. If no name is used for an allele in the literature, can use allele systematic ID in this column. Cardinality is one for single genes, >1 for multiple genes (double mutants, triple mutants, etc.). Also see Outstanding Issues.
  1. Supported allele types: 'wild type', 'deletion', 'mutation of single amino acid residue', 'mutation of multiple amino acid residues', 'partial deletion, amino acid', 'partial deletion, nucleotide', 'mutation of a single nucleotide', 'nonsense mutation', 'other', 'unknown'.
  1. ECO subset (note: waiting for ECO additions)
  1. Phenotype Experimental Conditions Ontology (PECO) ID(s)
  1. Derived from gene systematic ID; assigned consecutively

++-- old numbering --++

  1. Names of alleles (as in note 3) present in the background genotype
  1. An entry in this column is mandatory, but "unknown" will be an allowable value, in case submitters don't actually know the background. In a substantial majority of cases, the parental background will be 972 h-, so these will be the defaults in the curation tool. Users will be able to change h- to h90 (968) or h+ (975) upon entering allele details, and can contact curators if they have anything non-standard. Status as at 2012-01-25 curator meeting: After discussing options -- e.g. gathering selectable marker or other background details -- we have decided to use the defaults noted above, and try to capture "nutritional" or other marker alleles only when they are "of interest" in an experiment (e.g. when having particular markers present makes a difference to a phenotype, or when a phenotype is only relevant in one mating type). Also see "older notes" below.
  1. If no name used in literature, can use allele systematic ID in this column. Cardinality is one for single genes, >1 for multiple genes (double mutants, triple mutants, etc.). Also see Outstanding Issues.
  1. May include 'deletion', or residue description (see DescribingResidues); may also have SO ID; see Outstanding Issues.
  1. If the user wants to submit an alternative name for an allele
  1. ECO subset (note: waiting for ECO additions)
  1. Phenotype Experimental Conditions Ontology (PECO) ID
  1. Derived from gene systematic ID (assigned consecutively)

++-- end old numbering --++

Older notes on strain background

  • Based on conversation with Juan Mata maybe we should just have i) Parental strain (will almost always be h90/972 h-) plus mating type switching status (h+/-) and selectable markers or other background info ; Can be background strain or maybe a lab's designation (val, I would avoid lab designations if possible it will be too much )? Also see Outstanding Issues.
    • people often don't know the background strain...they are mostly from the same parental strain originally, but people don't know the history, so this definitely needs to be optional...
    • Should a mutation which does not directly affect the results of the experiment should be included in the background? This should not preclude people capturing asynthetic genetic interactions i.e gene a phenotype is the same as gene b phenotype so we conclude that they are acting in the same pathway, there is no real genetic interaction (i.e. change) but here you are studying the effect of both genes).

Outstanding issues

  • Does column order matter? (especially as I've put a couple of not-mandatory-for-user-subs columns first)
  • Q: Should multiple entries be separated by commas or pipes (or other delimiter)?
  • For double/triple/etc. mutants, we need to be able to combine entries in gene ID, gene name, allele name, allele description and allele synonym (e.g. gene IDs and allele names or descriptions to formulate) genotype, so multiple entries in those columns must appear in same order.
  • At present (i.e. as of 2013-09-09) we're only supporting bulk submissions of single-gene phenotype data. User version of documentation: http://www.pombase.org/submit-data/phenotype-data-bulk-upload-format
  • We will sometimes have to capture (input and output) complex allele descriptions.
    • For output, we'll have allele IDs, so what goes in the description column is just for convenience and can be incomplete if necessary
    • For user submissions, what should we collect? Not realistic to expect SO ids ...
  • How to handle expression in multiple-file scenario? Have put it with the annotation format for now, but that doesn't really capture which allele(s) the expression refers to. Don't think it belongs in the allele table, though, because expression isn't inherent in allele features (except for complete deletions). Maybe in genotype file somehow? (2013-09-09)

Additional questions:

  • Should the file(s) include information about changes (but not info in allele description)? We think we should include selectable markers to be used in genotype description...this will get complicated
  • Will this be used in construction of genotype? If so, how?
  • What happens with legacy data where we don't have this info without checking?
  • Syntax?
  • Should the format specification indicate which columns are essential for curation tool --> chado import? May not be exactly the same columns as we'd need for a bulk user submission, given that the tool will assign some things (e.g. allele systematic ID, genotype ID) if they're not provided by the user.

Draft for single format

from early Jan. 2012; superseded by above; and saved merely for the record as of 2012-01-23

ColumnContentsExampleInput CardinalityOutput cardinalityMandatory for user submission?
1DatabasePomBase0,11No
2Genotype ID1234560,11No
3Genotype name/description972h-,ade10-10,11No
4Strain background972 h- See 411Yes
5Gene systematic IDSPCPB16A4.03c1,>1 See 11,>1Yes
6Gene nameade100,1,>1 See 71,>1No
7Allele systematic IDSPCPB16A4.03c-01 See 110,1,>1 See 11,>1No
8Allele name ade10-10,1,>1 See 71,>1No
9Allele descriptiondeletion See 21,>1 See 11,>1Yes
10Allele synonymade101 See 70,1,>10,1,>1No
11FYPO IDFYPO:000000011Yes
12EvidenceECO:0000000 See 911Yes
13Condition at high temperature See 100,10,1No
14Penetrance %, range%, or small CV ('high', 'medium', or 'low')0,10,1No
15Expressivity integer + units, allowing ranges, or small CV ('high' (synonym: strong), 'medium', or 'low' (synonym: weak))0,10,1No
16Date2012010111Yes

Note: footnotes same as in tables for newer drafts above (updated 2013-09-09)

Note: forgot to include reference in this draft!!