PomBase Planning Meeting 03 July 2013

Cambridge - Meeting Room B, Sanger Building

Present: Steve, Paul, Dan, Mark, Midori, Val, Antonia
Remote: Kim


Update cycle and automated update pipeline

Dan explained the difference between the various tests they run to make sure that pages on the website behave and look as they should. These are:

  • Standard health checks – these are run by Mark on a continuous basis where he just goes to different pages and looks at them manually.
  • Selenium testing (Dan) – Dan has made an automated test that runs a script to test the website. It currently covers gene page, simple, advanced search, plus existing speed test. A good basic test is up an running.
  • Manual smoke testing – This is done by Dan manually and he has a first draft test sheet completed

Hosting large scale datasets

An ftp directory has been created to support incoming large scale datasets. Dan can provide the single username and password required for uploading to . We need to document how to submit data to this (acceptable data-types for each file format etc..). There is a web form intended for large-scale data submissions, but it hasn't been used yet. Midori reckons that the instructions already online don’t need much changing, however we might want to make the link more prominent. Val thinks we should reword the link to something including “HTP..”

Paul said that he is reluctant to ask for data until they have some form of “validation software” in place. This software would check that the files contain data that is compatible with what we can accept and display properly. This can be very simple to begin with. Val is keen to get going with collecting data and points out that if we need to go and find it ourselves then it would be a full-time job to locate data. If we get input from the community of where to find the stuff and what format it is in + the reference it will already make our job a lot easier.

Users should be able to upload and visualize their data privately.

Researchers doing transcriptome experiments should submit read-data (embl/genbank) ENA or Array express, once this is done we can display it using the accessions. The problem is that they often don’t supply the alignment, but we could produce these ourselves (Dan points out that we haven’t done this for pombe as of yet).

So for each thing people submit we need an accession number (RNA seq data) or a file. The accession number can be from ENA , GEO or array express. Acceptable file formats include GFF3, BAM, or VCF. We also require the genome assembly date + PMID. If there is an accession number the file should already have been validated.

Q Should we allow users to submit RNA seq files or ‘force’ users to submit their RNA seq to elsewhere to get an accession number?

  • sub-AI: Short term solution = Change form from ‘submit data’ to something that lets people point us to the data + describes the data types at the source (genome version…). Make this in more of an ‘advert’ style format. Rename the link. (Midori) (addendum: link rename DONE on dev -mah, 20130704; rest is in

Feedback pombe 2013

  • PomBase gene page feedback

We had feedback that the gene pages still load too slowly. Apparently it is worse in the US than here. Paul reckons we can still make it run faster. One other alternative is to look at external services for hosting (amazon).

Many people miss the ‘others’ link. The numbers themselves can be pre-cached as so not to slow down the site further

We also discussed providing a short description for each gene, with a bit more meat in it than the product description. Antonia has started this. Idea is for single sentence, 15-25 words’ish.

We decided to change the gene page order to GO MF, then BP then CC then phenotype.

We have had interest in restriction enzyme site mapping. Paul said that these can be calculated if he knows what the sites are. Val said she could check Gbrowse. Midori checked and emailed: “REBASE is still alive, well, and being updated. They have an ftp site where we can download various files. This page has links: I suspect Format #8 might be of particular interest.”

AI: Chado future long term flag, reinstate pre-cached ‘others’ column that links to page where all genes are shown. (
AI: Change gene page order to GO MF BP CC phenotype
AI: Antonia to get going with the list of gene captions
AI: Kim to write a loader for the flatfile of gene captions
AI: restriction enzyme mapping included as a track.

Frequent errors displayed on gene pages

Dan: Eugene can can sort this after next ensembl release. Currently it is on the same server as 1000 genomes. There is another server that can be used.
AI: Fix the drupal errors (

Graphs are misleading without all relationships

Val: it is a problem that we don’t show the same relationships as other databases. Mark says that we load all relationships but restrict the display to is_a and part_of. Paul thinks that ‘what is displayed’ should be configurable. Midori points out that the release generates ‘everything’ and ‘simplified’ versions. OLS is the simplest or a simple version. Mark says that OLS is not the one we load but he is not sure what version we use.

  • AI: relationships for graph generation
    • AI: Mark to tell Midori which version we load.
    • AI: Midori to tell Mark which one we should use.
    • AI: curators to document what relations we want to show
    • AI: Mark looks into making the graphs show them

All actions rolled into

Keeping front page news current

Dev to live should sync daily at 10.00. Midori points out that that did not happen this morning.

AI: Mark to check with Eugene that syncs happen on the days that they should happen.[[BR]]

AI: Change text on front page from about pombase (Mark) AI: community curation launch (menu links) and

Update pipeline and ftp site

Val wanted to know if we can get stats for how many times particular files are downloaded. Paul says that the information is stored so it can be done at some point. He wants to leave it until longer after geneDB decommissioning.

AI: future: Generate stats for how many times particular files are downloaded from ftp site. Wait until longer after GeneDB decommissioning

Val also wanted to know why the ontology wasn’t synced with the annotations that went live. Mark said that the term name was missing because it wasn’t loaded into the ensembl genome db. He reckons he might have loaded from an older file by mistake as he normally runs all updates at the same time.

Ensembl supports the GFF3, not GTF, file format and we should document to our users what this format contains. From this Friday onwards GFF3 files will be generated as part of the ensemble release. Pombase can use those. However we can also tell our geeky users they can use the rest resource to bulk download data.

AI: Make FAQ item re what information the GFF3 files contain. (
AI: Make FAQ item for how to use the restful interface to bulk download data (Dan or Mark
AI: Val to make Jira ticket to remind Mark to use “this file” (the one generated by ensemble?)
AI: make custom sequence download FAQ of how to do this in biomart (Antonia and will pass to Midori). DONE


  • AI: Mark needs to know when new SO qualifiers (like TR box) are added (open jira item for Kim to generate a list) (
  • AI: Remove ':pep' from identifiers in cDNA FASTA file (Kim)
  • AI:? Discussed moving to similar schema as for transcripts, so proteins are named systematic_ID.1 .2 matching alternative transcripts and only differentiated by "type" Any changes related to this are postponed until after 2013 which gives us time to think if this is what we really want to do (Kim sounded not keen)


Paul: compara version does not exist.
Dan: we also need to capture the GO file.
Val: won’t be necessary, done with datestamp.
Midori: Should update GAF from GO subversion repository, which has date-stamps and svn revision numbers. We could use either, but date might be friendlier to users. (There's usually more than one GO svn commit per Chado version, but that shouldn't pose a problem for us as long as we track which date or revision we're using.)
Val: what is INSDC version?
Dan: it is an assembly database. “GCA_0000..2” There is also another assembly version.
Val: aren’t there are more than two?
Dan: number 1 has no mitochondrial genes. Number 2 has mitochondrial genes.
Paul: the current assembly files from 2007.
Val to find out anything prior to this. Val can then give these numbers that predates the current versioning system. Val also wants to know what gene build refers to, Dan to send it to her.
Make JIRA tracker item. We start from v34. Can people access historical pages? We went live ‘officially’ with 29. No sequence changes since 2007.

AI: Generate a web page at update time populated with the following info:

There will be a major version number for the community which will follow chado version numbers 34 etc For each version we will record on a PomBase?? documentation page (automatically generated)

  • INSDC assembly (currently 2)
  • gene build 2.1 etc
  • annotation version will be the chado version
  • ensembl software version
  • GO GAF file date
  • Interpro release
  • pombe cerevisiae ortholog table version etc

This means :

  • Users can report chado version for any analysis and other components will be traceable
  • Users can easily check which components other than the functional annotation (which will change every release) have changed, to see if a particular type of analysis is affected. For example analysis which is dependent only on gene structures changes will need to be updated less frequently).
  • For legacy data users can continue to use file date stamp

AI: Val to find assembly files pre-dating 2007,


  • ETA/effort estimate for reciprocal interaction annotations (PB-873)?

AI: Kim to check if these are reciprocally represented in chado? revisit PB-873 at next planning meeting


Do people delete invitation-to-curate emails because they look automatically generated? Also email might be long, Antonia had a comment on this.
AI: check with Kim if we can make them look like they come from a person rather than semi-automated?

Other general issues

  • transcript type of TR box Mark: I have to add it to a list of features to define which side of the fence they fall, either gene/transcript/translation or if they are simple features, or they are something else that is used else where but does not need to be considered as a gene/simple feature, for example Chromsomes are treated as a special case.
    • This sounds as though it could be easily automated
      • note promoter should be part of gene, not transcript

Q- AI:? Already covered above


  • Community curation

Antonia to follow up with Sara Mole's lab to see how well co-curation works. is starting with Sara Mole's lab next month, other London labs and Warwick for starters

AI: future: Antonia to follow up on co-curation

Action items

Action Items carried forward from previous meetings

  • AI Check all required files are present and correct on ftp site (Val) Will be done for next meeting
  • AI Curators to draw up a script for the curation video. What steps do we want to show on the video? Discuss at next curator meeting. (Part Done)
  • AI Mark to continue work on automated release pipeline with regression testing. Plan is to have a pipeline ready for the next meeting (Mark)

New Action Items

  • AI: future: Antonia to follow up on co-curation
  • AI Check all required files are present and correct on ftp site (Val) Will be done for next meeting
    • AI Curators to draw up a script for the curation video. What steps do we want to show on the video? Discuss at next curator meeting. (Part Done)
  • AI Mark to continue work on automated release pipeline with regression testing. Plan is to have a pipeline ready for the next meeting (Mark)

Postponed Action Items

  • Demo of curation tool (postponed)
  • Mark to liaise with Giulietta to help with making a Pombe community curation video (postponed until curators have script)
  • Make a validation system that lets people submit data in a format already made usable to us (Paul?)
  • Dan and Mark to see what files downloadable from PomBase have previously been generated by ensembl. These should be easy to create automatic updates for (In progress?)
  • making Artemis applet available
  • Need to document how users can visualize their data privately (who?)
  • Make a validation system that checks the data that people submit so that it conforms to what we accept (Paul?)
  • Sort out display for gene captions/summaries on gene pages (Mark)
  • Paul to see to that the restriction enzyme mappings get calculated (if REBASE doesn’t work out) and/or get included as a track.
  • Antonia to follow up on co-curation once Sara is back from holiday (Antonia)
Last modified 7 years ago Last modified on Jul 5, 2013, 11:14:36 AM