Sonntag, 6. September 2009

Adventures in the galaxy Pt.2

Update 09092009

The project on github had to be updated, because the user-group association was not created

Introduction

This is part two of the series for creating an importer for galaxy. Part one shows the basic setup for checkout and development and bootstrapping the galaxy system. This part will intoduce an implementation of an importer.

Importer test drive

Checkout a branch of galaxy with the importer included and take it for a test drive.

hg clone http://bitbucket.org/ido/galaxy-central-importer galaxy-central-importer
cd galaxy-central-importer
hg branches #hg seems to checkout default branch
hg update -C importer
sh setup.sh 
./run.sh #end with ctrl-c after startup completes (serving on ....)
./importer.sh #starts the importer wait for "done importing"
./run.sh

Log in as u:name1@ p:123456 and browse the library hierarchy (yes this is with @ at the end). The importer creates one library for each group and within this library a folder for each person. Subfolders are created for each sequenced sample (unique combination of flowcell and lane). Each analysis of this sequenced samples is represented by files within a subfolder. Left click on a certain dataset, and select "Edit...". Unfortunalely it is possible to edit some of the attributes (name, info Message, data type) which I don't think should be editable, but fortunately it is also possible to change the "Notes" textfield. The textfield comes from a form which is automatically associated with each dataset. There are also metadata attributes giving experimental information, but these seem only to be visible when editing (I think its a bug in galaxy).

Logging in as u:name2@ p:123456 or u:tamir@ p:123456 shows galaxies RBAC in effect: only group2 libraries are visible for name2@ but all for tamir@ who is member in group bioinformatics.

The galaxy database folder is still small, because only the paths of the files were imported and their metadata was set by the importer.

Building blocks

The scripts for the importer are in the importer subfolder.

filepurpose
importerCommon.pyutility functions
importerModel.pysimple generic object model for NGS and its results
importer.pyreal worker
sampleIO.pyfunctions for sample.xml parsing
sampleIOTest.py unit test for sample xml parsing
users_lib.pyfunctions for user and group managment
import_lib.pyfunctions for creation of libraries, folders, datasets etc...
__init__.pystandard python
samples.xmlexample xml file containing the information to import

The sample* files are specific for our setup the rest is rather generic. They encode the converter from the persistence solution (RDBMS, CSV files etc...) to the simple object model that is represented in importerModel. In this case the data is encoded in one XML file.

The interesting functions are in importer_lib.py and users_lib.py. They represent the basic building blocks and you can combine them to create a completely different layout for your importer. Galaxy still organizes library information in the traditional and slightly outdated filesystem hierarchy metaphor, but alternatives e.g. tagging could be possible.

RBAC

This is an important point. The importer also changes the RBAC setup of the standard galaxy distribution. It does this in three files lib/galaxy/model/mapping.py where it defines an additional implementation of GalaxyRBACAgent called IMPGalaxyRBACAgent, lib/galaxy/security/__init__.py where the IMPGalaxyRBACAgent is initialized and lib/galaxy/web/controllers/admin.py where a check for roles intersection is taken out.

Why is this necessary?

Its not necessary for you, and you can create an importer without these modifications. Galaxy security requires for "access" permission that a user should have all the roles that are associated with the item. For our requirements this would have led to ad-hoc roles, so it was modified.

Idempotent

The importer as it is should be idempotent. Another run on the same data should not change anything in the database besides file paths. We move analysed files (and their directories) to another filesystem after they are checked for quality and found good. Only the paths of moved files should be updated, no new item should be created, so the analysis history associated with the file should still be valid. This is done by encoding the invariant part of the path to the file as a hash and putting it into the description field. Database insert functions also check for existence of an item before they create it (*_if_not_exists). This is a useful hack so we don't have to keep track of updates. But new items will of course be inserted into the db.

Metadata in db

I struggled a lot with this, making it working initially, then forgetting how -which is why I blog now- being too lazy to read the docs again. Galaxy allows some metadata in a jsonified string (dict) that is saved as a blob in the librarydatasetdatasetassociation (ldda) metadata field. These fields are declared in the datatype. I declared my own datatypes for the filetypes in this sample. The galaxy developers have already some definition for solexafastq etc, so YMMV. To initialize more than default datatypes, the importer.py passes the current directory to set_datatypes_registry.

Files affected
lib/galaxy/datatypes/registry.pyimport eland
lib/galaxy/datatypes/eland.pydeclaration of fields etc...
datatypes_conf.xmldeclaration of file types

There are also metadata files, whose purpose I don't understand yet.

Summary

I hope this series enables you to create a customized importer. Either taking galaxy-central-importer as a template or starting from scratch. Please ask questions in the comments of the blog. This makes it easier for other readers to estimate the usefulness/problems with this approach. Feedback is always appreciated.

Montag, 31. August 2009

Mappable Map

Back from vacation. This three liner creates a mappable map of the genome. It takes about a day on our new 16 core machine that I wanted to stress a little bit. splitter is of course from emboss. bowtie is used for quick exact matching.

BOWTIE="/projects/solexadst/bin/bowtie-0.10.1/bowtie"
BOWTIEINDEX="/projects/solexasrc/genomes/mouse/ncbi37_mm9/bowtie/mm9_bowtie"
FASTADIR="/projects/solexasrc/genomes/mouse/ncbi37_mm9/genome/fasta/"
for fasta in $(ls ${FASTADIR}*.fa);
do splitter $fasta raw::stdout -size 36 -overlap 35 -auto | $BOWTIE -r -v 0 -m 1 -p 16 $BOWTIEINDEX - | cut -f 1 | sort -n | awk -v fa=$(basename $fasta .fa) 'BEGIN{chr="chr" substr(fa,2)}{ if(NR==1){firstpos=$1;lastpos=$1};if($1 > lastpos + 1){ print chr "\t" firstpos "\t" lastpos; firstpos = $1 } lastpos = $1 }END{ print chr "\t" firstpos  "\t" lastpos}' > $(basename $fasta .fa).bed;
done

Sonntag, 16. August 2009

Adventures in the galaxy Pt.1

Introduction

Over a series of blogs called "adventures in the galaxy", I will show how to create a customized importer of data into galaxy.In case you don't know what galaxy is: its a web frontend for bioinformatic programs. The nice thing about galaxy is that it remembers the complete history of a data-set. This includes all the transformations together with the parameters and all the data-sets it has been combined with.

The builtin ability to import data lacked some features that we needed: association with metadata of the experiments, and presenting the data in a structured way. We also want the data to stay in its place. The main motivation of this blog is to have some documentation for my coworkers. Its my first blogging, so please be kind.

Basic setup

The developer site of galaxy seems to have moved to bitbucket galaxy-central. So if you want to submit bug reports or check the issues, go there.

In order to develop clone the repository, create a new branch, hack around and commit your changes. Then switch to the default branch, pull from upstream and merge the updated branch back into your own branch.

hg clone http://bitbucket.org/galaxy/galaxy-central galaxy_blog
cd galaxy_blog
hg branch importer #create branch
hg commit -m "created branch importer"
#...hack around
hg commit -m "useful changes"
hg update -C default #switch branch
hg pull
hg update -C importer
hg merge default #now you have to resolve the merge conflicts

To run galaxy you should: run setup.sh, change the created universe_wsgi.ini by adding an adminstrative user. Whenever you want to run just start run.sh. You can backup the database by copying database/universe.sqlite. After erasing the database it will be recreated when you run run.sh. A nice tool for development with sqlite is sqliteman3. For production you will probably switch to a different RDBM-system.

Initialization

Galaxy does a lot during startup, esp. connecting to the db and setup of the orm. The importer does this like the scripts in the scripts directory. A relatively complete version is available at github as a gist.

startup shell script

The startup shell script calls the actual python script.
#!/bin/sh
cd `dirname $0`
python -ES ./importer/importer.py universe_wsgi.ini $@

python importer

The python script (which I don't include here - get it from github) first does some pythonpath mangling. From line 72 on it parses the configuration file (universe_wsgi.ini) and returns an ImportData object. Now our importer we can use the python object model that was created by the galaxy developers.

As an example, we create a user in the database and associate her with a role. To create an object, just instantiate it and call flush(). You also have to create the association table object.

def createUser(app, username, password):
   """creates a user and a private role ("r" + username) and associates them in the database"""
   password = sha.new( password ).hexdigest()
   user = app.model.User(username, password)
   user.flush()
   role = app.model.Role( name="r" + username, description="private role of " + username, type=app.model.Role.types.PRIVATE )
   role.flush()
   userRoleAssociation = app.model.UserRoleAssociation(user, role)
   userRoleAssociation.flush()

We call createUser from the __main__ method. We have wrapped everything up in a transaction. You have to comment out the raising of the exception on line 8 in oder to save the user.

def __main__():
   app = parseConfig()
   username = "name" + "1"
   session = app.model.session
   transaction = session.begin()
   try:
      createUser(app, username, "123456")
      raise Exception('test for rollback', 'test for rollback')
      transaction.commit()
   except Exception:
      transaction.rollback()
      print("error: rollback")
   user = app.model.User.filter_by(email=username).first()
   if not user:
      print "user not created"
   else:
      print "user created: " + user.email + " " + " ".join(map(lambda r: r.name, user.all_roles()))

Next

From here on its rather simple, but if you don't want to experiment yourself with the galaxy object model yet, the next part of the series will show you how to create libraries, folders and datasets and associate everything with nice forms so authorized users will be able to store comments for each dataset

It would be very nice of you to give me feedback: was it useful, wordy, superfluous etc..