Sonntag, 6. September 2009

Adventures in the galaxy Pt.2

Update 09092009

The project on github had to be updated, because the user-group association was not created

Introduction

This is part two of the series for creating an importer for galaxy. Part one shows the basic setup for checkout and development and bootstrapping the galaxy system. This part will intoduce an implementation of an importer.

Importer test drive

Checkout a branch of galaxy with the importer included and take it for a test drive.

hg clone http://bitbucket.org/ido/galaxy-central-importer galaxy-central-importer
cd galaxy-central-importer
hg branches #hg seems to checkout default branch
hg update -C importer
sh setup.sh 
./run.sh #end with ctrl-c after startup completes (serving on ....)
./importer.sh #starts the importer wait for "done importing"
./run.sh

Log in as u:name1@ p:123456 and browse the library hierarchy (yes this is with @ at the end). The importer creates one library for each group and within this library a folder for each person. Subfolders are created for each sequenced sample (unique combination of flowcell and lane). Each analysis of this sequenced samples is represented by files within a subfolder. Left click on a certain dataset, and select "Edit...". Unfortunalely it is possible to edit some of the attributes (name, info Message, data type) which I don't think should be editable, but fortunately it is also possible to change the "Notes" textfield. The textfield comes from a form which is automatically associated with each dataset. There are also metadata attributes giving experimental information, but these seem only to be visible when editing (I think its a bug in galaxy).

Logging in as u:name2@ p:123456 or u:tamir@ p:123456 shows galaxies RBAC in effect: only group2 libraries are visible for name2@ but all for tamir@ who is member in group bioinformatics.

The galaxy database folder is still small, because only the paths of the files were imported and their metadata was set by the importer.

Building blocks

The scripts for the importer are in the importer subfolder.

filepurpose
importerCommon.pyutility functions
importerModel.pysimple generic object model for NGS and its results
importer.pyreal worker
sampleIO.pyfunctions for sample.xml parsing
sampleIOTest.py unit test for sample xml parsing
users_lib.pyfunctions for user and group managment
import_lib.pyfunctions for creation of libraries, folders, datasets etc...
__init__.pystandard python
samples.xmlexample xml file containing the information to import

The sample* files are specific for our setup the rest is rather generic. They encode the converter from the persistence solution (RDBMS, CSV files etc...) to the simple object model that is represented in importerModel. In this case the data is encoded in one XML file.

The interesting functions are in importer_lib.py and users_lib.py. They represent the basic building blocks and you can combine them to create a completely different layout for your importer. Galaxy still organizes library information in the traditional and slightly outdated filesystem hierarchy metaphor, but alternatives e.g. tagging could be possible.

RBAC

This is an important point. The importer also changes the RBAC setup of the standard galaxy distribution. It does this in three files lib/galaxy/model/mapping.py where it defines an additional implementation of GalaxyRBACAgent called IMPGalaxyRBACAgent, lib/galaxy/security/__init__.py where the IMPGalaxyRBACAgent is initialized and lib/galaxy/web/controllers/admin.py where a check for roles intersection is taken out.

Why is this necessary?

Its not necessary for you, and you can create an importer without these modifications. Galaxy security requires for "access" permission that a user should have all the roles that are associated with the item. For our requirements this would have led to ad-hoc roles, so it was modified.

Idempotent

The importer as it is should be idempotent. Another run on the same data should not change anything in the database besides file paths. We move analysed files (and their directories) to another filesystem after they are checked for quality and found good. Only the paths of moved files should be updated, no new item should be created, so the analysis history associated with the file should still be valid. This is done by encoding the invariant part of the path to the file as a hash and putting it into the description field. Database insert functions also check for existence of an item before they create it (*_if_not_exists). This is a useful hack so we don't have to keep track of updates. But new items will of course be inserted into the db.

Metadata in db

I struggled a lot with this, making it working initially, then forgetting how -which is why I blog now- being too lazy to read the docs again. Galaxy allows some metadata in a jsonified string (dict) that is saved as a blob in the librarydatasetdatasetassociation (ldda) metadata field. These fields are declared in the datatype. I declared my own datatypes for the filetypes in this sample. The galaxy developers have already some definition for solexafastq etc, so YMMV. To initialize more than default datatypes, the importer.py passes the current directory to set_datatypes_registry.

Files affected
lib/galaxy/datatypes/registry.pyimport eland
lib/galaxy/datatypes/eland.pydeclaration of fields etc...
datatypes_conf.xmldeclaration of file types

There are also metadata files, whose purpose I don't understand yet.

Summary

I hope this series enables you to create a customized importer. Either taking galaxy-central-importer as a template or starting from scratch. Please ask questions in the comments of the blog. This makes it easier for other readers to estimate the usefulness/problems with this approach. Feedback is always appreciated.