| NOTE : SAPAC has changed to become eResearch SA. The information on this website is now outdated. Please go to the new eResearch SA website. | |||||
![]() |
![]() |
||||
![]() |
![]() |
![]() |
![]() |
||
All Wheat and Barley photographs have been obtained from the BarleyBase Barley Photo Gallery |
Grass Microarray Database Project: Joy Raison, SAPAC Background Abiotic stresses, such as temperature, water logging, drought, salinity and mineral deficiencies or toxicities are a major cause of yield and quality loss in cereal crops. To develop varieties with resistance or increased tolerance to such stresses, scientists at the Australian Centre for Plant Functional Genomics (ACPFG) are trying to determine which genes are activated or repressed in different varieties under different stress conditions, and the function of these genes in the metabolism of the plants. Microarray technology has been developed to simultaneously detect the expression levels of thousands of genes in a sample of biological material. Microarray experiments are performed to determine differential gene expression in multiple biological samples. The biological material could come from any organism, at any developmental stage, be from any tissue of the organism, and have undergone any treatment or stress regime. The design of each experiment will determine the biological material to be used and any treatments that will be applied to it. To support genomic research, the results of microarray experiments can now be lodged at different web sites and made available to the public. To take advantage of the information that these publicly available microarray datasets may have in relation to cereals under different abiotic stresses, the South Australian Partnership for Advanced Computing (SAPAC) in conjunction with the ACPFG, is developing a database to store microarray experiment related data on any species in the Poaceae (grass) family. Overview The aim of the Grass Microarray Database project is to collect, manage, and provide accesses to microarray experiment data and related information for grass species in a single format. Figure 1 provides an overview of the different components of the project.
The data to be stored in the database include:
As well as publicly available data, data from microarray experiments performed at the ACPFG and sequence annotations derived at the ACPFG will be included in the database. Three user groups are expected to access the data in the database, namely biologists, bioinformaticians, and those working on other projects, such as the Data Warehousing project at the University of Queensland. Interfaces will be developed for each of the user groups to access the information they require. To keep the database up-to-date new datasets will need to be added as they become publicly available, and as better information becomes available, for example in the annotations of genetic sequences, the database will need to be updated with the most recent information. The Problems Faced The main problem areas resulted from the data coming from different sites, and in using the different standards developed for microarray data. The database contains many different types of data with a complex set of relationships. The MIAME standard specifies the minimum information that should be provided for a microarray experiment to be interpreted. The MAGE-OM provides an object oriented model of the microarray and gene expression information. The MAGE-ML dtd is a data type definition for exchanging microarray and gene expression information in a common xml format. The MAGEstk is an API providing java classes for the microarray and gene expression information space. Where possible, previously developed standards for microarray data were used in designing the database schema and for transferring data. The MIAME standard was not sufficiently detailed to use to determine the information space required for developing the schema. Although both provided an object oriented view of the same information space, the MAGE-OM and MAGEstk were not always consistent in their definitions of the data and its relationships. To complicate matters, two different MAGE-ML dtd definitions were found, both purporting to be the same version. At each site different information is provided and in different formats. Even when the same type of information is provided in the same format, different sites interpret how the data should be placed within that format differently. For some sites, how the data were formated was not defined, requiring some guess-work to determine the data and their relationships. Furthermore, inconsistencies were detected within datasets, between datasets from the same source and between datasets from different sources when referencing other common information, for example the type of microarray chip the experiment was performed on. It is evident that the database will require a lot of data-cleaning before it will be useful for researchers to compare data from different experiments. The Approach Taken A custom database was designed and built to accommodate the required microarray experiment related information. The MAGE-ML dtd and MAGEstk were reasonably consistent and so were used as the basis for the database schema. This required determining the information space from a hierarchical definition (MAGE_ML dtd) and an object-oriented definition (MAGEstk) and mapping it into a relational database schema. Where inconsistencies were found the more general of the definitions was used in the database as that would accommodate both scenarios. The data were downloaded from the different public sites and software has been developed to parse the downloaded data and populate the database. Custom designed interfaces are planned for the different types of users expected to access the data they are interested in from the database. Figure 2 illustrates the flow of data from the source sites to the interfaces through which they will be accessible to the database users.
A set of java classes has been developed to read the datasets downloaded from each site, parse the differently formatted datasets and insert the data into the database. The ploidy data for angiosperms was downloaded from the Royal Botanical Gardens, Kew in html. A reader was developed to parse these data and insert them into the database. The taxonomy data for the Poaceae family are downloaded from the National Center for Biotechnology Information (NCBI) Taxonomy site as a set of comma separated files. Programs have been written to regularly download the data and update the database for any changes that have been made among the relationships in this family. Affymetrix microarray chip information is downloaded from the Affymetrix site in MAGE-ML format. MAGE-ML is also used to format the experiments from the BarleyBase and ArrayExpress sites. However the BarleyBase MAGE-ML is provided in eight to eleven separate files, whereas the ArrayExpress experiments and chips are each provided in a single MAGE-ML formatted file. Consequently, separate readers have been developed to manage the downloaded data files from each site. A single set of MAGE-ML parsers has been developed to parse any MAGE-ML formatted file. The BarleyBase and ArrayExpress readers both use these parsers to parse their datasets. The parsers create MAGEstk objects and pass them to the database broker for insertion into the database. Microarray experiment data from the Gene Expression Omnibus (GEO) site are downloaded in "soft" format. A reader is being developed to parse this information for each experiment and populate the database with it. Future Work To complete the insertion of the data into the database the "soft" parser has to be completed, and the taxonomy and ploidy data areas need to be incorporated into the database. The data to be provided by the ACPFG have still to be obtained. Methods to implement the automatic updating of the database still need to be investigated and developed. Significant data cleaning is expected to be needed before the data will be able to be usefully compared across experiments. The interfaces for each of the user groups still need to be defined and developed. There may be a need to create data warehouses from the database to improve the querying efficiency for the interfaces for the different user groups. |
| SAPAC SITE MAP |