Home Blog Summary
The1000 Genomes Project has released the "world's largest set of data on human genetic variation" to the Amazon cloud.
The dataset is 200 terabytes including genomes of more than 2,600 people from 26 populations around the world.
While it is well known that these samples are impossible to truly de-identify, rigorous methods were put in place to ensure all participants were fully aware of the privacy implications. It will be interesting to see how the 1000 Genomes Project will impact variability and reproducibility of existing genome-wide studies.
Is "the cloud" the appropriate place for whole genome scans of human DNA?
Which is more important: analysis efficiency or access control?
Have we entered the era of health information altruists , is the practical risk overstated, or does this just too good to ignore?
Stay tuned.Read More
A special issue of Science features the importance of reproducibility, emphasizing the increasing role of computer science and large datasets.
Author and journal editor Roger Peng makes a strong case for open source in science
"Given the barriers to reproducible research, it is tempting to wait for a comprehensive solution to arrive. However, even incremental steps would be a vast improvement over the current situation. To this end, I propose the following steps (in order of increasing impact and cost) that individuals and the scientific community can take. First, anyone doing any computing in their research should publish their code. It does not have to be clean or beautiful, it just needs to be available. Even without the corresponding data, code can be very informative and can be used to check for problems as well as quickly translate ideas. Journal editors and reviewers should demand this so that it becomes routine. ... The next step would be to publish a cleaned-up version of the code along with the data sets in a durable nonproprietary format. This will involve some additional cost because not everyone will have the resources to publish data. Some fields such as genomics have already created data repositories, but there is not yet a general solution.
Last, the scientific community can pool its collective resources to create a DataMed Central and CodeMed Central, analogous to PubMed Central for all data, metadata, and code to be stored and linked with each other and with corresponding publications. Such an effort would probably need government coordination and support, but each would serve as a single gateway that would guide researchers to field-specific data and code repositories. Existing repositories could continue to be used and would interface with the gateway, whereas fields without existing infrastructure would be given access to these resources. The ultimate goal would be to provide a single place to which people in all fields could turn to make their work reproducible.