The core question is of course:
“One interesting trend is NoSQL offerings that now include SQL interfaces. Hadoop and Cassandra do this. I attended a talk at the recent Cassandra Summit 2012 where the main developer of CQL (Cassandra’s SQL interface) discussed the motivations for using SQL on Cassandra. In a nutshell he said SQL (including language bindings/APIs) is robust, well though-out, and does what a lot of people what, so why reinvent something else?”
Robert Hodges (2012). Source: http://www.dbms2.com/2012/08/08/hcatalog-yes-it-matters/
I will continue to update this with interesting tidbits as I battle to answer this question…
HIVE & Impala
Apache, Hortonworks and Cloudera have emerged as the major vendors of production-ready Hadoop stacks, addressing the critical performance issues that plagued a young Hadoop ecosystem. Naturally they have “improved” Hive with Stinger and Impala. In particular I like Matt Brandwien’s comments that Impala is suited to interactive SQL and Hive suited to batch SQL.
No surprise here, Facebook is leading the charge with Data Warehousing and Analytics on the Hadoop stack. Of the many case studies, it seems that the lessons of LikedIn and Facebook are at the forefront. Some interesting reads:
- Borthakur, D., Gray, J., Sarma, J. S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., … & Aiyer, A. (2011, June). Apache Hadoop goes realtime at Facebook. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (pp. 1071-1080). ACM.
- Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., … & Liu, H. (2010, June). Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 1013-1020). ACM.
- Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., … & Murthy, R. (2010, March). Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on (pp. 996-1005). IEEE.
And (of course!) with a bioinformatics flavour: (the last two in particular look really interesting)
- Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics, 11(Suppl 12), S1.
- O’Driscoll, A., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data’, Hadoop and cloud computing in genomics. Journal of biomedical informatics, 46(5), 774-781.
- Prekopcsak, Z., Makrai, G., Henk, T., & Gaspar-Papanek, C. (2011, June). Radoop: Analyzing big data with rapidminer and hadoop. In Proceedings of the 2nd RapidMiner Community Meeting and Conference (RCOMM 2011) (pp. 865-874).
- Kumari, P., & Kumar, S. Analyze Human Genome Using Big Data.
- Franklin, M. (2013, October). The Berkeley Data Analytics Stack: Present and future. In Big Data, 2013 IEEE International Conference on (pp. 2-3). IEEE.
Michael Stonebreaker at his best! Love it!
Hadoop is good neither at data management nor analytics. (Stonebreaker et al, 2014).
source: Taft, R., Vartak, M., Satish, N. R., Sundaram, N., Madden, S., & Stonebraker, M. (2014, June). Genbase: a complex analytics genomics benchmark. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 177-188). ACM.
The cloud is frequently hyped as a tool for high performance, scalable and distributed computing & collaboration. How is this going in the sciences?
- Gannon, D., Fay, D., Green, D., Takeda, K., & Yi, W. (2014, June). Science in the cloud: lessons from three years of research projects on microsoft azure. In Proceedings of the 5th ACM workshop on Scientific cloud computing (pp. 1-8). ACM.
Well I know this is a hodge-podge of resources. But these seem like interesting and varied starting points. If I can chew through these then I will probably have a great headache, and perhaps some tantalising ideas to help define my thesis…