“One interesting trend is NoSQL offerings that now include SQL interfaces. Hadoop and Cassandra do this. I attended a talk at the recent Cassandra Summit 2012 where the main developer of CQL (Cassandra’s SQL interface) discussed the motivations for using SQL on Cassandra. In a nutshell he said SQL (including language bindings/APIs) is robust, well though-out, and does what a lot of people what, so why reinvent something else?”

Robert Hodges (2012). Source: http://www.dbms2.com/2012/08/08/hcatalog-yes-it-matters/

I will continue to update this with interesting tidbits as I battle to answer this question…

**HIVE & Impala**

Apache, Hortonworks and Cloudera have emerged as the major vendors of production-ready Hadoop stacks, addressing the critical performance issues that plagued a young Hadoop ecosystem. Naturally they have “improved” Hive with Stinger and Impala. In particular I like Matt Brandwien’s comments that Impala is suited to interactive SQL and Hive suited to batch SQL.

More Info:

- http://www.dbms2.com/2014/02/09/distinctions-in-sqlhadoop-integration/
- http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
- http://impala.io/

**Case Studies**

No surprise here, Facebook is leading the charge with Data Warehousing and Analytics on the Hadoop stack. Of the many case studies, it seems that the lessons of LikedIn and Facebook are at the forefront. Some interesting reads:

- Borthakur, D., Gray, J., Sarma, J. S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., … & Aiyer, A. (2011, June). Apache Hadoop goes realtime at Facebook. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (pp. 1071-1080). ACM.
- Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., … & Liu, H. (2010, June). Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 1013-1020). ACM.
- Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., … & Murthy, R. (2010, March). Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on (pp. 996-1005). IEEE.

And (of course!) with a **bioinformatics flavour**: (the last two in particular look really interesting)

- Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics, 11(Suppl 12), S1.
- O’Driscoll, A., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data’, Hadoop and cloud computing in genomics. Journal of biomedical informatics, 46(5), 774-781.
- Prekopcsak, Z., Makrai, G., Henk, T., & Gaspar-Papanek, C. (2011, June). Radoop: Analyzing big data with rapidminer and hadoop. In Proceedings of the 2nd RapidMiner Community Meeting and Conference (RCOMM 2011) (pp. 865-874).
- Kumari, P., & Kumar, S. Analyze Human Genome Using Big Data.
- Franklin, M. (2013, October). The Berkeley Data Analytics Stack: Present and future. In Big Data, 2013 IEEE International Conference on (pp. 2-3). IEEE.

**Michael Stonebreaker at his best! Love it!**

Hadoop is good neither at data management nor analytics. (Stonebreaker et al, 2014).

source: Taft, R., Vartak, M., Satish, N. R., Sundaram, N., Madden, S., & Stonebraker, M. (2014, June). Genbase: a complex analytics genomics benchmark. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 177-188). ACM.

**Cloud-based Science**

The cloud is frequently hyped as a tool for high performance, scalable and distributed computing & collaboration. How is this going in the sciences?

- Gannon, D., Fay, D., Green, D., Takeda, K., & Yi, W. (2014, June). Science in the cloud: lessons from three years of research projects on microsoft azure. In Proceedings of the 5th ACM workshop on Scientific cloud computing (pp. 1-8). ACM.
- http://www.researchgate.net/publication/220717812_AzureBlast_a_case_study_of_developing_science_applications_on_the_cloud

Well I know this is a hodge-podge of resources. But these seem like interesting and varied starting points. If I can chew through these then I will probably have a great headache, and perhaps some tantalising ideas to help define my thesis…

]]>

Via generalised linear regression we were able to confirm that Td and the relative difference in Ta and Td had a significant effect upon the rate of relaxation. In this post we are going to explore the linear model in more detail. We will see if it is suitable for the accurate prediction of Tw (given Ta and Td), or if not, whether we can use predicted values to provide better starting conditions for the iterative relaxation.

**The Algorithm**

Recall from our earlier posts, that the calculation of WBGT requires the iterative approximation of Tw. The algorithm takes Ta and Td as inputs. Initially, Tw is set equal to Td and then this is slowly refined. In psuedocode the algorithm looks like this:

calcWBGT(Ta, Td): Tw := Td + 0.2 while not converged: check_converged() Tw += 0.2 end

Here we have glossed over the details of what check_converged() is, but you can find the full algorithm in our earlier post here. Long story short, the relaxation of Tw described above can often take many hundreds of iterations, which is a significant problem at scale.

**Predicting Tw**

Previously we used generalised linear regression to explore the effects of Ta and Td on the rate of relaxation. We found that the following model provided a reasonable approximation:

where *diff* is the difference between the two input temperatures *(Ta – Td)* and *Td:diff* is the interaction of these two terms. The rate of decay was modeled as count data, assuming a Poisson distribution.

Of course the best case scenario would allow us to accurately predict the rate of decay of Tw and thus, avoid any iteration. However while reasonable within certain temperature ranges, as Td increased the predicted values began to over estimate and skew the results as shown in the figure below:

We can observe from the plot above that WBGT calculated using predicted Tw agrees with the original algorithm within +/- 10 degrees Celcius. The greatest errors occur where Td is very low (Td < -50 degrees Celcius) e.g. up around the Greenland. There is also increasing errors at the upper ranges of Td, specifically note the colour gradient that runs from the Northern hemisphere (February, i.e. winter month with low Td) down to the Southern hemisphere (summer month with higher Td). Some of this variation will be captured by the confidence intervals, but the disagreement is impractical when considering physiological heat stress where a deviation of as little as 1 degree Celsius is of significant interest.

**Estimating Initial Conditions for Tw**

So while the linear model is not as accurate as we might like, perhaps we can use it to estimate a better starting point for the relaxation. The core issue with the performance of this algorithm is the sheer number of iterations required. So perhaps we can achieve a quick and effective performance improvement simply by providing a better initial estimate for Tw.

In general, the predicted values for Tw are being over-estimated as shown in the histogram below:

Despite the observed over estimation, the prediction tends to be reasonably close to the mark such that subtracting a small offset (which we will call delta) should provide an estimate very close to the final stopping conditions. We re-calculated WBGT using a range of values for *delta* from 1 through to 15. Comparing the mean squared error (actual WBGT c.f. predicted WBGT):

Comparison of the mean squared error revealed an optimal region of between 8 and 10 for the offset. Using an offset of 10, we compared the predicted WBGT with the original WBGT and found near-perfect agreement at all temperature ranges:

There are two areas of disagreement in the above plot. We suggest that these most likely coincide with missing data points in the raw data sets.

**Comparing the Performance**

Using *delta = 10*, we benchmarked the performance of the original algorithm against the predictive algorithm.

The performance of the original algorithm rapidly degrades as the scale of the problem increases. Conversely, the performance of the predictive algorithm (used to optimise the initial estimation of Tw) is dramatically better.

**Conclusions**

Using the previously determined linear model, we were able to effectively ‘short-circuit’ the number of iterations required to fully relax Tw and thus, dramatically improve the performance at scale. As a quick win this is a great result and if performance was our only goal then this would be a good job.

However, I still think that we should be able to build a better model than this iterative approximation. The relaxation of Tw is a well behaved, continuous exponential function. It is still my hope that we might build a mechanistic model to describe this process and perhaps gain some additional insight into the approximtion of WBGT in the process. In future posts we will explore an algorithm published by Stull (2011) derived through genetic programming as well as possible non-linear models for the relaxation of Tw.

**References**

Stull, R. (2011). Wet-bulb temperature from relative humidity and air temperature. Journal of Applied Meteorology and climatology, 50(11), 2267-2269.

]]>

The iterative relaxation of the wet bulb temperature (Tw) is fundamental to the calculation of heat stress indicator WBGT. While Tw can be measured, the equipment is notoriously inaccurate and it is not commonly recorded. Hence, considerable effort has gone into building models and approximations. Currently, the most commonly accepted method is an iterative algorithm that takes two inputs: atmospheric temperature (Ta) and the dew point temperature (Td). Tw is initialised at the same value as the dew point and then iteratively refined. A very basic function for the relaxation of Tw is given below (written in R):

calcE <- function (t) { return (0.6106 * exp(17.27 * t / (237.3 + t))) } calcWBGT <- function (ta, td) { # tw counters used as stopping conditions tw.previous <- 10000 tw.current <- 10000 # initialise ed and tw (tw := td + 0.2) ed <- calcE(td) tw <- td + 0.2 # Relax tw stop.point <- FALSE while (!stop.point) { # stopping conditions stop.point <- if (tw.previous > 0 & tw.current < 0) TRUE else if (tw >= ta) TRUE else FALSE ew <- calcE(tw) tw.previous <- tw.current tw.current <- (ed - ew) * (1556 - 1.484*tw) + 101 * (ta - tw) # step forward tw tw <- tw + 0.2 } tw <- if (tw > (td + 0.3)) tw - 0.3 else td wbgt <- 0.67 * tw + 0.33 * ta return (wbgt) }

To explore the relaxation of Tw we made a slight adjustment to the function above and recorded the intermediate values of tw at each step. These values were kept in a vector called *iter.tw* and plotted as shown below:

We can see from the plot above that the relaxation of Tw is a smooth, continuous decay. We should be able to model this with a high level of certainty. However, through simulation we were able to show that the decay function is strongly dependent on the inputs as shown in the figure below:

On the surface, the relaxation function appears to be very similar for each set of inputs. But closer inspection will reveal that the axes are all very different scales. The differing relaxation profiles become clear when we fix the axis:

There is a fair amount of variation in the above relaxation profiles. We can clearly see that the y-intercept and rate of decay are strongly dependent upon the inputs Ta and Td. Less apparent in the above plots, it seems that the relative difference between Ta and Td has a strong effect on the rate of relaxation. We investigated the relationship between the inputs and the number of steps required to fully relax Tw:

# TMax - even steps # Tdew - even steps with some randomness sim.data <- data.frame(tmax = seq(-20, 50, length.out=1000), tdew = seq(-50, 20, length.out=1000) + runif(1000, -40, 15), rate_decay = rep(0, 1000)) for (i in 1:1000) { iter.tw <- c() calcWBGT(sim.data[i, "tmax"], sim.data[i, "tdew"]) sim.data$rate_decay[i] <- length(iter.tw) } sim.data$diff <- sim.data$tmax - sim.data$tdew sim.data$ratio <- sim.data$tmax / sim.data$tdew pairs(sim.data, upper.panel=panel.cor, diag.panel=panel.hist)

The simulation supports our theory, that the rate of decay is strongly dependant on the inputs. More specifically, it seems that the rate of decay is strongly correlated to the dew point temperature and the difference between Tmax and Tdew (Tmax – Tdew). Observationally, it seems that the rate of decay increases as Tdew decreases. And (Tmax – Tdew) is strongly correlated to Tdew which explains the correlation to the rate of decay.

Interestingly the simulated data suggests that the rate of decay and (Tmax – Tdew) are not strongly correlated to Tmax. Suggesting that Tdew has a greater influence.

The ratio of Tmax to Tdew does not seems to be an important factor here.

**Generalised Linear Regression**

We used generalised linear regression to investigate the observations above. The rate of decay (i.e. the number of steps required to fully relax Tw) was modeled as count data assuming the following:

- the rate of decay follows a poisson distribution
- the observations are independent
- the residuals are approximately normally distributed with constant variance
- that the sample size is sufficient

Using stepwise regression an initial model (model.step) was obtained:

However analysis of deviance showed that the ratio term was insignificant (p-value = 0.1358). This term was removed and the model recreated with only Tmax and Tdew (model.inputs). There was no significant difference between the two AIC values which were determined to be 9119.105 and 9119.33 for model.step and model.inputs respectively.

An interaction model (*decay ~ tmax*tdew*) (interaction.one) was built and compared to the additive model. Analysis of deviance confirmed that these two models were significantly different and that the interaction term was significant (p-value << 0.05). The resulting AIC of the interaction model was 7962.14.

A full additive mode (*decay ~ tmax + tdew + diff + ratio*) was investigated, and the *diff* term was found to be insignificant due to strong correlation to *tdew* and *tmax*. However, we are primarily interested in the rate of decay which the pairwise plot shown previously suggested was strongly correlated to the difference between *tdew* and *tmax*. Subsequently we built an interaction model with the *diff* variable in place of *tmax*: *decay ~ tdew*diff* (interaction.two). Analysis of deviance confirmed that the interaction term was significant (p-value << 0.05). We compared the AIC of the two interaction models and determined them to be 7962.14 and 7360.523 for interaction.one and interaction.two respectively. Thus, we conclude that interaction.two is more suitable in this case.

Finally, we considered the model accuracy of both interaction.one (residual deviance of 1248.9 on 996 residual degrees of freedom) and interaction.two (residual deviance of 647.26 on 996 residual degrees of freedom) and found that interaction.two provided an adequate fit to the data.

Standard residual analysis plots are shown below:

The plot of Residuals vs. Fitted values shows strong signs of overdispersion and the Normal Q-Q plot indicates that the data deviates slightly from the expected normal distribution at the top end. These would give us concern for any predictions made from this model, and suggest that the confidence intervals may be wide for some sets of inputs. However, as we are not using this model for prediction then I am comfortable with making inferences as to the observed effects.

**Conclusions**

The iterative relaxation of Tw is a significant performance problem when applied to large data sets. However, the relaxation of Tw appears to be a smooth, continuous exponential function which we might be able to describe accurately.

Initial simulations suggest that the rate of decay (and therefore, overall performance) is srongly dependent on Tdew which is used as the initial estimate for Tw. Additionally, the difference between Tmax and Tdew is strongly correlated to the performance of the iterative algorithm. We might be able to exploit this to optimise the choice of initial value for Tw. However, this would still leave us with an iterative approach and potential problems at scale.

In future posts we will explore the relaxation more closely. Specifically, we will re-examine the empirial formula published by Stull (2011) and investigate nonlinear regression as a potential solution for modeling the decay.

**References**

Bernard, T.E., Pourmoghani, M. (1999). Prediction of workplace wet bulb global temperature. Applied occupational and environmental hygiene, 14(2), 126-134.

Lemke, B., & Kjellstrom, T. (2012). Calculating workplace WBGT from meteorological data: a tool for climate change assessment. Industrial health, 50(4), 267-278.

Stull, R. (2011). Wet-bulb temperature from relative humidity and air temperature. Journal of Applied Meteorology and climatology, 50(11), 2267-2269.

]]>

Around the globe there is ongoing debate as to whether climate change is a real phenomenon or simply a short-sighted perspective on normal cycles in weather patterns. John Coleman (2015) is outspoken about the integrity of climate change studies, believing that “heat waves are actually diminished, not increased”; a view supported by Professor Geoffrey Duffy (2010) who argues that “climate is always changing, and always will”. Even scientists who have made careers investigating climate change are cautioning against the public and political paranoia surrounding the topic, with noted New Zealand scientist David Evans (2011) suggesting that some of the claims are “outrageous fiction”. The cautionary views of scientists such as Evans and Duffy argue that empirical, scientific evidence does not support global warming. The facts simply ignored by political parties who use climate change as their main political platform.

However, even if we agree that global warming is a fiction and regardless of the debate on climate change, we can not ignore the physiological impact of extreme temperatures. Living and working in extreme heat is a serious problem, particularly in the ‘base-of-the-pyramid’ workforce of India, Africa and South East Asia. Overworked communities, suffering under poverty and substandard conditions are a humanitarian problem. For corporations who operate in these environments there are questions of sustainability, social responsibility and economic return as well. So while the science if often funded on the back of climate change, there are very real social and economical interests in the accurate measurement of physiological heat stress.

The UK’s Health and Safety Executive (2010) report the wet bulb globe temperature (WBGT) index as the most widely adopted measure of the effects of temperature on humans. However the direct measurement of WBGT is highly inaccurate, largely due to the difficulty of accurately recording the natural wet bulb temperature (Tw). It is the approximation of Tw that I have been particularly interested in over the past couple of years. In this series of blog posts (which I will link to below as they are published) I will investigate the current models for approximating Tw; including the most widely accepted iterative approach (see Lemke and Kjellstrom, 2012), an empirical model developed via genetic programming (see Stull 2011) and I will investigate the use of nonlinear least squares regression for modelling the relaxation of Tw.

My aim is not to set the academic world afire with this research, but simply to explore the properties of the approximation of WBGT, which is fundamentally defined by the iterative relaxation of Tw. The iterative approach is generally regarded as the the most accurate and suitable method at present. However, like many iterative algorithms if fails to scale to large data sets. I am intrigued that given the relaxation process is well understood, there is no current mechanistic model for this. It could be that the relaxation is highly sensitive to the inputs and that there is no simple relationship between the inputs and the final WBGT. To complicate this further, there are issues with missing data, anomalies and potential inaccuracies with the method of interpolation used by the European Climatic Research Unit (CRU) (New, Holme and Jones (2010) and http://www.ipcc-data.org/observ/clim/cru_climatologies.html).

My hope is to be able to derive a useful model for the relaxation of Tw, but failing this hopefully in attempting I can shed some light on the inherent spatial qualities and inaccuracies within the data set that make the approximation of WBGT such a tough problem.

**References**

Coleman, J. Climate Chang is a lie, global warming is not real [News Blog]. (June 9, 2015). Retrieved from http://www.express.co.uk/news/clarifications-corrections/526191/Climate-change-is-a-lie-global-warming-not-real-claims-weather-channel-founder

Duffy, G.G. Climate Change – the real cause [Webpage]. (August 3, 2010). Retrieved from http://www.climaterealists.org.nz/node/601

Evans, D. Climate Models Go Cold [Blog Article]. (April 8, 2011). Retrieved from http://www.climaterealists.org.nz/node/798

Health and Safety Executive. Wet bulb globe temperature index [Webpage]. (August 15, 2010). Retrieved from http://www.hse.gov.uk/temperature/heatstress/measuring/wetbulb.htm

Lemke, B., & Kjellstrom, T. (2012). Calculating workplace WBGT from meteorological data: a tool for climate change assessment. Industrial health, 50(4), 267-278.

New, M., Hulme, M., & Jones, P. (2000). Representing twentieth-century space-time climate variability. Part II: Development of 1901-96 monthly grids of terrestrial surface climate. Journal of Climate, 13(13), 2217-2238.

Stull, R. (2011). Wet-bulb temperature from relative humidity and air temperature. Journal of Applied Meteorology and climatology, 50(11), 2267-2269.

Sutter, J.D. EPA boss: Climate change could kill thousands [News Blog]. (July 22, 2015). Retrieved from http://edition.cnn.com/2015/06/22/opinions/sutter-epa-climate-cost/

]]>

**Attempt 1: Install R from Ubuntu repositories**

$sudo apt-get install r-base r-base-dev

This works nice and simple. However, had issues installing packages. There were a number of issues, mostly related to dependencies, but the core issue here was that when trying to install devtools I was getting an unmet dependency on xml2. This should have been there, having installed libxml2-dev from the ubuntu repositories. Was also getting a catch-all “plyr package is not available for R 3.0.1” when trying to install plyr.

**The Problem:**

As always, there are a myriad of forums all offering different solutions. Trying to install dependencies did not work for me.

Finally hit upon a stackoverflow answer (http://stackoverflow.com/questions/30794035/install-packagesdevtools-on-r-3-0-2-fails-in-ubuntu-14-04) that worked for me. Simply seems that the ubuntu repositories point to R version 3.0.2 by default, and this is horribly out of date.

**The Fix:**

The Fix it seems is simple – install R version 3.2.0.

I added the Auckland CRAN mirror to the sources.list, ran an update and then reinstalled R 3.2.0. devtools and plyr then installed without issue. Many thanks to the people on stackoverflow!

**Useful links:**

https://help.ubuntu.com/community/Repositories/Ubuntu#Adding_Other_Repositories (adding the public key for the Auckland CRAN mirror)

]]>

So just very briefly – our models should describe our data.

**empirical models**: describe the data we have observed. Great for finding patterns, trends, hidden structure. Useful for prediction, *if* our observed data encapsulates the expected future behaviour / range of future data.

**mechanistic models**: more akin to the models we are used to seeing in physical sciences. They are designed to describe a behaviour, and it is likely (or even expected) that this behaviour is stable and will continue to hold beyond the range of observation. i.e. extrapolation is likely.

And of course, all models are inferential – there are certain assumptions that we make before we model and in picking our models:

With the proliferation of amazing machine learning tools (BigML, Azure Machine Learning, Amazon Machine Learning etc.) I am seeing an increasing number of very suspect examples in the documentation. For sure, the documentation is focused on the workflow and ability of these tools to implement machine learning more so than the robustness of the models. But you would think some thought would go into the accuracy of the models.

I think we still need some time for these tools to grow before they are truly useful for data science. Primarily, if we are to use them then we are going to want to be able to perform all of our data analysis in the one environment, including: data cleaning, transformation, exploratory analysis and so forth right down into modeling, prediction and production. These tools will need to integrate powerful visualisation tools natively as they mature. And of course, we will all need to be educated in how to produce reliable, robust and ultimately profitable models if they are to be successful.

]]>

**Setup – My Test Replication Environment**

As usual, I will use my playbox database, WorldOfMayhem. Currently, the *players* and the *ticktick* tables are replicated. We will add the *dragons* table and the *weapons* table.

First, I will show you what happens if you just go and add this table to the publication articles…

**Full Snapshot Example**

Let’s add the *weapons* table. Right click your publication, go to articles and untick the option to only show included articles. We are going to tick the *weapons* table and then click ok.

Now, right click your publication. Go to “View Snapshot Agent Status” and start the snapshot agent. If you head to the replication monitor you will see the following:

Look closely at this screenshot. The nice thing is that only the *weapons* data has been applied to the subscription, you can see this in the replication monitor. However, the Snapshot Agent Monitor clearly shows that a snapshot of 3 articles was generated. And if you look in your snapshot folder, you can see that all three articles are there – and these files are bcp files and contain the table data.

The good news, is that even though all the articles where part of the snapshot, only the *weapons* table is actually pushed over. So if you have a local snapshot folder, this isn’t a terrible solution. But, if your snapshot folder is on a remote drive then you are potentially going to push a lot of data over the network.

Let’s fix this so that the full snapshot is not necessary.

**Step 1 Change the publication properties**

All you have to do is disable 2 publication properties: *allow_anonymous* and *immediate_sync*. Here’s how:

exec sp_changepublication @publication=N'testWOM', @property=N'allow_anonymous', @value='false'; go exec sp_changepublication @publication=N'testWOM', @property=N'immediate_sync', @value='false'; go

**Step 2 Add your article**

Now go back to the properties of the publication and add in a new article. This time, we will add the *dragons* table.

**Step 3 Start the snapshot agent**

Once you have added your article, right click the publication again and select “View Snapshot Agent Status” and start the snapshot agent. Fire up the replication monitor and head to your snapshot folder. You will see a very different scenario this time:

And that’s all there is to it. Very quick, low impact and makes for a happier change team in the small hours of the morning.

**Update – What Didn’t Work**

Thought I should add a note about what didn’t work when I was trying to do this. Using the GUI was the key. When I tried to add the article to the publication via the stored procedures I got a number of different errors.

1) if I did not drop the subscription, then I would get an error when I went to run the snapshot agent. It didn’t not pick up that there had been changes to the publication. There is probably a way around this, but I didnt’ find it.

2) when I did drop the subscription, then add the article to the publication, then recreate the subscription -> a full snapshot was generated. No good.

]]>

**Public vs. Private cloud**

You can think of a public cloud like a public swimming pool. Users share the same resources and while users are restricted to their own “swimming lanes” the boundaries are fluid and there is no physical boundaries between resource pools.

Contrast this to a private cloud, where individual resource pools are physically isolated within the data center. It is like having your own private swimming pool, that you have maintained and cleaned by the local pool boy. The biggest consideration with a private cloud is whether to have it hosted onsite, or in a remote data center.

**Platform (IaaS vs. PaaS vs. SaaS)**

Deciding to what level you wish to engage with your cloud provider is a really challenging question. At one extreme, you have Infrastructure as a Service (IaaS) which essentially provides an on-demand scalable virtual environment. You have the ability to stand up your own virtual machines, host your own applications and operate a standard virtual infrastructure. In this environment you share very little with the host and have the ability to lock down the logical boundaries of your machines.

Platform as a Service (PaaS) or Software as a Service (SaaS) offer greater convenience and lower management costs than IaaS. However, in going down these routes you increase your exposure and reliance on your provider. You have to trust to their configuration and security of applications which has broader ethical and legal complications which must be considered.

**Data Encryption**

This is something that I think most DBAs do not give enough consideration! And I lump myself in that boat as well. When I reflect critically on my own practices, I realise that I treat data encryption as a “one time” concern – set it up and trust that it is working. But when you have your data sitting offsite, we need to be a little more disciplined in our practices. Not only should we be making use of SQL Server’s very user-friendly encryption layers, we should also be thinking about rotating our encryption keys on a regular basis. Perhaps we should even be considering two-factor security with physical tokens, or more than one authentication method.

**End-to-end security**

Finally, we have to consider securing our communications. The best firewall and encryption in the world is no use, if unauthorised users are able to piggy-back on our connection. This is a really scary issue and it is surprisingly prevalent. Off the top of my head, I can think of a very large and very popular international remote management system that is open to piggy-back connections. It is a serious flaw in the security design and a liability for companies using it (especially if those companies are service providers with ethical and legal standards to maintain!).

Security remains one of the biggest barriers to adoption of cloud services. And as cloud providers like Amazon and Rackspace become larger, they become larger targets for security threats. We can be sure that developments in cloud security will continue to advance quickly, and we need to take time to very seriously consider our options and implement appropriate measures at all levels of service.

]]>

Information is more than raw data, it is a combination of intelligence, perspective and imagination. It is for this reason we need exceptional talent in the data industry, exceptional people who can see through the data to the opportunities within.

Check out the 3 different visualisations in this post. Each of these are created from the same data, but each tells a slightly different story. Exploring your data in different ways can help you gain a different perspective, give insight to improve your algorithm, help design a more efficient or more accurate model or perhaps just get your message across to different people. In this example, these visualisations helped to develop a more accurate model for climate change – check out the full blog post: https://nickburns2013.wordpress.com/2014/10/21/an-improved-algorithm-for-the-calculation-of-wbgt/

What ever area you work in, take the time to step back from your data and look at it again, with new eyes and a fresh perspective.

]]>

The vast majority of computational algorithms in science are iterative. While there may be numerical solutions, there is not always a numerical solution and often numerical solutions do not scale well (for example, the normal equation for linear regression). Iterative models provide a way to model increasingly complex problems over very large data sets. Euler’s method is a widely used iterative algorithm which continues to update a hypothesis, h(x), while some boundary condition holds. The general form of Euler’s method is shown below:

repeat { h(x) := h(x) + alpha.h(x) } where alpha << 1

Given the general form above, where alpha << 1, then the *“future hypothesis”* is* “mostly what it is now, plus a little bit”*. Generally, iteration will continue until some boundary condition is met. For example in linear regression, the hypothesis is updated until the mean squared error is minimised. Often the choice of ** alpha **is the single most critical element to the efficiency of the algorithm, where if alpha is too large the hypothesis may oscillate wildly and never converge, or if alpha is too small the rate of convergence will be slow.

**The Relaxation of Tw: The Existing Algorithm**

The computational modeling of WBGT is a practical example of Euler’s method in action. The critical step is the *relaxation of Tw*, an iterative process strongly reliant on the input values. Currently, there is no accurate numerical solution to calculate Tw and the physical measurement of atmospheric Tw is inherently inaccurate.The algorithm, using R, is given below:

esat <- function (t) { return (0.6106 * exp(17.27 * t / (237.3 + t))) } mcphersons <- function (ta, td, tw) { return ( (esat(td) - esat(tw)) * (1556 - 1.484 * tw) + 101 * (ta - tw) ) } relax <- function (ta, td) { # two additional variables to record intermediate state for visualisation twX <- c(td) mcpX <- c(10000) # initial conditions mcp_i <- 10000 mcp_j <- 10000 tw <- td # Increment Tw while these boundary conditions hold while (mcp_i > 0 & mcp_j > 0 & tw < ta) { mcp_j <- mcp_i mcp_i <- mcphersons(ta, td, tw) tw <- tw + 0.3 twX <- c(twX, tw) mcpX <- c(mcpX, mcp_i) } return (list(tw=twX, m=mcpX)) }

In the algorithm above, Tw is slowly “relaxed” until McPhersons formula reaches zero. With each iteration, Tw is advanced by 0.3 degrees and McPhersons formula is re-evaluated. Figure 1 shows the values of McPhersons Formula at each iteration:

The plot above was recorded with inputs (ta, td) = (20, -8). While it appears the the relaxation is linear within this range, as the number of iterations increase the relaxation is noticeably non-linear. Relaxation continues until the value of McPhersons formula crosses y = 0.

The number of steps required to fully relax Tw is strongly dependent on the input values Ta & Td and can take several thousand iterations with real-world data. When applied to large data sets (for example the CRU data sets (2) for which there is currently more than 30 years worth of measured temperature observations) the relaxation can take many hours to complete.

In previous experiments we have successfully been able to short-circuit this iteration by projecting a linear trend and then stepping backwards to the target value. However, there are a number of issues with this approach. Firstly, it adds complexity to the code to be able to determine the linear model, extrapolate out to y = 0 and then step backwards. Secondly, while it is effective at reducing the total number of iterations and improving the overall performance, it departs from the classic iterative model which makes it harder to understand and share with other researchers. Below, we explore a modified iterative scheme using ** gradient descent** which results in a faster rate-of-convergence than the existing method and has the advantage that gradient descent is a well understood and widely-used algorithm.

**Exploring the Relaxation of Tw**

We know, from observation of the relaxation profile in Figure 1, that the general trend in relaxation is downwards. Looking at the code, we can see that the relaxation of Tw depends on the value of McPhersons formula and that the boundary condition is approximately where McPhersons formula dips below the x-axis. Given this boundary condition, the target value of Tw is approximate to within +/- 0.3 degrees. However, if we plot the square of McPhersons formula we observe a more accurate limit:

Figure 2 provides a different perspective of the relaxation of Tw. The plot of the square of Mcphersons formula is a classic convex function with a global minimum that corresponds to our target value of Tw. Looking at Figure 2, it seems that we could employ *gradient descent* to relax Tw. An algorithm for gradient descent is given below:

descend <- function (ta, td, alpha=0.001, eps=0.05) { # initial conditions twX <- c(td) tw_i <- td tw_j <- 100000 # this is just an impractical temperature # increment Tw, while delta(tw) > epsilon while (abs(tw_i - tw_j) > eps & tw_i < ta) { tw_j <- tw_i tw_i <- tw_i + alpha*mcphersons(ta, td, tw_i) twX <- c(twX, tw_i) } return (twX) }

And if we plot Tw, we can see that the *descend()* function asymptotes at the target value of Tw, ~ 8.2.

Figure 3 shows the relaxation of Tw via the gradient descent algorithm, which is able to accurately and efficiently approximate Tw. By using gradient descent, the accuracy of the approximation is improved (accurate to within the threshold *epsilon*) and the total number of iterations has been greatly reduced. The model can be further refined by adjusting the learning rate, *alpha*, and the threshold, *epsilon.*

**Comparing the two models**

To assess the suitability of the gradient descent algorithm, we compared the resulting values of Tw and WBGT to the original algorithm on a randomly generated test set with 10,000 observations. The experimental setup is shown below:

n = 10000 data <- data.frame( ta = sample(5:40, n, replace=TRUE) , td = sample(-20:20, n, replace=TRUE) , tw_relax = rep(NA, n) , tw_descent = rep(NA, n) ) for (i in 1:n) { original <- relax(data$ta[i], data$td[i])$tw descent <- descend(data$ta[i], data$td[i], eps=0.05) data$tw_relax[i] <- max(original) data$tw_descent[i] <- max(descent) } data$delta_tw <- data$tw_relax - data$tw_descent

The agreement between these two models was determined as the difference in Tw by the original *relax()* *algorithm* and the new *descend() algorithm* and is shown in Figure 4:

Figure 4 shows a reasonable agreement between the original algorithm and the gradient descent algorithm. It seems that the original algorithm has a tendency to over-shoot the true minimum of McPhersons formula and typically produces values of Tw that are 0.5 degrees greater than the gradient descent algorithm.

WBGT is calculated as:

We compared the values of WBGT using Tw determined by the original algorithm with the same for gradient descent:

Figure 5 shows good agreement of WBGT calculated by the original algorithm and the gradient descent algorithm. A 95 % confidence interval was determined for the agreement of WBGT(relax) and WBGT(descend):

Based on the 95 % confidence interval above, the gradient descent algorithm is entirely suitable for the approximation of WBGT and could serve as an efficient and reasonable substitute for the existing algorithm.

**Conclusion**

Accurate and efficient prediction of heat stress indicators, such as WBGT, is important for the modeling of climate change. There are no accurate numerical solutions for the calculation of WBGT, where Tw is unknown, which has led to the development of iterative algorithms. However, iterative algorithms are expensive when applied to large databases of real-world climate data. To improve the accuracy and efficiency of calculating WBGT we propose a method for the relaxation of Tw that uses the well-known and widely used gradient descent algorithm. We have shown excellent agreement between the original algorithm and the gradient descent algorithm to within 0.18 +/- 0.0015 degrees Celcius with a 95 % confidence interval. Overall, it appears that the gradient descent algorithm is able to accurately and efficiently approximate WBGT. Also, as gradient descent is widely used in learning algorithms, such as linear regression and logistic regression, the algorithm has the advantage that is it relatively easy to implement and share across research groups.

**References**

(1) Lemke, B., & Kjellstrom, T. (2012). Calculating workplace WBGT from meteorological data: a tool for climate change assessment. *Industrial health*,*50*(4), 267-278.

(2) The University of East Anglia, Climatic Research Unit. http://www.cru.uea.ac.uk/data

]]>