Friday, March 29, 2013

Topics of Interest

Because I just don't always have time to fully develop a post on everthing I come across, here are a few shorties:

Pauls Allison has done some great posts recently related to logistic regression and model assessment.

With regard to the pseudo R^2, see this post as well as the article associated with a new proposed alternative:

Tjur, T. (2009) “Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination.” The American Statistician 63: 366-372.

(I've written about the pseudo R-square before here.

His most recent post discusses the Hosmer-Lemeshaw test. In the futrue I'd like to expand more on this, but he's critical of the test because it is sensitive to strata size. I am too, and I've also seen many criticisms related to its sensitivity to large sample sizes. I'll come back and expand more on that later, or do a separate post, but for now I'm just looking forward to his next article in which Paul is going to discuss some recent advancements and alternatives to the HL test.

Thursday, March 21, 2013

Big Ag Meets Big Data Part II

Previously I discussed the role of social media in producing ‘big data’ and tools that may be used to get the most from this data in the ag industry. In this second installment I’m going to discuss other sources of ‘big data.’ 

I recall once  about 10 years ago attending a UK College of Agriculture field day in Princeton Ky, and someone made the comment that  went something like this

“these events are good because on the farm we don’t have time to set up experiments, collect data, and analyze to figure out best practices. We can’t stop and measure and record and report about everything we do.”

It’s certainly true that extension services will continue to conduct valuable research and it will probably remain a fact that producers aren’t going to necessarily have the time and resources to reduce their operation to a collection of well-crafted scientific experiments. However, every decision made on the farm is a trial of sorts, and with modern technology it is much easier to collect and log data about your operation, and some companies are now figuring out ways to take this farm level data and turn it into powerful analytical tools that can boost productivity and efficiency.    In a recent article ‘Building Big Data: Farming Big Data Goes To The Cows’ the following statement is made:
 "The major problem we keep on seeing — especially in bigger, modern farms — is that there's a lot of data being created and not being used, on how they're performing, what they're doing."
How is this data being generated? Lots if it is generated via your equipment including GPS:

“Next generation farm equipment like combines and tillers are going to be able to take soil samples as they move along, perform analysis on those samples, and feed the results of the analysis back to the manufacturer for crunching on a macro scale. This will result in a better understanding of what is happening in that entire area and make it possible to adjust things like the amount or types of fertilizer and chemicals that should be applied. If the farm equipment manufacturers figure out how to harness all this information, this kind of big-picture analysis could change the commodity trading markets forever." – from 4 Examples of Big Data Trends. Spetember 27,2012. VmwareBlogs.

 And how might we use this data?  Well some seed companies are already combining farm level data, public data, and their own proprietary data to develop some pretty powerful analytical tools. As discussed recently in an AgWeb technology article Steyer seeds offers a great example with its ACRES tool which is based on a complex form of decision tree:

“After they sign up, customers start by selecting their fields from Google Earth maps. Back-end programming then pulls up a wealth of information – everything from soil type to yield potential. As farmers enter in additional information about their farm, such as crop rotation, traits used, etc., the ACRES algorithm spits out recommendations, which users can accept or tweak as needed.” AgWeb - Unlock Your Farm Data

Another company, Climate Corporation is also taking advantage of massive amounts of data useful in agricultural applications:

"We took 60 years of crop yield data, and 14 terabytes of information on soil types, every two square miles for the United States, from the Department of Agriculture," says David Friedberg, chief executive of the Climate Corporation, …We match that with the weather information for one million points the government scans with Doppler radar — this huge national infrastructure for storm warnings — and make predictions for the effect on corn, soybeans and winter wheat." –New York Times

We’ve seen lots of efficiency, environmental, and productivity gains in agriculture related to GPS/GIS and biotechnology.  But with every trip across the field more and more data is being generated. Combining these technologies with ‘big data’ definitely will have its benefits, if not continue to revolutionize the industry.  

References and Further Reading:

Climate Corp. Updates Crop Insurance via High Tech. BloombergBusinessWeek. By Ashlee Vance on March 22, 2012.

Big Data Goes to the Cows
Big Data in the Dirt (and the Cloud) October 11,2011. NYT. Quentin Hardy.
4 Examples of Big Data Trends. Spetember 27,2012. Vmware|Blogs.
Data analysis, biotech are key in agriculture's future sustainability
By Sarah Gonzalez
© Copyright Agri-Pulse Communications, Inc.
Unlock Your Farm Data
February 15, 2013
By: Ben Potter, Farm Journal Technology Editor

Wednesday, March 6, 2013

Decision Trees and Gradient Boosting

Decision Trees

Decision tree algorithms search through the input space and find values of the input variables (split values) that maximize the differences in the target value between groups created by the split. The final model is characterized by the split values for each explanatory variable and creates a set of rules for classifying new cases.

Gradient Boosting

Boosting algorithms are ensemble methods that make predictions based on the average results of a series of weak learners. Gradient boosting involves fitting a series of trees, with each successive tree being fit to a resampled training set that is weighted according to the classification accuracy of the previously fit tree. The original training data is resampled several times and the combined series of trees form a single predictive model.  This differs from other ensemble methods using trees, such as random forests. Random forests are a modified type of bootstrap aggregation or bagging estimator (Freidman et al,2009). With random forests, we get a predictor that is an average of a series of trees grown on a bootstrap sample of the training data with only a random subset of the available inputs from the training data used to fit each tree (De Ville, 2006).  Gradient boosting can perform similarly to random forests and boosting may tend to dominate bagging methods in many applications. (Freidman et al,2009).


Friedman, Jerome H. (2001), Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189-1232. Available at http://stat.stanford.

Hasti, Tibshirani and Friedman. (2009)Elements of Statistical Learning: Data Mining,Inference, and Prediction. Second Edition. Springer-Verlag.

DeVille, Barry. (2006). Decision Trees for Business Intelligence and Data Mining Using SAS®  Enterprise Miner. SAS® Institute.

Monday, March 4, 2013

Big Ag Meets Big Data: Part 1

Social media has allowed farmers to organize and communicate about their industry.  The #agchat conversations on twitter are a good example. Not to mention Facebook (see Agriculture Proud for example) and YouTube ( like this look behind the scenes of a family farm). We've seen powerful examples of how social media can be used to mobilize voices and impact perceptions on a national level ( for example issues related to Yellow Tail wine and Pilot Travel Centers).

Social media also provides a rich data source for measuring sentiment or perceptions about the industry. Take for instance text mining. With Twitter, Facebook, email, online forums, open response surveys, customer and reader comments on web pages and news articles etc. there is a lot of information available to companies and organizations in the form of text. Without hiring experts to read through all of the thousands of pages worth of text available and making subjective claims about its meaning, text mining allows us to take otherwise unusable 'qualitative' data and convert it into quantitative measures that we can use for various types of reporting and modeling. Companies are finding that by mining text from web pages, comments, blogs, and social media, they can get measure consumer perceptions almost as well or better than they can through explicit surveys or other directly measurable outcomes in their databases. In my own personal experience, I've bench marked predictions made from traditional data base variables vs. text mining and found remarkable comparisons in performance.  The  validity of these  tools is not based necessarily on their ability to make new breakthrough discoveries, but on the contrary, how these algorithms give us almost exactly what we would expect, if we had time to manually process all of the information social media provides. (For a basic example of mining tweets related to 'factory farms' see: ).

Besides the actual text we get from social media, the actual structure of social networks can also be very informative.  Social network analysis (SNA) allows us to answer questions such as who are key actors in a network? Who are the most influential members of a network? Who seems to be acting on the peripheral? Which connections in the network are most important?  Are there key players bridging connections or information between otherwise disconnected groups? Have policies or other forces changed the overall dynamics/interaction between people in the network (i.e. has the network structure changed in any meaningful way) and does that relate to some other performance outcome or goal? I’ve recently used this kind of information to help a company develop a predictive model to improve its viral marketing campaigns.

Of course, it doesn't take a rocket scientist to read tweets, Facebook posts, or blog comments to know when people are upset about a product. But there is also a wealth of knowledge to be gained from this type of information that is so voluminous, it would take an army of social media experts to glean and analyze. This is the essence of what has been termed in the industry as 'big data.' It requires new tools for capturing, storing, processing and analyzing this data, and a new type of analyst referred to as a data scientist.  These powerful analytics could be very beneficial to those in the ag industry or agvocacy groups. But this goes beyond social media, and I will discuss how big data is revolutionizing agriculture at the farm level in the second part of this two part series on big data.

*Note: I’m not using the term ‘big ag’ in the derogatory sense used by anti-agricultural activists, but in a complimentary sense referring to the complex network of modern family farms, biotechnology companies, food processors, other agribusinesses and retailers that cooperate to bring healthy and sustainable food to your table.


Social Media Analytics. Matt Bogard, Applied Econometric and Analytical Consulting.

With Hadoop, Big Data Analytics Challenges Old-School Business Intelligence. Doug Henschen, Information Week
Big Bets On Big Data. Eric Savitz, Forbes.

Creative Commons Image Attributions:
Handheld GPS
By Paul Downey from Berkhamsted, UK (Earthcache De Slufter  Uploaded by Partyzan_XXI) [CC-BY-2.0 (], via Wikimedia Commons
Satellite: NAVSTAR-2 (GPS-2) satellite Source: Status: PD-USGov-Military-Air Force {{PD-USGov-Military-Air Force}} Category:Satellites
Tractor: bdk [CC-BY-SA-3.0 (], via Wikimedia Commons