Sunday, June 17, 2018
Big Data and Death at America's Racetracks

The original title for this post was "Use of Artificial Intelligence Techniques to Predict Racetrack Breakdowns", but I kind of wanted more than five people to read it. Hence the current title with a bit more sizzle and a nod to the New York Times investigative series from last year. Fair warning though, there's no getting around the fact that we'll be talking here about the advanced analysis of data to build a hypothetical predictive model, but I'll try to make that as accessible and interesting as possible.

John_Hamilton_Mortimer_-_Death_on_a_Pale_Horse_-_Google_Art_ProjectIf you follow Thoroughbred horse racing (and even if you don't) then you probably know that the chronic issue of racehorses breaking down and being euthanized is a serious problem. Not that it's a new problem by any measure -- as long as horses have raced, horses have suffered fatal injuries in the process. What is fairly new though is an increasing intolerance for it, both within and outside of racing. What has slowly changed over the years is our collective cultural attitude towards animals, especially mammals, including horses. They are not just expendable things. We are obliged to them in certain ways.

How long does it usually take for the subject of breakdowns to come up when you meet someone new and they learn about your interest or involvement in Thoroughbred racing? My guess is not very long. That's kind of telling.

The early days of Hollywood are notorious for the utter disregard for the welfare of horses. Indeed, they were routinely, and intentionally, killed to "get the scene". The exact number of horses killed by trip-wire and other ways during the filming of the 1925 version of Ben-Hur may never be known. Last year, HBO's horse racing drama "Luck" was completely shut down when a few horses met with fatal accidents. One death, no matter the circumstances, is now too many.

It's much the same in Thoroughbred racing. Things are different now. Injured horses are not "wastage". Still, we accept that accidents will happen in a dangerous sport. The fatality rate can never be 0%. However, it is incumbent on the industry to discover how low the fatality rate can be.

Racing has really never come to terms with what to do and how to react when a horse breaks down on the racetrack. So typically the reaction is to do and say nothing, lest we draw attention to the matter. Sure, the track announcer will inform that a horse is down, but that's likely his last word on the subject. Onlookers who have seen it all before may mutter something about "the dark side of the game" and flip their programs to the next race. I can only guess what goes through the mind of racetrack newcomers. It's only recently that journalists in attendance have taken to Twitter to release previously elusive information on the aftermath of racetrack incidents.

There's an uncomfortable disconnect in racing. We rightly point to the horses as the stars of the game. Yet an injured star might be silently euthanized on the racetrack, with one eye in the dirt and the other looking to the sky. And the show goes on without missing a beat. It just feels strange.

Now we can probably do better at how we handle racing fatalities, but we can certainly do better at preventing them. Are we really doing everything we can to maximize equine safety? The next time someone says "There is always something more we can do" in regard to breakdowns, ask them what is that something. Be annoying in your persistence. The day may come when horse racing can say with sincerity "we are doing everything we can imagine to ensure their safety".

You're likely aware of the "data revolution" unfolding all around us during this era of "big data". Maybe you're involved in this revolution in some way. Never before have we produced so much data and had access to so much data. And never before have we been more able to utilize those data to solve problems. Data-ism is a veritable movement, and for good reason. Areas of application include science, marketing, medicine, sports and many others. There's someone out there right now who is thinking about how to analyze and leverage certain data in a way that's never been done before. And it's going to be a big deal.

It's a combination of factors converging at once to make all this possible. Computing power and data storage is cheap. Data science as a cross-disciplinary field of study is beginning to mature. It really wasn't that long ago that a sophisticated data-driven predictive model would require a team of statisticians working with expensive software on a mainframe computer.

NeuralLabActivationBut today, cutting-edge analytical software like R is open source. Resources to learn how to use this free software are widely available at little to no cost. A motivated individual can take free online courses by professors from Johns Hopkins and other top schools. The folks over at Predixion have even designed an advanced analytics module that integrates with plain old Microsoft Excel, with the idea of bringing predictive analytics straight to the desktops of those who are most familiar with the data. Their product is not free, but it's not expensive either.

And lastly, let's face it, predicting stuff and being right is all the rage. Just ask Nate Silver.

So what does all this highfalutin data talk have to do with reducing racetrack breakdowns? Quite a bit, as it turns out, or as it might turn out.

You are probably already familiar with the spectrum of data available in typical horse racing past performances. There is a truly impressive amount of information embedded in racing charts and running lines. And plenty more that could be derived through further analysis.

Cutting to the chase: Can we use readily available data from racing charts to predict which horses are at risk of breaking down? I think we probably can and will explain how.

First, a little about predictive modeling in general. Predictive analyses aim to accomplish quite a bit more than just describe or explore data. The objective is to develop an algorithm that can predict something in the real world. Sometimes predictive analyses will shine a light on underlying causal factors too. But that's just a welcome bonus. What we want to know is does X predict Y. Because if it does then we can go out into the world and measure X and expect to find Y, with the assumption that finding Y is an important and valuable thing to do.

A project that intends to produce a useful predictive model will generally proceed as follows:

Plan how the predictive model will be used in the real world
Define the question/objective
Imagine the ideal data
Conduct a census of relevant data
Clean the data
Create derived variables
Partition the data
Explore the data
Model the data
Interpret results
Challenge results
Plan how the predictive model will be used in the real world

You can see that the first step and the last step are the same. It makes little sense to build a predictive model that cannot be implemented in practice. Unless, that is, you simply want to submit a scientific article to an academic journal. Our goal here is to build something to help solve a real life problem.

Turning to the specifics of racetrack breakdowns, what is it then that we want to predict? Well, we want to predict when a horse is at risk of breaking down. But upon further thought it's not as simple as that. We really want to predict unsoundness. Horses may break down as a result of accidents too, and we can't predict that. We can't predict that because we have no theoretical justification to be able to predict that. But what if the racetrack surface is somehow unsafe and a contributing factor to injury? Both sound and unsound horses are at risk if the racetrack surface is unsafe. And how confident can we really be when chalking up a specific breakdown to either accident or unsoundness?

It's already getting complicated and we're still discussing the project objective. That's absolutely fine. That's the way it should be. Statistical modeling efforts require a lot of thinking before any of the heavy lifting should begin. The computer just crunches numbers according to our instructions and always has an answer. It's up to us to interpret the input and output and make sure theory is governing the process every step of the way.

So ideally we want to predict unsoundness but will likely have to settle for predicting "likelihood of breaking down". The breakdowns in our data will include an unknown number of random accidents confounding the thing we really want to predict. So the model will not be as precise as it could be, which is not to say it will be any less valuable. With more robust data, future iterations of the model may be better at isolating unsoundness.

Let's go ahead and imagine the ideal data for a project like this. Well, we need to know when a horse has broken down. And we want to know as much as possible about each specific horse and as much as possible about the specific circumstances. And we want many years of data.

We'll start with identifying horses who have broken down. The good news is that industry organizations and regulators are counting breakdowns. Of course, this hasn't always been the case. The high profile breakdown of Eight Belles in the 2008 Kentucky Derby convinced many in the racing industry that breakdowns are important enough to be counted. So we now have a centralized Equine Injury Database maintained by the Jockey Club. This is a very good thing in concept but the disclaimer following the fatality rates gives one pause, as does the list of participating racetracks. No matter, we'll take a leap of faith and assume these data are going to be very useful in our model.

Many individual racing jurisdictions collect and track fatality statistics as well. And one jurisdiction, the state of New York, impressively maintains and shares a fatality database that can be drilled into for details. There are probably other "watchdog" sources too who may have relevant data they are willing to share.

Finally, there are electronic past performances and charts going back a long way, a couple of decades at least. These are valuable because incidents and fatalities are often noted in some fashion in the chart callers' comments. That notation may not amount to much more than a "Went wrong" comment in many cases, but other times there is more information including confirmation of fatality.

It's ironic that this chart for the 2008 Kentucky Derby, the moment that ushered in a new era of sorts regarding Thoroughbred safety and awareness, only informs us that Eight Belles "pulled up in distress". Kind of an understatement, don't you think?

The last step we might take to tally breakdowns is bold and arguably impractical. But since we're just writing ideas on the whiteboard... Why not conduct a census of all owners and trainers asking them about any of their horses that have broken down over the last X number of years. They certainly have records, right? Will they cooperate and even provide some circumstantial information about the incident? Beats me, but they might. They should.

Is all of this overkill? No, not really. It's called triangulation. We can use multiple methods to gather the same information. Holes and gaps in one data source might be filled in by another data source.

So, in the end, there's plenty of breakdown information to be ferreted out. And we don't have to pick and choose -- we'll take it all and cross-reference between different datasets until confident that we've captured the most complete and accurate census of racetrack breakdowns possible given existing data. Good data results in a good model.

This multi-method data collection approach is really not much different from the way Joe Drape and the New York Times approached the issue. There was some griping at the time about the Times' methodology which relied heavily on chart caller comments, but I don't really know why. Is there a better way to go back in time to estimate racetrack fatalities? This might be the only way to conduct a retrospective study going back many years.

The major difference between the Times' analysis and this predictive modeling idea is that they set out to merely describe the data they assembled and produce breakdown statistics over a recent window of time. We'll of course be able to do that too, with hopefully cleaner and more accurate, inclusive data.

Now, since we're building a predictive model tied to past performances, we want ALL the racing charts for a specific time frame, not just those for breakdowns. We'll explain why in a moment.

The remaining predictive modeling steps are in a sense "easier" than the giant task of collecting all relevant data. In brief, next the data will be cleaned for consistency and to prepare it for processing and analysis. A library of variables and features will be built. Then the data will be randomly partitioned for modeling purposes -- divided into training, testing and validation sets.

nnetPredictive modeling would begin using a variety of analytical techniques, maybe logistic regression, neural networks, or genetic algorithms. The final choice of modeling technique, which serves as the underlying engine in the model, is usually a less important consideration than the data. A simpler model built with more comprehensive data will outperform a more complex model built on comparatively limited data.

If you're really interested in learning about predictive analytics in greater detail go ahead and Google it, there's no shortage of overviews and basic (and advanced) explanations out there. The main thing to understand is that algorithms would be trained to learn patterns of information that lead to breakdowns, compared with patterns of information that do not. When presented with new information it has never seen, the algorithm would look for those "risky" patterns and produce a relatively higher risk score when it finds them.

If you're a horseplayer, think about how the "sheets" players focus on patterns of speed figures to represent form cycles. Certain form cycles are associated with expected performance levels in the next race. When those form cycles are encountered in the future, handicapping decisions can be quickly made. This model would be kind of like that too in a general sense, except certain patterns of speed figures (if demonstrated to be predictive of breakdowns in the next start) would be simultaneously evaluated in combination with many other factors.

Which data then would be useful in predicting breakdowns? Well, the only way to answer that is to begin the analysis. The data always has secrets. But we can speculate with a laundry list of possibilities.

Speed figures: Already mentioned this; analysis of patterns of figures; sudden declines/improvements.
Pedigree: Imagine a giant table of breakdown risk indices representing different sires or sire lines.
Class: Analysis of patterns of downward class movement; sharp drop in class.
Other: Trainer, workout patterns, typical time between races, equipment, track surface, track condition, weight carried, weather, medication, recent scratch.

But don't get the impression that this would be a mere fishing expedition for correlation. Fortunately there's an existing body of research to inform the modeling efforts. There have been a couple of updates from Dr. Tim Parkin of the University of Glasgow who is analyzing the Jockey Club's EID. His research efforts and others like it would be utilized fully.

"All models are wrong, but some are useful." - George E. P. Box

Mr. Box, a statistician, is quite right. Even if a model predicting racetrack breakdowns is working well, it will usually be wrong, but still very useful. How can that be? Bear with me on this.

There's an important concept in predictive modeling known as a False Positive Ratio. Basically, this is the ratio of predictions that turn out to be wrong to predictions that turn out to be right, over the long term. If our model predicts a certain horse is at high risk of breaking down but appears sound and continues to run without incident, this is a false positive. Likewise, a horse assigned a high risk of breakdown score who does suffer injury, or better yet (much better), is prevented from racing upon discovery of soundness issues, is a True Positive.

Some time ago I worked for a major credit card company building predictive models trained to discover fraudulent transactions. Maybe you've received a call from your bank about a card transaction that seemed suspicious. Your bank was using such a model. The models were typically calibrated to perform at a 30:1 false positive ratio. A fraud investigator would be expected to discover, on average, one true fraudulent transaction for every thirty that the model flagged as risky.

Our predictive breakdown model would work in a similar way. It will take a little time to discover what false positive ratio performs best "in the field".

Are you still with me so far? If not, that's only because I haven't explained a complex topic simply enough. The rest is more straightforward though.

763px-Skeleton_with_outline_of_a_horseImagine that we've built a predictive breakdown model that we feel is ready to "go live". Now what? Remember this question was answered in the very first step of the predictive modeling process. It was at that time that we might have decided it would be feasible to run the model on the back end when entries are produced (yes, there's generally a substantial software engineering component involved with predictive model implementation). We may have also imagined an automated process to funnel high breakdown scores (above a certain threshold) to racetrack veterinarians. Now would be the time to fine-tune that process.

I'll not pretend to have any special insight into the track vet's prospective role in this process. I'm assuming though that there could be a more intensive pre-race inspection performed on those horses identified as at-risk. If the unsoundness predicted by the model can't be routinely and efficiently detected then the model will be a tool of little practical value.

I was once told by a Thoroughbred racing industry leader who will remain unnamed (it was Alex Waldrop) that "this industry needs to be more analytical in the way that it approaches problems". Alex could not be more right of course, yet we don't see it, not yet.

We don't see an analytical approach in marketing that appears to be targeting any specific profile of potential customers. Actually, I take that back, some marketing tactics do appear to be targeting certain demographic groups -- millennials for example. But I don't know why. Where is the data that suggests millennials represent the horseplayers of tomorrow?

We don't see an analytical approach in product pricing (pari-mutuel takeout rates). There is a theoretically "correct" answer to what these rates should be to maximize profit. Research and analysis can point the way to enlightenment. But instead, regulators set takeout rates like giggling elves pulling levers in a secret workshop.

And we don't see an analytical approach in reducing breakdowns. Hence the idea behind this post.

There is an important question still to answer. Which racing organization is best positioned to take on a project to build a predictive breakdown model?

The Jockey Club is the obvious answer. This would fit right in with the many projects they've backed in the last year or two. They have the required data and the influence to acquire more data, and they have the resources.

Another less obvious candidate might be the NYSRWB and the New York Racing Association. The new and improved NYRA is... well... new and improved. This initiative feels like something they could be interested in exploring. The racing taking place under their aegis might stand to benefit more so than other jurisdictions. There's probably enough racing in New York alone to build a predictive model. It could be an informative pilot study.

As you can see, I feel rather strongly that Thoroughbred racing should get started down the path to predicting breakdowns and intervening to prevent them from happening. I also think that progress down that path could and should be measured not in years, but months. Just getting started and letting the public know would be good PR for racing. Such a modern, data-driven initiative would be respected by those whose respect the industry has lost or is losing, including certain members of the U.S. Congress.

The first models don't have to be brilliant, just very useful. We'll save the brilliant part for later when we're able to merge vet records into the data.