There’s been lots of debate over whether the data scientist position is the sexiest job of the 21st century. Despite the Unicorn hype, spending a day with them at the Wrangle conference, an event staged by Cloudera, was a surprisingly earthy experience. It wasn’t an event chock full of algorithms, but instead, it was about the trials and tribulations of making data science work in a business. The issues were surprisingly mundane. And by the way, the brains in the room spoke perfectly understandable English.
It starts with questions as elementary as finding the data, and enough of it, to learn something meaningful. Or defining your base assumptions; a data scientist with a financial payments processor found definitions of fraud were not as black and white as she (or anybody) would have expected. And assuming you’ve found those data sets and established some baseline truths, there are the usual growing pains of scaling infrastructure and analytics. What might compute well in a 10-node cluster might have issues when you scale many times that. Significantly, the hiccups could be logical as well as physical; if your computations have any interdependencies; surprises can emerge as the threads multiply.
But let’s get down to brass tacks. Like why run a complex algorithm when a simple one will do. For instance, when a flyer tweets about bad services, it’s far more effective for the airline to simply respond to the tweet asking the customer to provide their booking number (thru private message) rather than resort to elaborate graph analytics establishing the customer’s identity. And don’t just show data for the sake of it; there’s a good reason why Google Maps GPS simply shows colored lines to highlight best routes rather than dashboards at each intersection showing which percentages of drivers turned left or went straight. When formulating queries or hypotheses, look outside your peer group to see if it makes sense through other peoples’ eyes.
Data scientists face many of the same issues as developers at large. One of the speakers admitted resorting to Python scripts rather than some heavier weight frameworks like Storm or Kafka; the question in retrospect is how well are those scripts documented for future reference. Another spoke of the pain of scale up of infrastructure not designed for sophisticated analytics; in this case, a system built with Ruby scripting (not exactly well suited for statistical programming) on a Mongo database (not well suited for analytics), and taking Band-Aid approaches (e.g., replicating the database nightly to a Hadoop cluster) before finally biting the bullet and rewriting the code to eliminate the need for wasteful data transfers. Another spoke of the difficulty of debugging machine learning algorithms that get too complex for their own good.
There are moral questions as well. Clare Corthell, who heads her own ML consulting firm, made an impassioned plea for data scientists to root out bias in their algorithms. Of course, the idea of any human viewing data or querying it objectively is a literal impossibility as we’re all human, we see things through our own mental lenses. In essence, it means factoring in human biases even in the most objective computational problems. For instance, the algorithms for online dating sites should factor skews, such as Asian men tending to rate African American women more negatively than average; or that loan approvals based on ‘objective’ metrics such as income, assets, and zip code in effect perpetuate the same redlining practices that fair lending laws were supposed to prohibit.
Data science may be a much hyped profession; the supply is far dwarfed by demand. We’ve long believed that there will always be need for data scientists, but that also, for the large mass of enterprises, the applications will start embedding data science. And it’s already happening, thanks to machine learning providing a system assist to humans in BI tools and data prep/data wrangling tools. But at the end of the day, as much as they might be considered unicorns, data scientists face very familiar issues.