Tech Insights

rss

Contributor Columns on Information Technology and Security

A Primer on Hadoop Plus Machine Learning and Why You Should Care

The promise of big data, particularly big data analytics and machine learning, is being greatly touted as the next big thing in technology. Predictive data applications based on machine learning are already making our world amazing by combining historical data with near real-time user interactions to make accurate predictions.

The promise of big data, particularly big data analytics and machine learning, is being greatly touted as the next big thing in technology. Predictive data applications based on machine learning are already making our world amazing by combining historical data with near real-time user interactions to make accurate predictions.

Consider, for instance, the more advanced recommender systems that take into account previous user selections with current searches to recommend the product for which a person is likely looking. Similar applications tell us whether airline ticket prices will increase or whether a credit card authorization request is indicative of fraud and filter our social network interactions based on our preferences and sentiment.

These current uses are only the tip of the spear. Machine learning can save us significant energy costs by predicting the right thermostat settings for the environmental conditions, make our health care more effective through personalization of medicines, improve customer and citizen satisfaction with services by analyzing text and sentiment and provide significant impact in a myriad of other cases. Even the typically conservative Gartner Research puts predictive analytics in the “plateau of productivity” in the latest hype cycle on Big Data, showing its true potential. However, we are still in a point and time in the market where the current range of applications and benefits of predictive analytics is simply scratching the surface.

Enter Hadoop and Big Data Architecture

How then, do businesses participate in the boon of productivity and creativity by virtue of machine learning and big data analytics?

While there isn’t a single right way to get started in big data, a lot of firms choose to tackle the problem of corralling their data first. Part of what makes data “big” besides volume is its variance and the many different types of information residing in a variety of locations, formats and stores. The Hadoop Distributed File System (HDFS) is particularly well suited for aggregating large data sets (i.e. file sizes in the gigabytes and terabytes) and making that data available to applications that require streaming access to it. As an alternative to traditional data warehouses and files systems, HDFS is the near de facto big data ecosystem architecture. It gives organizations a natural starting point for aggregating, cleaning, organizing and managing all of the data that will be accessed by applications for things like business intelligence and predictive analytics.

Why Machine Learning and How it is Used

Machine learning is an area of technology that advances the ability for computers to learn from historical data, recognize patterns and make accurate predictions with that knowledge, all in an automated way. Although known in academia for decades, machine learning has now emerged in industry as a Business Intelligence (BI) 2.0 a discipline which makes it possible to go from dynamics report generation on historical data to prognosticating business outcomes. The impact of predictive services on shortening decision-making and augmenting the bottom line is massive.

Consider how long it takes for a sales report to drive product promotion decisions within a company versus a recommender system that analyzes buyer behavior and automatically offers the product most likely to be purchased. The time savings for this use case is measured in orders of magnitude as it is for many of the use cases for machine learning.

The applications for machine learning are limited only by the aspirations of the data scientist. Some of the most common ones in production today include:

  • Retail – recommender systems, pricing prediction
  • Financial services – fraud detection
  • Marketing and advertising – targeting and sentiment analysis
  • Telecommunications – churn prediction
  • Social network analysis – network and friend recommendations

 

Challenges in Taking Machine Learning Insights to Production

Despite their promise, machine learning based predictive applications remain the domain of academia and the few organizations that invested in data science talent early on. Even so, you may be an organization or business that has both talented data scientists and access to copious clean data in your Hadoop cluster and yet you’re still unable to turn those stores into big data gold.  That’s because the journey from big data inspiration to production involves tiers of skill sets and many non-integrated and hard to use tools. Data scientists will start with a laptop and an idea for a data product. They will build a prototype using a variety of disparate tools and a small subset of available data.  Getting to this point is hard enough and then the journey is often stilted because of a problem we have termed the “big data chasm.”  It entails scaling the prototype for use in a proof of concept phase that will leverage larger data sets. To do this the model has to be re-implemented in programming languages robust enough for production environments (i.e. redundancy, distributed compute). Data scientists often lack those programming skills needed to cross the big data chasm so their inspiration and its potential stay nascent.

Emerging Trend: Holistic Machine Learning Platforms

The challenges of the big data chasm have been at the center of innovative work from a few machine learning scholars and experts for some years. This has resulted in the emergence of complete platforms that include all of the tools and expertise to bridge the path from inspiration to production for data scientists and developers alike.

The democratization of machine learning and the tools and platforms that are making that possible will be the subject of the next series of features.

*Images courtesy of GraphLab and Cloudera

Showing 0 Comment
Your comment will be shown after administrator's approval







b i u quote

Save Comment
The Number One Menace to All Organizations
 

Learn more about how to protect your organization against this growing menace
https://info.knowbe4.com/ransomware-simulator-tool-its