Tutorial: Declarative Machine Learning

Machine learning explores the study and construction of algorithms that learn and make predictions based on data. In the field of machine learning, data scientists, who specialize in analyzing data, are responsible for writing and modifying such algorithms.

Initially, a data scientist writes an algorithm based on a set of data features. This is generally an iterative process in which the data scientist explores different algorithms for predictive purpose. In this process, the amount of data and the number of features chosen for analysis may change. Data used for analysis could be of any type, such as sparse versus dense, or compressed versus non-compressed. Once the quantity and analysis of data no longer fit on a single machine, operations are typically scaled to a cluster of machines. In summary, analysis involves an iterative process over changes in a feature set and changes in the amount and type of data, which leads to the customization of algorithms. We can consider this to be “domain specific analytics.”

Analysis using a single machine drifts to a Big Data problem

Generally, a data scientist writes an algorithm in R, Python or another scripting language using a sufficiently small data size that can fit on a single machine. At a certain point, the algorithm works for a given situation.arvind1

However, the problem is not yet fully solved. In most cases, the data will grow to the point where it cannot fit onto a single machine, and it becomes a Big Data problem.arvind2

How does an organization solve a Big Data Machine Learning problem?

Generally, in an organization, a data scientist writes an algorithm using a small data set that fits on a single machine and makes that “prototype” work. Then, a systems programmer who is an expert in clustered environments gets involved to run this algorithm in a clustered environment with larger scale data. This involves an iterative process of adjusting an algorithm to make it work in a clustered environment through continuous communication between the data scientist and the systems programmer until they are satisfied with the outcome. This works reasonably well and many organizations make that model work successfully. Though organizations make this model work, there are quite a few challenges with this approach. The data scientist writes the algorithm in R, Python or another scripting language. The script needs to be run efficiently and effectively on target platforms such as Apache Spark or Hadoop, which is not a trivial job.arvind3

Challenges in analyzing vertically across the business?

As I said earlier, this is “Domain specific analytics.” What will happen if the data scientist needs to analyze the data vertically across the business? An organization may have expertise in a particular business, say a “Car Selling” or “Car Manufacturing” business. Do we expect a person who is specialized in a particular business to be an analyst as well? Probably not, since it’s difficult to find a person with such a set of skills.

We see there are at least four types of issues in such a situation:

    1. It’s difficult to hire a person with multi-dimensional skills.
    1. It’s a dual effort to write an algorithm, first for a single machine with small data and then on a cluster with big data.
    1. The process of doing domain specific analysis is iterative and will need to change per situation.
    1. Human factors, involving communication between the data scientist and systems programmer to run an algorithm successfully on a cluster that was originally written for small data on single machine, will slow down the effort.
  1. 5.
Challenges for Data Scientists?

How do you expect a data scientist to deal with such dynamism? Do you expect data scientists to handle these variations in data or runtime environment when he or she writes an algorithm? Ideally, we would like a data scientist to be able to write an algorithm that is independent of data characteristics and runtime environment.

Motivation for “Declarative Machine Learning”

This motivated us to develop “Declarative Machine Learning” so that data scientists can write an algorithm in an expressive language. An algorithm written by a data scientist should be independent of data characteristics, scale of data, and runtime environment where the algorithm will be run. Data scientists should have the flexibility to write new algorithms, reuse existing algorithms, or customize algorithms as needed. We wanted data scientists to be able to treat this as a single machine problem.

This leads to four high level requirements:

    1. High-level semantics: A data scientist should be able to write an algorithm in a high-level language without focusing on any low-level implementation details. He or she should be able to express goals through easy semantics. A data scientist should be able to understand the semantics and debug easily as needed.
    1. Flexibility: A data scientist should have flexibility to leverage existing algorithms with or without any customization. A data scientist should be able to write new algorithms easily.
    1. Data independence: A data scientist should not worry about data characteristics while writing the algorithms. Data could be sparse/dense, it could be analyzed per row or column, it may need to be cached, it could be in compressed or non-compressed form, but the algorithm should be able to be written without considering any of these data characteristics.
    1. Scale independence:  The size of the data could be small or large. It could fit on a single machine for analysis or across a distributed environment.

Based on these requirements, we need something that can understand algorithms written in a high-level language and transform those into instructions to be executed on a target environment based on data characteristics. We realized we needed to leverage database optimization technologies and other database features to handle such a transformation. We developed a query optimizer to do the transformation from high-level statements written in a script to runtime instructions on a target environment based on characteristics of the data and runtime environment. The goal is to have a high-level language with the ability to scale in many different dimensions to many different data characteristics so that we can iterate faster.

Query Optimizer

The query optimizer is based on database query optimization techniques. The query optimizer will read statements written in a high-level language. These statements will be converted into smaller statement blocks. Based on data characteristics and the available runtime environment, individual smaller blocks get translated into generic data flow representations called high-level operations (HOPs). Subsequently, the optimizer applies dynamic rewrites and optimizes HOPs to generate low-level operations (LOPs) that will be a generic representation of the runtime execution plan. With current support, if the amount of memory required to run a particular instruction is available on a single node, then that instruction will be run in memory on the single node (Control Program); otherwise, that instruction will run in a distributed environment such as Spark or Hadoop as per the user’s preference.arvind4

SystemML Architecture

The above diagram shows a pictorial representation of SystemML. At the top of the diagram, you can see that a user can write an algorithm in an R-like or Python-like language supported by SystemML. SystemML has the capability to expand language support for other languages as well.

An algorithm, which is expressed as a set of statements in a script, will be parsed by the parser for validation and then transformed into smaller blocks. These blocks go through static and dynamic rewrites to generate high-level operations (HOPs) that are a generic representation of data flow. Any static transformation is independent of data characteristics, whereas dynamic transformation is based on data characteristics. At the HOP level, the known data size is labeled and propagated across the HOP tree for a given block. Based on data size, the target runtime platform gets determined at the HOP level.

Every HOP gets transformed to one or more low-level operations (LOPs) that are a generic representation of the runtime execution plan. These transformations are based on dynamic rewrites and dynamic recompilation. Physical operators are substituted in the runtime execution plan based on data and runtime characteristics, and then the runtime execution plan is executed on the target environment.arvind5

SystemML Compilation Chain

The following diagram shows various compilation processes discussed in an earlier section. Please have a look at the cited paper for more detailed information on each of these processes.arvind6

*Out-of-the-box Algorithms
We have implemented several common algorithms. These algorithms illustrate how SystemML can be leveraged to write algorithms in a high-level language with R-like or Python-like syntax very easily.

Algorithms in the SystemML package
COX Proportional Hazard Regression Analysis
Cubic Spline
Decision Tree
Generalized Linear Model
Kaplan-Meier Survival Analysis
L2 Support Vector Machine
Linear Regression
Minimal Support Vector Machine
Multi Log Regression
Principal Component Analysis
Random Forest
Step GLM
    Step Linear Regression DS


We briefly looked at Declarative Machine Learning and discussed domain specific knowledge. We also discussed the language and compiler support needed to implement Declarative Machine Learning. A primary goal of this approach is to allow data scientists to write efficient and effective algorithms and improve productivity without thinking about data and runtime characteristics.

Github link and documentation
  1. Conference Publication
      Ghoting, Amol, et al. “SystemML: Declarative machine learning on MapReduce.” ICDE 2011.
  2. Journal Paper
      [Matthias Boehm et al: SystemML’s Optimizer: Plan Generation for Large-Scale Machine Learning Programs. *IEEE Data Eng. Bull 2014]
   *Arvind Surve


You Might Also Enjoy

Kevin Bates
Kevin Bates
10 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
a year ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More