Channeling Oceans of IoT Data

IBM researchers in Haifa, together with partners from the COSMOS EU-funded project, are using Apache Spark™ to analyze the new wave of IoT data and solve problems in a way that is generic, integrated, and practical.

I presented this work at the Apache Spark Summit in Amsterdam October 2015. See my slides on SlideShare

How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases from sparktc
Way back before the days of the web, most data was created by manual data entry, seriously limiting the rate at which it could be produced. The next wave brought the Internet, in which meaningful data could be automatically generated by tracking the mouse clicks of web surfers. More recently, we’ve been seeing social networks, with individuals around the world sharing information about themselves, their locations, friends, and lifestyles. We could call this the ‘Internet of People’ – its size is determined by the number of humans using its services. Today, a huge myriad of sensors amassing volumes of information each day is bringing about another transformation known as the ‘Internet of Things’. This means data about our planet can be automatically generated, collected, stored, analyzed, and acted upon to do things smarter and faster, with little or no human intervention.

*Why is everyone talking about the Internet of Things? *

Sensors capturing data are by no means a new phenomenon. We have even seen them on space shuttles and missiles; what is novel here is their commoditization. As their cost decreases, they become ubiquitous – pervading our lives in our cellphones, cars, homes, clothing, appliances, and even pets. However, we are not just talking about cool gadgets and gimmicks. Traffic sensors collect data on vehicle speed, volume, and intensity at any time. Healthcare centers continuously monitor patient heartbeats. Utility providers use meters to track the consumption of power, water, and fuel. The sheer volume of data being generated by sensors is the fastest growing segment today – potentially even going beyond media and entertainment. Studies predict that by 2020, the number of network connected sensors will reach 30 billion, more than four times the world population, and the market opportunity is expected to grow by trillions of dollars by 2020. The real question is: what is the best way to collect it, store it, and analyze it so we can take advantage of this gold mine?

As part of the COSMOS EU-funded project, I have been working with colleagues at IBM Research – Haifa, ATOS, and the University of Surrey, to study this data. We’re taking a closer look at historical data to teach computers what is ‘normal behavior’ and have them notify us when things go wrong. As part of this effort, we are working to monitor traffic in Madrid, to help our COSMOS partner, the EMT Madrid bus company.

Our team is using machine learning to examine historical data and learn from it what constitutes regular behavior signifying that ‘all is well’, and what data points to anomalies that indicate potential problems such as traffic jams. By analyzing the historical data together with real time information, we can use what we learn to make smart and fast decisions in real time as new data comes in.

Finding the truth about traffic
EMT Madrid busMadrid has 3,000 traffic sensors installed on various roads around the city. We selected 300 of these sensors with the most data and began collecting it, storing it, and indexing it. By applying machine learning techniques to the data for each sensor, we are able to understand what residents can expect for the area – whether morning, afternoon, or evening. Using the historical data helps us learn what is expected and what is unusual and needs to be reported, in an intelligent way. For example, 20 km an hour during rush hour is normal, but 20 km per hour at midnight means something is wrong.

The EMT bus company has a control room in which tens of employees watch over monitors and make manual adjustments to the bus schedules when things are not working right. By analyzing this data and having alerts automatically come in, we are helping EMT work more efficiently to both handle and prevent problems.

screenThe same technique being used for Madrid traffic can be used to detect leaky pipes before damage is caused in the home, changes to occupancy in office buildings so lights or air conditioning can be turned off, an appliance about to break down, or problems in oil pipes so authorities can be alerted. Imagine if your home could detect that there is a change to the moisture levels under the floor, conclude that there is pipe leakage, and automatically turn off the main water supply to the apartment before it floods.

Where Spark fits in

Because IoT data is so massive, we needed a low-cost and scalable way to store and analyze the data. This is where OpenStack Swift and Apache Spark fit in. Swift scalably stores data as objects, which can be annotated with metadata; using Spark we can access this data. One nice thing about the Spark platform is the package called Spark SQL, which allows us to access multiple external data sources with an SQL interface. It also provides an API where you can plug in new data sources by implementing a driver. We took advantage of this and introduced a new technique that uses metadata search over Swift objects to retrieve only the data that is relevant to our queries. Using this method, we were able to cut down the number of data requests from Swift to Spark by a factor of 20, without making core changes to Spark. Once we have efficient access to the data, Spark provides a whole library of machine learning algorithms to choose from for analysis. In the case of Madrid traffic, we use K-means clustering, although we could plug in a different algorithm for a different IoT use case. The bottom line is that using Spark and Swift, we can learn from the IoT history in order to make smart and fast decisions in real time.

Our architecture is generic. In addition to applying it to Madrid traffic, we are using exactly the same approach to implement other IoT use cases such as occupancy detection to turn off air-conditioners when no one is present, or anomaly detection to alert and possibly respond to malfunctioning electrical appliances. For me, this project has brought home the potential impact of IoT and how it is about to change our day-to-day lives. It’s going to be fascinating to see how having everything connected and generating the next wave of data will influence what we can do and how we do it.


You Might Also Enjoy

James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Seth Dobrin
Seth Dobrin
2 months ago

Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a no... Read More