Fast data processing so far has been a purely technically driven topic. It had a clear focus on mastering technology to provide reliable and robust frameworks for working with moving data. From that perspective things made pretty good progress as hard challenges like “exactly once processing” were solved (see the data Artisans blog on how Apache Flink solves this).
Now it is the time to look forward into a future where data processing is not just getting faster but streaming technologies are having a significant impact on the implementation of what John Launchbury (director of DARPA’s information innovation office (I2O)) calls the third wave of AI systems (https://www.youtube.com/watch?v=-O01G3tSYpU). In this context the construction of explanatory models for classes of real world phenomena to introduce contextual adaption requires a constant development over time – a perfect match for stream processing.
Before we lose ourselves in the future, let’s get back to the present. Although sophisticated technologies are available and could help to make significant progress we are actually quite hesitant or relinquish to apply them. Why?
During a presentation someone asked me about unique use cases for stream processing technologies that may not be implemented with batch oriented approaches.
At first I was a bit puzzled as I could not come up with an ad-hoc answer, especially one that does not sound too sophisticated, like supporting explanatory models. Although I could have pointed towards fraud detection or behavior driven recommendation systems, these are not those exclusive set of use cases expected by the inquirer.
Finally my answer was that there might be some special use cases but for now fast data adds a pretty important aspect to existing cases: the timeliness of data.
Most current cases are based on historical data collected and provided through batch processes based on the concept of data-at-rest. To make significant progress compared to any competitor is to come up with services based on more current data. This is where data-in-motion enters the scene.
But that still leaves us with the question why transforming businesses and processes from data-at-rest towards data-in-motion happens so sluggishly. Is it simply a question about technology? Probably not as frameworks like akka are around for quite a while now and even newer ones like Apache Flink matured into mainstream recently.
From my point of view and what I have experienced so far this is neither a technology thing nor it is a question on budget and resources. It is more a human factor.
People tend to stick with traditions, existing processes and proven knowledge. At some time they were trained to work with SQL on data-at-rest and they were introduced to the concept of reproducible operation results.
However these characteristics – data persistency and reproducible results – are removed from stream processing architectures and were replaced by uncertainty and time: it is not for sure when an information arrives and if it is received at all. Furthermore data must be processed right the moment it is available since it is immediately replaced by the next event as it occurs.
Although adding temporary caches or mixing in data from persistent storages is a well established approach people must deal with the nature of this paradigm to fully understand and leverage it.
To turn your fast data architecture into a success do not only tackle architectural questions but put focus on the people who are going to use your platform in the future. Let them embrace the future, meet their fears and make them first class citizens of your platform.