Twitter Hadoop project gets Apache‘s blessing ~ TECHNOLOGY NEWS

Sunday, 5 October 2014

Twitter Hadoop project gets Apache‘s blessing

10:50 Unknown No comments

Twitter's open source, real-time computation framework picks up the Apache Foundation's full backing and support

Storm, a framework for real-time data processing in Hadoop, got a major promotion. After joining the Apache Incubator in September 2013, it's now a full-blown, top-level Apache Foundation project.

Storm's main application is the processing of streaming real-time data ("fast data," per John Hugg's description). Its processing power is designed to scale across multiple nodes, with up to 1 million, 100-byte messages per second per node as an advertised benchmark. As with most other work done for Hadoop, Java is the most broadly supported language for working in Storm, though other languages are in the mix.

What's the significance of Storm becoming a "top-level project"? Mainly, it's a vote of confidence from the Apache Foundation that the project and the community around it have demonstrated a high degree of sustainability and robustness. For those efforts, Storm receives the full backing of the Foundation for its development. Hadoop-related top-level projects at Apache include Tez, Spark, and Mesos.

Storm first found a foothold at Twitter following the company's acquisition of original developer BackType in July 2011; Twitter later released Storm as open source. Since becoming an Apache project, Storm has received commits and contributions from a number of other Hadoop-related entities, such as Yahoo (the original developers of Hadoop) and Hortonworks.

Storm is not the only Hadoop framework for processing streaming data. Another Apache project, Spark, analyzes streaming data, though with an emphasis on fast in-memory processing; it can also work with batch and streaming jobs. Former Lawrence Livermore National Laboratory software engineer Xinh Huynh explains one key difference between the two: "Storm is a good choice if you need sub-second latency and no data loss. Spark Streaming is better if you need stateful computation, with the guarantee that each event is processed exactly once."

Storm is one of several recent software projects written in Clojure, a functional language reminiscent of Lisp that runs on the Java Virtual Machine. Its power and expressiveness have made it a choice language for programmers trying to build big data projects, according to its creator Rich Hickey. Puppet Labs recently decided to rewrite key parts of the server-side functionality for its Puppet automation framework in Clojure to pick up speed and harness the power and expressiveness of the language.

In addition, Clojure allows access to the entire culture of programming already available in Java. In turn, it's an appealing language for Hadoop-related projects, given their deep, existing roots in the Java world.