See the full meetup event details at
http://www.meetup.com/Spark-NYC/events/220733560/
At this meetup, co-organized with Spark Summit East, we heard first about Spark on Google Compute Platform, and second, about the new Spark DataFrame abstraction.
=================
First, a demo of Spark on Google Cloud Platform.
1) Seamlessly deploy Apache Spark 1.2 on Google Cloud Platform with bdutil - CLI. Start developing on your cluster within minutes.
2) Enable users to take advantage of Apache Spark and Google Dataflow, with the open source Spark Dataflow Runner enabling you to run your code on-premise on Spark clusters and in the cloud with the managed dataflow service
=================
Second, Databricks software engineer and lead of the Spark SQL project Michael Armbrust will present the new DataFrame abstraction in Spark for large-scale data science. (EDITOR'S NOTE: Speaker change; previously was Reynold Xin)
Data frames in R and Python have become the de facto standards for data science. When it comes to Big Data however, neither R data frames nor Python data frames integrate well with Big Data tooling to scale up for use on large datasets.
Inspired by R and Pandas, Spark's DataFrame provides concise, powerful programmatic interfaces designed for structured data manipulation. In particular, when compared with traditional data frame implementations, it enables:
- Scaling from kilobytes to petabytes of data
- Reading structured datasets (JSON, Parquet, CSV, relational tables, ...)
- Machine learning integration
- Cross-language support for Java, Scala, and Python
Internally, the DataFrame API builds on Spark SQL's query optimization and query processing capabilities for efficient execution. Data scientists and engineers can use this API to more elegantly express common operations in data analytics. It makes Spark more accessible to a broader range of users and improve optimizations for existing ones.
Тэги:
#Spark_SQL #Apache_Spark #Spark #Google_Compute_Platform #Spark_DataFrame