How do you scale geospatial analytics on big data? And while you’re at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity.
By leveraging space-filling curves and indexing geometric shapes on the fly, Magellan is able to compute massive geospatial joins scalably while providing a level of abstraction to the end user that hides the complexities of indexing, join optimizations, etc. Magellan has also been benchmarked to be among the fastest geospatial engines even on a single node. Ram outlines the design considerations of Magellan, how it is able to achieve scalability for geospatial analytics without sacrificing simplicity and expressibility, how it can achieve blazingly fast single-node performance even with the usual framework overheads of Spark on a single node, and what’s next for the project.
Ram Sriharsha is the product manager for Apache Spark at Databricks and an Apache Spark committer and PMC member. Previously, Ram was architect of Spark and data science at Hortonworks and principal research scientist at Yahoo Labs, where he worked on scalable machine learning and data science. He holds a PhD in theoretical physics from the University of Maryland and a BTech in electronics from the Indian Institute of Technology, Madras.
Comments on this page are now closed.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com
Comments
Thanks for the excellent talk! I learned a lot and I’m looking forward to getting a version of Magellean with the python API so I can play with it on our stack.
I was wondering if you had any good references for other implementations of z-order curves? We have some non-spark pieces of the stack that may also benefit from approximate matches with z-order curves.