Skip to main content

Making Big Data Portable

Soam Acharya (Altiscale), Charles Wimmer (Altiscale), David Chaiken (Altiscale)
Hadoop and Beyond
GA Ballroom J
Average rating: **...
(2.33, 6 ratings)

The growing popularity of Hadoop has led to the availability of an increasing number of clusters worldwide, often multiply within the same organization. However, in order to leverage this computing capability, the clusters must first be primed with data. Frequently, this entails uploading existing client repositories into a remote cluster. Such a move can be challenging for the following reasons:

  • size: the size of the data to be transferred can be very large. Typically, enterprises do not consider adopting Big Data technologies unless they are actively experiencing pain owing to their current system being unable to handle the existing volume. At that point, their data has usually grown to significant levels and, consequently, is much more difficult to manage.
  • networks: if the target cluster is remote, one option is to move data via wide area networks. This presents hurdles in terms of limited available throughput, bandwidth and security. Transferring large data sizes via this approach can potentially be very time consuming. A special case is if the source and destination clusters are within the same data center but belong to different organizations. This scenario requires a different set of specialized skills in order to set up a network architecture that allows data to flow.
  • lack of domain knowledge & tools: there exists little understanding of the various approaches for bulk data uploads to a Hadoop cluster. In addition, widely used data transfer tools such as scp, ftp and rsync do not directly interface with HDFS and alternatives are not available. While there are tools to facilitate cluster to cluster copies, doing so across organizations and multiple hadoop versions is challenging.
  • security: data is particularly vulnerable during transit. Being able to safely transport high volume data across organizational boundaries and networks demands thorough understanding of security protocols and practices.

In this talk, we present a number of techniques and best practices for uploading large quantities of data to a remote Hadoop cluster. Our presentation is based on real world experience in transferring large amounts of data on behalf of various clients. Topics covered will include:

- DistCp: this is a tool that comes bundled with hadoop for cluster to cluster copying. DistCp has many possible configuration settings and Hadoop 2.0 adds more. We will provide parameters for robust distcp performance and sketch possible pitfalls.

- WebHDFS and HttpFS: this is a newer protocol and implementation intended to securely expose HDFS functionality. We will show how to use both to safely and rapidly transfer bulk data to a remote cluster.

- Leveraging AWS infrastructure: we will illustrate how to take advantage of AWS infrastructure such as S3 and availability zone connectivity for moving data across geographical regions. As part of our discussion, we will present s3copy, an open source tool we have developed that facilitates copying from one S3 bucket to another.

- Compression and Encryption: we will explore the role of preprocessing the data prior to transfer in order to reduce its size and increase security.

- Network cross connects: this is a special case whereby the source and destination clusters are housed in the same data center but belong to different organizations. We will show the architecture necessary to achieve high bandwidth connectivity in such a situation.

- Storage Appliance: sometimes, simply shipping a storage appliance is the speediest approach. We will offer practical guidance on selecting the right configuration and workflow in this eventuality.

We will conclude with a set of commonly encountered transfer scenarios and our recommendations for the right combination of technologies listed above to best attack each case.

Photo of Soam Acharya

Soam Acharya

Head of Application Architecture, Altiscale

Soam Acharya is Head of Application Architecture at Altiscale, Inc, a company focused on building the world’s best Hadoop clusters. Previously, he was Chief Scientist and Architect of the Limelight Video Platform where he focussed on video analytics, hybrid cloud architectures and big data issues. Prior to acquisition by Limelight, he led Delve Network’s initiatives into semantic video, video search, analytics and cloud computing in AWS. In addition, Soam has also worked on web and enterprise search at Inktomi, LookSmart and Yahoo.

Charles Wimmer

Head of Operations, Altiscale

Prior to joining Altiscale, Charles was Site Reliability Engineer at LinkedIn in their web operations team, managing their Apache Traffic Server and HAProxy. Prior to LinkedIn, he was a Principal Service Engineer at Yahoo!, where he ran tens of thousands of nodes in some of the world’s largest Hadoop clusters. He has experience in all aspects of large-scale technical operations—from network to storage to distributed systems. Charles has spoken at Hadoop Summits and Big Data Camps on the subject of Hadoop operations.

David Chaiken

Member of the technical staff, Altiscale

David Chaiken is a member of the technical staff at Altiscale, a start-up in Palo Alto that runs Hadoop as a service for other companies.

Before Altiscale, David served as the Chief Architect of Yahoo!, where he led teams building consumer advertising and media systems with Hadoop at their core. Over his career, David has also built voice search products for consumers, mobile enterprise applications, network management systems, project management software, a large-scale multiprocessor architecture, a tablet computer and several other information appliances.

David is an experienced speaker at industry conferences including keynoting the Saturn Conference and presenting at the ACM Chennai Chapter Lectures.

David’s favorite technologies include the RSA encryption algorithm, the C programming language, the ARM instruction set architecture, the CentOS distribution of Linux, and the build-on-grid-push-to-serving design pattern. In 1994, David earned a Ph.D. in electrical engineering and computer science from MIT.