Get the free Ebook:
Private and Open Data in Asia: A Regional Guide.
Circa 2009, LinkedIn had a traditional reporting and data warehouse ecosystem. We had a relational operational data store (ODS) with nightly refreshes and an ETL pipeline based on Informatica leading to reports based off Microstrategy. We also had a logging pipeline that was fragile but used to carry important impression data such as views on profiles, ads, etc. that also fed into the ETL pipeline.
By 2011, this stack had significant issues keeping up with the growth in data as well as the barrage of requirements coming from the rising number of internal users. We invested in building and adopting new open-source technologies to alleviate these issues. We started working with Hadoop as early as 2009, and by 2011 regular nightly dumps of snapshot and incremental data from our online Oracle databases were happening into Hadoop. We built Kafka to become the central activity pipeline for carrying all our user activity and logging data, and this data was also getting piped into Hadoop regularly. Hadoop had become central for building recommendations and other insights that powered our data products like People You May Know, Who Viewed My Profile etc.
Now that we had liberated a lot of our data and achieved true data democracy, our intrepid analysts realized that some of this data could be used for computing important business metrics as well. Different groups started computing metrics for themselves off this data, which led to short-term happiness but brought a host of new challenges. We had problems with the data quality of the pipelines, duplicate, slightly different business logic in the metrics computation scripts, and operational challenges in computing so many different metrics on time every day. All these symptoms were caused by one main problem: we didn’t have a single source of truth for metrics.
Since 2014, we’ve embarked on building a unified reporting platform based on Hadoop to centralize all metrics computation at LinkedIn, while striving to keep the authoring process completely decentralized and making the on-boarding process as friction free as possible.
In this session, we’ll discuss the reporting platform, its core tenets and data models, and the infrastructure that powers it; from computation frameworks to visualization tools. We’ll also talk about the organizational impact of this, from culture change to new processes that were created to make this work for LinkedIn. Finally, we’ll discuss new frontiers around real-time monitoring and anomaly detection as well as operational learnings around SLAs and QoS.
Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He’s working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine; Gobblin, a data lifecycle management platform for Hadoop; WhereHows, a data discovery and lineage platform; and Dali, a data virtualization layer for Hadoop.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.