At improve digital we collect and store large volumes of machine generated and behavioural data from our fleet of ad servers. For some time we have performed mostly batch processing through a data warehouse that combines traditional RDBMs (MySQL), columnar stores (Infobright, impala+parquet) and Hadoop.
We wish to share our experiences in enhancing this capability with systems and techniques that process the data as streams in near-realtime. In particular we will cover:
• The architectural need for an approach to data collection and distribution as a first-class capability
• The different needs of the ingest pipeline required by streamed realtime data, the challenges faced in building these pipelines and how they forced us to start thinking about the concept of production-ready data.
• The tools we used, in particular Apache Kafka as the message broker, Apache Samza for stream processing and Apache Avro to allow schema evolution; an essential element to handle data whose formats will change over time.
• The unexpected capabilities enabled by this approach, including the value in using realtime alerting as a strong adjunct to data validation and testing.
• What this has meant for our approach to analytics and how we are moving to online learning and realtime simulation.
This is still a work in progress at Improve Digital with differing levels of production-deployed capability across the topics above. We feel our experiences can help inform others embarking on a similar journey and hopefully allow them to learn from our initiative in this space.
Garry Turkington joined Improve Digital as VP Data Engineering in 2012 and is now the company CTO. One of his current focuses is in building out the company’s ability to derive more value from its substantial data asset. Prior to Improve Digital he was a Software Development Manager at Amazon where he led teams responsible for systems that process the data in the Amazon retail catalog. Before Amazon he spent over a decade in various government roles with a focus on large-scale distributed systems.
He has Ph.D and BSc degrees in Computer Science from the Queens University of Belfast in N.Ireland and a MEng in Systems Engineering from Stevens Institute of Technology in Hoboken New Jersey, USA.
Gabriele Modena is a Data Scientist at Improve Digital.
In his current position he uses Hadoop to manage, process and analyze behavioural and machine generated data. Prior to joining Improve Digital he held a number of positions in Academia and Industry where he researched and applied machine learning techniques to areas such as Natural Language Processing, Information Retrieval and Recommendation Systems.
He holds a BSc in Computer Science from the University of Trento in Italy and a Research MSc in Artificial Intelligence – Learning Systems from the University of Amsterdam in The Netherlands.
For exhibition and sponsorship opportunities, email firstname.lastname@example.org
For information on trade opportunities with O'Reilly conferences, email email@example.com
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.