It’s a common story. Software developers are working hard to get a project off the ground. They set up logging to catch errors, which is great, but when they go to do data science with those logs down the road, they find problems. Maybe their logs are missing crucial information, or their database schema may not have the right unique identifiers across different data sets. A few days up front doing a “data audit” could have saved them time, made them piles of money, and helped them gain insight into their customers.
This talk will give you the toolkit you need to collect data properly, years before you bring on a data scientist. You will be able to do your own data audit, even if you don’t know anything about data science. You will learn the three major things to check: is your data complete? Is it correct? And is it connectable?
You’ll also get a concise list of command line tools to quickly look through your data to get some intuition for what’s hiding in those CSVs. Be a hero to your future data team.
Sasha is the founding data scientist and engineer at Polynumeral, a data science consultancy in New York City. She helps clients solve hard data problems and design their data strategy, including the World Bank, New York Public Radio, DonorsChoose.org, and Warby Parker. Previously she worked at Twilio and was an early employee at Codecademy. She founded Women Who Code, a global non-profit which connects 16,000 technical women in 14 countries.
For exhibition and sponsorship opportunities, email email@example.com
For information on trade opportunities with O'Reilly conferences, email firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata + Hadoop World contacts
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.