Pervasive Data Munging Gremlins

Data Science Ballroom AB
Average rating: **...
(2.50, 4 ratings)

Data science may seem like all machine learning hotness, but the reality is less flattering and for many counting the number of minor annoyances would result in an overflow unless you explicitly cast that variable as double-precision.

• Are those data little-endian or big-endian?
• Were those date created using Unix time, the Windows version of Excel, or the OS X version?
• Is that governmental PDF machine-readable? No? Good luck.
• Does the EU follow daylight savings time? What about Arizona?
• Is that lat/lng inside that geographic boundary or not?
• This JSON object has nonsense attribute names… which data am I supposed to use?
• Oh, there are non-printing UNICODE characters in this file breaking my munging algorithm. I’m glad I only spent 7 hours hunting that bug down.
• These data have a non-explicit ID numbering system based on their ordinal position. Great. But does their system use zero-based or one-based indexing?

You’re all just bits of information, just play together nicely!

In this presentation I’ll outline some of the common data munging frustrations that I’ve encountered, and offer advice on how people can spot and avoid them, covering some specific examples from my own work such as I hint at above.

Photo of Bradley Voytek

Bradley Voytek

UC San Diego

Bradley Voytek, PhD is a UCSF neuroscientist making use of data, brain-computer interfacing, and machine learning to figure out cognition. He is also the Data Evangelist for the San Francisco-based on-demand car service, Uber, Inc. Brad is an avid science teacher, outreach advocate, and world zombie brain expert. He’s spoken at events ranging from elementary schools to Ignite, TEDxBerkeley, @GoogleTalks, and SciFoo. His research and science writing has been featured in The Washington Post, Wired, Forbes, The New York Times, The New Yorker, The Guardian, The Atlantic, and Scientific American. He runs the blog Oscillatory Thoughts (http://blog.ketyov.com), tweets at @bradleyvoytek, and co-created brainSCANr.com with his wife Jessica Bolger Voytek.

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts