GitHub hosts tens of millions of people collaborating on more than 20 million repositories—an unprecedented treasure trove of data for software engineering researchers, companies, and project teams alike. Researchers take interest in developer behavior and code evolution—branching, collaboration, bug/fix rates, software quality, and distributed software development. Companies look for how projects (theirs and others) are doing and discover trends in the industry. Project teams want to understand their health, uptake of their offerings, API usage, and more.
Jeff McAffer, Georgios Gousios, and Kevin Lewis explore tools and techniques for sifting through terabytes of content, present key insights they discovered, and explain how you can follow suit. Jeff, Georgios, and Kevin offer an overview of the architecture of GHTorrent/DataLake, an infrastructure for tracking the activity of all (20 million) public GitHub repos and their thousands (and thousands) of events per hour. (Using this infrastructure, GitHub has analyzed the behavior of Microsoft and other repositories.) Jeff, Georgios, and Kevin present real insights in areas from contribution handling with pull requests and issues to API usage, tool adoption, and notions of project health that are applicable to researchers, developers, community members/managers, product teams, and executive sponsors. They conclude by outlining the open source stack you can use to get insights of your own.
Jeff McAffer is the director of open source engineering at Microsoft, where he helps drive the company’s transition to an “open source engagement first” model. Jeff was one of the founders of the Eclipse open source project. He is an active community leader, core contributor, book author, and frequent conference speaker.
Georgios Gousios does research with big data in software engineering. Georgios has published more than 40 papers in his field and coedited Beautiful Architectures (O’Reilly, 2009). His research interests include software engineering, systems software, and programming languages.
Kevin Lewis is a developer at Microsoft specializing in data warehousing and management using SQL Server and Cosmos. He is currently working with GitHub data from GHTorrent.org using Azure Data Lake.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org