The Web is a treasure trove filled with all kinds of information people care about. However, up until recently this dataset was reserved for the “big players”, because only they had the infrastructure and expertise to acquire and work on a dataset of this scale. This changed with the advent of Cloud Computing as it allows virtually anyone to to access and process datasets at Web scale on a pay-as-you-go basis. Furthermore, recent advances in Big Data and Natural Language Processing technologies enable us to unlock raw Web data and gain insights out of it.
In this joint presentation we will introduce Common Crawl, an open repository of Web data, and MIA, a cloud-based platform for processing and analyzing Web data.
Common Crawl provides over a petabyte of open data that anyone can access for free. We will detail the types of data in the Common Crawl corpus and explain how you can access the data directly.
MIA is a cloud-based platform for processing and analyzing web data. End users can describe their analytical task in a structured query language based on SQL. We will describe the functionality of the platform to gather relevant text data, extract information, join, aggregate, group and return results as database tables. We will provide examples of analysis done on the MIA platform and discuss practical application of web data analysis that you can incorporate into your research or products.
Lisa Green is the Director at the Common Crawl Foundation where she
oversees the foundation’s mission of building, maintaining and openly
disseminating a comprehensive crawl of the web. Prior to joining
Common Crawl, she was the Chief of Staff at Creative Commons. Lisa
holds a PhD in physical chemistry from the University of California
Berkeley, lives in San Francisco and is passionate about open systems.
R&D Project Manager at Neofonie, a Berlin based full service company providing text and data mining software and services. Peter studied computer science and linguistics at the Humboldt University in Berlin and worked for 6 years in the German Research Centre for Artificial Intelligence (DFKI). His focus of work is the development and improvement of information access and knowledge management solutions by leveraging linguistic analyses at various levels and applying them to large amounts of data, for arriving at an application-specific semantics with minimalized human efforts.
For exhibition and sponsorship opportunities, email email@example.com
For information on trade opportunities with O'Reilly conferences, email firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.