Teaching Python for Big Open Data

Sep 26, 2014 • Raymond Yee


This coming Friday – 2014.09.26 4:30-5:30pm (during the off-week for the Python Worker’s Party), Raymond Yee (former lecturer at the School of Information), along with Lisa Green and Stephen Merity (of CommonCrawl.org) will lead a discussion on the topic of teaching Python for big data. We (Stephen, Raymond, and Lisa) have been developing training materials for computing on web crawl data as a vehicle for teaching both web science and techniques for handling large amounts of data.

We’re actively working on the training materials and would love to get feedback on our work in progress. Some topics we hope to sketch out this Friday are:

  • What is a web crawl and what exactly is in the CommonCrawl data sets and how the data is structured (housed in AWS S3)?
  • How Python programmers might be able to process this data with mrjob (Stephen has already developed some materials on this front: cc-mrjob)
  • How we might use a combination of Python multiprocessing and/or IPython Parallel + docker + AWS + the IPython notebook to do some exploratory data analysis.
  • Use BCE for some computations?
  • Figure out how to work in Apache Spark?

Everyone is welcome!

Google hangout on air