We live in a world awash in data. As individuals and as a society we need to learn how to better understand, analyze and use it. To that end, Berkeley recently started an exciting initiative to teach every undergraduate the fundamentals of data science. As the first chapter of the open source book for Data 8 (one of the courses) puts it:
"For whatever aspect of the world we wish to study—whether it’s the Earth’s weather, the world’s markets, political polls, or the human mind—data we collect typically offer an incomplete description of the subject at hand. A central challenge of data science is to make reliable conclusions using this partial information."
To this end, students learn randomization and computation, which means hundreds of new computationally intensive courses across all disciplines for thousands of students.
To make such a sweeping program effective requires more than lectures and textbooks. For this type of class, students learn best by getting their hands on data and writing code. This once meant multiple classes and one-on-one sessions just to get students set up with the required tools on their computers before they could even start. Today we have easier ways, through a collaborative, open computing project developed at Berkeley with other institutions. Project Jupyter allows people from all disciplines to deploy dozens of different coding languages (Jupyter is a portmanteau of three of the initial coding languages Julia+Python+R) directly in electronic documents along with rich media. The result is a notebook that can be used in classes, labs, projects -- anywhere a computing tool for analyzing data might be useful. In classrooms, students can point their web browsers at online shareable “Jupyter notebooks” that contain instructional materials. They can explore datasets using code to analyze text, solve equations, generate visualizations, enter notes on their observations, etc.
Some of the courses have more than a thousand students – with every student requiring their own unique digital notebook.
Beyond classes in the Division of Data Sciences, Jupyterhub has evolved as a major scientific computing platform. At the 2017 Internet2 Global Summit conference in Washington DC, Larry Smarr's talk, Toward A National Big Data Superhighway, highlighted the importance of Jupyter as "the digital fabric in which data science is going to be done". Vendor support is also growing – many cloud providers are working to develop Jupyter as a service offerings. Amazon’s machine learning offering, Sagemaker, uses Jupyter notebooks. Azure Notebooks and Google's Cloud Datalab provide Jupyter notebooks. At our first UC Berkeley Cloud Meetup in March, Lindsey Heagy presented a super interesting overview of how she uses Jupyter for geoscience (see video here).
Berkeley provisions groups of many Jupyter notebooks from a technology platform known as Jupyterhub that can provide access to powerful cloud computing resources. As the number of courses grows, so too does the need for multiple Jupyterhub instances, most of which run on one of the major cloud providers. Begun as an experiment with the support of the academic staff who designed and built Jupyter notebooks and Jupyterhub, the expansion of the academic programs now calls for more operationally intensive IT support.
(Photo credit: UC Berkeley Division of Data Sciences, 2018)
In future posts, we’ll explore how Jupyterhub is evolving and how the campus (including IT) is responding to ensure that Jupyterhubs stay up and running when the students and instructors need them. In the meantime, for anyone who wants to get a sense of where UC Berkeley is today with Project Jupyter, I highly recommend this inspiring and informative talk from iPython inventor and Project Jupyter co-founder, Fernando Perez, delivered last month at TEDxBerkeley: