11: Pandas for data analysis with Python: Part 2#
Learning objectives#
Last week, we learned:
Pandas is a library in Python that is designed for data manipulation and analysis
How to use libraries (import them, access their functions and data structures with
library.function_name()
)About the
dataframe
data structure: basically a smart spreadsheet, with rows of observations, and columns of variables/data for each observation - sort of a cross between a list (sortable, indexable) and a dictionary (quickly access data by key)Some basic operations: constructing a dataframe, summarizing, subsetting, reshaping
This week, we’ll most dig into more advanced operations for reshaping/modifying your dataframe:
Use
.apply()
to apply functions to one or more columns to generate new columnsUse
.groupby()
to split your data into subgroups, apply some function to their data, then combine them into a new dataframe for further analysis (the “split-apply-combine” pattern that is fundamental to data analysis with pandas)Use some basic plotting functions to explore your data
These roughly correspond to Qs 6-8 in your PCEs.
If we have time, we’ll learn a bit more about summarization:
Use
.value_counts()
to summarize categorical dataHow to plot data
Creating/modifying data columns based on one or more columns using .apply()
#
More advanced operations on dataframes involve modifying or creating new columns!
In data analysis, we often want to do things to data in our columns for data preparation/cleaning. Sometimes there is missing data we want to recode, or we want to redescribe data or reclassify it for our analysis. We can do this with a combination of functions and the apply()
method.
In this way, again making a connection back to lists, .apply()
is a little like the map()
function that we used for lists to transform items from one list to another list with equal length (e.g., convert scores to letter grades).
.apply()
with a single column#
The simpler version of .apply()
only takes input from one column.
To illustrate let’s do some operations on the dataset INST courses.csv
.
# import the pandas library
import pandas as pd
# read in the dataset
fpath = 'INST courses.csv'
courses = pd.read_csv(fpath) # read in the file into a dataframe called courses
courses # use the .head() function to show the top 5 rows in the dataframe
/Users/joelchan/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.4' or newer of 'numexpr' (version '2.7.3' currently installed).
from pandas.core.computation.check import NUMEXPR_INSTALLED
/Users/joelchan/opt/anaconda3/lib/python3.9/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.2' currently installed).
from pandas.core import (
Code | Title | Description | Prereqs | Credits | |
---|---|---|---|---|---|
0 | INST126 | Introduction to Programming for Information Sc... | An introduction to computer programming for st... | Minimum grade of C- in MATH115; or must have m... | 3.0 |
1 | INST201 | Introduction to Information Science | Examining the effects of new information techn... | NaN | 3.0 |
2 | INST311 | Information Organization | Examines the theories, concepts, and principle... | Must have completed or be concurrently enrolle... | 3.0 |
3 | INST314 | Statistics for Information Science | Basic concepts in statistics including measure... | Must have completed or be concurrently enrolle... | 3.0 |
4 | INST326 | Object-Oriented Programming for Information Sc... | An introduction to programming, emphasizing un... | 1 course with a minimum grade of C- from (INST... | 3.0 |
5 | INST327 | Database Design and Modeling | Introduction to databases, the relational mode... | 1 course with a minimum grade of C- from (CMSC... | 3.0 |
6 | INST335 | Teams and Organizations | Team development and the principles, methods a... | 1 course with a minimum grade of C- from (INST... | 3.0 |
7 | INST346 | Technologies Infrastructure and Architecture | Examines the basic concepts of local and wide-... | 1 course with a minimum grade of C- from (INST... | 3.0 |
8 | INST352 | Information User Needs and Assessment | Focuses on use of information by individuals, ... | 1 course with a minimum grade of C- from (INST... | 3.0 |
9 | INST354 | Decision-Making for Information Science | Examines the use of information in organizatio... | INST314. | 3.0 |
10 | INST362 | User-Centered Design | Introduction to human-computer interaction (HC... | 1 course with a minimum grade of C- from (INST... | 3.0 |
11 | INST377 | Dynamic Web Applications | An exploration of the basic methods and tools ... | INST327. | 3.0 |
12 | INST408Y | Special Topics in Information Science; Privacy... | NaN | NaN | |
13 | INST408Z | Special Topics in Information Science; The Apo... | NaN | NaN | |
14 | INST414 | Data Science Techniques | An exploration of how to extract insights from... | INST314. | 3.0 |
15 | INST447 | Data Sources and Manipulation | Examines approaches to locating, acquiring, ma... | INST326 or CMSC131; and INST327. | 3.0 |
16 | INST462 | Introduction to Data Visualization | Exploration of the theories, methods, and tech... | INST314. | 3.0 |
17 | INST466 | Technology, Culture, and Society | Individual, cultural, and societal outcomes as... | INST201. | 3.0 |
18 | INST490 | Integrated Capstone for Information Science | The capstone provides a platform for Informati... | Minimum grade of C- in INST314, INST335, INST3... | 3.0 |
19 | INST604 | Introduction to Archives and Digital Curation | Overview of the principles, practices, and app... | NaN | 3.0 |
20 | INST612 | Information Policy | Nature, structure, development and application... | NaN | 3.0 |
21 | INST614 | Literacy and Inclusion | The educational and psychological dimensions o... | NaN | 3.0 |
22 | INST616 | Open Source Intelligence | An introduction to Open Source Intelligence (O... | NaN | 3.0 |
23 | INST622 | Information and Universal Usability | Information services and technologies to provi... | NaN | 3.0 |
24 | INST627 | Data Analytics for Information Professionals | Skills and knowledge needed to craft datasets,... | NaN | 3.0 |
25 | INST630 | Introduction to Programming for the Informatio... | An introduction to computer programming intend... | NaN | 3.0 |
26 | INST652 | Design Thinking and Youth | Methods of design thinking specifically within... | NaN | 3.0 |
27 | INST702 | Advanced Usability Testing | Usability testing methods -- how to design and... | Permission of instructor; or (INFM605 or INST6... | 3.0 |
28 | INST709 | Independent Study | NaN | NaN | |
29 | INST728G | Special Topics in Information Studies; Smart C... | NaN | NaN | |
30 | INST728V | Special Topics in Information Studies; Digital... | NaN | NaN | |
31 | INST733 | Database Design | Principles of user-oriented database design. ... | LBSC690, LBSC671, or INFM603; or permission of... | 3.0 |
32 | INST737 | Introduction to Data Science | An exploration of some of the best and most ge... | INST627; and (LBSC690, LBSC671, or INFM603). O... | 3.0 |
33 | INST741 | Social Computing Technologies and Applications | Tools and techniques for developing and config... | INFM603 and INFM605; or (LBSC602 and LBSC671);... | 3.0 |
34 | INST742 | Implementing Digital Curation | Management of and technology for application o... | INST604; or permission of instructor. | 3.0 |
35 | INST746 | Digitization of Legacy Holdings | Through hands on exercises and real-world proj... | INST604. | 3.0 |
36 | INST762 | Visual Analytics | Visual analytics is the use of interactive vis... | INFM603 or INST630; or permission of instructor. | 3.0 |
37 | INST767 | Big Data Infrastructure | Principles and techniques of data science and ... | INST737; or permission of instructor. | 3.0 |
38 | INST776 | HCIM CAPSTONE PROJECT | The opportunity to apply the skills learned th... | INST775; or permission of instructor. | 3.0 |
39 | INST785 | Documentation, Collection, and Appraisal of Re... | Development of documentation strategies and pl... | INST604; or permission of instructor. | 3.0 |
40 | INST794 | Capstone in Youth Experience | Through a supervised project, to synthesize de... | INST650, INST651, and INST652; or permission o... | 3.0 |
Let’s say we want to have a prereqs column that is sortable, maybe 0 = No prereqs, and 1 = has prereqs
Step 1: Define the function you want to apply#
# Step 1: define the function you want to apply
def has_prereq(prereq_descr):
# assume we get a string prereq description
if pd.isnull(prereq_descr):
return 0
elif "None" in prereq_descr:
return 0
else:
return 1
# test the function
prereq = "BMGT301; or instructor permission" # this should yield 1
prereq2 = "None" # this should yield 0
print(has_prereq(prereq))
print(has_prereq(prereq2))
1
0
Step 2: Apply the function to a column and save the result in a (new) column#
# Step 2: apply it to a column and save the result in the (new) `had_prereqs` column
courses['has_prereqs'] = courses['Prereqs'].apply(has_prereq) # apply the has_prereq() function to every row in the prereqs column in the courses data frame
courses.head(10)
Code | Title | Description | Prereqs | Credits | has_prereqs | |
---|---|---|---|---|---|---|
0 | INST126 | Introduction to Programming for Information Sc... | An introduction to computer programming for st... | Minimum grade of C- in MATH115; or must have m... | 3.0 | 1 |
1 | INST201 | Introduction to Information Science | Examining the effects of new information techn... | NaN | 3.0 | 0 |
2 | INST311 | Information Organization | Examines the theories, concepts, and principle... | Must have completed or be concurrently enrolle... | 3.0 | 1 |
3 | INST314 | Statistics for Information Science | Basic concepts in statistics including measure... | Must have completed or be concurrently enrolle... | 3.0 | 1 |
4 | INST326 | Object-Oriented Programming for Information Sc... | An introduction to programming, emphasizing un... | 1 course with a minimum grade of C- from (INST... | 3.0 | 1 |
5 | INST327 | Database Design and Modeling | Introduction to databases, the relational mode... | 1 course with a minimum grade of C- from (CMSC... | 3.0 | 1 |
6 | INST335 | Teams and Organizations | Team development and the principles, methods a... | 1 course with a minimum grade of C- from (INST... | 3.0 | 1 |
7 | INST346 | Technologies Infrastructure and Architecture | Examines the basic concepts of local and wide-... | 1 course with a minimum grade of C- from (INST... | 3.0 | 1 |
8 | INST352 | Information User Needs and Assessment | Focuses on use of information by individuals, ... | 1 course with a minimum grade of C- from (INST... | 3.0 | 1 |
9 | INST354 | Decision-Making for Information Science | Examines the use of information in organizatio... | INST314. | 3.0 | 1 |
Another example: let’s say we want ot know if a course is an introductory course. How might we do this?
# first define a function to check if the course is an intro course
def is_intro(title):
if "introduction" in title.lower():
return 1
else:
return 0
# then apply it to the courses column and save the result in the (new) `is_intro` column
courses['is_intro'] = courses['Title'].apply(is_intro)
courses.head(10)
Code | Title | Description | Prereqs | Credits | has_prereqs | is_intro | |
---|---|---|---|---|---|---|---|
0 | INST126 | Introduction to Programming for Information Sc... | An introduction to computer programming for st... | Minimum grade of C- in MATH115; or must have m... | 3.0 | 1 | 1 |
1 | INST201 | Introduction to Information Science | Examining the effects of new information techn... | NaN | 3.0 | 0 | 1 |
2 | INST311 | Information Organization | Examines the theories, concepts, and principle... | Must have completed or be concurrently enrolle... | 3.0 | 1 | 0 |
3 | INST314 | Statistics for Information Science | Basic concepts in statistics including measure... | Must have completed or be concurrently enrolle... | 3.0 | 1 | 0 |
4 | INST326 | Object-Oriented Programming for Information Sc... | An introduction to programming, emphasizing un... | 1 course with a minimum grade of C- from (INST... | 3.0 | 1 | 0 |
5 | INST327 | Database Design and Modeling | Introduction to databases, the relational mode... | 1 course with a minimum grade of C- from (CMSC... | 3.0 | 1 | 0 |
6 | INST335 | Teams and Organizations | Team development and the principles, methods a... | 1 course with a minimum grade of C- from (INST... | 3.0 | 1 | 0 |
7 | INST346 | Technologies Infrastructure and Architecture | Examines the basic concepts of local and wide-... | 1 course with a minimum grade of C- from (INST... | 3.0 | 1 | 0 |
8 | INST352 | Information User Needs and Assessment | Focuses on use of information by individuals, ... | 1 course with a minimum grade of C- from (INST... | 3.0 | 1 | 0 |
9 | INST354 | Decision-Making for Information Science | Examines the use of information in organizatio... | INST314. | 3.0 | 1 | 0 |
If you’re lazy, you can pass in anonymous functions too, with lambda
: https://towardsdatascience.com/apply-and-lambda-usage-in-pandas-b13a1ea037f7
is_introductory = courses['Title'].apply(lambda title: 1 if "introduction" in title.lower() else 0)
is_introductory.head(10) # show the top 10
0 1
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
Name: Title, dtype: int64
What’s happening under the hood when you .apply()
a function to a column#
Pandas is iterating through every row in that column, and applying the function to the value in that row.
# let's show this for the first 10 courses
for prereq in courses['Prereqs'].head(10):
print(f"Result of applying has_prereq() to {prereq}: {has_prereq(prereq)}")
Result of applying has_prereq() to Minimum grade of C- in MATH115; or must have math eligibility of MATH140 or higher; or permission of instructor.: 1
Result of applying has_prereq() to nan: 0
Result of applying has_prereq() to Must have completed or be concurrently enrolled in INST201; or INST301.: 1
Result of applying has_prereq() to Must have completed or be concurrently enrolled in INST201; or must have completed or be concurrently enrolled in INST301. And minimum grade of C- in INST201 and INST301; and MATH115; and STAT100; and minimum grade of C- in MATH115 and STAT100.: 1
Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST126, CMSC106); and must have completed or be concurrently enrolled in INST201 or INST301. And minimum grade of C- in INST201; or minimum grade of C- in INST301.: 1
Result of applying has_prereq() to 1 course with a minimum grade of C- from (CMSC106, CMSC122, INST126); and must have completed or be concurrently enrolled in INST201 or INST301; and minimum grade of C- in INST201 and INST301.: 1
Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST201, INST301); and minimum grade of C- in PSYC100.: 1
Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST201, INST301); and 1 course with a minimum grade of C- from (INST326, CMSC131); and minimum grade of C- in INST327.: 1
Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST201, INST301); and minimum grade of C- in INST311.: 1
Result of applying has_prereq() to INST314.: 1
The .apply()
function returns a pandas Series that is the same length as the input column (which is also a Series), with a corresponding value for each input.
print(f"the Prereqs column has {len(courses['Prereqs'])} rows")
print(f"the Series created by applying `has_prereq()` to the Prereqs column has {len(courses['Prereqs'].apply(has_prereq))} rows")
the Prereqs column has 41 rows
the Series created by applying `has_prereq()` to the Prereqs column has 41 rows
To save the results of the apply()
for later analysis, we then need to assign it to a column, new or existing.
Remember, pandas prefers immutability in general (return a new object instead of modifying the object), and sometimes enforces it. With .apply()
, it’s enforced: you can’t directly modify the column, you have to assign the returned Series to a column if you want it to persist.
Like with other assignment statements, just running the .apply()
and assigning its return value to a column will not yield output. You’ll need to print out the dataframe to check the results.
PRACTICE: Let’s say I want to know how many courses we have in each area. We don’t have that data in the dataset; at least not explicitly. Fortunately we can make it with some simple programming that you already know how to do! The problem here is, given a code (i.e., data from one column), how do we “extract” the area?
# Step 1: define the function
def extract_area(code):
# heuristic: just grab the first four characters
return
c = "CMSC250"
extract_area(c)
# Step 2: apply the function
courses['area'] = courses['code'].apply(extract_area)
courses.head(10)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
index.pyx in pandas._libs.index.IndexEngine.get_loc()
index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'code'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/var/folders/xz/_hjc5hsx743dclmg8n5678nc0000gn/T/ipykernel_46864/1864781452.py in <module>
1 # Step 2: apply the function
----> 2 courses['area'] = courses['code'].apply(extract_area)
3 courses.head(10)
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
KeyError: 'code'
PRACTICE: With the wunderground.csv
dataset, how can we extract the year/month/day from the date column?
PRACTICE: With the BreadBasket_DMS.csv
dataset, how can we extract the hour for each transaction from the Time column?
# the bread dataset
bread = pd.read_csv("data/BreadBasket_DMS.csv")
def extract_hour(time):
return
bread['Hour'] = bread['Time'].apply(extract_hour)
bread.sort_values(by="Hour")
.apply()
with data from multiple columns#
What if you want to have a way to filter the courses in terms of “easy entry points” (i.e., both introductory and has no prerequisites)? That might also be interesting to analyze by area to see how many departments offer these easy entry points into the department for students from other departments.
Core thing we need to know here is that our .apply()
will now apply a function that has a row as input, not an element of a single column. That way, we can access data from any column in the row: in this case, data from the “is_intro” and “has_prereq” columns.
There are two key differences between this use of .apply()
and the single-column case:
First, we do
.apply()
with the whole dataframe, not from a single columnWe specify an argument for the
axis
parameter to tell it to use rows as inputs. We need to pass the argument1
to the axis parameter when we call.apply()
so it knows to pass a row into the function, not just a single column element. See here for more details: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
# is_entry_point function
def is_entry_point(row):
# if the value of the "is intro" column for this row is 1
# AND the value of hte "has_prereq" column for this row is 0
# return 1
if row['is_intro'] == 1 and row['has_prereqs'] == 0:
return 1
else:
return 0
# this should yield 1
test_row = {
'is_intro': 1,
'has_prereqs': 0
}
# this should yield 0
test_row2 = {
'is_intro': 1,
'has_prereqs': 1
}
print(is_entry_point(test_row))
print(is_entry_point(test_row2))
1
0
# Step 2 apply the function and save the result
courses['is_entrypoint'] = courses.apply(is_entry_point, axis=1) # need to specify axis=1 to apply it to every row
# courses['classlevel'] = courses['classcode'].apply(level)
courses.head()
# compare to .apply() with a single column
# courses['is_intro'] = courses['title'].apply(is_intro)
# key differences:
# - here for multiple columns, we start with the whole dataframe, instead of a specific column
# - and we pass the argument 1 to the axis parameter instead of letting it use the default 0 value
Code | Title | Description | Prereqs | Credits | has_prereqs | is_intro | is_entrypoint | |
---|---|---|---|---|---|---|---|---|
0 | INST126 | Introduction to Programming for Information Sc... | An introduction to computer programming for st... | Minimum grade of C- in MATH115; or must have m... | 3.0 | 1 | 1 | 0 |
1 | INST201 | Introduction to Information Science | Examining the effects of new information techn... | None | 3.0 | 0 | 1 | 1 |
2 | INST311 | Information Organization | Examines the theories, concepts, and principle... | Must have completed or be concurrently enrolle... | 3.0 | 1 | 0 | 0 |
3 | INST314 | Statistics for Information Science | Basic concepts in statistics including measure... | Must have completed or be concurrently enrolle... | 3.0 | 1 | 0 | 0 |
4 | INST326 | Object-Oriented Programming for Information Sc... | An introduction to programming, emphasizing un... | 1 course with a minimum grade of C- from (INST... | 3.0 | 1 | 0 | 0 |
# show me all the courses that are intro and have no prereqs
courses[courses['is_entrypoint'] == 1]
# if we have a list, we can do indexing like this to get the first 4 elements, say: courses[:4]
# if we have a dictionary, we can retrieve base don key, like this courses['hello']
Code | Title | Description | Prereqs | Credits | has_prereqs | is_intro | is_entrypoint | |
---|---|---|---|---|---|---|---|---|
1 | INST201 | Introduction to Information Science | Examining the effects of new information techn... | None | 3.0 | 0 | 1 | 1 |
19 | INST604 | Introduction to Archives and Digital Curation | Overview of the principles, practices, and app... | None | 3.0 | 0 | 1 | 1 |
25 | INST630 | Introduction to Programming for the Informatio... | An introduction to computer programming intend... | None | 3.0 | 0 | 1 | 1 |
More examples?
# for ncaa: make a column that is 1 if you had a winning season AND reached the round of 32
ncaa = pd.read_csv("data/ncaa-team-data-cleanCoachNames.csv")
ncaa.head()
def underdog_season(row):
if row['w'] < 21 and row['ncaa_result'] == "Won National Final":
return 1
else:
return 0
ncaa['underdog'] = ncaa.apply(underdog_season, axis=1)
ncaa[ncaa['underdog']==1]
Unnamed: 0 | school | conf | rk | w | l | srs | sos | pts_for | pts_vs | ... | ap_pre | ap_high | ap_final | pts_diff | ncaa_result | ncaa_numeric | season | year | coach | underdog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8229 | 8229 | indiana | Big Ten | 78 | 20 | 3 | NaN | NaN | NaN | NaN | ... | 30 | 30 | 30 | NaN | Won National Final | 48 | 1939-40 | 1939 | Branch McCracken | 1 |
23515 | 23515 | wisconsin | Big Ten | 77 | 20 | 3 | NaN | NaN | NaN | NaN | ... | 30 | 30 | 30 | NaN | Won National Final | 48 | 1940-41 | 1940 | Bud Foster | 1 |
2 rows × 21 columns
set(ncaa['ncaa_result'])
{'Lost First Four',
'Lost First Round',
'Lost National Final',
'Lost National Semifinal',
'Lost Opening Round',
'Lost Regional Final',
'Lost Regional Final (Final Four)',
'Lost Regional Semifinal',
'Lost Second Round',
'Lost Third Round',
'Playing First Four',
'Playing First Round',
'Won National Final',
nan}
# for ncaa: create a column taht is 1 if the pt differential is non-negative BUT the team didn't make the playoffs (numeric >= 32 or 0)
# apply the get_month function to every value in the Date column in bread
# and store the resulting series in the Month column
# convert the ncaa_result column to a numerical ranking of season outcome
Analyze subgroups of your data with the split-apply-combine pattern#
Going more deeply on the path of “reshaping”, we often want to compute data based on subsets of the data, grouped by some column.
For example, we might want to see how many departments offer easy entry points.
We can do this with the “split-apply-combine” pattern, which is implemented in the .groupby()
function.
Basically, it goes like this:
Split the data into subgroups (e.g., split courses into department subgroups)
Apply some computation on each subgroup (e.g., find number of easy entry points for each department subgroup)
Combine subgroup-computation information into an overall new dataframe that has subgroups as entries
Here’s a picture with a simpler datset to give an intuition:
for area, areaData in courses.groupby('area'):
print(area)
print(areaData)
Split with .groupby()
#
The first step is to split the dataset into subgroups.
The manual way
# get all the unique area values
course_areas = set(courses['area'].values)
# iterate through each unique area value
for area in course_areas:
print(area)
# get the subset of the course data that is associated with this area
area_df = courses[courses['area'] == area]
print(area_df.head())
# summarize the credits for this subset of hte dataframe
print(area_df['credits'].mean())
print("\n")
PLCY
code title \
319 PLCY101 Great Thinkers on Public Policy
320 PLCY201 Public Leaders and Active Citizens
321 PLCY215 Innovation and Social Change: Creating Change...
322 PLCY301 Sustainability
323 PLCY302 Examining Pluralism in Public Policy
description prereqs credits \
319 Great ideas in public policy, such as equalit... None 3
320 Aims to inspire, teach and engage students in... None 3
321 A team-based, highly interactive and dynamic ... None 3
322 Designed for students whose academic majors w... None 3
323 Understanding pluralism and how groups and in... None 3
prereq_type area has_prereqs is_intro is_entrypoint
319 None PLCY 0 0 0
320 None PLCY 0 0 0
321 None PLCY 0 0 0
322 None PLCY 0 0 0
323 None PLCY 0 0 0
3.0
COMM
code title \
108 COMM200 Critical Thinking and Speaking
109 COMM331 News Writing and Reporting for Public Relations
110 COMM332 News Editing for Public Relations
111 COMM351 Public Relations Techniques
112 COMM353 New Media Writing for Public Relations
description \
108 Theory and practice of persuasive discourse a...
109 Writing and researching news and information ...
110 Copy editing, graphic principles and processe...
111 The techniques of public relations, including...
112 Students learn the uses and influence of new ...
prereqs credits prereq_type \
108 None 3 None
109 COMM201; and must have completed or be concur... 3 Flexible
110 Minimum grade of C- in COMM331; or students w... 3 Flexible
111 COMM332. 3 Hard
112 Minimum grade of C- in COMM351. 3 Hard
area has_prereqs is_intro is_entrypoint
108 COMM 0 0 0
109 COMM 1 0 0
110 COMM 1 0 0
111 COMM 1 0 0
112 COMM 1 0 0
2.870967741935484
CMSC
code title \
62 CMSC122 Introduction to Computer Programming via the ...
63 CMSC131 Object-Oriented Programming I
64 CMSC132 Object-Oriented Programming II
65 CMSC133 Object Oriented Programming I Beyond Fundamen...
66 CMSC216 Introduction to Computer Systems
description \
62 Introduction to computer programming in the c...
63 Introduction to programming and computer scie...
64 Introduction to use of computers to solve pro...
65 An introduction to computer science and objec...
66 Introduction to the interaction between user ...
prereqs credits prereq_type \
62 None 3 None
63 None 4 None
64 Minimum grade of C- in CMSC131; or must have ... 4 Flexible
65 None 2 None
66 Minimum grade of C- in CMSC132; and minimum g... 4 Hard
area has_prereqs is_intro is_entrypoint
62 CMSC 0 1 1
63 CMSC 0 0 0
64 CMSC 1 0 0
65 CMSC 0 0 0
66 CMSC 1 1 0
3.0
ENTS
code title \
209 ENTS630 The Economics of International Telecommunicat...
210 ENTS632 Telecommunications Marketing Management
211 ENTS635 Decision Support Methods for Telecommunicatio...
212 ENTS641 Networks and Protocols II
description prereqs credits \
209 Basic microeconomic principles used by teleco... None 3
210 Topics covered include strategic marketing, s... None 3
211 The aim of this course is to introduce manage... None 3
212 Techniques for the specification, design, ana... ENTS640. 3
prereq_type area has_prereqs is_intro is_entrypoint
209 None ENTS 0 0 0
210 None ENTS 0 0 0
211 None ENTS 0 0 0
212 Hard ENTS 1 0 0
3.0
URSP
code title \
407 URSP600 Research Design and Application
408 URSP601 Research Methods
409 URSP604 The Planning Process
410 URSP606 Planning Economics
411 URSP631 Transportation and Land Use
description prereqs credits \
407 Techniques in urban research, policy analysis... None 3
408 Use of measurement, statistics, quantitative ... None 3
409 Legal framework for U.S. planning; approaches... None 3
410 Resource allocation in a market economy, the ... None 3
411 The interrelationship between transportation ... None 3
prereq_type area has_prereqs is_intro is_entrypoint
407 None URSP 0 0 0
408 None URSP 0 0 0
409 None URSP 0 0 0
410 None URSP 0 0 0
411 None URSP 0 0 0
3.0
STAT
code title \
392 STAT100 Elementary Statistics and Probability
393 STAT400 Applied Probability and Statistics I
394 STAT401 Applied Probability and Statistics II
395 STAT410 Introduction to Probability Theory
396 STAT420 Theory and Methods of Statistics
description \
392 Simplest tests of statistical hypotheses; app...
393 Random variables, standard distributions, mom...
394 Point estimation - unbiased and consistent es...
395 Probability and its properties. Random variab...
396 Point estimation, sufficiency, completeness, ...
prereqs credits prereq_type \
392 MATH110, MATH112, MATH113, or MATH115; or per... 3 Flexible
393 1 course with a minimum grade of C- from (MAT... 3 Flexible
394 1 course with a minimum grade of C- from (STA... 3 Hard
395 1 course with a minimum grade of C- from (MAT... 3 Hard
396 1 course with a minimum grade of C- from (SUR... 3 Hard
area has_prereqs is_intro is_entrypoint
392 STAT 1 0 0
393 STAT 1 0 0
394 STAT 1 0 0
395 STAT 1 1 0
396 STAT 1 0 0
3.0
ENSP
code title \
203 ENSP102 Introduction to Environmental Policy
204 ENSP250 Lawns in the Landscape: Environmental Hero or...
205 ENSP305 Applied Quantitative Methods in Environmental...
206 ENSP330 Introduction to Environmental Law
207 ENSP342 Environmental Threats to Oceans and Coasts: T...
description \
203 Second of two courses that introduce students...
204 Examination of the lawn as an element in the ...
205 Intended for students interested in pursuing ...
206 An overview of environmental law, from its co...
207 An interdisciplinary study of the challenges ...
prereqs credits prereq_type \
203 None 3 None
204 None 3 None
205 BIOM301, ECON321, GEOG306, PSYC200, or SOCY20... 3 Flexible
206 None 3 None
207 None 3 None
area has_prereqs is_intro is_entrypoint
203 ENSP 0 1 1
204 ENSP 0 0 0
205 ENSP 1 0 0
206 ENSP 0 1 1
207 ENSP 0 0 0
3.0
AMST
code title \
0 AMST101 Introduction American Studies
1 AMST298C Introduction to Asian American Studies
2 AMST340 Introduction to History, Theories and Methods...
3 AMST418N Asian American Public Policy
4 AMST450 Seminar in American Studies
description \
0 Introduces students to the interdisciplinary ...
1 The aggregate experience of Asian Pacific Ame...
2 Introduction to the process of interdisciplin...
3
4 Developments in theories and methods of Ameri...
prereqs credits prereq_type \
0 None 3 None
1 None 3 None
2 Must have completed AMST201; and 2 courses in... 3 Hard
3 None 3 None
4 AMST201 and AMST340; and 1 course in AMST. 3 Hard
area has_prereqs is_intro is_entrypoint
0 AMST 0 1 1
1 AMST 0 1 1
2 AMST 1 1 0
3 AMST 0 0 0
4 AMST 1 0 0
3.0
MATH
code title \
265 MATH107 Introduction to Math Modeling and Probability
266 MATH113 College Algebra and Trigonometry
267 MATH115 Precalculus
268 MATH120 Elementary Calculus I
269 MATH121 Elementary Calculus II
description \
265 A goal is to convey the power of mathematics ...
266 Topics include elementary functions including...
267 Preparation for MATH120, MATH130 or MATH140. ...
268 Basic ideas of differential and integral calc...
269 Differential and integral calculus, with emph...
prereqs credits prereq_type \
265 Must have math eligibility of Math 107 or hig... 3 Flexible
266 Must have math eligibility of MATH113 or high... 3 Flexible
267 Must have math eligibility of MATH115 or high... 3 Flexible
268 1 course with a minimum grade of C- from (MAT... 3 Flexible
269 MATH120, MATH130, MATH136, or MATH140; or mus... 3 Flexible
area has_prereqs is_intro is_entrypoint
265 MATH 1 1 0
266 MATH 1 0 0
267 MATH 1 0 0
268 MATH 1 0 0
269 MATH 1 0 0
3.061224489795918
INFM
code title \
213 INFM605 Users and Use Context
214 INFM612 Management Concepts and Principles for Inform...
215 INFM620 Introduction to Strategic Information Managem...
216 INFM700 Information Architecture
217 INFM737 Information Management Capstone Experience
description \
213 Use of information by individuals. Nature of ...
214 Key aspects of management - focusing on plann...
215 Strategic management is the comprehensive col...
216 Principles and techniques of information orga...
217 The Information Management Capstone Experienc...
prereqs credits prereq_type \
213 None 3 None
214 None 3 None
215 INFM612; or LBSC631; or permission of instruc... 3 Flexible
216 INFM603; or permission of instructor. 3 Flexible
217 INFM736; and must have earned a minimum of 27... 3 Flexible
area has_prereqs is_intro is_entrypoint
213 INFM 0 0 0
214 INFM 0 0 0
215 INFM 1 1 0
216 INFM 1 0 0
217 INFM 1 0 0
3.0
PHSC
code title \
314 PHSC401 History of Public Health
315 PHSC412 Food, Policy, and Public Health
316 PHSC415 Essentials of Public Health Biology: The Cell...
317 PHSC440 Public Health Nutrition
318 PHSC497 Public Health Science Capstone
description \
314 Emphasis is on the history of public health i...
315 Broad overview of the impact of food and food...
316 Presents the basic scientific and biomedical ...
317 Engages students in conceptual thinking about...
318 The capstone course is the culminating experi...
prereqs credits prereq_type \
314 None 3 None
315 Must have completed HLSA300 with a C- or high... 3 Flexible
316 Minimum grade of C- in BSCI202. 3 Hard
317 A minimum of C- in BSCI170, BSCI171, CHEM131,... 3 Hard
318 Must have completed the professional writing ... 3 Flexible
area has_prereqs is_intro is_entrypoint
314 PHSC 0 0 0
315 PHSC 1 0 0
316 PHSC 1 0 0
317 PHSC 1 0 0
318 PHSC 1 0 0
3.0
BMGT
code title \
9 BMGT190H Introduction to Design and Quality
10 BMGT210 Foundations of Accounting for Non Business Ma...
11 BMGT302 Designing Applications for Business Analytics
12 BMGT310 Intermediate Accounting I
13 BMGT311 Intermediate Accounting II
description \
9 QUEST students learn and apply design practic...
10 Provides an understanding of the common state...
11 Provides an introduction to structured progra...
12 Comprehensive analysis of financial accountin...
13 Continuation of BMGT310.
prereqs credits prereq_type \
9 None 4 None
10 None 3 None
11 BMGT301; or permission of BMGT-Robert H. Smit... 3 Flexible
12 BMGT221. 3 Hard
13 BMGT310. 3 Hard
area has_prereqs is_intro is_entrypoint
9 BMGT 0 1 1
10 BMGT 0 0 0
11 BMGT 1 0 0
12 BMGT 1 0 0
13 BMGT 1 0 0
3.0377358490566038
ECON
code title \
139 ECON200 Principles of Microeconomics
140 ECON201 Principles of Macroeconomics
141 ECON230 Applied Economic Statistics
142 ECON305 Intermediate Macroeconomic Theory and Policy
143 ECON306 Intermediate Microeconomic Theory & Policy
description \
139 Introduces economic models used to analyze ec...
140 An introduction to how market economies behav...
141 Introductory course to develop understanding ...
142 Analysis of the determination of national inc...
143 Analysis of the theories of consumer behavior...
prereqs credits prereq_type \
139 MATH107 or MATH110; or must have math eligibi... 3 Flexible
140 MATH107 or MATH110; or must have math eligibi... 3 Flexible
141 Must have math eligibility of MATH113 or high... 3 Flexible
142 Minimum grade of C- in ECON201 and ECON200. A... 3 Flexible
143 1 course with a minimum grade of C- from (ECO... 3 Flexible
area has_prereqs is_intro is_entrypoint
139 ECON 1 0 0
140 ECON 1 0 0
141 ECON 1 0 0
142 ECON 1 0 0
143 ECON 1 0 0
2.984375
SPHL
code title \
385 SPHL100 Foundations of Public Health
386 SPHL291 Can we move beyond medication? Examining yoga...
387 SPHL333 Fundamentals of Undergraduate Teaching for Ed...
388 SPHL600 Foundations of Public Health
389 SPHL610 Program and Policy Planning, Implementation, ...
description prereqs credits \
385 An overview of the goals, functions, and meth... None 3
386 Does yoga improve the health of wounded warri... None 3
387 Supports the professional and personal develo... None 1
388 An overview of the goals, functions, and meth... None 3
389 This second course in the MPH/MHA integrated ... None 5
prereq_type area has_prereqs is_intro is_entrypoint
385 None SPHL 0 0 0
386 None SPHL 0 0 0
387 None SPHL 0 0 0
388 None SPHL 0 0 0
389 None SPHL 0 0 0
2.4285714285714284
INST
code title \
218 INST126 Introduction to Programming for Information S...
219 INST155 Social Networking
220 INST201 Introduction to Information Science
221 INST311 Information Organization
222 INST314 Statistics for Information Science
description \
218 An introduction to computer programming for s...
219 Introduces methods for analyzing and understa...
220 Examining the effects of new information tech...
221 Examines the theories, concepts, and principl...
222 Basic concepts in statistics including measur...
prereqs credits prereq_type \
218 Minimum grade of C- in MATH115; or must have ... 3 Flexible
219 None 3 None
220 None 3 None
221 None 3 None
222 Minimum grade of C- in STAT100 and MATH115 (o... 3 Flexible
area has_prereqs is_intro is_entrypoint
218 INST 1 1 0
219 INST 0 0 0
220 INST 0 1 1
221 INST 0 0 0
222 INST 1 0 0
3.0
PSYC
code title \
347 PSYC123 The Psychology of Getting Hired
348 PSYC200 Statistical Methods in Psychology
349 PSYC221 Social Psychology
350 PSYC300 Research Methods in Psychology Laboratory
351 PSYC301 Biological Basis of Behavior
description \
347 Designed to introduce students to the science...
348 A basic introduction to quantitative methods ...
349 The influence of social factors on the indivi...
350 A general introduction and overview to the fu...
351 Recent advances in neuroscience are radically...
prereqs credits prereq_type \
347 None 1 None
348 PSYC100. And 1 course with a minimum grade of... 3 Flexible
349 PSYC100. 3 Hard
350 PSYC200. 4 Hard
351 PSYC100. And BSCI170 and BSCI171; or BSCI105. 3 Flexible
area has_prereqs is_intro is_entrypoint
347 PSYC 0 0 0
348 PSYC 1 0 0
349 PSYC 1 0 0
350 PSYC 1 0 0
351 PSYC 1 0 0
3.026315789473684
The .groupby()
way
# use groupby to split the courses df into subsets grouped by area
# we can iterate through the resulting collection of dataframe subsets
# where each step in the iteration allows us to grab
# 1. the name of the subset, which is the shared value (in this case area)
# 2. the subset dataframe (here called areaDF)
for area, areaDF in courses.groupby('area'):
print(area)
print(areaDF.head())
print(areaDF['credits'].mean()) # starting to get into the "apply"
AMST
code title \
0 AMST101 Introduction American Studies
1 AMST298C Introduction to Asian American Studies
2 AMST340 Introduction to History, Theories and Methods...
3 AMST418N Asian American Public Policy
4 AMST450 Seminar in American Studies
description \
0 Introduces students to the interdisciplinary ...
1 The aggregate experience of Asian Pacific Ame...
2 Introduction to the process of interdisciplin...
3
4 Developments in theories and methods of Ameri...
prereqs credits prereq_type \
0 None 3 None
1 None 3 None
2 Must have completed AMST201; and 2 courses in... 3 Hard
3 None 3 None
4 AMST201 and AMST340; and 1 course in AMST. 3 Hard
area has_prereqs is_intro is_entrypoint
0 AMST 0 1 1
1 AMST 0 1 1
2 AMST 1 1 0
3 AMST 0 0 0
4 AMST 1 0 0
3.0
BMGT
code title \
9 BMGT190H Introduction to Design and Quality
10 BMGT210 Foundations of Accounting for Non Business Ma...
11 BMGT302 Designing Applications for Business Analytics
12 BMGT310 Intermediate Accounting I
13 BMGT311 Intermediate Accounting II
description \
9 QUEST students learn and apply design practic...
10 Provides an understanding of the common state...
11 Provides an introduction to structured progra...
12 Comprehensive analysis of financial accountin...
13 Continuation of BMGT310.
prereqs credits prereq_type \
9 None 4 None
10 None 3 None
11 BMGT301; or permission of BMGT-Robert H. Smit... 3 Flexible
12 BMGT221. 3 Hard
13 BMGT310. 3 Hard
area has_prereqs is_intro is_entrypoint
9 BMGT 0 1 1
10 BMGT 0 0 0
11 BMGT 1 0 0
12 BMGT 1 0 0
13 BMGT 1 0 0
3.0377358490566038
CMSC
code title \
62 CMSC122 Introduction to Computer Programming via the ...
63 CMSC131 Object-Oriented Programming I
64 CMSC132 Object-Oriented Programming II
65 CMSC133 Object Oriented Programming I Beyond Fundamen...
66 CMSC216 Introduction to Computer Systems
description \
62 Introduction to computer programming in the c...
63 Introduction to programming and computer scie...
64 Introduction to use of computers to solve pro...
65 An introduction to computer science and objec...
66 Introduction to the interaction between user ...
prereqs credits prereq_type \
62 None 3 None
63 None 4 None
64 Minimum grade of C- in CMSC131; or must have ... 4 Flexible
65 None 2 None
66 Minimum grade of C- in CMSC132; and minimum g... 4 Hard
area has_prereqs is_intro is_entrypoint
62 CMSC 0 1 1
63 CMSC 0 0 0
64 CMSC 1 0 0
65 CMSC 0 0 0
66 CMSC 1 1 0
3.0
COMM
code title \
108 COMM200 Critical Thinking and Speaking
109 COMM331 News Writing and Reporting for Public Relations
110 COMM332 News Editing for Public Relations
111 COMM351 Public Relations Techniques
112 COMM353 New Media Writing for Public Relations
description \
108 Theory and practice of persuasive discourse a...
109 Writing and researching news and information ...
110 Copy editing, graphic principles and processe...
111 The techniques of public relations, including...
112 Students learn the uses and influence of new ...
prereqs credits prereq_type \
108 None 3 None
109 COMM201; and must have completed or be concur... 3 Flexible
110 Minimum grade of C- in COMM331; or students w... 3 Flexible
111 COMM332. 3 Hard
112 Minimum grade of C- in COMM351. 3 Hard
area has_prereqs is_intro is_entrypoint
108 COMM 0 0 0
109 COMM 1 0 0
110 COMM 1 0 0
111 COMM 1 0 0
112 COMM 1 0 0
2.870967741935484
ECON
code title \
139 ECON200 Principles of Microeconomics
140 ECON201 Principles of Macroeconomics
141 ECON230 Applied Economic Statistics
142 ECON305 Intermediate Macroeconomic Theory and Policy
143 ECON306 Intermediate Microeconomic Theory & Policy
description \
139 Introduces economic models used to analyze ec...
140 An introduction to how market economies behav...
141 Introductory course to develop understanding ...
142 Analysis of the determination of national inc...
143 Analysis of the theories of consumer behavior...
prereqs credits prereq_type \
139 MATH107 or MATH110; or must have math eligibi... 3 Flexible
140 MATH107 or MATH110; or must have math eligibi... 3 Flexible
141 Must have math eligibility of MATH113 or high... 3 Flexible
142 Minimum grade of C- in ECON201 and ECON200. A... 3 Flexible
143 1 course with a minimum grade of C- from (ECO... 3 Flexible
area has_prereqs is_intro is_entrypoint
139 ECON 1 0 0
140 ECON 1 0 0
141 ECON 1 0 0
142 ECON 1 0 0
143 ECON 1 0 0
2.984375
ENSP
code title \
203 ENSP102 Introduction to Environmental Policy
204 ENSP250 Lawns in the Landscape: Environmental Hero or...
205 ENSP305 Applied Quantitative Methods in Environmental...
206 ENSP330 Introduction to Environmental Law
207 ENSP342 Environmental Threats to Oceans and Coasts: T...
description \
203 Second of two courses that introduce students...
204 Examination of the lawn as an element in the ...
205 Intended for students interested in pursuing ...
206 An overview of environmental law, from its co...
207 An interdisciplinary study of the challenges ...
prereqs credits prereq_type \
203 None 3 None
204 None 3 None
205 BIOM301, ECON321, GEOG306, PSYC200, or SOCY20... 3 Flexible
206 None 3 None
207 None 3 None
area has_prereqs is_intro is_entrypoint
203 ENSP 0 1 1
204 ENSP 0 0 0
205 ENSP 1 0 0
206 ENSP 0 1 1
207 ENSP 0 0 0
3.0
ENTS
code title \
209 ENTS630 The Economics of International Telecommunicat...
210 ENTS632 Telecommunications Marketing Management
211 ENTS635 Decision Support Methods for Telecommunicatio...
212 ENTS641 Networks and Protocols II
description prereqs credits \
209 Basic microeconomic principles used by teleco... None 3
210 Topics covered include strategic marketing, s... None 3
211 The aim of this course is to introduce manage... None 3
212 Techniques for the specification, design, ana... ENTS640. 3
prereq_type area has_prereqs is_intro is_entrypoint
209 None ENTS 0 0 0
210 None ENTS 0 0 0
211 None ENTS 0 0 0
212 Hard ENTS 1 0 0
3.0
INFM
code title \
213 INFM605 Users and Use Context
214 INFM612 Management Concepts and Principles for Inform...
215 INFM620 Introduction to Strategic Information Managem...
216 INFM700 Information Architecture
217 INFM737 Information Management Capstone Experience
description \
213 Use of information by individuals. Nature of ...
214 Key aspects of management - focusing on plann...
215 Strategic management is the comprehensive col...
216 Principles and techniques of information orga...
217 The Information Management Capstone Experienc...
prereqs credits prereq_type \
213 None 3 None
214 None 3 None
215 INFM612; or LBSC631; or permission of instruc... 3 Flexible
216 INFM603; or permission of instructor. 3 Flexible
217 INFM736; and must have earned a minimum of 27... 3 Flexible
area has_prereqs is_intro is_entrypoint
213 INFM 0 0 0
214 INFM 0 0 0
215 INFM 1 1 0
216 INFM 1 0 0
217 INFM 1 0 0
3.0
INST
code title \
218 INST126 Introduction to Programming for Information S...
219 INST155 Social Networking
220 INST201 Introduction to Information Science
221 INST311 Information Organization
222 INST314 Statistics for Information Science
description \
218 An introduction to computer programming for s...
219 Introduces methods for analyzing and understa...
220 Examining the effects of new information tech...
221 Examines the theories, concepts, and principl...
222 Basic concepts in statistics including measur...
prereqs credits prereq_type \
218 Minimum grade of C- in MATH115; or must have ... 3 Flexible
219 None 3 None
220 None 3 None
221 None 3 None
222 Minimum grade of C- in STAT100 and MATH115 (o... 3 Flexible
area has_prereqs is_intro is_entrypoint
218 INST 1 1 0
219 INST 0 0 0
220 INST 0 1 1
221 INST 0 0 0
222 INST 1 0 0
3.0
MATH
code title \
265 MATH107 Introduction to Math Modeling and Probability
266 MATH113 College Algebra and Trigonometry
267 MATH115 Precalculus
268 MATH120 Elementary Calculus I
269 MATH121 Elementary Calculus II
description \
265 A goal is to convey the power of mathematics ...
266 Topics include elementary functions including...
267 Preparation for MATH120, MATH130 or MATH140. ...
268 Basic ideas of differential and integral calc...
269 Differential and integral calculus, with emph...
prereqs credits prereq_type \
265 Must have math eligibility of Math 107 or hig... 3 Flexible
266 Must have math eligibility of MATH113 or high... 3 Flexible
267 Must have math eligibility of MATH115 or high... 3 Flexible
268 1 course with a minimum grade of C- from (MAT... 3 Flexible
269 MATH120, MATH130, MATH136, or MATH140; or mus... 3 Flexible
area has_prereqs is_intro is_entrypoint
265 MATH 1 1 0
266 MATH 1 0 0
267 MATH 1 0 0
268 MATH 1 0 0
269 MATH 1 0 0
3.061224489795918
PHSC
code title \
314 PHSC401 History of Public Health
315 PHSC412 Food, Policy, and Public Health
316 PHSC415 Essentials of Public Health Biology: The Cell...
317 PHSC440 Public Health Nutrition
318 PHSC497 Public Health Science Capstone
description \
314 Emphasis is on the history of public health i...
315 Broad overview of the impact of food and food...
316 Presents the basic scientific and biomedical ...
317 Engages students in conceptual thinking about...
318 The capstone course is the culminating experi...
prereqs credits prereq_type \
314 None 3 None
315 Must have completed HLSA300 with a C- or high... 3 Flexible
316 Minimum grade of C- in BSCI202. 3 Hard
317 A minimum of C- in BSCI170, BSCI171, CHEM131,... 3 Hard
318 Must have completed the professional writing ... 3 Flexible
area has_prereqs is_intro is_entrypoint
314 PHSC 0 0 0
315 PHSC 1 0 0
316 PHSC 1 0 0
317 PHSC 1 0 0
318 PHSC 1 0 0
3.0
PLCY
code title \
319 PLCY101 Great Thinkers on Public Policy
320 PLCY201 Public Leaders and Active Citizens
321 PLCY215 Innovation and Social Change: Creating Change...
322 PLCY301 Sustainability
323 PLCY302 Examining Pluralism in Public Policy
description prereqs credits \
319 Great ideas in public policy, such as equalit... None 3
320 Aims to inspire, teach and engage students in... None 3
321 A team-based, highly interactive and dynamic ... None 3
322 Designed for students whose academic majors w... None 3
323 Understanding pluralism and how groups and in... None 3
prereq_type area has_prereqs is_intro is_entrypoint
319 None PLCY 0 0 0
320 None PLCY 0 0 0
321 None PLCY 0 0 0
322 None PLCY 0 0 0
323 None PLCY 0 0 0
3.0
PSYC
code title \
347 PSYC123 The Psychology of Getting Hired
348 PSYC200 Statistical Methods in Psychology
349 PSYC221 Social Psychology
350 PSYC300 Research Methods in Psychology Laboratory
351 PSYC301 Biological Basis of Behavior
description \
347 Designed to introduce students to the science...
348 A basic introduction to quantitative methods ...
349 The influence of social factors on the indivi...
350 A general introduction and overview to the fu...
351 Recent advances in neuroscience are radically...
prereqs credits prereq_type \
347 None 1 None
348 PSYC100. And 1 course with a minimum grade of... 3 Flexible
349 PSYC100. 3 Hard
350 PSYC200. 4 Hard
351 PSYC100. And BSCI170 and BSCI171; or BSCI105. 3 Flexible
area has_prereqs is_intro is_entrypoint
347 PSYC 0 0 0
348 PSYC 1 0 0
349 PSYC 1 0 0
350 PSYC 1 0 0
351 PSYC 1 0 0
3.026315789473684
SPHL
code title \
385 SPHL100 Foundations of Public Health
386 SPHL291 Can we move beyond medication? Examining yoga...
387 SPHL333 Fundamentals of Undergraduate Teaching for Ed...
388 SPHL600 Foundations of Public Health
389 SPHL610 Program and Policy Planning, Implementation, ...
description prereqs credits \
385 An overview of the goals, functions, and meth... None 3
386 Does yoga improve the health of wounded warri... None 3
387 Supports the professional and personal develo... None 1
388 An overview of the goals, functions, and meth... None 3
389 This second course in the MPH/MHA integrated ... None 5
prereq_type area has_prereqs is_intro is_entrypoint
385 None SPHL 0 0 0
386 None SPHL 0 0 0
387 None SPHL 0 0 0
388 None SPHL 0 0 0
389 None SPHL 0 0 0
2.4285714285714284
STAT
code title \
392 STAT100 Elementary Statistics and Probability
393 STAT400 Applied Probability and Statistics I
394 STAT401 Applied Probability and Statistics II
395 STAT410 Introduction to Probability Theory
396 STAT420 Theory and Methods of Statistics
description \
392 Simplest tests of statistical hypotheses; app...
393 Random variables, standard distributions, mom...
394 Point estimation - unbiased and consistent es...
395 Probability and its properties. Random variab...
396 Point estimation, sufficiency, completeness, ...
prereqs credits prereq_type \
392 MATH110, MATH112, MATH113, or MATH115; or per... 3 Flexible
393 1 course with a minimum grade of C- from (MAT... 3 Flexible
394 1 course with a minimum grade of C- from (STA... 3 Hard
395 1 course with a minimum grade of C- from (MAT... 3 Hard
396 1 course with a minimum grade of C- from (SUR... 3 Hard
area has_prereqs is_intro is_entrypoint
392 STAT 1 0 0
393 STAT 1 0 0
394 STAT 1 0 0
395 STAT 1 1 0
396 STAT 1 0 0
3.0
URSP
code title \
407 URSP600 Research Design and Application
408 URSP601 Research Methods
409 URSP604 The Planning Process
410 URSP606 Planning Economics
411 URSP631 Transportation and Land Use
description prereqs credits \
407 Techniques in urban research, policy analysis... None 3
408 Use of measurement, statistics, quantitative ... None 3
409 Legal framework for U.S. planning; approaches... None 3
410 Resource allocation in a market economy, the ... None 3
411 The interrelationship between transportation ... None 3
prereq_type area has_prereqs is_intro is_entrypoint
407 None URSP 0 0 0
408 None URSP 0 0 0
409 None URSP 0 0 0
410 None URSP 0 0 0
411 None URSP 0 0 0
3.0
courses.head()
Apply and Combine#
The “manual” way: apply and combine into a new dataframe we construct from scratch#
# create an empty list to hold the new rows of the new COMBINED dataframe
eba = []
# SPLIT the dataframe by area, and iterate through each split
for area, areaDF in courses.groupby('area'):
# APPLY operations on the dataframe split
# count the number of entry point courses in the subarea
num_entrypoints = areaDF['is_entrypoint'].sum()
# count the number of total courses in the subarea
num_classes = len(areaDF)
# count the ratio of entry points to total classes
entrypoint_ratio = num_entrypoints/num_classes
# prepare a new row to put into the COMBINEd dataframe that has areas as rows instead of courses
# the row is represented as a dictionary, where each key is a column
entry = {
'area': area, # each row is an area
'num_entrypoints': num_entrypoints,
'num_classes': num_classes,
'entry_point_ratio': entrypoint_ratio
}
# COMBINE the resulting subcomputation into a new dataset
eba.append(entry)
# convert the list of new entries into a dataframe
eba = pd.DataFrame(eba)
eba
area | num_entrypoints | num_classes | entry_point_ratio | |
---|---|---|---|---|
0 | AMST | 2 | 9 | 0.222222 |
1 | BMGT | 1 | 53 | 0.018868 |
2 | CMSC | 1 | 46 | 0.021739 |
3 | COMM | 0 | 31 | 0.000000 |
4 | ECON | 0 | 64 | 0.000000 |
5 | ENSP | 2 | 6 | 0.333333 |
6 | ENTS | 0 | 4 | 0.000000 |
7 | INFM | 0 | 5 | 0.000000 |
8 | INST | 4 | 47 | 0.085106 |
9 | MATH | 0 | 49 | 0.000000 |
10 | PHSC | 0 | 5 | 0.000000 |
11 | PLCY | 0 | 28 | 0.000000 |
12 | PSYC | 1 | 38 | 0.026316 |
13 | SPHL | 0 | 7 | 0.000000 |
14 | STAT | 0 | 15 | 0.000000 |
15 | URSP | 0 | 7 | 0.000000 |
# how many classes are in each area?
# make a data frame that summarizes it
# make a list to hold the combined data by subgroup
summarized = []
# SPLIT-APPLY-COMBINE
# SPLIT the dataset by area
for area, areaData in courses.groupby('area'):
# APPLY a computation to get the number of courses
# which is just the number of rows in this subset dataframe
num_courses = len(areaData)
entry = {
'area': area,
'num_courses': num_courses
}
# add it to the COMBINED summarized dataset
summarized.append(entry)
# turn the list of dictionaries into a dataframe
summarized = pd.DataFrame(summarized)
summarized
area | num_courses | |
---|---|---|
0 | AMST | 9 |
1 | BMGT | 53 |
2 | CMSC | 46 |
3 | COMM | 31 |
4 | ECON | 64 |
5 | ENSP | 6 |
6 | ENTS | 4 |
7 | INFM | 5 |
8 | INST | 47 |
9 | MATH | 49 |
10 | PHSC | 5 |
11 | PLCY | 28 |
12 | PSYC | 38 |
13 | SPHL | 7 |
14 | STAT | 15 |
15 | URSP | 7 |
# iterate through the dataframe, row by row
for index, row in eba.iterrows():
print(row['area'])
print(row['prereq_classes'])
We can define a whole new function that is much more complicated than a simple sum or mean, to APPLY to each subset of our SPLIT
For example, we can write a function that takes in a prereq description, and extracts all the prereq classes mentinoed in the description
def extract_prereqs(descr):
prereqs = []
for word in descr.split():
# clean the word
clean_word = ""
for char in word:
if char.isalpha() or char.isdigit():
clean_word += char
if len(clean_word) == 7 and clean_word[:4].isalpha() and clean_word[-3:].isdigit():
prereqs.append(clean_word)
return prereqs
for area, areaDF in courses.groupby('area'):
area_prereqs = set()
for prereq in areaDF['prereqs']:
prereqs = extract_prereqs(prereq)
for prereq_descr in courses[courses['has_prereqs'] == 1]['prereqs'].values:
print(prereq_descr)
print(extract_prereqs(prereq_descr))
Shortcut apply-combine with .agg()
#
A more concise way to apply and combine is to chain the .agg()
function to a .groupby()
object to tell pandas to aggregate particular columns in particular ways (e.g., count the number of entry point courses in a given department, vs. give an average proportion of classes that are entry points).
# SPLIT by area
courses.groupby("area", as_index=False).agg(
# create a new column named num_entrypoints,
# and give it the value of the sum function applied to the is_entrypoints column
# APPLY tehse computations and COMBINE into a new data frame using .agg
num_entrypoints=('is_entrypoint', sum),
num_classes=('area', "count")
)
area | num_entrypoints | num_classes | |
---|---|---|---|
0 | AMST | 2 | 9 |
1 | BMGT | 1 | 53 |
2 | CMSC | 1 | 46 |
3 | COMM | 0 | 31 |
4 | ECON | 0 | 64 |
5 | ENSP | 2 | 6 |
6 | ENTS | 0 | 4 |
7 | INFM | 0 | 5 |
8 | INST | 4 | 47 |
9 | MATH | 0 | 49 |
10 | PHSC | 0 | 5 |
11 | PLCY | 0 | 28 |
12 | PSYC | 1 | 38 |
13 | SPHL | 0 | 7 |
14 | STAT | 0 | 15 |
15 | URSP | 0 | 7 |
Anatomy of combining .groupby()
with .agg()
This pattern is explained in the section “Recommended: Tuple Named Aggregations” in this article: https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
# group the courses by the area column (and make sure that they show up as columns in the resulting dataframe)
# then apply the functions in the .agg() function to each subgroup
# and stitch it back into a dataframe that we'll put into the entrypoints_by_area variable
entrypoints_by_area = courses.groupby("area", as_index=False).agg(
# create a new column named num_entrypoints,
# and give it the value of the sum function applied to the is_entrypoints column
the_num_entrypoints=('is_entrypoint', "sum"),
num_classes=('area', "count")
)
entrypoints_by_area
# let's now compute the proportion of entry point classes, as a proxy for "openness"
# step 1: define the function
def openness(row):
return row['num_entrypoints']/row['num_classes']
# step 2: apply the function and save the results
entrypoints_by_area['openness'] = entrypoints_by_area.apply(openness, axis=1)
entrypoints_by_area
# what are some fun groupbys we can do on the other datasets?
# e.g., for donations, we can do average and sum and range, etc. by team
# or number of wins by conference in ncaa
# or total sales per hour of day for bread
In my experience, this works really well for standard analysis tasks, but is less flexible than the manual approach. This approach would be tough to adapt easily for the bread Project 4, for example. I also like teaching the manual approach first for getting an intuition for what is happening under the hood.
Saving data / results for later analysis#
We often want to save the results of our analysis for later. This can be done in a few different ways (depending on what file format will be useful later, such as json
, html
, xlsx
(excel spreadsheets), or csv
).
In this class, we’ll practice saving to csv
, a common file format for data (the same one you practice reading into pandas!)
# example of saving the entrypoints_by_area dataframe to a csv file
entrypoints_by_area.to_csv("outputs/entrypoints_by_area.csv")
The .to_csv()
method can take a number of optional arguments to control what happens, but the only required one is the file path to where the csv file will be saved (similar to what you need when you want to write to a file with open()
)
In this example, we’re saving the entrypoints_by_area
dataframe to the entrypoints_by_area.csv
in the output/
folder.
Just make sure (as with files), that the folder actually exists before you try to put a file there!
Extras#
This is stuff we may not get to in class but is available because it may be useful for your projects and beyond (though you can certainly solve most of Project 4 without these).
Use .value_counts()
to summarize categorical data in your dataframe#
Last week we learned how to compute some basic statistics, overall, and by column, for quantitative data. Today, we’ll learn how to use value_counts()
to quickly summarize categorical data.
.value_counts()
does exactly what you think it might do based on the name: it counts the frequency of each unique value in a column! In other words, it gives us a way to count how many times each value shows up in a column. In this way, it’s kinda similar to the basic “count-based” indexing we did in Module 3.
Hint: this could be useful for Problem 4 for Project 4!
Here’s an example for the courses data: how many times does each “area” show up?
# access the area column in the courses dataframe
# and apply the value_counts method to that column
courses['area'].value_counts()
The syntax here is:
nameOfDataFrame[‘nameOfColumn’].value_counts()
value_counts()
is a method that a Pandas series (i.e., column in a dataframe) data structure can do (again, make the connection back to .append()
for lists, and .split()
for strings).
Let’s try some other queries!
# for ncaa dataset
# how many entries do we have for each conference?
# for bls data
# how many entries do we have for each category?
Value counts returns a series, which has nice properties of both lists and dictionaries.
Like lists, we can sort it using the .sort_values()
method, though we need to make sure to either force it to run “in place” (with inplace=True
as an argument for .sort_values()
), or save it to a variable.
area_counts.sort_values(ascending=True)
And access items by index position, which allows us to get the first thing, or the first 5 things, or the last 5 things, etc.
# get the first value in the series
# note: you only get the value, not the "name"
area_counts[0]
area_counts[:5]
area_counts = courses['area'].value_counts()
# like a cross between a dictionary anda list
# can get value by named key like a dict
print("INST", area_counts['INST'])
print("most frqeuent item count", area_counts[0])
area_counts.keys()
# let's say we want the top 5 most populous areas
# we can slice/subset the series just like a list
# and then get the keys from that subset
area_counts[:5].keys()
# let's try with the other datasets!
# ncaa-team-data
# bls-by-category
# BreadBasket_DMS
bread = pd.read_csv(f'{folder}/BreadBasket_DMS.csv')
bread.head()
# how do we get the frequency counts for items in the bread dataframe?
bread['Item'].value_counts()
Plotting#
The main library for plotting in Python is matplotlib
. You can learn that library later. It has lots of fine-grained controls.
For now, you can use pandas “wrapper” over matplotlib (basically calling matplotlib from inside pandas), which is a bit easier to learn.
entrypoints_by_area
def openness(row):
return row['num_entrypoints']/row['num_classes']
entrypoints_by_area['openness'] = entrypoints_by_area.apply(openness, axis=1)
entrypoints_by_area
# sort the data by the openness column
# make sure we assign to the entry points variable again so we don't lose it (bc pandas treats dataframes as immutable, like strings, unless we force it to do otherwise)
entrypoints_by_area = entrypoints_by_area.sort_values(by="openness", ascending=False)
entrypoints_by_area
# plot openness by area
entrypoints_by_area.plot(
x="area",
y="openness",
kind='bar',
xlabel="AREA",
ylabel="Proportion of entry point classes",
title="Classes openness by area"
)
entrypoints_by_area.plot(y="openness", kind="hist")
entrypoints_by_area.sort_values(by="num_classes", ascending=False).plot(x="area", y="num_classes", kind="bar")
Reminder: More resources#
The pandas website is decent place to start: https://pandas.pydata.org/
This “cheat sheet” is also a really helpful guide to more common operations that you may run into later: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
There are also many blogs that are helpful, like towardsdatascience.com
The cool thing about pandas and data analysis in python is that many people share notebooks that you can inspect / learn from / adapt code for your own projects (just like mine!).