11: Pandas for data analysis with Python: Part 2#

Learning objectives#

Last week, we learned:

  • Pandas is a library in Python that is designed for data manipulation and analysis

  • How to use libraries (import them, access their functions and data structures with library.function_name())

  • About the dataframe data structure: basically a smart spreadsheet, with rows of observations, and columns of variables/data for each observation - sort of a cross between a list (sortable, indexable) and a dictionary (quickly access data by key)

  • Some basic operations: constructing a dataframe, summarizing, subsetting, reshaping

This week, we’ll most dig into more advanced operations for reshaping/modifying your dataframe:

  • Use .apply() to apply functions to one or more columns to generate new columns

  • Use .groupby() to split your data into subgroups, apply some function to their data, then combine them into a new dataframe for further analysis (the “split-apply-combine” pattern that is fundamental to data analysis with pandas)

  • Use some basic plotting functions to explore your data

These roughly correspond to Qs 6-8 in your PCEs.

If we have time, we’ll learn a bit more about summarization:

  • Use .value_counts() to summarize categorical data

  • How to plot data

Creating/modifying data columns based on one or more columns using .apply()#

More advanced operations on dataframes involve modifying or creating new columns!

In data analysis, we often want to do things to data in our columns for data preparation/cleaning. Sometimes there is missing data we want to recode, or we want to redescribe data or reclassify it for our analysis. We can do this with a combination of functions and the apply() method.

In this way, again making a connection back to lists, .apply() is a little like the map() function that we used for lists to transform items from one list to another list with equal length (e.g., convert scores to letter grades).

.apply() with a single column#

The simpler version of .apply() only takes input from one column.

To illustrate let’s do some operations on the dataset INST courses.csv.

# import the pandas library
import pandas as pd

# read in the dataset
fpath = 'INST courses.csv'
courses = pd.read_csv(fpath) # read in the file into a dataframe called courses
courses # use the .head() function to show the top 5 rows in the dataframe
Code Title Description Prereqs Credits
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0
12 INST408Y Special Topics in Information Science; Privacy... NaN NaN
13 INST408Z Special Topics in Information Science; The Apo... NaN NaN
14 INST414 Data Science Techniques An exploration of how to extract insights from... INST314. 3.0
15 INST447 Data Sources and Manipulation Examines approaches to locating, acquiring, ma... INST326 or CMSC131; and INST327. 3.0
16 INST462 Introduction to Data Visualization Exploration of the theories, methods, and tech... INST314. 3.0
17 INST466 Technology, Culture, and Society Individual, cultural, and societal outcomes as... INST201. 3.0
18 INST490 Integrated Capstone for Information Science The capstone provides a platform for Informati... Minimum grade of C- in INST314, INST335, INST3... 3.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0
20 INST612 Information Policy Nature, structure, development and application... None 3.0
21 INST614 Literacy and Inclusion The educational and psychological dimensions o... None 3.0
22 INST616 Open Source Intelligence An introduction to Open Source Intelligence (O... None 3.0
23 INST622 Information and Universal Usability Information services and technologies to provi... None 3.0
24 INST627 Data Analytics for Information Professionals Skills and knowledge needed to craft datasets,... None 3.0
25 INST630 Introduction to Programming for the Informatio... An introduction to computer programming intend... None 3.0
26 INST652 Design Thinking and Youth Methods of design thinking specifically within... None 3.0
27 INST702 Advanced Usability Testing Usability testing methods -- how to design and... Permission of instructor; or (INFM605 or INST6... 3.0
28 INST709 Independent Study NaN NaN
29 INST728G Special Topics in Information Studies; Smart C... NaN NaN
30 INST728V Special Topics in Information Studies; Digital... NaN NaN
31 INST733 Database Design Principles of user-oriented database design. ... LBSC690, LBSC671, or INFM603; or permission of... 3.0
32 INST737 Introduction to Data Science An exploration of some of the best and most ge... INST627; and (LBSC690, LBSC671, or INFM603). O... 3.0
33 INST741 Social Computing Technologies and Applications Tools and techniques for developing and config... INFM603 and INFM605; or (LBSC602 and LBSC671);... 3.0
34 INST742 Implementing Digital Curation Management of and technology for application o... INST604; or permission of instructor. 3.0
35 INST746 Digitization of Legacy Holdings Through hands on exercises and real-world proj... INST604. 3.0
36 INST762 Visual Analytics Visual analytics is the use of interactive vis... INFM603 or INST630; or permission of instructor. 3.0
37 INST767 Big Data Infrastructure Principles and techniques of data science and ... INST737; or permission of instructor. 3.0
38 INST776 HCIM CAPSTONE PROJECT The opportunity to apply the skills learned th... INST775; or permission of instructor. 3.0
39 INST785 Documentation, Collection, and Appraisal of Re... Development of documentation strategies and pl... INST604; or permission of instructor. 3.0
40 INST794 Capstone in Youth Experience Through a supervised project, to synthesize de... INST650, INST651, and INST652; or permission o... 3.0

Let’s say we want to have a prereqs column that is sortable, maybe 0 = No prereqs, and 1 = has prereqs

Step 1: Define the function you want to apply#

# Step 1: define the function you want to apply
def has_prereq(prereq_descr):
  # assume we get a string prereq description
  if pd.isnull(prereq_descr):
    return 0
  elif "None" in prereq_descr:
    return 0
  else:
    return 1
# test the function
prereq = "BMGT301; or instructor permission" # this should yield 1
prereq2 = "None" # this should yield 0
print(has_prereq(prereq))
print(has_prereq(prereq2))
1
0

Step 2: Apply the function to a column and save the result in a (new) column#

# Step 2: apply it to a column and save the result in the (new) `had_prereqs` column
courses['has_prereqs'] = courses['Prereqs'].apply(has_prereq) # apply the has_prereq() function to every row in the prereqs column in the courses data frame
courses.head(10)
Code Title Description Prereqs Credits has_prereqs
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0 1
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0 0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0 1
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0 1
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0 1
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0 1
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0 1
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0 1
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0 1
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0 1

Another example: let’s say we want ot know if a course is an introductory course. How might we do this?

# first define a function to check if the course is an intro course
def is_intro(title):
    if "introduction" in title.lower():
        return 1
    else:
        return 0

# then apply it to the courses column and save the result in the (new) `is_intro` column
courses['is_intro'] = courses['Title'].apply(is_intro)
courses.head(10)
Code Title Description Prereqs Credits has_prereqs is_intro
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0 1 1
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0 0 1
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0 1 0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0 1 0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0 1 0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0 1 0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0 1 0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0 1 0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0 1 0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0 1 0

If you’re lazy, you can pass in anonymous functions too, with lambda: https://towardsdatascience.com/apply-and-lambda-usage-in-pandas-b13a1ea037f7

is_introductory = courses['Title'].apply(lambda title: 1 if "introduction" in title.lower() else 0)
is_introductory.head(10) # show the top 10
0    1
1    1
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: Title, dtype: int64
What’s happening under the hood when you .apply() a function to a column#

Pandas is iterating through every row in that column, and applying the function to the value in that row.

# let's show this for the first 10 courses
for prereq in courses['Prereqs'].head(10):
    print(f"Result of applying has_prereq() to {prereq}: {has_prereq(prereq)}")
Result of applying has_prereq() to Minimum grade of C- in MATH115; or must have math eligibility of MATH140 or higher; or permission of instructor.: 1
Result of applying has_prereq() to None: 0
Result of applying has_prereq() to Must have completed or be concurrently enrolled in INST201; or INST301.: 1
Result of applying has_prereq() to Must have completed or be concurrently enrolled in INST201; or must have completed or be concurrently enrolled in INST301. And minimum grade of C- in INST201 and INST301; and MATH115; and STAT100; and minimum grade of C- in MATH115 and STAT100.: 1
Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST126, CMSC106); and must have completed or be concurrently enrolled in INST201 or INST301. And minimum grade of C- in INST201; or minimum grade of C- in INST301.: 1
Result of applying has_prereq() to 1 course with a minimum grade of C- from (CMSC106, CMSC122, INST126); and must have completed or be concurrently enrolled in INST201 or INST301; and minimum grade of C- in INST201 and INST301.: 1
Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST201, INST301); and minimum grade of C- in PSYC100.: 1
Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST201, INST301); and 1 course with a minimum grade of C- from (INST326, CMSC131); and minimum grade of C- in INST327.: 1
Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST201, INST301); and minimum grade of C- in INST311.: 1
Result of applying has_prereq() to INST314.: 1

The .apply() function returns a pandas Series that is the same length as the input column (which is also a Series), with a corresponding value for each input.

print(f"the Prereqs column has {len(courses['Prereqs'])} rows")
print(f"the Series created by applying `has_prereq()` to the Prereqs column has {len(courses['Prereqs'].apply(has_prereq))} rows")
the Prereqs column has 41 rows
the Series created by applying `has_prereq()` to the Prereqs column has 41 rows

To save the results of the apply() for later analysis, we then need to assign it to a column, new or existing.

Remember, pandas prefers immutability in general (return a new object instead of modifying the object), and sometimes enforces it. With .apply(), it’s enforced: you can’t directly modify the column, you have to assign the returned Series to a column if you want it to persist.

Like with other assignment statements, just running the .apply() and assigning its return value to a column will not yield output. You’ll need to print out the dataframe to check the results.

PRACTICE: Let’s say I want to know how many courses we have in each area. We don’t have that data in the dataset; at least not explicitly. Fortunately we can make it with some simple programming that you already know how to do! The problem here is, given a code (i.e., data from one column), how do we “extract” the area?

# Step 1: define the function
def extract_area(code):
    # heuristic: just grab the first four characters
    return 
c = "CMSC250"
extract_area(c)
# Step 2: apply the function
courses['area'] = courses['code'].apply(extract_area)
courses.head(10)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'code'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/var/folders/xz/_hjc5hsx743dclmg8n5678nc0000gn/T/ipykernel_89416/1864781452.py in <module>
      1 # Step 2: apply the function
----> 2 courses['area'] = courses['code'].apply(extract_area)
      3 courses.head(10)

~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'code'

PRACTICE: With the wunderground.csv dataset, how can we extract the year/month/day from the date column?

PRACTICE: With the BreadBasket_DMS.csv dataset, how can we extract the hour for each transaction from the Time column?

# the bread dataset
bread = pd.read_csv("data/BreadBasket_DMS.csv")
def extract_hour(time):
    return 

bread['Hour'] = bread['Time'].apply(extract_hour)
bread.sort_values(by="Hour")

.apply() with data from multiple columns#

What if you want to have a way to filter the courses in terms of “easy entry points” (i.e., both introductory and has no prerequisites)? That might also be interesting to analyze by area to see how many departments offer these easy entry points into the department for students from other departments.

Core thing we need to know here is that our .apply() will now apply a function that has a row as input, not an element of a single column. That way, we can access data from any column in the row: in this case, data from the “is_intro” and “has_prereq” columns.

There are two key differences between this use of .apply() and the single-column case:

  1. First, we do .apply() with the whole dataframe, not from a single column

  2. We specify an argument for the axis parameter to tell it to use rows as inputs. We need to pass the argument 1 to the axis parameter when we call .apply() so it knows to pass a row into the function, not just a single column element. See here for more details: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

# is_entry_point function
def is_entry_point(row):
  # if the value of the "is intro" column for this row is 1
  # AND the value of hte "has_prereq" column for this row is 0
  # return 1
  if row['is_intro'] == 1 and row['has_prereqs'] == 0: 
    return 1
  else:
    return 0
# this should yield 1
test_row = {
    'is_intro': 1,
    'has_prereqs': 0
}

# this should yield 0
test_row2 = {
    'is_intro': 1,
    'has_prereqs': 1
}

print(is_entry_point(test_row))
print(is_entry_point(test_row2))
1
0
# Step 2 apply the function and save the result
courses['is_entrypoint'] = courses.apply(is_entry_point, axis=1) # need to specify axis=1 to apply it to every row
# courses['classlevel'] = courses['classcode'].apply(level)
courses.head()

# compare to .apply() with a single column
# courses['is_intro'] = courses['title'].apply(is_intro)
# key differences:
# - here for multiple columns, we start with the whole dataframe, instead of a specific column
# - and we pass the argument 1 to the axis parameter instead of letting it use the default 0 value
Code Title Description Prereqs Credits has_prereqs is_intro is_entrypoint
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0 1 1 0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0 0 1 1
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0 1 0 0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0 1 0 0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0 1 0 0
# show me all the courses that are intro and have no prereqs
courses[courses['is_entrypoint'] == 1]
# if we have a list, we can do indexing like this to get the first 4 elements, say: courses[:4]
# if we have a dictionary, we can retrieve base don key, like this courses['hello']
Code Title Description Prereqs Credits has_prereqs is_intro is_entrypoint
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0 0 1 1
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0 0 1 1
25 INST630 Introduction to Programming for the Informatio... An introduction to computer programming intend... None 3.0 0 1 1

More examples?

# for ncaa: make a column that is 1 if you had a winning season AND reached the round of 32
ncaa = pd.read_csv("data/ncaa-team-data-cleanCoachNames.csv")
ncaa.head()

def underdog_season(row):
    if row['w'] < 21 and row['ncaa_result'] == "Won National Final":
        return 1
    else:
        return 0
    
ncaa['underdog'] = ncaa.apply(underdog_season, axis=1)
ncaa[ncaa['underdog']==1]
Unnamed: 0 school conf rk w l srs sos pts_for pts_vs ... ap_pre ap_high ap_final pts_diff ncaa_result ncaa_numeric season year coach underdog
8229 8229 indiana Big Ten 78 20 3 NaN NaN NaN NaN ... 30 30 30 NaN Won National Final 48 1939-40 1939 Branch McCracken 1
23515 23515 wisconsin Big Ten 77 20 3 NaN NaN NaN NaN ... 30 30 30 NaN Won National Final 48 1940-41 1940 Bud Foster 1

2 rows × 21 columns

set(ncaa['ncaa_result'])
{'Lost First Four',
 'Lost First Round',
 'Lost National Final',
 'Lost National Semifinal',
 'Lost Opening Round',
 'Lost Regional Final',
 'Lost Regional Final (Final Four)',
 'Lost Regional Semifinal',
 'Lost Second Round',
 'Lost Third Round',
 'Playing First Four',
 'Playing First Round',
 'Won National Final',
 nan}
# for ncaa: create a column taht is 1 if the pt differential is non-negative BUT the team didn't make the playoffs (numeric >= 32 or 0)
# apply the get_month function to every value in the Date column in bread
# and store the resulting series in the Month column
# convert the ncaa_result column to a numerical ranking of season outcome

Analyze subgroups of your data with the split-apply-combine pattern#

Going more deeply on the path of “reshaping”, we often want to compute data based on subsets of the data, grouped by some column.

For example, we might want to see how many departments offer easy entry points.

We can do this with the “split-apply-combine” pattern, which is implemented in the .groupby() function.

Basically, it goes like this:

  1. Split the data into subgroups (e.g., split courses into department subgroups)

  2. Apply some computation on each subgroup (e.g., find number of easy entry points for each department subgroup)

  3. Combine subgroup-computation information into an overall new dataframe that has subgroups as entries

Here’s a picture with a simpler datset to give an intuition:

for area, areaData in courses.groupby('area'):
    print(area)
    print(areaData)

Split with .groupby()#

The first step is to split the dataset into subgroups.

The manual way

# get all the unique area values
course_areas = set(courses['area'].values)

# iterate through each unique area value
for area in course_areas:
    print(area)
    # get the subset of the course data that is associated with this area
    area_df = courses[courses['area'] == area]
    print(area_df.head())
    # summarize the credits for this subset of hte dataframe
    print(area_df['credits'].mean())
    print("\n")
PLCY
         code                                              title  \
319  PLCY101                    Great Thinkers on Public Policy    
320  PLCY201                 Public Leaders and Active Citizens    
321  PLCY215    Innovation and Social Change: Creating Change...   
322  PLCY301                                     Sustainability    
323  PLCY302               Examining Pluralism in Public Policy    

                                           description prereqs  credits  \
319   Great ideas in public policy, such as equalit...   None         3   
320   Aims to inspire, teach and engage students in...   None         3   
321   A team-based, highly interactive and dynamic ...   None         3   
322   Designed for students whose academic majors w...   None         3   
323   Understanding pluralism and how groups and in...   None         3   

    prereq_type  area  has_prereqs  is_intro  is_entrypoint  
319        None  PLCY            0         0              0  
320        None  PLCY            0         0              0  
321        None  PLCY            0         0              0  
322        None  PLCY            0         0              0  
323        None  PLCY            0         0              0  
3.0


COMM
         code                                              title  \
108  COMM200                     Critical Thinking and Speaking    
109  COMM331    News Writing and Reporting for Public Relations    
110  COMM332                  News Editing for Public Relations    
111  COMM351                        Public Relations Techniques    
112  COMM353             New Media Writing for Public Relations    

                                           description  \
108   Theory and practice of persuasive discourse a...   
109   Writing and researching news and information ...   
110   Copy editing, graphic principles and processe...   
111   The techniques of public relations, including...   
112   Students learn the uses and influence of new ...   

                                               prereqs  credits prereq_type  \
108                                              None         3        None   
109   COMM201; and must have completed or be concur...        3    Flexible   
110   Minimum grade of C- in COMM331; or students w...        3    Flexible   
111                                        COMM332.           3        Hard   
112                 Minimum grade of C- in COMM351.           3        Hard   

     area  has_prereqs  is_intro  is_entrypoint  
108  COMM            0         0              0  
109  COMM            1         0              0  
110  COMM            1         0              0  
111  COMM            1         0              0  
112  COMM            1         0              0  
2.870967741935484


CMSC
        code                                              title  \
62  CMSC122    Introduction to Computer Programming via the ...   
63  CMSC131                      Object-Oriented Programming I    
64  CMSC132                     Object-Oriented Programming II    
65  CMSC133    Object Oriented Programming I Beyond Fundamen...   
66  CMSC216                   Introduction to Computer Systems    

                                          description  \
62   Introduction to computer programming in the c...   
63   Introduction to programming and computer scie...   
64   Introduction to use of computers to solve pro...   
65   An introduction to computer science and objec...   
66   Introduction to the interaction between user ...   

                                              prereqs  credits prereq_type  \
62                                              None         3        None   
63                                              None         4        None   
64   Minimum grade of C- in CMSC131; or must have ...        4    Flexible   
65                                              None         2        None   
66   Minimum grade of C- in CMSC132; and minimum g...        4        Hard   

    area  has_prereqs  is_intro  is_entrypoint  
62  CMSC            0         1              1  
63  CMSC            0         0              0  
64  CMSC            1         0              0  
65  CMSC            0         0              0  
66  CMSC            1         1              0  
3.0


ENTS
         code                                              title  \
209  ENTS630    The Economics of International Telecommunicat...   
210  ENTS632            Telecommunications Marketing Management    
211  ENTS635    Decision Support Methods for Telecommunicatio...   
212  ENTS641                          Networks and Protocols II    

                                           description       prereqs  credits  \
209   Basic microeconomic principles used by teleco...         None         3   
210   Topics covered include strategic marketing, s...         None         3   
211   The aim of this course is to introduce manage...         None         3   
212   Techniques for the specification, design, ana...   ENTS640.           3   

    prereq_type  area  has_prereqs  is_intro  is_entrypoint  
209        None  ENTS            0         0              0  
210        None  ENTS            0         0              0  
211        None  ENTS            0         0              0  
212        Hard  ENTS            1         0              0  
3.0


URSP
         code                              title  \
407  URSP600    Research Design and Application    
408  URSP601                   Research Methods    
409  URSP604               The Planning Process    
410  URSP606                 Planning Economics    
411  URSP631        Transportation and Land Use    

                                           description prereqs  credits  \
407   Techniques in urban research, policy analysis...   None         3   
408   Use of measurement, statistics, quantitative ...   None         3   
409   Legal framework for U.S. planning; approaches...   None         3   
410   Resource allocation in a market economy, the ...   None         3   
411   The interrelationship between transportation ...   None         3   

    prereq_type  area  has_prereqs  is_intro  is_entrypoint  
407        None  URSP            0         0              0  
408        None  URSP            0         0              0  
409        None  URSP            0         0              0  
410        None  URSP            0         0              0  
411        None  URSP            0         0              0  
3.0


STAT
         code                                    title  \
392  STAT100    Elementary Statistics and Probability    
393  STAT400     Applied Probability and Statistics I    
394  STAT401    Applied Probability and Statistics II    
395  STAT410       Introduction to Probability Theory    
396  STAT420         Theory and Methods of Statistics    

                                           description  \
392   Simplest tests of statistical hypotheses; app...   
393   Random variables, standard distributions, mom...   
394   Point estimation - unbiased and consistent es...   
395   Probability and its properties. Random variab...   
396   Point estimation, sufficiency, completeness, ...   

                                               prereqs  credits prereq_type  \
392   MATH110, MATH112, MATH113, or MATH115; or per...        3    Flexible   
393   1 course with a minimum grade of C- from (MAT...        3    Flexible   
394   1 course with a minimum grade of C- from (STA...        3        Hard   
395   1 course with a minimum grade of C- from (MAT...        3        Hard   
396   1 course with a minimum grade of C- from (SUR...        3        Hard   

     area  has_prereqs  is_intro  is_entrypoint  
392  STAT            1         0              0  
393  STAT            1         0              0  
394  STAT            1         0              0  
395  STAT            1         1              0  
396  STAT            1         0              0  
3.0


ENSP
         code                                              title  \
203  ENSP102               Introduction to Environmental Policy    
204  ENSP250    Lawns in the Landscape: Environmental Hero or...   
205  ENSP305    Applied Quantitative Methods in Environmental...   
206  ENSP330                  Introduction to Environmental Law    
207  ENSP342    Environmental Threats to Oceans and Coasts: T...   

                                           description  \
203   Second of two courses that introduce students...   
204   Examination of the lawn as an element in the ...   
205   Intended for students interested in pursuing ...   
206   An overview of environmental law, from its co...   
207   An interdisciplinary study of the challenges ...   

                                               prereqs  credits prereq_type  \
203                                              None         3        None   
204                                              None         3        None   
205   BIOM301, ECON321, GEOG306, PSYC200, or SOCY20...        3    Flexible   
206                                              None         3        None   
207                                              None         3        None   

     area  has_prereqs  is_intro  is_entrypoint  
203  ENSP            0         1              1  
204  ENSP            0         0              0  
205  ENSP            1         0              0  
206  ENSP            0         1              1  
207  ENSP            0         0              0  
3.0


AMST
        code                                              title  \
0   AMST101                      Introduction American Studies    
1  AMST298C             Introduction to Asian American Studies    
2   AMST340    Introduction to History, Theories and Methods...   
3  AMST418N                       Asian American Public Policy    
4   AMST450                        Seminar in American Studies    

                                         description  \
0   Introduces students to the interdisciplinary ...   
1   The aggregate experience of Asian Pacific Ame...   
2   Introduction to the process of interdisciplin...   
3                                                      
4   Developments in theories and methods of Ameri...   

                                             prereqs  credits prereq_type  \
0                                              None         3        None   
1                                              None         3        None   
2   Must have completed AMST201; and 2 courses in...        3        Hard   
3                                              None         3        None   
4      AMST201 and AMST340; and 1 course in AMST.           3        Hard   

   area  has_prereqs  is_intro  is_entrypoint  
0  AMST            0         1              1  
1  AMST            0         1              1  
2  AMST            1         1              0  
3  AMST            0         0              0  
4  AMST            1         0              0  
3.0


MATH
         code                                            title  \
265  MATH107    Introduction to Math Modeling and Probability    
266  MATH113                 College Algebra and Trigonometry    
267  MATH115                                      Precalculus    
268  MATH120                            Elementary Calculus I    
269  MATH121                           Elementary Calculus II    

                                           description  \
265   A goal is to convey the power of mathematics ...   
266   Topics include elementary functions including...   
267   Preparation for MATH120, MATH130 or MATH140. ...   
268   Basic ideas of differential and integral calc...   
269   Differential and integral calculus, with emph...   

                                               prereqs  credits prereq_type  \
265   Must have math eligibility of Math 107 or hig...        3    Flexible   
266   Must have math eligibility of MATH113 or high...        3    Flexible   
267   Must have math eligibility of MATH115 or high...        3    Flexible   
268   1 course with a minimum grade of C- from (MAT...        3    Flexible   
269   MATH120, MATH130, MATH136, or MATH140; or mus...        3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
265  MATH            1         1              0  
266  MATH            1         0              0  
267  MATH            1         0              0  
268  MATH            1         0              0  
269  MATH            1         0              0  
3.061224489795918


INFM
         code                                              title  \
213  INFM605                              Users and Use Context    
214  INFM612    Management Concepts and Principles for Inform...   
215  INFM620    Introduction to Strategic Information Managem...   
216  INFM700                           Information Architecture    
217  INFM737         Information Management Capstone Experience    

                                           description  \
213   Use of information by individuals. Nature of ...   
214   Key aspects of management - focusing on plann...   
215   Strategic management is the comprehensive col...   
216   Principles and techniques of information orga...   
217   The Information Management Capstone Experienc...   

                                               prereqs  credits prereq_type  \
213                                              None         3        None   
214                                              None         3        None   
215   INFM612; or LBSC631; or permission of instruc...        3    Flexible   
216           INFM603; or permission of instructor.           3    Flexible   
217   INFM736; and must have earned a minimum of 27...        3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
213  INFM            0         0              0  
214  INFM            0         0              0  
215  INFM            1         1              0  
216  INFM            1         0              0  
217  INFM            1         0              0  
3.0


PHSC
         code                                              title  \
314  PHSC401                           History of Public Health    
315  PHSC412                    Food, Policy, and Public Health    
316  PHSC415    Essentials of Public Health Biology: The Cell...   
317  PHSC440                            Public Health Nutrition    
318  PHSC497                     Public Health Science Capstone    

                                           description  \
314   Emphasis is on the history of public health i...   
315   Broad overview of the impact of food and food...   
316   Presents the basic scientific and biomedical ...   
317   Engages students in conceptual thinking about...   
318   The capstone course is the culminating experi...   

                                               prereqs  credits prereq_type  \
314                                              None         3        None   
315   Must have completed HLSA300 with a C- or high...        3    Flexible   
316                 Minimum grade of C- in BSCI202.           3        Hard   
317   A minimum of C- in BSCI170, BSCI171, CHEM131,...        3        Hard   
318   Must have completed the professional writing ...        3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
314  PHSC            0         0              0  
315  PHSC            1         0              0  
316  PHSC            1         0              0  
317  PHSC            1         0              0  
318  PHSC            1         0              0  
3.0


BMGT
         code                                              title  \
9   BMGT190H                 Introduction to Design and Quality    
10   BMGT210    Foundations of Accounting for Non Business Ma...   
11   BMGT302      Designing Applications for Business Analytics    
12   BMGT310                          Intermediate Accounting I    
13   BMGT311                         Intermediate Accounting II    

                                          description  \
9    QUEST students learn and apply design practic...   
10   Provides an understanding of the common state...   
11   Provides an introduction to structured progra...   
12   Comprehensive analysis of financial accountin...   
13                          Continuation of BMGT310.    

                                              prereqs  credits prereq_type  \
9                                               None         4        None   
10                                              None         3        None   
11   BMGT301; or permission of BMGT-Robert H. Smit...        3    Flexible   
12                                          BMGT221.         3        Hard   
13                                          BMGT310.         3        Hard   

    area  has_prereqs  is_intro  is_entrypoint  
9   BMGT            0         1              1  
10  BMGT            0         0              0  
11  BMGT            1         0              0  
12  BMGT            1         0              0  
13  BMGT            1         0              0  
3.0377358490566038


ECON
         code                                           title  \
139  ECON200                    Principles of Microeconomics    
140  ECON201                    Principles of Macroeconomics    
141  ECON230                     Applied Economic Statistics    
142  ECON305    Intermediate Macroeconomic Theory and Policy    
143  ECON306      Intermediate Microeconomic Theory & Policy    

                                           description  \
139   Introduces economic models used to analyze ec...   
140   An introduction to how market economies behav...   
141   Introductory course to develop understanding ...   
142   Analysis of the determination of national inc...   
143   Analysis of the theories of consumer behavior...   

                                               prereqs  credits prereq_type  \
139   MATH107 or MATH110; or must have math eligibi...        3    Flexible   
140   MATH107 or MATH110; or must have math eligibi...        3    Flexible   
141   Must have math eligibility of MATH113 or high...        3    Flexible   
142   Minimum grade of C- in ECON201 and ECON200. A...        3    Flexible   
143   1 course with a minimum grade of C- from (ECO...        3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
139  ECON            1         0              0  
140  ECON            1         0              0  
141  ECON            1         0              0  
142  ECON            1         0              0  
143  ECON            1         0              0  
2.984375


SPHL
         code                                              title  \
385  SPHL100                       Foundations of Public Health    
386  SPHL291    Can we move beyond medication? Examining yoga...   
387  SPHL333    Fundamentals of Undergraduate Teaching for Ed...   
388  SPHL600                       Foundations of Public Health    
389  SPHL610    Program and Policy Planning, Implementation, ...   

                                           description prereqs  credits  \
385   An overview of the goals, functions, and meth...   None         3   
386   Does yoga improve the health of wounded warri...   None         3   
387   Supports the professional and personal develo...   None         1   
388   An overview of the goals, functions, and meth...   None         3   
389   This second course in the MPH/MHA integrated ...   None         5   

    prereq_type  area  has_prereqs  is_intro  is_entrypoint  
385        None  SPHL            0         0              0  
386        None  SPHL            0         0              0  
387        None  SPHL            0         0              0  
388        None  SPHL            0         0              0  
389        None  SPHL            0         0              0  
2.4285714285714284


INST
         code                                              title  \
218  INST126    Introduction to Programming for Information S...   
219  INST155                                  Social Networking    
220  INST201                Introduction to Information Science    
221  INST311                           Information Organization    
222  INST314                 Statistics for Information Science    

                                           description  \
218   An introduction to computer programming for s...   
219   Introduces methods for analyzing and understa...   
220   Examining the effects of new information tech...   
221   Examines the theories, concepts, and principl...   
222   Basic concepts in statistics including measur...   

                                               prereqs  credits prereq_type  \
218   Minimum grade of C- in MATH115; or must have ...        3    Flexible   
219                                              None         3        None   
220                                              None         3        None   
221                                              None         3        None   
222   Minimum grade of C- in STAT100 and MATH115 (o...        3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
218  INST            1         1              0  
219  INST            0         0              0  
220  INST            0         1              1  
221  INST            0         0              0  
222  INST            1         0              0  
3.0


PSYC
         code                                        title  \
347  PSYC123              The Psychology of Getting Hired    
348  PSYC200            Statistical Methods in Psychology    
349  PSYC221                            Social Psychology    
350  PSYC300    Research Methods in Psychology Laboratory    
351  PSYC301                 Biological Basis of Behavior    

                                           description  \
347   Designed to introduce students to the science...   
348   A basic introduction to quantitative methods ...   
349   The influence of social factors on the indivi...   
350   A general introduction and overview to the fu...   
351   Recent advances in neuroscience are radically...   

                                               prereqs  credits prereq_type  \
347                                              None         1        None   
348   PSYC100. And 1 course with a minimum grade of...        3    Flexible   
349                                          PSYC100.         3        Hard   
350                                        PSYC200.           4        Hard   
351     PSYC100. And BSCI170 and BSCI171; or BSCI105.         3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
347  PSYC            0         0              0  
348  PSYC            1         0              0  
349  PSYC            1         0              0  
350  PSYC            1         0              0  
351  PSYC            1         0              0  
3.026315789473684

The .groupby() way

# use groupby to split the courses df into subsets grouped by area
# we can iterate through the resulting collection of dataframe subsets
# where each step in the iteration allows us to grab 
# 1. the name of the subset, which is the shared value (in this case area)
# 2. the subset dataframe (here called areaDF)
for area, areaDF in courses.groupby('area'):
    print(area)
    print(areaDF.head())
    print(areaDF['credits'].mean()) # starting to get into the "apply"
AMST
        code                                              title  \
0   AMST101                      Introduction American Studies    
1  AMST298C             Introduction to Asian American Studies    
2   AMST340    Introduction to History, Theories and Methods...   
3  AMST418N                       Asian American Public Policy    
4   AMST450                        Seminar in American Studies    

                                         description  \
0   Introduces students to the interdisciplinary ...   
1   The aggregate experience of Asian Pacific Ame...   
2   Introduction to the process of interdisciplin...   
3                                                      
4   Developments in theories and methods of Ameri...   

                                             prereqs  credits prereq_type  \
0                                              None         3        None   
1                                              None         3        None   
2   Must have completed AMST201; and 2 courses in...        3        Hard   
3                                              None         3        None   
4      AMST201 and AMST340; and 1 course in AMST.           3        Hard   

   area  has_prereqs  is_intro  is_entrypoint  
0  AMST            0         1              1  
1  AMST            0         1              1  
2  AMST            1         1              0  
3  AMST            0         0              0  
4  AMST            1         0              0  
3.0
BMGT
         code                                              title  \
9   BMGT190H                 Introduction to Design and Quality    
10   BMGT210    Foundations of Accounting for Non Business Ma...   
11   BMGT302      Designing Applications for Business Analytics    
12   BMGT310                          Intermediate Accounting I    
13   BMGT311                         Intermediate Accounting II    

                                          description  \
9    QUEST students learn and apply design practic...   
10   Provides an understanding of the common state...   
11   Provides an introduction to structured progra...   
12   Comprehensive analysis of financial accountin...   
13                          Continuation of BMGT310.    

                                              prereqs  credits prereq_type  \
9                                               None         4        None   
10                                              None         3        None   
11   BMGT301; or permission of BMGT-Robert H. Smit...        3    Flexible   
12                                          BMGT221.         3        Hard   
13                                          BMGT310.         3        Hard   

    area  has_prereqs  is_intro  is_entrypoint  
9   BMGT            0         1              1  
10  BMGT            0         0              0  
11  BMGT            1         0              0  
12  BMGT            1         0              0  
13  BMGT            1         0              0  
3.0377358490566038
CMSC
        code                                              title  \
62  CMSC122    Introduction to Computer Programming via the ...   
63  CMSC131                      Object-Oriented Programming I    
64  CMSC132                     Object-Oriented Programming II    
65  CMSC133    Object Oriented Programming I Beyond Fundamen...   
66  CMSC216                   Introduction to Computer Systems    

                                          description  \
62   Introduction to computer programming in the c...   
63   Introduction to programming and computer scie...   
64   Introduction to use of computers to solve pro...   
65   An introduction to computer science and objec...   
66   Introduction to the interaction between user ...   

                                              prereqs  credits prereq_type  \
62                                              None         3        None   
63                                              None         4        None   
64   Minimum grade of C- in CMSC131; or must have ...        4    Flexible   
65                                              None         2        None   
66   Minimum grade of C- in CMSC132; and minimum g...        4        Hard   

    area  has_prereqs  is_intro  is_entrypoint  
62  CMSC            0         1              1  
63  CMSC            0         0              0  
64  CMSC            1         0              0  
65  CMSC            0         0              0  
66  CMSC            1         1              0  
3.0
COMM
         code                                              title  \
108  COMM200                     Critical Thinking and Speaking    
109  COMM331    News Writing and Reporting for Public Relations    
110  COMM332                  News Editing for Public Relations    
111  COMM351                        Public Relations Techniques    
112  COMM353             New Media Writing for Public Relations    

                                           description  \
108   Theory and practice of persuasive discourse a...   
109   Writing and researching news and information ...   
110   Copy editing, graphic principles and processe...   
111   The techniques of public relations, including...   
112   Students learn the uses and influence of new ...   

                                               prereqs  credits prereq_type  \
108                                              None         3        None   
109   COMM201; and must have completed or be concur...        3    Flexible   
110   Minimum grade of C- in COMM331; or students w...        3    Flexible   
111                                        COMM332.           3        Hard   
112                 Minimum grade of C- in COMM351.           3        Hard   

     area  has_prereqs  is_intro  is_entrypoint  
108  COMM            0         0              0  
109  COMM            1         0              0  
110  COMM            1         0              0  
111  COMM            1         0              0  
112  COMM            1         0              0  
2.870967741935484
ECON
         code                                           title  \
139  ECON200                    Principles of Microeconomics    
140  ECON201                    Principles of Macroeconomics    
141  ECON230                     Applied Economic Statistics    
142  ECON305    Intermediate Macroeconomic Theory and Policy    
143  ECON306      Intermediate Microeconomic Theory & Policy    

                                           description  \
139   Introduces economic models used to analyze ec...   
140   An introduction to how market economies behav...   
141   Introductory course to develop understanding ...   
142   Analysis of the determination of national inc...   
143   Analysis of the theories of consumer behavior...   

                                               prereqs  credits prereq_type  \
139   MATH107 or MATH110; or must have math eligibi...        3    Flexible   
140   MATH107 or MATH110; or must have math eligibi...        3    Flexible   
141   Must have math eligibility of MATH113 or high...        3    Flexible   
142   Minimum grade of C- in ECON201 and ECON200. A...        3    Flexible   
143   1 course with a minimum grade of C- from (ECO...        3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
139  ECON            1         0              0  
140  ECON            1         0              0  
141  ECON            1         0              0  
142  ECON            1         0              0  
143  ECON            1         0              0  
2.984375
ENSP
         code                                              title  \
203  ENSP102               Introduction to Environmental Policy    
204  ENSP250    Lawns in the Landscape: Environmental Hero or...   
205  ENSP305    Applied Quantitative Methods in Environmental...   
206  ENSP330                  Introduction to Environmental Law    
207  ENSP342    Environmental Threats to Oceans and Coasts: T...   

                                           description  \
203   Second of two courses that introduce students...   
204   Examination of the lawn as an element in the ...   
205   Intended for students interested in pursuing ...   
206   An overview of environmental law, from its co...   
207   An interdisciplinary study of the challenges ...   

                                               prereqs  credits prereq_type  \
203                                              None         3        None   
204                                              None         3        None   
205   BIOM301, ECON321, GEOG306, PSYC200, or SOCY20...        3    Flexible   
206                                              None         3        None   
207                                              None         3        None   

     area  has_prereqs  is_intro  is_entrypoint  
203  ENSP            0         1              1  
204  ENSP            0         0              0  
205  ENSP            1         0              0  
206  ENSP            0         1              1  
207  ENSP            0         0              0  
3.0
ENTS
         code                                              title  \
209  ENTS630    The Economics of International Telecommunicat...   
210  ENTS632            Telecommunications Marketing Management    
211  ENTS635    Decision Support Methods for Telecommunicatio...   
212  ENTS641                          Networks and Protocols II    

                                           description       prereqs  credits  \
209   Basic microeconomic principles used by teleco...         None         3   
210   Topics covered include strategic marketing, s...         None         3   
211   The aim of this course is to introduce manage...         None         3   
212   Techniques for the specification, design, ana...   ENTS640.           3   

    prereq_type  area  has_prereqs  is_intro  is_entrypoint  
209        None  ENTS            0         0              0  
210        None  ENTS            0         0              0  
211        None  ENTS            0         0              0  
212        Hard  ENTS            1         0              0  
3.0
INFM
         code                                              title  \
213  INFM605                              Users and Use Context    
214  INFM612    Management Concepts and Principles for Inform...   
215  INFM620    Introduction to Strategic Information Managem...   
216  INFM700                           Information Architecture    
217  INFM737         Information Management Capstone Experience    

                                           description  \
213   Use of information by individuals. Nature of ...   
214   Key aspects of management - focusing on plann...   
215   Strategic management is the comprehensive col...   
216   Principles and techniques of information orga...   
217   The Information Management Capstone Experienc...   

                                               prereqs  credits prereq_type  \
213                                              None         3        None   
214                                              None         3        None   
215   INFM612; or LBSC631; or permission of instruc...        3    Flexible   
216           INFM603; or permission of instructor.           3    Flexible   
217   INFM736; and must have earned a minimum of 27...        3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
213  INFM            0         0              0  
214  INFM            0         0              0  
215  INFM            1         1              0  
216  INFM            1         0              0  
217  INFM            1         0              0  
3.0
INST
         code                                              title  \
218  INST126    Introduction to Programming for Information S...   
219  INST155                                  Social Networking    
220  INST201                Introduction to Information Science    
221  INST311                           Information Organization    
222  INST314                 Statistics for Information Science    

                                           description  \
218   An introduction to computer programming for s...   
219   Introduces methods for analyzing and understa...   
220   Examining the effects of new information tech...   
221   Examines the theories, concepts, and principl...   
222   Basic concepts in statistics including measur...   

                                               prereqs  credits prereq_type  \
218   Minimum grade of C- in MATH115; or must have ...        3    Flexible   
219                                              None         3        None   
220                                              None         3        None   
221                                              None         3        None   
222   Minimum grade of C- in STAT100 and MATH115 (o...        3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
218  INST            1         1              0  
219  INST            0         0              0  
220  INST            0         1              1  
221  INST            0         0              0  
222  INST            1         0              0  
3.0
MATH
         code                                            title  \
265  MATH107    Introduction to Math Modeling and Probability    
266  MATH113                 College Algebra and Trigonometry    
267  MATH115                                      Precalculus    
268  MATH120                            Elementary Calculus I    
269  MATH121                           Elementary Calculus II    

                                           description  \
265   A goal is to convey the power of mathematics ...   
266   Topics include elementary functions including...   
267   Preparation for MATH120, MATH130 or MATH140. ...   
268   Basic ideas of differential and integral calc...   
269   Differential and integral calculus, with emph...   

                                               prereqs  credits prereq_type  \
265   Must have math eligibility of Math 107 or hig...        3    Flexible   
266   Must have math eligibility of MATH113 or high...        3    Flexible   
267   Must have math eligibility of MATH115 or high...        3    Flexible   
268   1 course with a minimum grade of C- from (MAT...        3    Flexible   
269   MATH120, MATH130, MATH136, or MATH140; or mus...        3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
265  MATH            1         1              0  
266  MATH            1         0              0  
267  MATH            1         0              0  
268  MATH            1         0              0  
269  MATH            1         0              0  
3.061224489795918
PHSC
         code                                              title  \
314  PHSC401                           History of Public Health    
315  PHSC412                    Food, Policy, and Public Health    
316  PHSC415    Essentials of Public Health Biology: The Cell...   
317  PHSC440                            Public Health Nutrition    
318  PHSC497                     Public Health Science Capstone    

                                           description  \
314   Emphasis is on the history of public health i...   
315   Broad overview of the impact of food and food...   
316   Presents the basic scientific and biomedical ...   
317   Engages students in conceptual thinking about...   
318   The capstone course is the culminating experi...   

                                               prereqs  credits prereq_type  \
314                                              None         3        None   
315   Must have completed HLSA300 with a C- or high...        3    Flexible   
316                 Minimum grade of C- in BSCI202.           3        Hard   
317   A minimum of C- in BSCI170, BSCI171, CHEM131,...        3        Hard   
318   Must have completed the professional writing ...        3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
314  PHSC            0         0              0  
315  PHSC            1         0              0  
316  PHSC            1         0              0  
317  PHSC            1         0              0  
318  PHSC            1         0              0  
3.0
PLCY
         code                                              title  \
319  PLCY101                    Great Thinkers on Public Policy    
320  PLCY201                 Public Leaders and Active Citizens    
321  PLCY215    Innovation and Social Change: Creating Change...   
322  PLCY301                                     Sustainability    
323  PLCY302               Examining Pluralism in Public Policy    

                                           description prereqs  credits  \
319   Great ideas in public policy, such as equalit...   None         3   
320   Aims to inspire, teach and engage students in...   None         3   
321   A team-based, highly interactive and dynamic ...   None         3   
322   Designed for students whose academic majors w...   None         3   
323   Understanding pluralism and how groups and in...   None         3   

    prereq_type  area  has_prereqs  is_intro  is_entrypoint  
319        None  PLCY            0         0              0  
320        None  PLCY            0         0              0  
321        None  PLCY            0         0              0  
322        None  PLCY            0         0              0  
323        None  PLCY            0         0              0  
3.0
PSYC
         code                                        title  \
347  PSYC123              The Psychology of Getting Hired    
348  PSYC200            Statistical Methods in Psychology    
349  PSYC221                            Social Psychology    
350  PSYC300    Research Methods in Psychology Laboratory    
351  PSYC301                 Biological Basis of Behavior    

                                           description  \
347   Designed to introduce students to the science...   
348   A basic introduction to quantitative methods ...   
349   The influence of social factors on the indivi...   
350   A general introduction and overview to the fu...   
351   Recent advances in neuroscience are radically...   

                                               prereqs  credits prereq_type  \
347                                              None         1        None   
348   PSYC100. And 1 course with a minimum grade of...        3    Flexible   
349                                          PSYC100.         3        Hard   
350                                        PSYC200.           4        Hard   
351     PSYC100. And BSCI170 and BSCI171; or BSCI105.         3    Flexible   

     area  has_prereqs  is_intro  is_entrypoint  
347  PSYC            0         0              0  
348  PSYC            1         0              0  
349  PSYC            1         0              0  
350  PSYC            1         0              0  
351  PSYC            1         0              0  
3.026315789473684
SPHL
         code                                              title  \
385  SPHL100                       Foundations of Public Health    
386  SPHL291    Can we move beyond medication? Examining yoga...   
387  SPHL333    Fundamentals of Undergraduate Teaching for Ed...   
388  SPHL600                       Foundations of Public Health    
389  SPHL610    Program and Policy Planning, Implementation, ...   

                                           description prereqs  credits  \
385   An overview of the goals, functions, and meth...   None         3   
386   Does yoga improve the health of wounded warri...   None         3   
387   Supports the professional and personal develo...   None         1   
388   An overview of the goals, functions, and meth...   None         3   
389   This second course in the MPH/MHA integrated ...   None         5   

    prereq_type  area  has_prereqs  is_intro  is_entrypoint  
385        None  SPHL            0         0              0  
386        None  SPHL            0         0              0  
387        None  SPHL            0         0              0  
388        None  SPHL            0         0              0  
389        None  SPHL            0         0              0  
2.4285714285714284
STAT
         code                                    title  \
392  STAT100    Elementary Statistics and Probability    
393  STAT400     Applied Probability and Statistics I    
394  STAT401    Applied Probability and Statistics II    
395  STAT410       Introduction to Probability Theory    
396  STAT420         Theory and Methods of Statistics    

                                           description  \
392   Simplest tests of statistical hypotheses; app...   
393   Random variables, standard distributions, mom...   
394   Point estimation - unbiased and consistent es...   
395   Probability and its properties. Random variab...   
396   Point estimation, sufficiency, completeness, ...   

                                               prereqs  credits prereq_type  \
392   MATH110, MATH112, MATH113, or MATH115; or per...        3    Flexible   
393   1 course with a minimum grade of C- from (MAT...        3    Flexible   
394   1 course with a minimum grade of C- from (STA...        3        Hard   
395   1 course with a minimum grade of C- from (MAT...        3        Hard   
396   1 course with a minimum grade of C- from (SUR...        3        Hard   

     area  has_prereqs  is_intro  is_entrypoint  
392  STAT            1         0              0  
393  STAT            1         0              0  
394  STAT            1         0              0  
395  STAT            1         1              0  
396  STAT            1         0              0  
3.0
URSP
         code                              title  \
407  URSP600    Research Design and Application    
408  URSP601                   Research Methods    
409  URSP604               The Planning Process    
410  URSP606                 Planning Economics    
411  URSP631        Transportation and Land Use    

                                           description prereqs  credits  \
407   Techniques in urban research, policy analysis...   None         3   
408   Use of measurement, statistics, quantitative ...   None         3   
409   Legal framework for U.S. planning; approaches...   None         3   
410   Resource allocation in a market economy, the ...   None         3   
411   The interrelationship between transportation ...   None         3   

    prereq_type  area  has_prereqs  is_intro  is_entrypoint  
407        None  URSP            0         0              0  
408        None  URSP            0         0              0  
409        None  URSP            0         0              0  
410        None  URSP            0         0              0  
411        None  URSP            0         0              0  
3.0
courses.head()

Apply and Combine#

The “manual” way: apply and combine into a new dataframe we construct from scratch#

# create an empty list to hold the new rows of the new COMBINED dataframe
eba = []
# SPLIT the dataframe by area, and iterate through each split
for area, areaDF in courses.groupby('area'): 
  
    # APPLY operations on the dataframe split
    # count the number of entry point courses in the subarea
    num_entrypoints = areaDF['is_entrypoint'].sum()
    
    # count the number of total courses in the subarea
    num_classes = len(areaDF)
    
    # count the ratio of entry points to total classes
    entrypoint_ratio = num_entrypoints/num_classes
    
    # prepare a new row to put into the COMBINEd dataframe that has areas as rows instead of courses
    # the row is represented as a dictionary, where each key is a column
    entry = {
      'area': area, # each row is an area
      'num_entrypoints': num_entrypoints, 
      'num_classes': num_classes,
        'entry_point_ratio': entrypoint_ratio
    }
    # COMBINE the resulting subcomputation into a new dataset
    eba.append(entry) 
# convert the list of new entries into a dataframe
eba = pd.DataFrame(eba)
eba
area num_entrypoints num_classes entry_point_ratio
0 AMST 2 9 0.222222
1 BMGT 1 53 0.018868
2 CMSC 1 46 0.021739
3 COMM 0 31 0.000000
4 ECON 0 64 0.000000
5 ENSP 2 6 0.333333
6 ENTS 0 4 0.000000
7 INFM 0 5 0.000000
8 INST 4 47 0.085106
9 MATH 0 49 0.000000
10 PHSC 0 5 0.000000
11 PLCY 0 28 0.000000
12 PSYC 1 38 0.026316
13 SPHL 0 7 0.000000
14 STAT 0 15 0.000000
15 URSP 0 7 0.000000
# how many classes are in each area?
# make a data frame that summarizes it

# make a list to hold the combined data by subgroup
summarized = []

# SPLIT-APPLY-COMBINE
# SPLIT the dataset by area
for area, areaData in courses.groupby('area'):
    # APPLY a computation to get the number of courses
    # which is just the number of rows in this subset dataframe
    num_courses = len(areaData)
    entry = {
        'area': area,
        'num_courses': num_courses
    }
    # add it to the COMBINED summarized dataset
    summarized.append(entry)

# turn the list of dictionaries into a dataframe
summarized = pd.DataFrame(summarized)
summarized
area num_courses
0 AMST 9
1 BMGT 53
2 CMSC 46
3 COMM 31
4 ECON 64
5 ENSP 6
6 ENTS 4
7 INFM 5
8 INST 47
9 MATH 49
10 PHSC 5
11 PLCY 28
12 PSYC 38
13 SPHL 7
14 STAT 15
15 URSP 7
# iterate through the dataframe, row by row
for index, row in eba.iterrows():
    print(row['area'])
    print(row['prereq_classes'])

We can define a whole new function that is much more complicated than a simple sum or mean, to APPLY to each subset of our SPLIT

For example, we can write a function that takes in a prereq description, and extracts all the prereq classes mentinoed in the description

def extract_prereqs(descr):
    prereqs = []
    for word in descr.split():
        # clean the word
        clean_word = ""
        for char in word:
            if char.isalpha() or char.isdigit():
                clean_word += char
        if len(clean_word) == 7 and clean_word[:4].isalpha() and clean_word[-3:].isdigit():
            prereqs.append(clean_word)
    return prereqs
for area, areaDF in courses.groupby('area'):
    
    area_prereqs = set()
    for prereq in areaDF['prereqs']:
        prereqs = extract_prereqs(prereq)
for prereq_descr in courses[courses['has_prereqs'] == 1]['prereqs'].values:
    print(prereq_descr)
    print(extract_prereqs(prereq_descr))

Shortcut apply-combine with .agg()#

A more concise way to apply and combine is to chain the .agg() function to a .groupby() object to tell pandas to aggregate particular columns in particular ways (e.g., count the number of entry point courses in a given department, vs. give an average proportion of classes that are entry points).

# SPLIT by area
courses.groupby("area", as_index=False).agg(
    # create a new column named num_entrypoints, 
    # and give it the value of the sum function applied to the is_entrypoints column
    # APPLY tehse computations and COMBINE into a new data frame using .agg
    num_entrypoints=('is_entrypoint', sum), 
    num_classes=('area', "count")
)
area num_entrypoints num_classes
0 AMST 2 9
1 BMGT 1 53
2 CMSC 1 46
3 COMM 0 31
4 ECON 0 64
5 ENSP 2 6
6 ENTS 0 4
7 INFM 0 5
8 INST 4 47
9 MATH 0 49
10 PHSC 0 5
11 PLCY 0 28
12 PSYC 1 38
13 SPHL 0 7
14 STAT 0 15
15 URSP 0 7

Anatomy of combining .groupby() with .agg()

![image.png]()

This pattern is explained in the section “Recommended: Tuple Named Aggregations” in this article: https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

# group the courses by the area column (and make sure that they show up as columns in the resulting dataframe)
# then apply the functions in the .agg() function to each subgroup
# and stitch it back into a dataframe that we'll put into the entrypoints_by_area variable
entrypoints_by_area = courses.groupby("area", as_index=False).agg(
    # create a new column named num_entrypoints, 
    # and give it the value of the sum function applied to the is_entrypoints column
    the_num_entrypoints=('is_entrypoint', "sum"), 
    num_classes=('area', "count")
)
entrypoints_by_area
# let's now compute the proportion of entry point classes, as a proxy for "openness"

# step 1: define the function
def openness(row):
    return row['num_entrypoints']/row['num_classes']

# step 2: apply the function and save the results
entrypoints_by_area['openness'] = entrypoints_by_area.apply(openness, axis=1)

entrypoints_by_area
# what are some fun groupbys we can do on the other datasets?

# e.g., for donations, we can do average and sum and range, etc. by team
# or number of wins by conference in ncaa
# or total sales per hour of day for bread

In my experience, this works really well for standard analysis tasks, but is less flexible than the manual approach. This approach would be tough to adapt easily for the bread Project 4, for example. I also like teaching the manual approach first for getting an intuition for what is happening under the hood.

Saving data / results for later analysis#

We often want to save the results of our analysis for later. This can be done in a few different ways (depending on what file format will be useful later, such as json, html, xlsx (excel spreadsheets), or csv).

In this class, we’ll practice saving to csv, a common file format for data (the same one you practice reading into pandas!)

# example of saving the entrypoints_by_area dataframe to a csv file
entrypoints_by_area.to_csv("outputs/entrypoints_by_area.csv")

The .to_csv() method can take a number of optional arguments to control what happens, but the only required one is the file path to where the csv file will be saved (similar to what you need when you want to write to a file with open())

In this example, we’re saving the entrypoints_by_area dataframe to the entrypoints_by_area.csv in the output/ folder. Just make sure (as with files), that the folder actually exists before you try to put a file there!

Extras#

This is stuff we may not get to in class but is available because it may be useful for your projects and beyond (though you can certainly solve most of Project 4 without these).

Use .value_counts() to summarize categorical data in your dataframe#

Last week we learned how to compute some basic statistics, overall, and by column, for quantitative data. Today, we’ll learn how to use value_counts() to quickly summarize categorical data.

.value_counts() does exactly what you think it might do based on the name: it counts the frequency of each unique value in a column! In other words, it gives us a way to count how many times each value shows up in a column. In this way, it’s kinda similar to the basic “count-based” indexing we did in Module 3.

Hint: this could be useful for Problem 4 for Project 4!

Here’s an example for the courses data: how many times does each “area” show up?

# access the area column in the courses dataframe
# and apply the value_counts method to that column
courses['area'].value_counts()

The syntax here is:

nameOfDataFrame[‘nameOfColumn’].value_counts()

value_counts() is a method that a Pandas series (i.e., column in a dataframe) data structure can do (again, make the connection back to .append() for lists, and .split() for strings).

Let’s try some other queries!

# for ncaa dataset
# how many entries do we have for each conference?
# for bls data
# how many entries do we have for each category?

Value counts returns a series, which has nice properties of both lists and dictionaries.

Like lists, we can sort it using the .sort_values() method, though we need to make sure to either force it to run “in place” (with inplace=True as an argument for .sort_values()), or save it to a variable.

area_counts.sort_values(ascending=True)

And access items by index position, which allows us to get the first thing, or the first 5 things, or the last 5 things, etc.

# get the first value in the series
# note: you only get the value, not the "name"
area_counts[0]
area_counts[:5]
area_counts = courses['area'].value_counts()
# like a cross between a dictionary anda  list
# can get value by named key like a dict
print("INST", area_counts['INST'])
print("most frqeuent item count", area_counts[0])
area_counts.keys()
# let's say we want the top 5 most populous areas
# we can slice/subset the series just like a list
# and then get the keys from that subset
area_counts[:5].keys()
# let's try with the other datasets!
# ncaa-team-data
# bls-by-category
# BreadBasket_DMS
bread = pd.read_csv(f'{folder}/BreadBasket_DMS.csv')
bread.head()
# how do we get the frequency counts for items in the bread dataframe?
bread['Item'].value_counts()

Plotting#

The main library for plotting in Python is matplotlib. You can learn that library later. It has lots of fine-grained controls.

For now, you can use pandas “wrapper” over matplotlib (basically calling matplotlib from inside pandas), which is a bit easier to learn.

entrypoints_by_area
def openness(row):
    return row['num_entrypoints']/row['num_classes']

entrypoints_by_area['openness'] = entrypoints_by_area.apply(openness, axis=1)
entrypoints_by_area
# sort the data by the openness column
# make sure we assign to the entry points variable again so we don't lose it (bc pandas treats dataframes as immutable, like strings, unless we force it to do otherwise)
entrypoints_by_area = entrypoints_by_area.sort_values(by="openness", ascending=False)
entrypoints_by_area
# plot openness by area
entrypoints_by_area.plot(
    x="area", 
    y="openness", 
    kind='bar', 
    xlabel="AREA", 
    ylabel="Proportion of entry point classes",
    title="Classes openness by area"
)
entrypoints_by_area.plot(y="openness", kind="hist")
entrypoints_by_area.sort_values(by="num_classes", ascending=False).plot(x="area", y="num_classes", kind="bar")

Reminder: More resources#

The pandas website is decent place to start: https://pandas.pydata.org/

This “cheat sheet” is also a really helpful guide to more common operations that you may run into later: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

There are also many blogs that are helpful, like towardsdatascience.com

The cool thing about pandas and data analysis in python is that many people share notebooks that you can inspect / learn from / adapt code for your own projects (just like mine!).