10: Pandas for data analysis with Python: Part 1#

What is Pandas?#

Pandas is a library in Python that is designed for data manipulation and analysis

Especially tabular data, as in an SQL table or Excel spreadsheet. So things like:

  • Time series data

  • Arbitrary matrix data with meaningful row and column labels

  • Any other form of observational / statistical data sets

Example / motivating use cases#

Importing the pandas library (getting started)#

What is a library?#

You can think of a library is a collection of functions and data structures. You import a library (or subsets of it) into your program / notebook so you have access to special functions or data structures in your program.

You are already using Python’s standard library, which includes built-in functions like print(), and built-in data structures like str and dict. Every time you fire up Python, these are “imported” into your program in the background.

As you advance in your programming career, you will often find that you want to solve some (sub)problems that others have tried to do, and wrote a collection of functions and/or data structures to solve those problems really well, and saved that collection into a library that others can use. Take advantage of this!

You should learn how to read documentation for libraries#

You should have handy access to (and know how to use):

  • Docs for “ground truth”

  • Some collection of examples for references.

The pandas website is decent place to start: https://pandas.pydata.org/

This “cheat sheet” is also a really helpful guide to more common operations that you may run into later: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

There are also many blogs that are helpful, like towardsdatascience.com

The cool thing about pandas and data analysis in python is that many people share notebooks that you can inspect / learn from / adapt code for your own projects (just like mine!).

Learning how to use libraries is training for learning to code in teams, using code from others. Basically nobody writes anything all from scratch, unless they are trying to really REALLY learn something deeply.

“importing” a library: mechanics#

Here’s what it looks like to import a library and use it, conceptually with a “fake” library, and with the pandas library

We often want to import libraries with “as”

The name after as is sort of like a variable name; usually we do that if the library name is clunky, or might conflict with variable names we want to use

For pandas, by convention people usually import it as pd.

Let’s do that quickly to illustrate.

# import the pandas library, give it the name pd for easier access
import pandas as pd
# test here
courses = pd.read_csv("INST courses.csv")
courses
Code Title Description Prereqs Credits
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0
12 INST408Y Special Topics in Information Science; Privacy... NaN NaN
13 INST408Z Special Topics in Information Science; The Apo... NaN NaN
14 INST414 Data Science Techniques An exploration of how to extract insights from... INST314. 3.0
15 INST447 Data Sources and Manipulation Examines approaches to locating, acquiring, ma... INST326 or CMSC131; and INST327. 3.0
16 INST462 Introduction to Data Visualization Exploration of the theories, methods, and tech... INST314. 3.0
17 INST466 Technology, Culture, and Society Individual, cultural, and societal outcomes as... INST201. 3.0
18 INST490 Integrated Capstone for Information Science The capstone provides a platform for Informati... Minimum grade of C- in INST314, INST335, INST3... 3.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0
20 INST612 Information Policy Nature, structure, development and application... None 3.0
21 INST614 Literacy and Inclusion The educational and psychological dimensions o... None 3.0
22 INST616 Open Source Intelligence An introduction to Open Source Intelligence (O... None 3.0
23 INST622 Information and Universal Usability Information services and technologies to provi... None 3.0
24 INST627 Data Analytics for Information Professionals Skills and knowledge needed to craft datasets,... None 3.0
25 INST630 Introduction to Programming for the Informatio... An introduction to computer programming intend... None 3.0
26 INST652 Design Thinking and Youth Methods of design thinking specifically within... None 3.0
27 INST702 Advanced Usability Testing Usability testing methods -- how to design and... Permission of instructor; or (INFM605 or INST6... 3.0
28 INST709 Independent Study NaN NaN
29 INST728G Special Topics in Information Studies; Smart C... NaN NaN
30 INST728V Special Topics in Information Studies; Digital... NaN NaN
31 INST733 Database Design Principles of user-oriented database design. ... LBSC690, LBSC671, or INFM603; or permission of... 3.0
32 INST737 Introduction to Data Science An exploration of some of the best and most ge... INST627; and (LBSC690, LBSC671, or INFM603). O... 3.0
33 INST741 Social Computing Technologies and Applications Tools and techniques for developing and config... INFM603 and INFM605; or (LBSC602 and LBSC671);... 3.0
34 INST742 Implementing Digital Curation Management of and technology for application o... INST604; or permission of instructor. 3.0
35 INST746 Digitization of Legacy Holdings Through hands on exercises and real-world proj... INST604. 3.0
36 INST762 Visual Analytics Visual analytics is the use of interactive vis... INFM603 or INST630; or permission of instructor. 3.0
37 INST767 Big Data Infrastructure Principles and techniques of data science and ... INST737; or permission of instructor. 3.0
38 INST776 HCIM CAPSTONE PROJECT The opportunity to apply the skills learned th... INST775; or permission of instructor. 3.0
39 INST785 Documentation, Collection, and Appraisal of Re... Development of documentation strategies and pl... INST604; or permission of instructor. 3.0
40 INST794 Capstone in Youth Experience Through a supervised project, to synthesize de... INST650, INST651, and INST652; or permission o... 3.0
import os

os.getcwd()
'/Users/joelchan/Projects/inst126-intro-programming-notes'
import random

random.randint(1,6)
5

The core of Pandas: The dataframe data structure#

We’ve so far progressed from single-item data structures (str, int, float) to “basic” collections (list, dict)

Now we will learn about the dataframe, which has:

  • nice properties of both lists (orderable, indexable) and dictionaries (can retrieve things quickly by key, store associated values)

  • and othe properties and built-in algorithms and methods that are useful for data analysis (e.g., summarizing, grouping, statistics, etc.)

Remember: data structures and algorithms go hand in hand: people made dataframes (and the associated pandas library) so we can do particular kinds of algorithms more easily.

Dataframes are basically like smart spreadsheets that Python can read/write

The data is in rows and columns. Columns in pandas are special data structures called series.

More here

Dataframes combine the best characteristics of lists and dictionaries, and more!#

  • Can sort (from lists)

  • Can access data by key (from dictionaries)

  • Can also reindex easily!

# show me the "columns"
courses.columns
Index(['Code', 'Title', 'Description', 'Prereqs', 'Credits'], dtype='object')
# get the code column
courses['Code']
0      INST126
1      INST201
2      INST311
3      INST314
4      INST326
5      INST327
6      INST335
7      INST346
8      INST352
9      INST354
10     INST362
11     INST377
12    INST408Y
13    INST408Z
14     INST414
15     INST447
16     INST462
17     INST466
18     INST490
19     INST604
20     INST612
21     INST614
22     INST616
23     INST622
24     INST627
25     INST630
26     INST652
27     INST702
28     INST709
29    INST728G
30    INST728V
31     INST733
32     INST737
33     INST741
34     INST742
35     INST746
36     INST762
37     INST767
38     INST776
39     INST785
40     INST794
Name: Code, dtype: object
# find the courses that are 3 credits
courses[courses['Credits'] == 3.0]
Code Title Description Prereqs Credits
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0
14 INST414 Data Science Techniques An exploration of how to extract insights from... INST314. 3.0
15 INST447 Data Sources and Manipulation Examines approaches to locating, acquiring, ma... INST326 or CMSC131; and INST327. 3.0
16 INST462 Introduction to Data Visualization Exploration of the theories, methods, and tech... INST314. 3.0
17 INST466 Technology, Culture, and Society Individual, cultural, and societal outcomes as... INST201. 3.0
18 INST490 Integrated Capstone for Information Science The capstone provides a platform for Informati... Minimum grade of C- in INST314, INST335, INST3... 3.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0
20 INST612 Information Policy Nature, structure, development and application... None 3.0
21 INST614 Literacy and Inclusion The educational and psychological dimensions o... None 3.0
22 INST616 Open Source Intelligence An introduction to Open Source Intelligence (O... None 3.0
23 INST622 Information and Universal Usability Information services and technologies to provi... None 3.0
24 INST627 Data Analytics for Information Professionals Skills and knowledge needed to craft datasets,... None 3.0
25 INST630 Introduction to Programming for the Informatio... An introduction to computer programming intend... None 3.0
26 INST652 Design Thinking and Youth Methods of design thinking specifically within... None 3.0
27 INST702 Advanced Usability Testing Usability testing methods -- how to design and... Permission of instructor; or (INFM605 or INST6... 3.0
31 INST733 Database Design Principles of user-oriented database design. ... LBSC690, LBSC671, or INFM603; or permission of... 3.0
32 INST737 Introduction to Data Science An exploration of some of the best and most ge... INST627; and (LBSC690, LBSC671, or INFM603). O... 3.0
33 INST741 Social Computing Technologies and Applications Tools and techniques for developing and config... INFM603 and INFM605; or (LBSC602 and LBSC671);... 3.0
34 INST742 Implementing Digital Curation Management of and technology for application o... INST604; or permission of instructor. 3.0
35 INST746 Digitization of Legacy Holdings Through hands on exercises and real-world proj... INST604. 3.0
36 INST762 Visual Analytics Visual analytics is the use of interactive vis... INFM603 or INST630; or permission of instructor. 3.0
37 INST767 Big Data Infrastructure Principles and techniques of data science and ... INST737; or permission of instructor. 3.0
38 INST776 HCIM CAPSTONE PROJECT The opportunity to apply the skills learned th... INST775; or permission of instructor. 3.0
39 INST785 Documentation, Collection, and Appraisal of Re... Development of documentation strategies and pl... INST604; or permission of instructor. 3.0
40 INST794 Capstone in Youth Experience Through a supervised project, to synthesize de... INST650, INST651, and INST652; or permission o... 3.0
# find all courses where the title contains the word introduction
courses[courses['Title'].str.contains("Introduction")]
Code Title Description Prereqs Credits
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0
16 INST462 Introduction to Data Visualization Exploration of the theories, methods, and tech... INST314. 3.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0
25 INST630 Introduction to Programming for the Informatio... An introduction to computer programming intend... None 3.0
32 INST737 Introduction to Data Science An exploration of some of the best and most ge... INST627; and (LBSC690, LBSC671, or INFM603). O... 3.0
courses.head(10) # show me the top 10 rows in the dataframe
Code Title Description Prereqs Credits
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0

Common operations (basic)#

Let’s go over some common operations with dataframes. This will overlap with your PCE, mostly Q1-5 and Q8.

Constructing a dataframe#

From other data structures (e.g., lists, dictionaries)#

Seldom use this at the start (usually we import data from an external file like a .csv file into a dataframe.

But I do use this frequently when I’m creating new dataframes for analysis from existing data(frames). Might not be the best pattern to emulate (but it works for me!): a lot of what I do could probably be done more elegantly with proper use of .groupby() and .apply() (more on this next week).

But it’s useful to do this to get a sense of how a dataframe combines aspects of lists and dictionaries. Because a common input ‘literal’ for a dictionary (just like the input literal for an int has to be numbers), is a set of “records” - a list of dictionaries, where each dictionary is a row, and within each dictionary, a key is a column (with an associated value).

basic_data = [
    {'name': 'Joel', 'role': 'instructor'},
    {'name': 'Sarah', 'role': 'UTA'}
]
# construct a dataframe from the basic_data list of dictionaries
example_df = pd.DataFrame(basic_data)
example_df
name role
0 Joel instructor
1 Sarah UTA
example_df.sort_values(by="name", ascending=False)
name role
1 Sarah UTA
0 Joel instructor
more_basic_data = [
    {'school': 'UMD', 'fundingModel': 'public', 'conference': 'Big Ten'},
    {'school': 'Harvard', 'fundingModel': 'private', 'conference': 'Harvard'}
]
# let's make this into a dataframe!
schoolsDF = pd.DataFrame(more_basic_data)
schoolsDF
school fundingModel conference
0 UMD public Big Ten
1 Harvard private Harvard
# let's make another sample dataset!
marvel_movies = [
    {"name": "Iron Man 1", "Phase": 1, "Year release": 2008},
    {"name": "Avengers 1", "Phase": 1, "Year release": 2012},
    {"name": "Avengers: Endgame", "Phase": 3, "Year release": 2020}
]

# and turn it into a dataframe
marvel_df = pd.DataFrame(marvel_movies)
marvel_df
name Phase Year release
0 Iron Man 1 1 2008
1 Avengers 1 1 2012
2 Avengers: Endgame 3 2020
marvel_df.to_csv("marvel-movies.csv")
marvel_df[marvel_df['Phase'] == 1]
name Phase Year release
0 Iron Man 1 1 2008
1 Avengers 1 1 2012

From (external) data files#

Most frequently this is done with .read_csv(), but there are many other common formats, such as json. See here for a full listing

csv stands for comma-separated-values

commonly used because it’s plain-text, technically. this means any program that can read a string can read this file. and have it be meaningful. not so with excel files!

courses = pd.read_csv("INST courses.csv") # needs a path to a csv file
courses
Code Title Description Prereqs Credits
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0
12 INST408Y Special Topics in Information Science; Privacy... NaN NaN
13 INST408Z Special Topics in Information Science; The Apo... NaN NaN
14 INST414 Data Science Techniques An exploration of how to extract insights from... INST314. 3.0
15 INST447 Data Sources and Manipulation Examines approaches to locating, acquiring, ma... INST326 or CMSC131; and INST327. 3.0
16 INST462 Introduction to Data Visualization Exploration of the theories, methods, and tech... INST314. 3.0
17 INST466 Technology, Culture, and Society Individual, cultural, and societal outcomes as... INST201. 3.0
18 INST490 Integrated Capstone for Information Science The capstone provides a platform for Informati... Minimum grade of C- in INST314, INST335, INST3... 3.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0
20 INST612 Information Policy Nature, structure, development and application... None 3.0
21 INST614 Literacy and Inclusion The educational and psychological dimensions o... None 3.0
22 INST616 Open Source Intelligence An introduction to Open Source Intelligence (O... None 3.0
23 INST622 Information and Universal Usability Information services and technologies to provi... None 3.0
24 INST627 Data Analytics for Information Professionals Skills and knowledge needed to craft datasets,... None 3.0
25 INST630 Introduction to Programming for the Informatio... An introduction to computer programming intend... None 3.0
26 INST652 Design Thinking and Youth Methods of design thinking specifically within... None 3.0
27 INST702 Advanced Usability Testing Usability testing methods -- how to design and... Permission of instructor; or (INFM605 or INST6... 3.0
28 INST709 Independent Study NaN NaN
29 INST728G Special Topics in Information Studies; Smart C... NaN NaN
30 INST728V Special Topics in Information Studies; Digital... NaN NaN
31 INST733 Database Design Principles of user-oriented database design. ... LBSC690, LBSC671, or INFM603; or permission of... 3.0
32 INST737 Introduction to Data Science An exploration of some of the best and most ge... INST627; and (LBSC690, LBSC671, or INFM603). O... 3.0
33 INST741 Social Computing Technologies and Applications Tools and techniques for developing and config... INFM603 and INFM605; or (LBSC602 and LBSC671);... 3.0
34 INST742 Implementing Digital Curation Management of and technology for application o... INST604; or permission of instructor. 3.0
35 INST746 Digitization of Legacy Holdings Through hands on exercises and real-world proj... INST604. 3.0
36 INST762 Visual Analytics Visual analytics is the use of interactive vis... INFM603 or INST630; or permission of instructor. 3.0
37 INST767 Big Data Infrastructure Principles and techniques of data science and ... INST737; or permission of instructor. 3.0
38 INST776 HCIM CAPSTONE PROJECT The opportunity to apply the skills learned th... INST775; or permission of instructor. 3.0
39 INST785 Documentation, Collection, and Appraisal of Re... Development of documentation strategies and pl... INST604; or permission of instructor. 3.0
40 INST794 Capstone in Youth Experience Through a supervised project, to synthesize de... INST650, INST651, and INST652; or permission o... 3.0

Inspecting your dataframe#

Common operations:

  • summarizing

  • filtering / accessing

  • sorting

Summarizing#

With:

  • .head()

  • .describe()

  • various stats

# we have a dataframe named df
# df has a method called head
# can optionally pass in a parameter to tell how many rows from the top to return
courses.head(20) # show the top 20
Code Title Description Prereqs Credits
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0
12 INST408Y Special Topics in Information Science; Privacy... NaN NaN
13 INST408Z Special Topics in Information Science; The Apo... NaN NaN
14 INST414 Data Science Techniques An exploration of how to extract insights from... INST314. 3.0
15 INST447 Data Sources and Manipulation Examines approaches to locating, acquiring, ma... INST326 or CMSC131; and INST327. 3.0
16 INST462 Introduction to Data Visualization Exploration of the theories, methods, and tech... INST314. 3.0
17 INST466 Technology, Culture, and Society Individual, cultural, and societal outcomes as... INST201. 3.0
18 INST490 Integrated Capstone for Information Science The capstone provides a platform for Informati... Minimum grade of C- in INST314, INST335, INST3... 3.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0
import random # importing a library! :) to generate random numbers
courses['random_number'] = [c + random.randint(0,5) for c in courses['Credits']]
courses.head(25)
Code Title Description Prereqs Credits random_number
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0 8.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0 7.0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0 3.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0 4.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0 3.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0 6.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0 4.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0 7.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0 6.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0 7.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0 4.0
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0 6.0
12 INST408Y Special Topics in Information Science; Privacy... NaN NaN NaN
13 INST408Z Special Topics in Information Science; The Apo... NaN NaN NaN
14 INST414 Data Science Techniques An exploration of how to extract insights from... INST314. 3.0 3.0
15 INST447 Data Sources and Manipulation Examines approaches to locating, acquiring, ma... INST326 or CMSC131; and INST327. 3.0 4.0
16 INST462 Introduction to Data Visualization Exploration of the theories, methods, and tech... INST314. 3.0 5.0
17 INST466 Technology, Culture, and Society Individual, cultural, and societal outcomes as... INST201. 3.0 5.0
18 INST490 Integrated Capstone for Information Science The capstone provides a platform for Informati... Minimum grade of C- in INST314, INST335, INST3... 3.0 6.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0 8.0
20 INST612 Information Policy Nature, structure, development and application... None 3.0 6.0
21 INST614 Literacy and Inclusion The educational and psychological dimensions o... None 3.0 4.0
22 INST616 Open Source Intelligence An introduction to Open Source Intelligence (O... None 3.0 3.0
23 INST622 Information and Universal Usability Information services and technologies to provi... None 3.0 7.0
24 INST627 Data Analytics for Information Professionals Skills and knowledge needed to craft datasets,... None 3.0 7.0
courses.describe()
Credits random_number
count 36.0 36.000000
mean 3.0 5.388889
std 0.0 1.572583
min 3.0 3.000000
25% 3.0 4.000000
50% 3.0 6.000000
75% 3.0 7.000000
max 3.0 8.000000
ncaa = pd.read_csv("ncaa-team-data.csv")
ncaa.head()
school conf rk w l srs sos pts_for pts_vs pts_total ap_pre ap_high ap_final pts_diff ncaa_result ncaa_numeric season coaches year
0 air-force MWC 1 12 21 -2.99 1.08 73.1 75.1 148.2 30 30 30 -2.0 NaN 0 2016-17 Dave Pilipovich (12-21) 2016.0
1 air-force MWC 2 14 18 -5.51 0.66 68.4 72.8 141.2 30 30 30 -4.4 NaN 0 2015-16 Dave Pilipovich (14-18) 2015.0
2 air-force MWC 3 14 17 -1.85 -0.71 65.7 65.1 130.8 30 30 30 0.6 NaN 0 2014-15 Dave Pilipovich (14-17) 2014.0
3 air-force MWC 4 12 18 -4.08 1.71 66.0 69.1 135.1 30 30 30 -3.1 NaN 0 2013-14 Dave Pilipovich (12-18) 2013.0
4 air-force MWC 5 18 14 4.18 4.28 70.0 67.8 137.8 30 30 30 2.2 NaN 0 2012-13 Dave Pilipovich (18-14) 2012.0
ncaa['school'].value_counts()
yale                       122
minnesota                  122
bucknell                   122
penn-state                 121
temple                     121
                          ... 
hampton                     22
southwestern-ks             21
allegheny                   21
wisconsin-stevens-point     21
loyola-la                   21
Name: school, Length: 338, dtype: int64
ncaa.describe()
rk w l srs sos pts_for pts_vs pts_total ap_pre ap_high ap_final pts_diff ncaa_numeric year
count 24029.000000 24029.000000 24029.000000 16945.000000 16945.000000 7815.000000 6916.000000 6916.000000 24029.000000 24029.000000 24029.000000 6916.000000 24029.000000 24029.000000
mean 43.961380 13.641974 11.607807 -0.554195 -0.184372 70.232028 68.843855 138.557244 29.084897 27.811478 28.904865 0.869534 0.972283 1969.397312
std 30.694366 6.366854 5.509024 9.975686 5.447946 6.537464 5.977788 10.468501 4.266874 6.542723 4.666435 6.317951 4.754037 32.310531
min 1.000000 0.000000 0.000000 -44.010000 -22.460000 0.000000 35.200000 81.300000 1.000000 1.000000 1.000000 -26.700000 0.000000 1892.000000
25% 18.000000 9.000000 8.000000 -7.480000 -4.210000 66.100000 65.200000 132.500000 30.000000 30.000000 30.000000 -3.300000 0.000000 1943.000000
50% 37.000000 13.000000 11.000000 -0.570000 -0.300000 70.200000 68.850000 138.600000 30.000000 30.000000 30.000000 1.000000 0.000000 1976.000000
75% 67.000000 18.000000 15.000000 6.350000 3.990000 74.400000 72.500000 145.100000 30.000000 30.000000 30.000000 5.200000 0.000000 1997.000000
max 122.000000 38.000000 31.000000 34.800000 16.000000 101.000000 99.900000 199.100000 30.000000 30.000000 30.000000 24.600000 48.000000 2016.000000
ncaa['w'].min()
0
ncaa.describe()
rk w l srs sos pts_for pts_vs pts_total ap_pre ap_high ap_final pts_diff ncaa_numeric year
count 24029.000000 24029.000000 24029.000000 16945.000000 16945.000000 7815.000000 6916.000000 6916.000000 24029.000000 24029.000000 24029.000000 6916.000000 24029.000000 24029.000000
mean 43.961380 13.641974 11.607807 -0.554195 -0.184372 70.232028 68.843855 138.557244 29.084897 27.811478 28.904865 0.869534 0.972283 1969.397312
std 30.694366 6.366854 5.509024 9.975686 5.447946 6.537464 5.977788 10.468501 4.266874 6.542723 4.666435 6.317951 4.754037 32.310531
min 1.000000 0.000000 0.000000 -44.010000 -22.460000 0.000000 35.200000 81.300000 1.000000 1.000000 1.000000 -26.700000 0.000000 1892.000000
25% 18.000000 9.000000 8.000000 -7.480000 -4.210000 66.100000 65.200000 132.500000 30.000000 30.000000 30.000000 -3.300000 0.000000 1943.000000
50% 37.000000 13.000000 11.000000 -0.570000 -0.300000 70.200000 68.850000 138.600000 30.000000 30.000000 30.000000 1.000000 0.000000 1976.000000
75% 67.000000 18.000000 15.000000 6.350000 3.990000 74.400000 72.500000 145.100000 30.000000 30.000000 30.000000 5.200000 0.000000 1997.000000
max 122.000000 38.000000 31.000000 34.800000 16.000000 101.000000 99.900000 199.100000 30.000000 30.000000 30.000000 24.600000 48.000000 2016.000000
ncaa.hist(column="w")
array([[<AxesSubplot:title={'center':'w'}>]], dtype=object)
_images/10_Pandas-1_50_1.png

Getting/accessing parts of our dataframe#

Most basic is just getting a specific column. Looks like the basic way we index things in lists or dictionaries.

courses.columns
Index(['Code', 'Title', 'Description', 'Prereqs', 'Credits', 'random_number'], dtype='object')
courses['Code']
0      INST126
1      INST201
2      INST311
3      INST314
4      INST326
5      INST327
6      INST335
7      INST346
8      INST352
9      INST354
10     INST362
11     INST377
12    INST408Y
13    INST408Z
14     INST414
15     INST447
16     INST462
17     INST466
18     INST490
19     INST604
20     INST612
21     INST614
22     INST616
23     INST622
24     INST627
25     INST630
26     INST652
27     INST702
28     INST709
29    INST728G
30    INST728V
31     INST733
32     INST737
33     INST741
34     INST742
35     INST746
36     INST762
37     INST767
38     INST776
39     INST785
40     INST794
Name: Code, dtype: object

Let’s say you want a particular statistic for only one column. You can do this by accessing the series, and asking for a specific statistic.

courses['random_number'].median()
6.0

Filtering the data based on one or more columns#

But we sometimes also want to get subsets of the data, depending on one or more column values.

We can do this with indexing notation (I use this because I’m used to it).

The stuff you put in the brackets is a Boolean expression Any row where the answer is TRUE, will come back; anything where the answer is FALSE, is filtered out

# get me all the rows where the value of the column Code is equal to INST126
courses[courses['Code']=="INST126"] 
Code Title Description Prereqs Credits random_number
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0 8.0
# get me all the rows where the value of the column random_number is greater than 5
courses[courses['random_number'] > 5] 
Code Title Description Prereqs Credits random_number
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0 8.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0 7.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0 6.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0 7.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0 6.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0 7.0
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0 6.0
18 INST490 Integrated Capstone for Information Science The capstone provides a platform for Informati... Minimum grade of C- in INST314, INST335, INST3... 3.0 6.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0 8.0
20 INST612 Information Policy Nature, structure, development and application... None 3.0 6.0
23 INST622 Information and Universal Usability Information services and technologies to provi... None 3.0 7.0
24 INST627 Data Analytics for Information Professionals Skills and knowledge needed to craft datasets,... None 3.0 7.0
26 INST652 Design Thinking and Youth Methods of design thinking specifically within... None 3.0 6.0
27 INST702 Advanced Usability Testing Usability testing methods -- how to design and... Permission of instructor; or (INFM605 or INST6... 3.0 7.0
31 INST733 Database Design Principles of user-oriented database design. ... LBSC690, LBSC671, or INFM603; or permission of... 3.0 7.0
33 INST741 Social Computing Technologies and Applications Tools and techniques for developing and config... INFM603 and INFM605; or (LBSC602 and LBSC671);... 3.0 8.0
34 INST742 Implementing Digital Curation Management of and technology for application o... INST604; or permission of instructor. 3.0 6.0
37 INST767 Big Data Infrastructure Principles and techniques of data science and ... INST737; or permission of instructor. 3.0 6.0
40 INST794 Capstone in Youth Experience Through a supervised project, to synthesize de... INST650, INST651, and INST652; or permission o... 3.0 6.0
# find all of the seasons where yale had at least 11 wins
ncaa[(ncaa['school'] == "yale") & (ncaa['w'] >= 20)]
school conf rk w l srs sos pts_for pts_vs pts_total ap_pre ap_high ap_final pts_diff ncaa_result ncaa_numeric season coaches year
23871 yale Ivy 2 23 7 9.08 -1.03 74.9 63.8 138.7 30 30 30 11.1 Lost Second Round 2 2015-16 James Jones (23-7) 2015.0
23872 yale Ivy 3 22 10 3.53 -0.87 65.6 58.5 124.1 30 30 30 7.1 NaN 0 2014-15 James Jones (22-10) 2014.0
23885 yale Ivy 16 21 11 -0.08 -5.32 74.8 70.5 145.3 30 30 30 4.3 NaN 0 2001-02 James Jones (21-11) 2001.0
23938 yale Ivy 69 22 8 NaN NaN 69.6 56.6 126.2 30 11 11 13.0 Lost Regional Semifinal 8 1948-49 Howard Hobson (22-8) 1948.0
23979 yale Ivy 110 20 9 NaN NaN NaN NaN NaN 30 30 30 NaN NaN 0 1907-08 William Lush (20-9) 1907.0
23980 yale Ivy 111 30 7 NaN NaN NaN NaN NaN 30 30 30 NaN NaN 0 1906-07 William Lush (30-7) 1906.0
23982 yale Ivy 113 22 13 NaN NaN NaN NaN NaN 30 30 30 NaN NaN 0 1904-05 Unknown 1904.0
print(set(ncaa['
  File "/var/folders/xz/_hjc5hsx743dclmg8n5678nc0000gn/T/ipykernel_89408/1243815937.py", line 1
    print(set(ncaa['
                    ^
SyntaxError: EOL while scanning string literal
# find all of the seasons where a Big Ten school had at least ten wins
ncaa[(ncaa['conf'] == "Big Ten") | (ncaa['w'] >= 20)]
school conf rk w l srs sos pts_for pts_vs pts_total ap_pre ap_high ap_final pts_diff ncaa_result ncaa_numeric season coaches year
10 air-force MWC 11 26 9 15.19 3.34 69.0 56.0 125.0 30 13 30 13.0 NaN 0 2006-07 Jeff Bzdelik (26-9) 2006.0
11 air-force MWC 12 24 7 10.20 2.49 64.2 54.7 118.9 30 30 30 9.5 Lost First Round 1 2005-06 Jeff Bzdelik (24-7) 2005.0
13 air-force MWC 14 22 7 9.12 0.08 59.9 50.9 110.8 30 25 30 9.0 Lost First Round 1 2003-04 Joe Scott (22-7) 2003.0
60 akron MAC 1 26 8 3.97 -1.52 77.2 70.3 147.5 30 30 30 6.9 NaN 0 2016-17 Keith Dambrot (26-8) 2016.0
61 akron MAC 2 26 9 5.55 -1.24 76.6 68.0 144.6 30 30 30 8.6 NaN 0 2015-16 Keith Dambrot (26-9) 2015.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
23938 yale Ivy 69 22 8 NaN NaN 69.6 56.6 126.2 30 11 11 13.0 Lost Regional Semifinal 8 1948-49 Howard Hobson (22-8) 1948.0
23979 yale Ivy 110 20 9 NaN NaN NaN NaN NaN 30 30 30 NaN NaN 0 1907-08 William Lush (20-9) 1907.0
23980 yale Ivy 111 30 7 NaN NaN NaN NaN NaN 30 30 30 NaN NaN 0 1906-07 William Lush (30-7) 1906.0
23982 yale Ivy 113 22 13 NaN NaN NaN NaN NaN 30 30 30 NaN NaN 0 1904-05 Unknown 1904.0
24011 youngstown-state Mid-Cont 20 20 9 -1.92 -7.66 72.3 64.6 136.9 30 30 30 7.7 NaN 0 1997-98 Dan Peters (20-9) 1997.0

5293 rows × 19 columns

# find all of the seasons where harvard had at least 15 losses
ncaa[(ncaa['school'] == "harvard") & (ncaa['l'] >= 15)]
school conf rk w l srs sos pts_for pts_vs pts_total ap_pre ap_high ap_final pts_diff ncaa_result ncaa_numeric season coaches year
7329 harvard Ivy 2 14 16 -0.96 -0.25 66.7 66.2 132.9 30 30 30 0.5 NaN 0 2015-16 Tommy Amaker (14-16) 2015.0
7337 harvard Ivy 10 8 22 -10.76 -4.86 68.5 74.4 142.9 30 30 30 -5.9 NaN 0 2007-08 Tommy Amaker (8-22) 2007.0
7338 harvard Ivy 11 12 16 -9.67 -4.56 72.3 77.4 149.7 30 30 30 -5.1 NaN 0 2006-07 Frank Sullivan (12-16) 2006.0
7340 harvard Ivy 13 12 15 -6.73 -4.47 67.1 69.4 136.5 30 30 30 -2.3 NaN 0 2004-05 Frank Sullivan (12-15) 2004.0
7341 harvard Ivy 14 4 23 -16.19 -4.37 64.7 76.6 141.3 30 30 30 -11.9 NaN 0 2003-04 Frank Sullivan (4-23) 2003.0
7342 harvard Ivy 15 12 15 -5.80 -3.65 71.4 73.3 144.7 30 30 30 -1.9 NaN 0 2002-03 Frank Sullivan (12-15) 2002.0
7345 harvard Ivy 18 12 15 -12.75 -9.75 65.9 67.7 133.6 30 30 30 -1.8 NaN 0 1999-00 Frank Sullivan (12-15) 1999.0
7350 harvard Ivy 23 6 20 -14.50 -9.10 66.8 NaN NaN 30 30 30 NaN NaN 0 1994-95 Frank Sullivan (6-20) 1994.0
7351 harvard Ivy 24 9 17 -12.43 -8.75 68.0 NaN NaN 30 30 30 NaN NaN 0 1993-94 Frank Sullivan (9-17) 1993.0
7352 harvard Ivy 25 6 20 -16.14 -4.70 69.5 NaN NaN 30 30 30 NaN NaN 0 1992-93 Frank Sullivan (6-20) 1992.0
7353 harvard Ivy 26 6 20 -17.52 -4.40 NaN NaN NaN 30 30 30 NaN NaN 0 1991-92 Frank Sullivan (6-20) 1991.0
7354 harvard Ivy 27 9 17 -12.59 -4.59 NaN NaN NaN 30 30 30 NaN NaN 0 1990-91 Peter Roby (9-17) 1990.0
7356 harvard Ivy 29 11 15 -14.74 -8.66 NaN NaN NaN 30 30 30 NaN NaN 0 1988-89 Peter Roby (11-15) 1988.0
7357 harvard Ivy 30 11 15 -13.28 -5.69 NaN NaN NaN 30 30 30 NaN NaN 0 1987-88 Peter Roby (11-15) 1987.0
7358 harvard Ivy 31 9 17 -7.78 -5.36 NaN NaN NaN 30 30 30 NaN NaN 0 1986-87 Peter Roby (9-17) 1986.0
7359 harvard Ivy 32 6 20 -19.73 -7.69 NaN NaN NaN 30 30 30 NaN NaN 0 1985-86 Peter Roby (6-20) 1985.0
7363 harvard Ivy 36 11 15 -10.07 -6.25 NaN NaN NaN 30 30 30 NaN NaN 0 1981-82 Frank McLaughlin (11-15) 1981.0
7365 harvard Ivy 38 11 15 -13.37 -7.37 NaN NaN NaN 30 30 30 NaN NaN 0 1979-80 Frank McLaughlin (11-15) 1979.0
7366 harvard Ivy 39 8 21 -10.58 -1.79 NaN NaN NaN 30 30 30 NaN NaN 0 1978-79 Frank McLaughlin (8-21) 1978.0
7367 harvard Ivy 40 11 15 -9.60 -4.90 NaN NaN NaN 30 30 30 NaN NaN 0 1977-78 Frank McLaughlin (11-15) 1977.0
7368 harvard Ivy 41 9 16 -12.73 -3.33 NaN NaN NaN 30 30 30 NaN NaN 0 1976-77 Tom Sanders (9-16) 1976.0
7369 harvard Ivy 42 8 18 -9.20 -5.12 NaN NaN NaN 30 30 30 NaN NaN 0 1975-76 Tom Sanders (8-18) 1975.0
7375 harvard Ivy 48 7 19 -10.19 0.25 NaN NaN NaN 30 30 30 NaN NaN 0 1969-70 Robert Harrison (7-19) 1969.0
7376 harvard Ivy 49 7 18 -9.80 -3.12 NaN NaN NaN 30 30 30 NaN NaN 0 1968-69 Robert Harrison (7-18) 1968.0
7382 harvard Ivy 55 6 15 -11.34 -6.28 NaN NaN NaN 30 30 30 NaN NaN 0 1962-63 Floyd S. Wilson (6-15) 1962.0
7386 harvard Ivy 59 10 15 -14.71 -7.38 NaN NaN NaN 30 30 30 NaN NaN 0 1958-59 Floyd S. Wilson (10-15) 1958.0
7389 harvard Ivy 62 8 16 -17.54 -7.10 NaN NaN NaN 30 30 30 NaN NaN 0 1955-56 Floyd S. Wilson (8-16) 1955.0
7390 harvard Ivy 63 6 17 -11.97 -4.33 NaN NaN NaN 30 30 30 NaN NaN 0 1954-55 Floyd S. Wilson (6-17) 1954.0
7391 harvard Ivy 64 9 16 -14.61 -3.02 NaN NaN NaN 30 30 30 NaN NaN 0 1953-54 Bo Shepard (9-16) 1953.0
7392 harvard Ivy 65 7 16 -15.52 -3.69 NaN NaN NaN 30 30 30 NaN NaN 0 1952-53 Bo Shepard (7-16) 1952.0
7393 harvard Ivy 66 5 17 -16.94 -2.75 NaN NaN NaN 30 30 30 NaN NaN 0 1951-52 Bo Shepard (5-17) 1951.0
7394 harvard Ivy 67 8 18 -8.66 1.56 NaN NaN NaN 30 30 30 NaN NaN 0 1950-51 Bo Shepard (8-18) 1950.0
7395 harvard Ivy 68 9 15 -7.61 -1.75 NaN NaN NaN 30 30 30 NaN NaN 0 1949-50 Bo Shepard (9-15) 1949.0
7396 harvard Ivy 69 3 20 NaN NaN 51.3 61.4 112.7 30 30 30 -10.1 NaN 0 1948-49 William Barclay (3-20) 1948.0
7397 harvard Ivy 70 5 20 NaN NaN NaN NaN NaN 30 30 30 NaN NaN 0 1947-48 William Barclay (5-20) 1947.0
7403 harvard Ivy 76 8 16 NaN NaN NaN NaN NaN 30 30 30 NaN NaN 0 1941-42 Earl Brown (8-16) 1941.0
7409 harvard Ivy 82 7 15 NaN NaN NaN NaN NaN 30 30 30 NaN NaN 0 1935-36 Wes Fesler (7-15) 1935.0
7411 harvard Ivy 84 3 19 NaN NaN NaN NaN NaN 30 30 30 NaN NaN 0 1933-34 Wes Fesler (3-19) 1933.0
ncaa.columns
Index(['school', 'conf', 'rk', 'w', 'l', 'srs', 'sos', 'pts_for', 'pts_vs',
       'pts_total', 'ap_pre', 'ap_high', 'ap_final', 'pts_diff', 'ncaa_result',
       'ncaa_numeric', 'season', 'coaches', 'year'],
      dtype='object')

Combine multiple Boolean expressions using logical operators, like with conditionals, BUT unfortunately with diff. syntax.

and: &

or: |

# find all of the seasons where an ACC school had a winning record
# all losing seasons for coach K

Many of the basic Boolean operators apply here, like > and == (see here for review of Boolean expressions)

But in Pandas we also have access to Boolean “methods” for strings, like .contains() or .startswith(). It works like this:

courses[courses['Title'].str.contains("Design")] # get all the rows where the value of the Title column contains the word Design
Code Title Description Prereqs Credits random_number
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0 7.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0 5.0
26 INST652 Design Thinking and Youth Methods of design thinking specifically within... None 3.0 7.0
31 INST733 Database Design Principles of user-oriented database design. ... LBSC690, LBSC671, or INFM603; or permission of... 3.0 3.0
# get all the courses that are INST 300-level courses
courses[courses['Code'].str.startswith("INST3")] 
Code Title Description Prereqs Credits random_number
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0 4.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0 5.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0 7.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0 7.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0 3.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0 6.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0 5.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0 6.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0 5.0
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0 5.0
# get all the courses that have programming in their course description?
courses[courses['Description'].str.contains("programming")]
Code Title Description Prereqs Credits random_number
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0 3.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0 7.0
25 INST630 Introduction to Programming for the Informatio... An introduction to computer programming intend... None 3.0 3.0
40 INST794 Capstone in Youth Experience Through a supervised project, to synthesize de... INST650, INST651, and INST652; or permission o... 3.0 3.0
print(courses[courses['Code'] == "INST794"]['Description'])
40    Through a supervised project, to synthesize de...
Name: Description, dtype: object
# get all the courses that have a "minimum grade" prereq

Reshaping#

Most basic is sorting.

More advanced stuff like transposing and so on we will discuss next week.

courses.sort_values(by="Code", ascending=False)
Code Title Description Prereqs Credits random_number
40 INST794 Capstone in Youth Experience Through a supervised project, to synthesize de... INST650, INST651, and INST652; or permission o... 3.0 3.0
39 INST785 Documentation, Collection, and Appraisal of Re... Development of documentation strategies and pl... INST604; or permission of instructor. 3.0 5.0
38 INST776 HCIM CAPSTONE PROJECT The opportunity to apply the skills learned th... INST775; or permission of instructor. 3.0 7.0
37 INST767 Big Data Infrastructure Principles and techniques of data science and ... INST737; or permission of instructor. 3.0 5.0
36 INST762 Visual Analytics Visual analytics is the use of interactive vis... INFM603 or INST630; or permission of instructor. 3.0 3.0
35 INST746 Digitization of Legacy Holdings Through hands on exercises and real-world proj... INST604. 3.0 7.0
34 INST742 Implementing Digital Curation Management of and technology for application o... INST604; or permission of instructor. 3.0 3.0
33 INST741 Social Computing Technologies and Applications Tools and techniques for developing and config... INFM603 and INFM605; or (LBSC602 and LBSC671);... 3.0 3.0
32 INST737 Introduction to Data Science An exploration of some of the best and most ge... INST627; and (LBSC690, LBSC671, or INFM603). O... 3.0 3.0
31 INST733 Database Design Principles of user-oriented database design. ... LBSC690, LBSC671, or INFM603; or permission of... 3.0 3.0
30 INST728V Special Topics in Information Studies; Digital... NaN NaN NaN
29 INST728G Special Topics in Information Studies; Smart C... NaN NaN NaN
28 INST709 Independent Study NaN NaN NaN
27 INST702 Advanced Usability Testing Usability testing methods -- how to design and... Permission of instructor; or (INFM605 or INST6... 3.0 3.0
26 INST652 Design Thinking and Youth Methods of design thinking specifically within... None 3.0 7.0
25 INST630 Introduction to Programming for the Informatio... An introduction to computer programming intend... None 3.0 3.0
24 INST627 Data Analytics for Information Professionals Skills and knowledge needed to craft datasets,... None 3.0 8.0
23 INST622 Information and Universal Usability Information services and technologies to provi... None 3.0 3.0
22 INST616 Open Source Intelligence An introduction to Open Source Intelligence (O... None 3.0 5.0
21 INST614 Literacy and Inclusion The educational and psychological dimensions o... None 3.0 5.0
20 INST612 Information Policy Nature, structure, development and application... None 3.0 6.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0 8.0
18 INST490 Integrated Capstone for Information Science The capstone provides a platform for Informati... Minimum grade of C- in INST314, INST335, INST3... 3.0 5.0
17 INST466 Technology, Culture, and Society Individual, cultural, and societal outcomes as... INST201. 3.0 5.0
16 INST462 Introduction to Data Visualization Exploration of the theories, methods, and tech... INST314. 3.0 5.0
15 INST447 Data Sources and Manipulation Examines approaches to locating, acquiring, ma... INST326 or CMSC131; and INST327. 3.0 7.0
14 INST414 Data Science Techniques An exploration of how to extract insights from... INST314. 3.0 6.0
13 INST408Z Special Topics in Information Science; The Apo... NaN NaN NaN
12 INST408Y Special Topics in Information Science; Privacy... NaN NaN NaN
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0 5.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0 5.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0 6.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0 5.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0 6.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0 3.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0 7.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0 7.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0 5.0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0 4.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0 4.0
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0 3.0
courses
Code Title Description Prereqs Credits random_number
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0 3.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0 4.0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0 4.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0 5.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0 7.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0 7.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0 3.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0 6.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0 5.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0 6.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0 5.0
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0 5.0
12 INST408Y Special Topics in Information Science; Privacy... NaN NaN NaN
13 INST408Z Special Topics in Information Science; The Apo... NaN NaN NaN
14 INST414 Data Science Techniques An exploration of how to extract insights from... INST314. 3.0 6.0
15 INST447 Data Sources and Manipulation Examines approaches to locating, acquiring, ma... INST326 or CMSC131; and INST327. 3.0 7.0
16 INST462 Introduction to Data Visualization Exploration of the theories, methods, and tech... INST314. 3.0 5.0
17 INST466 Technology, Culture, and Society Individual, cultural, and societal outcomes as... INST201. 3.0 5.0
18 INST490 Integrated Capstone for Information Science The capstone provides a platform for Informati... Minimum grade of C- in INST314, INST335, INST3... 3.0 5.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0 8.0
20 INST612 Information Policy Nature, structure, development and application... None 3.0 6.0
21 INST614 Literacy and Inclusion The educational and psychological dimensions o... None 3.0 5.0
22 INST616 Open Source Intelligence An introduction to Open Source Intelligence (O... None 3.0 5.0
23 INST622 Information and Universal Usability Information services and technologies to provi... None 3.0 3.0
24 INST627 Data Analytics for Information Professionals Skills and knowledge needed to craft datasets,... None 3.0 8.0
25 INST630 Introduction to Programming for the Informatio... An introduction to computer programming intend... None 3.0 3.0
26 INST652 Design Thinking and Youth Methods of design thinking specifically within... None 3.0 7.0
27 INST702 Advanced Usability Testing Usability testing methods -- how to design and... Permission of instructor; or (INFM605 or INST6... 3.0 3.0
28 INST709 Independent Study NaN NaN NaN
29 INST728G Special Topics in Information Studies; Smart C... NaN NaN NaN
30 INST728V Special Topics in Information Studies; Digital... NaN NaN NaN
31 INST733 Database Design Principles of user-oriented database design. ... LBSC690, LBSC671, or INFM603; or permission of... 3.0 3.0
32 INST737 Introduction to Data Science An exploration of some of the best and most ge... INST627; and (LBSC690, LBSC671, or INFM603). O... 3.0 3.0
33 INST741 Social Computing Technologies and Applications Tools and techniques for developing and config... INFM603 and INFM605; or (LBSC602 and LBSC671);... 3.0 3.0
34 INST742 Implementing Digital Curation Management of and technology for application o... INST604; or permission of instructor. 3.0 3.0
35 INST746 Digitization of Legacy Holdings Through hands on exercises and real-world proj... INST604. 3.0 7.0
36 INST762 Visual Analytics Visual analytics is the use of interactive vis... INFM603 or INST630; or permission of instructor. 3.0 3.0
37 INST767 Big Data Infrastructure Principles and techniques of data science and ... INST737; or permission of instructor. 3.0 5.0
38 INST776 HCIM CAPSTONE PROJECT The opportunity to apply the skills learned th... INST775; or permission of instructor. 3.0 7.0
39 INST785 Documentation, Collection, and Appraisal of Re... Development of documentation strategies and pl... INST604; or permission of instructor. 3.0 5.0
40 INST794 Capstone in Youth Experience Through a supervised project, to synthesize de... INST650, INST651, and INST652; or permission o... 3.0 3.0
# sort the dataframe, and make sure the mod changes the df itself
courses.sort_values(by="Code", ascending=False, inplace=True) # sort in ascending order by the random_number column
# sort the dataframe, and save the resulting copy in another variable
courses = courses.sort_values(by="Code", ascending=False)
courses
Code Title Description Prereqs Credits random_number
40 INST794 Capstone in Youth Experience Through a supervised project, to synthesize de... INST650, INST651, and INST652; or permission o... 3.0 3.0
39 INST785 Documentation, Collection, and Appraisal of Re... Development of documentation strategies and pl... INST604; or permission of instructor. 3.0 5.0
38 INST776 HCIM CAPSTONE PROJECT The opportunity to apply the skills learned th... INST775; or permission of instructor. 3.0 7.0
37 INST767 Big Data Infrastructure Principles and techniques of data science and ... INST737; or permission of instructor. 3.0 5.0
36 INST762 Visual Analytics Visual analytics is the use of interactive vis... INFM603 or INST630; or permission of instructor. 3.0 3.0
35 INST746 Digitization of Legacy Holdings Through hands on exercises and real-world proj... INST604. 3.0 7.0
34 INST742 Implementing Digital Curation Management of and technology for application o... INST604; or permission of instructor. 3.0 3.0
33 INST741 Social Computing Technologies and Applications Tools and techniques for developing and config... INFM603 and INFM605; or (LBSC602 and LBSC671);... 3.0 3.0
32 INST737 Introduction to Data Science An exploration of some of the best and most ge... INST627; and (LBSC690, LBSC671, or INFM603). O... 3.0 3.0
31 INST733 Database Design Principles of user-oriented database design. ... LBSC690, LBSC671, or INFM603; or permission of... 3.0 3.0
30 INST728V Special Topics in Information Studies; Digital... NaN NaN NaN
29 INST728G Special Topics in Information Studies; Smart C... NaN NaN NaN
28 INST709 Independent Study NaN NaN NaN
27 INST702 Advanced Usability Testing Usability testing methods -- how to design and... Permission of instructor; or (INFM605 or INST6... 3.0 3.0
26 INST652 Design Thinking and Youth Methods of design thinking specifically within... None 3.0 7.0
25 INST630 Introduction to Programming for the Informatio... An introduction to computer programming intend... None 3.0 3.0
24 INST627 Data Analytics for Information Professionals Skills and knowledge needed to craft datasets,... None 3.0 8.0
23 INST622 Information and Universal Usability Information services and technologies to provi... None 3.0 3.0
22 INST616 Open Source Intelligence An introduction to Open Source Intelligence (O... None 3.0 5.0
21 INST614 Literacy and Inclusion The educational and psychological dimensions o... None 3.0 5.0
20 INST612 Information Policy Nature, structure, development and application... None 3.0 6.0
19 INST604 Introduction to Archives and Digital Curation Overview of the principles, practices, and app... None 3.0 8.0
18 INST490 Integrated Capstone for Information Science The capstone provides a platform for Informati... Minimum grade of C- in INST314, INST335, INST3... 3.0 5.0
17 INST466 Technology, Culture, and Society Individual, cultural, and societal outcomes as... INST201. 3.0 5.0
16 INST462 Introduction to Data Visualization Exploration of the theories, methods, and tech... INST314. 3.0 5.0
15 INST447 Data Sources and Manipulation Examines approaches to locating, acquiring, ma... INST326 or CMSC131; and INST327. 3.0 7.0
14 INST414 Data Science Techniques An exploration of how to extract insights from... INST314. 3.0 6.0
13 INST408Z Special Topics in Information Science; The Apo... NaN NaN NaN
12 INST408Y Special Topics in Information Science; Privacy... NaN NaN NaN
11 INST377 Dynamic Web Applications An exploration of the basic methods and tools ... INST327. 3.0 5.0
10 INST362 User-Centered Design Introduction to human-computer interaction (HC... 1 course with a minimum grade of C- from (INST... 3.0 5.0
9 INST354 Decision-Making for Information Science Examines the use of information in organizatio... INST314. 3.0 6.0
8 INST352 Information User Needs and Assessment Focuses on use of information by individuals, ... 1 course with a minimum grade of C- from (INST... 3.0 5.0
7 INST346 Technologies Infrastructure and Architecture Examines the basic concepts of local and wide-... 1 course with a minimum grade of C- from (INST... 3.0 6.0
6 INST335 Teams and Organizations Team development and the principles, methods a... 1 course with a minimum grade of C- from (INST... 3.0 3.0
5 INST327 Database Design and Modeling Introduction to databases, the relational mode... 1 course with a minimum grade of C- from (CMSC... 3.0 7.0
4 INST326 Object-Oriented Programming for Information Sc... An introduction to programming, emphasizing un... 1 course with a minimum grade of C- from (INST... 3.0 7.0
3 INST314 Statistics for Information Science Basic concepts in statistics including measure... Must have completed or be concurrently enrolle... 3.0 5.0
2 INST311 Information Organization Examines the theories, concepts, and principle... Must have completed or be concurrently enrolle... 3.0 4.0
1 INST201 Introduction to Information Science Examining the effects of new information techn... None 3.0 4.0
0 INST126 Introduction to Programming for Information Sc... An introduction to computer programming for st... Minimum grade of C- in MATH115; or must have m... 3.0 3.0
# if you modify in place and store result, it will be None
sorted_df = df.sort_values(by="Code", ascending=False, inplace=True)
print(type(sorted_df))
# sort by the code column, in ascending order
# sort by the code column, in ascending order
# sort by prereqs
# sort by prereqs, then random

Aside: dataframes are (mostly) immutable#

Python wants you to treat dataframes as immutable: by default, any modifications you make to a dataframe will create a modified copy (just like a string), rather than modifying the dataframe itself.

This means you’ll get the same error as with strings, in that your modifications won’t stick around if you don’t save the resulting copy in a variable.

Like this:

You can get around this if you want, by passing in a inplace=True argument to most function calls.

But most of the time you will treat them like strings and make sure you save the result of a modification into a variable.