{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "NxDEC90LqWMi" }, "source": [ "# 11: Pandas for data analysis with Python: Part 2\n", "\n", "## Learning objectives\n", "\n", "Last week, we learned:\n", "- Pandas is a library in Python that is designed for data manipulation and analysis\n", "- How to use libraries (import them, access their functions and data structures with `library.function_name()`)\n", "- About the `dataframe` data structure: basically a smart spreadsheet, with rows of observations, and columns of variables/data for each observation - sort of a cross between a list (sortable, indexable) and a dictionary (quickly access data by key)\n", "- Some basic operations: constructing a dataframe, summarizing, subsetting, reshaping\n", "\n", "This week, we'll most dig into more advanced operations for reshaping/modifying your dataframe:\n", "- Use `.apply()` to apply functions to one or more columns to generate new columns\n", "- Use `.groupby()` to split your data into subgroups, apply some function to their data, then combine them into a new dataframe for further analysis (the \"**split-apply-combine**\" pattern that is fundamental to data analysis with pandas)\n", "- Use some basic plotting functions to explore your data\n", "\n", "These roughly correspond to Qs 6-8 in your PCEs.\n", "\n", "If we have time, we'll learn a bit more about summarization:\n", "- Use `.value_counts()` to summarize categorical data\n", "- How to plot data" ] }, { "cell_type": "markdown", "metadata": { "id": "-Jy7Hmo3Tbgo" }, "source": [ "## Creating/modifying data columns based on one or more columns using `.apply()`" ] }, { "cell_type": "markdown", "metadata": { "id": "8cekigZU82xM" }, "source": [ "More advanced operations on dataframes involve modifying or creating new columns!" ] }, { "cell_type": "markdown", "metadata": { "id": "DAHrk-uWTbgo" }, "source": [ "In data analysis, we often want to do things to data in our columns for data preparation/cleaning. Sometimes there is missing data we want to recode, or we want to redescribe data or reclassify it for our analysis. We can do this with a combination of functions and the `apply()` method.\n", "\n", "In this way, again making a connection back to lists, `.apply()` is a little like the `map()` function that we used for lists to transform items from one list to another list with equal length (e.g., convert scores to letter grades)." ] }, { "cell_type": "markdown", "metadata": { "id": "nxl6erzh8G3a" }, "source": [ "### `.apply()` with a single column" ] }, { "cell_type": "markdown", "metadata": { "id": "f_b6BhOauQpU" }, "source": [ "The simpler version of `.apply()` only takes input from one column.\n", "\n", "To illustrate let's do some operations on the dataset `INST courses.csv`. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 271 }, "executionInfo": { "elapsed": 4181, "status": "ok", "timestamp": 1620651417488, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "V3IHERsSssMe", "outputId": "478edb38-7cd5-4cba-83ee-9e931f8c80c0" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CodeTitleDescriptionPrereqsCredits
0INST126Introduction to Programming for Information Sc...An introduction to computer programming for st...Minimum grade of C- in MATH115; or must have m...3.0
1INST201Introduction to Information ScienceExamining the effects of new information techn...None3.0
2INST311Information OrganizationExamines the theories, concepts, and principle...Must have completed or be concurrently enrolle...3.0
3INST314Statistics for Information ScienceBasic concepts in statistics including measure...Must have completed or be concurrently enrolle...3.0
4INST326Object-Oriented Programming for Information Sc...An introduction to programming, emphasizing un...1 course with a minimum grade of C- from (INST...3.0
5INST327Database Design and ModelingIntroduction to databases, the relational mode...1 course with a minimum grade of C- from (CMSC...3.0
6INST335Teams and OrganizationsTeam development and the principles, methods a...1 course with a minimum grade of C- from (INST...3.0
7INST346Technologies Infrastructure and ArchitectureExamines the basic concepts of local and wide-...1 course with a minimum grade of C- from (INST...3.0
8INST352Information User Needs and AssessmentFocuses on use of information by individuals, ...1 course with a minimum grade of C- from (INST...3.0
9INST354Decision-Making for Information ScienceExamines the use of information in organizatio...INST314.3.0
10INST362User-Centered DesignIntroduction to human-computer interaction (HC...1 course with a minimum grade of C- from (INST...3.0
11INST377Dynamic Web ApplicationsAn exploration of the basic methods and tools ...INST327.3.0
12INST408YSpecial Topics in Information Science; Privacy...NaNNaN
13INST408ZSpecial Topics in Information Science; The Apo...NaNNaN
14INST414Data Science TechniquesAn exploration of how to extract insights from...INST314.3.0
15INST447Data Sources and ManipulationExamines approaches to locating, acquiring, ma...INST326 or CMSC131; and INST327.3.0
16INST462Introduction to Data VisualizationExploration of the theories, methods, and tech...INST314.3.0
17INST466Technology, Culture, and SocietyIndividual, cultural, and societal outcomes as...INST201.3.0
18INST490Integrated Capstone for Information ScienceThe capstone provides a platform for Informati...Minimum grade of C- in INST314, INST335, INST3...3.0
19INST604Introduction to Archives and Digital CurationOverview of the principles, practices, and app...None3.0
20INST612Information PolicyNature, structure, development and application...None3.0
21INST614Literacy and InclusionThe educational and psychological dimensions o...None3.0
22INST616Open Source IntelligenceAn introduction to Open Source Intelligence (O...None3.0
23INST622Information and Universal UsabilityInformation services and technologies to provi...None3.0
24INST627Data Analytics for Information ProfessionalsSkills and knowledge needed to craft datasets,...None3.0
25INST630Introduction to Programming for the Informatio...An introduction to computer programming intend...None3.0
26INST652Design Thinking and YouthMethods of design thinking specifically within...None3.0
27INST702Advanced Usability TestingUsability testing methods -- how to design and...Permission of instructor; or (INFM605 or INST6...3.0
28INST709Independent StudyNaNNaN
29INST728GSpecial Topics in Information Studies; Smart C...NaNNaN
30INST728VSpecial Topics in Information Studies; Digital...NaNNaN
31INST733Database DesignPrinciples of user-oriented database design. ...LBSC690, LBSC671, or INFM603; or permission of...3.0
32INST737Introduction to Data ScienceAn exploration of some of the best and most ge...INST627; and (LBSC690, LBSC671, or INFM603). O...3.0
33INST741Social Computing Technologies and ApplicationsTools and techniques for developing and config...INFM603 and INFM605; or (LBSC602 and LBSC671);...3.0
34INST742Implementing Digital CurationManagement of and technology for application o...INST604; or permission of instructor.3.0
35INST746Digitization of Legacy HoldingsThrough hands on exercises and real-world proj...INST604.3.0
36INST762Visual AnalyticsVisual analytics is the use of interactive vis...INFM603 or INST630; or permission of instructor.3.0
37INST767Big Data InfrastructurePrinciples and techniques of data science and ...INST737; or permission of instructor.3.0
38INST776HCIM CAPSTONE PROJECTThe opportunity to apply the skills learned th...INST775; or permission of instructor.3.0
39INST785Documentation, Collection, and Appraisal of Re...Development of documentation strategies and pl...INST604; or permission of instructor.3.0
40INST794Capstone in Youth ExperienceThrough a supervised project, to synthesize de...INST650, INST651, and INST652; or permission o...3.0
\n", "
" ], "text/plain": [ " Code Title \\\n", "0 INST126 Introduction to Programming for Information Sc... \n", "1 INST201 Introduction to Information Science \n", "2 INST311 Information Organization \n", "3 INST314 Statistics for Information Science \n", "4 INST326 Object-Oriented Programming for Information Sc... \n", "5 INST327 Database Design and Modeling \n", "6 INST335 Teams and Organizations \n", "7 INST346 Technologies Infrastructure and Architecture \n", "8 INST352 Information User Needs and Assessment \n", "9 INST354 Decision-Making for Information Science \n", "10 INST362 User-Centered Design \n", "11 INST377 Dynamic Web Applications \n", "12 INST408Y Special Topics in Information Science; Privacy... \n", "13 INST408Z Special Topics in Information Science; The Apo... \n", "14 INST414 Data Science Techniques \n", "15 INST447 Data Sources and Manipulation \n", "16 INST462 Introduction to Data Visualization \n", "17 INST466 Technology, Culture, and Society \n", "18 INST490 Integrated Capstone for Information Science \n", "19 INST604 Introduction to Archives and Digital Curation \n", "20 INST612 Information Policy \n", "21 INST614 Literacy and Inclusion \n", "22 INST616 Open Source Intelligence \n", "23 INST622 Information and Universal Usability \n", "24 INST627 Data Analytics for Information Professionals \n", "25 INST630 Introduction to Programming for the Informatio... \n", "26 INST652 Design Thinking and Youth \n", "27 INST702 Advanced Usability Testing \n", "28 INST709 Independent Study \n", "29 INST728G Special Topics in Information Studies; Smart C... \n", "30 INST728V Special Topics in Information Studies; Digital... \n", "31 INST733 Database Design \n", "32 INST737 Introduction to Data Science \n", "33 INST741 Social Computing Technologies and Applications \n", "34 INST742 Implementing Digital Curation \n", "35 INST746 Digitization of Legacy Holdings \n", "36 INST762 Visual Analytics \n", "37 INST767 Big Data Infrastructure \n", "38 INST776 HCIM CAPSTONE PROJECT \n", "39 INST785 Documentation, Collection, and Appraisal of Re... \n", "40 INST794 Capstone in Youth Experience \n", "\n", " Description \\\n", "0 An introduction to computer programming for st... \n", "1 Examining the effects of new information techn... \n", "2 Examines the theories, concepts, and principle... \n", "3 Basic concepts in statistics including measure... \n", "4 An introduction to programming, emphasizing un... \n", "5 Introduction to databases, the relational mode... \n", "6 Team development and the principles, methods a... \n", "7 Examines the basic concepts of local and wide-... \n", "8 Focuses on use of information by individuals, ... \n", "9 Examines the use of information in organizatio... \n", "10 Introduction to human-computer interaction (HC... \n", "11 An exploration of the basic methods and tools ... \n", "12 \n", "13 \n", "14 An exploration of how to extract insights from... \n", "15 Examines approaches to locating, acquiring, ma... \n", "16 Exploration of the theories, methods, and tech... \n", "17 Individual, cultural, and societal outcomes as... \n", "18 The capstone provides a platform for Informati... \n", "19 Overview of the principles, practices, and app... \n", "20 Nature, structure, development and application... \n", "21 The educational and psychological dimensions o... \n", "22 An introduction to Open Source Intelligence (O... \n", "23 Information services and technologies to provi... \n", "24 Skills and knowledge needed to craft datasets,... \n", "25 An introduction to computer programming intend... \n", "26 Methods of design thinking specifically within... \n", "27 Usability testing methods -- how to design and... \n", "28 \n", "29 \n", "30 \n", "31 Principles of user-oriented database design. ... \n", "32 An exploration of some of the best and most ge... \n", "33 Tools and techniques for developing and config... \n", "34 Management of and technology for application o... \n", "35 Through hands on exercises and real-world proj... \n", "36 Visual analytics is the use of interactive vis... \n", "37 Principles and techniques of data science and ... \n", "38 The opportunity to apply the skills learned th... \n", "39 Development of documentation strategies and pl... \n", "40 Through a supervised project, to synthesize de... \n", "\n", " Prereqs Credits \n", "0 Minimum grade of C- in MATH115; or must have m... 3.0 \n", "1 None 3.0 \n", "2 Must have completed or be concurrently enrolle... 3.0 \n", "3 Must have completed or be concurrently enrolle... 3.0 \n", "4 1 course with a minimum grade of C- from (INST... 3.0 \n", "5 1 course with a minimum grade of C- from (CMSC... 3.0 \n", "6 1 course with a minimum grade of C- from (INST... 3.0 \n", "7 1 course with a minimum grade of C- from (INST... 3.0 \n", "8 1 course with a minimum grade of C- from (INST... 3.0 \n", "9 INST314. 3.0 \n", "10 1 course with a minimum grade of C- from (INST... 3.0 \n", "11 INST327. 3.0 \n", "12 NaN NaN \n", "13 NaN NaN \n", "14 INST314. 3.0 \n", "15 INST326 or CMSC131; and INST327. 3.0 \n", "16 INST314. 3.0 \n", "17 INST201. 3.0 \n", "18 Minimum grade of C- in INST314, INST335, INST3... 3.0 \n", "19 None 3.0 \n", "20 None 3.0 \n", "21 None 3.0 \n", "22 None 3.0 \n", "23 None 3.0 \n", "24 None 3.0 \n", "25 None 3.0 \n", "26 None 3.0 \n", "27 Permission of instructor; or (INFM605 or INST6... 3.0 \n", "28 NaN NaN \n", "29 NaN NaN \n", "30 NaN NaN \n", "31 LBSC690, LBSC671, or INFM603; or permission of... 3.0 \n", "32 INST627; and (LBSC690, LBSC671, or INFM603). O... 3.0 \n", "33 INFM603 and INFM605; or (LBSC602 and LBSC671);... 3.0 \n", "34 INST604; or permission of instructor. 3.0 \n", "35 INST604. 3.0 \n", "36 INFM603 or INST630; or permission of instructor. 3.0 \n", "37 INST737; or permission of instructor. 3.0 \n", "38 INST775; or permission of instructor. 3.0 \n", "39 INST604; or permission of instructor. 3.0 \n", "40 INST650, INST651, and INST652; or permission o... 3.0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# import the pandas library\n", "import pandas as pd\n", "\n", "# read in the dataset\n", "fpath = 'INST courses.csv'\n", "courses = pd.read_csv(fpath) # read in the file into a dataframe called courses\n", "courses # use the .head() function to show the top 5 rows in the dataframe" ] }, { "cell_type": "markdown", "metadata": { "id": "ojewpk2m5vdy" }, "source": [ "Let's say we want to have a prereqs column that is sortable, maybe 0 = No prereqs, and 1 = has prereqs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 1: Define the function you want to apply" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "executionInfo": { "elapsed": 446, "status": "ok", "timestamp": 1620651438118, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "LXPbCqQc6Bdq" }, "outputs": [], "source": [ "# Step 1: define the function you want to apply\n", "def has_prereq(prereq_descr):\n", " # assume we get a string prereq description\n", " if pd.isnull(prereq_descr):\n", " return 0\n", " elif \"None\" in prereq_descr:\n", " return 0\n", " else:\n", " return 1" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 701, "status": "ok", "timestamp": 1620050164279, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "Dtn5-jWD6QIG", "outputId": "2542376e-5d19-4416-92c3-657a6418d7e4" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "0\n" ] } ], "source": [ "# test the function\n", "prereq = \"BMGT301; or instructor permission\" # this should yield 1\n", "prereq2 = \"None\" # this should yield 0\n", "print(has_prereq(prereq))\n", "print(has_prereq(prereq2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 2: Apply the function to a column and save the result in a (new) column" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 546 }, "executionInfo": { "elapsed": 182, "status": "ok", "timestamp": 1620651440591, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "mEvHuoFH6iid", "outputId": "55ff4823-9787-4180-b6cd-e34b6b06503d", "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CodeTitleDescriptionPrereqsCreditshas_prereqs
0INST126Introduction to Programming for Information Sc...An introduction to computer programming for st...Minimum grade of C- in MATH115; or must have m...3.01
1INST201Introduction to Information ScienceExamining the effects of new information techn...None3.00
2INST311Information OrganizationExamines the theories, concepts, and principle...Must have completed or be concurrently enrolle...3.01
3INST314Statistics for Information ScienceBasic concepts in statistics including measure...Must have completed or be concurrently enrolle...3.01
4INST326Object-Oriented Programming for Information Sc...An introduction to programming, emphasizing un...1 course with a minimum grade of C- from (INST...3.01
5INST327Database Design and ModelingIntroduction to databases, the relational mode...1 course with a minimum grade of C- from (CMSC...3.01
6INST335Teams and OrganizationsTeam development and the principles, methods a...1 course with a minimum grade of C- from (INST...3.01
7INST346Technologies Infrastructure and ArchitectureExamines the basic concepts of local and wide-...1 course with a minimum grade of C- from (INST...3.01
8INST352Information User Needs and AssessmentFocuses on use of information by individuals, ...1 course with a minimum grade of C- from (INST...3.01
9INST354Decision-Making for Information ScienceExamines the use of information in organizatio...INST314.3.01
\n", "
" ], "text/plain": [ " Code Title \\\n", "0 INST126 Introduction to Programming for Information Sc... \n", "1 INST201 Introduction to Information Science \n", "2 INST311 Information Organization \n", "3 INST314 Statistics for Information Science \n", "4 INST326 Object-Oriented Programming for Information Sc... \n", "5 INST327 Database Design and Modeling \n", "6 INST335 Teams and Organizations \n", "7 INST346 Technologies Infrastructure and Architecture \n", "8 INST352 Information User Needs and Assessment \n", "9 INST354 Decision-Making for Information Science \n", "\n", " Description \\\n", "0 An introduction to computer programming for st... \n", "1 Examining the effects of new information techn... \n", "2 Examines the theories, concepts, and principle... \n", "3 Basic concepts in statistics including measure... \n", "4 An introduction to programming, emphasizing un... \n", "5 Introduction to databases, the relational mode... \n", "6 Team development and the principles, methods a... \n", "7 Examines the basic concepts of local and wide-... \n", "8 Focuses on use of information by individuals, ... \n", "9 Examines the use of information in organizatio... \n", "\n", " Prereqs Credits has_prereqs \n", "0 Minimum grade of C- in MATH115; or must have m... 3.0 1 \n", "1 None 3.0 0 \n", "2 Must have completed or be concurrently enrolle... 3.0 1 \n", "3 Must have completed or be concurrently enrolle... 3.0 1 \n", "4 1 course with a minimum grade of C- from (INST... 3.0 1 \n", "5 1 course with a minimum grade of C- from (CMSC... 3.0 1 \n", "6 1 course with a minimum grade of C- from (INST... 3.0 1 \n", "7 1 course with a minimum grade of C- from (INST... 3.0 1 \n", "8 1 course with a minimum grade of C- from (INST... 3.0 1 \n", "9 INST314. 3.0 1 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 2: apply it to a column and save the result in the (new) `had_prereqs` column\n", "courses['has_prereqs'] = courses['Prereqs'].apply(has_prereq) # apply the has_prereq() function to every row in the prereqs column in the courses data frame\n", "courses.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another example: let's say we want ot know if a course is an introductory course. How might we do this?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 667 }, "executionInfo": { "elapsed": 193, "status": "ok", "timestamp": 1620651446449, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "-rkq4M_AhLN-", "outputId": "f1331af4-801f-4b29-9dfc-dd5cbd60a169" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CodeTitleDescriptionPrereqsCreditshas_prereqsis_intro
0INST126Introduction to Programming for Information Sc...An introduction to computer programming for st...Minimum grade of C- in MATH115; or must have m...3.011
1INST201Introduction to Information ScienceExamining the effects of new information techn...None3.001
2INST311Information OrganizationExamines the theories, concepts, and principle...Must have completed or be concurrently enrolle...3.010
3INST314Statistics for Information ScienceBasic concepts in statistics including measure...Must have completed or be concurrently enrolle...3.010
4INST326Object-Oriented Programming for Information Sc...An introduction to programming, emphasizing un...1 course with a minimum grade of C- from (INST...3.010
5INST327Database Design and ModelingIntroduction to databases, the relational mode...1 course with a minimum grade of C- from (CMSC...3.010
6INST335Teams and OrganizationsTeam development and the principles, methods a...1 course with a minimum grade of C- from (INST...3.010
7INST346Technologies Infrastructure and ArchitectureExamines the basic concepts of local and wide-...1 course with a minimum grade of C- from (INST...3.010
8INST352Information User Needs and AssessmentFocuses on use of information by individuals, ...1 course with a minimum grade of C- from (INST...3.010
9INST354Decision-Making for Information ScienceExamines the use of information in organizatio...INST314.3.010
\n", "
" ], "text/plain": [ " Code Title \\\n", "0 INST126 Introduction to Programming for Information Sc... \n", "1 INST201 Introduction to Information Science \n", "2 INST311 Information Organization \n", "3 INST314 Statistics for Information Science \n", "4 INST326 Object-Oriented Programming for Information Sc... \n", "5 INST327 Database Design and Modeling \n", "6 INST335 Teams and Organizations \n", "7 INST346 Technologies Infrastructure and Architecture \n", "8 INST352 Information User Needs and Assessment \n", "9 INST354 Decision-Making for Information Science \n", "\n", " Description \\\n", "0 An introduction to computer programming for st... \n", "1 Examining the effects of new information techn... \n", "2 Examines the theories, concepts, and principle... \n", "3 Basic concepts in statistics including measure... \n", "4 An introduction to programming, emphasizing un... \n", "5 Introduction to databases, the relational mode... \n", "6 Team development and the principles, methods a... \n", "7 Examines the basic concepts of local and wide-... \n", "8 Focuses on use of information by individuals, ... \n", "9 Examines the use of information in organizatio... \n", "\n", " Prereqs Credits has_prereqs \\\n", "0 Minimum grade of C- in MATH115; or must have m... 3.0 1 \n", "1 None 3.0 0 \n", "2 Must have completed or be concurrently enrolle... 3.0 1 \n", "3 Must have completed or be concurrently enrolle... 3.0 1 \n", "4 1 course with a minimum grade of C- from (INST... 3.0 1 \n", "5 1 course with a minimum grade of C- from (CMSC... 3.0 1 \n", "6 1 course with a minimum grade of C- from (INST... 3.0 1 \n", "7 1 course with a minimum grade of C- from (INST... 3.0 1 \n", "8 1 course with a minimum grade of C- from (INST... 3.0 1 \n", "9 INST314. 3.0 1 \n", "\n", " is_intro \n", "0 1 \n", "1 1 \n", "2 0 \n", "3 0 \n", "4 0 \n", "5 0 \n", "6 0 \n", "7 0 \n", "8 0 \n", "9 0 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# first define a function to check if the course is an intro course\n", "def is_intro(title):\n", " if \"introduction\" in title.lower():\n", " return 1\n", " else:\n", " return 0\n", "\n", "# then apply it to the courses column and save the result in the (new) `is_intro` column\n", "courses['is_intro'] = courses['Title'].apply(is_intro)\n", "courses.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you're lazy, you can pass in anonymous functions too, with `lambda`: https://towardsdatascience.com/apply-and-lambda-usage-in-pandas-b13a1ea037f7" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 0\n", "3 0\n", "4 0\n", "5 0\n", "6 0\n", "7 0\n", "8 0\n", "9 0\n", "Name: Title, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "is_introductory = courses['Title'].apply(lambda title: 1 if \"introduction\" in title.lower() else 0)\n", "is_introductory.head(10) # show the top 10" ] }, { "cell_type": "markdown", "metadata": { "id": "tE7sj3uIxFH2" }, "source": [ "##### What's happening under the hood when you `.apply()` a function to a column\n", "\n", "Pandas is *iterating* through every row in that column, and *applying* the function to the value in that row." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Result of applying has_prereq() to Minimum grade of C- in MATH115; or must have math eligibility of MATH140 or higher; or permission of instructor.: 1\n", "Result of applying has_prereq() to None: 0\n", "Result of applying has_prereq() to Must have completed or be concurrently enrolled in INST201; or INST301.: 1\n", "Result of applying has_prereq() to Must have completed or be concurrently enrolled in INST201; or must have completed or be concurrently enrolled in INST301. And minimum grade of C- in INST201 and INST301; and MATH115; and STAT100; and minimum grade of C- in MATH115 and STAT100.: 1\n", "Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST126, CMSC106); and must have completed or be concurrently enrolled in INST201 or INST301. And minimum grade of C- in INST201; or minimum grade of C- in INST301.: 1\n", "Result of applying has_prereq() to 1 course with a minimum grade of C- from (CMSC106, CMSC122, INST126); and must have completed or be concurrently enrolled in INST201 or INST301; and minimum grade of C- in INST201 and INST301.: 1\n", "Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST201, INST301); and minimum grade of C- in PSYC100.: 1\n", "Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST201, INST301); and 1 course with a minimum grade of C- from (INST326, CMSC131); and minimum grade of C- in INST327.: 1\n", "Result of applying has_prereq() to 1 course with a minimum grade of C- from (INST201, INST301); and minimum grade of C- in INST311.: 1\n", "Result of applying has_prereq() to INST314.: 1\n" ] } ], "source": [ "# let's show this for the first 10 courses\n", "for prereq in courses['Prereqs'].head(10):\n", " print(f\"Result of applying has_prereq() to {prereq}: {has_prereq(prereq)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.apply()` function returns a pandas Series that is the same length as the input column (which is also a Series), with a corresponding value for each input." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "the Prereqs column has 41 rows\n", "the Series created by applying `has_prereq()` to the Prereqs column has 41 rows\n" ] } ], "source": [ "print(f\"the Prereqs column has {len(courses['Prereqs'])} rows\")\n", "print(f\"the Series created by applying `has_prereq()` to the Prereqs column has {len(courses['Prereqs'].apply(has_prereq))} rows\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To save the results of the `apply()` for later analysis, we then need to assign it to a column, new or existing. \n", "\n", "Remember, pandas prefers immutability in general (return a new object instead of modifying the object), and sometimes enforces it. With `.apply()`, it's enforced: you can't directly modify the column, you have to assign the returned Series to a column if you want it to persist.\n", "\n", "Like with other assignment statements, just running the `.apply()` and assigning its return value to a column will not yield output. You'll need to print out the dataframe to check the results.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "d9moZ_FCwNEs" }, "source": [ "**PRACTICE:** Let's say I want to know how many courses we have in each area. We don't have that data in the dataset; at least not explicitly. Fortunately we can make it with some simple programming that you already know how to do! The problem here is, given a code (i.e., data from one column), how do we \"extract\" the area?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "executionInfo": { "elapsed": 216, "status": "ok", "timestamp": 1620651452553, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "KUWhFrzsTbgo" }, "outputs": [], "source": [ "# Step 1: define the function\n", "def extract_area(code):\n", " # heuristic: just grab the first four characters\n", " return " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "executionInfo": { "elapsed": 367, "status": "ok", "timestamp": 1620050634026, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "nsmT_dWOTbgo", "outputId": "bae6f033-a834-4d4f-d1a7-0d7d0b0f89c8" }, "outputs": [], "source": [ "c = \"CMSC250\"\n", "extract_area(c)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 667 }, "executionInfo": { "elapsed": 164, "status": "ok", "timestamp": 1620651455017, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "SaOKpq5OTbgo", "outputId": "74885e32-33e7-4174-a934-73f49e7c2ba0" }, "outputs": [], "source": [ "# Step 2: apply the function\n", "courses['area'] = courses['code'].apply(extract_area)\n", "courses.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**PRACTICE**: With the `wunderground.csv` dataset, how can we extract the year/month/day from the date column?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**PRACTICE:** With the `BreadBasket_DMS.csv` dataset, how can we extract the hour for each transaction from the Time column?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the bread dataset\n", "bread = pd.read_csv(\"data/BreadBasket_DMS.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 417 }, "executionInfo": { "elapsed": 1635, "status": "ok", "timestamp": 1620050938934, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "hMcGkkjsjRVm", "outputId": "a09cc8bc-1c83-4936-fff8-5fa683cd1f79" }, "outputs": [], "source": [ "def extract_hour(time):\n", " return \n", "\n", "bread['Hour'] = bread['Time'].apply(extract_hour)\n", "bread.sort_values(by=\"Hour\")" ] }, { "cell_type": "markdown", "metadata": { "id": "2p5cEALzwUit" }, "source": [ "### `.apply()` with data from multiple columns\n", "\n", "What if you want to have a way to filter the courses in terms of \"easy entry points\" (i.e., both introductory *and* has no prerequisites)? That might also be interesting to analyze by area to see how many departments offer these easy entry points into the department for students from other departments." ] }, { "cell_type": "markdown", "metadata": { "id": "YfLr69-fxQ_3" }, "source": [ "Core thing we need to know here is that our `.apply()` will now apply a function that has a **row** as input, not an element of a single column. That way, we can access data from any column in the row: in this case, data from the \"is_intro\" and \"has_prereq\" columns.\n", "\n", "There are two key differences between this use of `.apply()` and the single-column case:\n", "1. First, we do `.apply()` with the whole dataframe, not from a single column\n", "2. We specify an argument for the `axis` parameter to tell it to use rows as inputs. We need to pass the argument `1` to the axis parameter when we call `.apply()` so it knows to pass a row into the function, not just a single column element. See here for more details: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "executionInfo": { "elapsed": 956, "status": "ok", "timestamp": 1620651475427, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "dfe45SocxJ9D" }, "outputs": [], "source": [ "# is_entry_point function\n", "def is_entry_point(row):\n", " # if the value of the \"is intro\" column for this row is 1\n", " # AND the value of hte \"has_prereq\" column for this row is 0\n", " # return 1\n", " if row['is_intro'] == 1 and row['has_prereqs'] == 0: \n", " return 1\n", " else:\n", " return 0" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 639, "status": "ok", "timestamp": 1620051476022, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "a773Hm4w-2xi", "outputId": "15d7770d-c04c-4fbe-ab9a-fff671c4c0ed" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "0\n" ] } ], "source": [ "# this should yield 1\n", "test_row = {\n", " 'is_intro': 1,\n", " 'has_prereqs': 0\n", "}\n", "\n", "# this should yield 0\n", "test_row2 = {\n", " 'is_intro': 1,\n", " 'has_prereqs': 1\n", "}\n", "\n", "print(is_entry_point(test_row))\n", "print(is_entry_point(test_row2))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 551 }, "executionInfo": { "elapsed": 662, "status": "ok", "timestamp": 1620651479735, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "_oUWTToLxMju", "outputId": "be8c58b4-d2cb-4b06-9871-3dff97684633" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CodeTitleDescriptionPrereqsCreditshas_prereqsis_introis_entrypoint
0INST126Introduction to Programming for Information Sc...An introduction to computer programming for st...Minimum grade of C- in MATH115; or must have m...3.0110
1INST201Introduction to Information ScienceExamining the effects of new information techn...None3.0011
2INST311Information OrganizationExamines the theories, concepts, and principle...Must have completed or be concurrently enrolle...3.0100
3INST314Statistics for Information ScienceBasic concepts in statistics including measure...Must have completed or be concurrently enrolle...3.0100
4INST326Object-Oriented Programming for Information Sc...An introduction to programming, emphasizing un...1 course with a minimum grade of C- from (INST...3.0100
\n", "
" ], "text/plain": [ " Code Title \\\n", "0 INST126 Introduction to Programming for Information Sc... \n", "1 INST201 Introduction to Information Science \n", "2 INST311 Information Organization \n", "3 INST314 Statistics for Information Science \n", "4 INST326 Object-Oriented Programming for Information Sc... \n", "\n", " Description \\\n", "0 An introduction to computer programming for st... \n", "1 Examining the effects of new information techn... \n", "2 Examines the theories, concepts, and principle... \n", "3 Basic concepts in statistics including measure... \n", "4 An introduction to programming, emphasizing un... \n", "\n", " Prereqs Credits has_prereqs \\\n", "0 Minimum grade of C- in MATH115; or must have m... 3.0 1 \n", "1 None 3.0 0 \n", "2 Must have completed or be concurrently enrolle... 3.0 1 \n", "3 Must have completed or be concurrently enrolle... 3.0 1 \n", "4 1 course with a minimum grade of C- from (INST... 3.0 1 \n", "\n", " is_intro is_entrypoint \n", "0 1 0 \n", "1 1 1 \n", "2 0 0 \n", "3 0 0 \n", "4 0 0 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 2 apply the function and save the result\n", "courses['is_entrypoint'] = courses.apply(is_entry_point, axis=1) # need to specify axis=1 to apply it to every row\n", "# courses['classlevel'] = courses['classcode'].apply(level)\n", "courses.head()\n", "\n", "# compare to .apply() with a single column\n", "# courses['is_intro'] = courses['title'].apply(is_intro)\n", "# key differences:\n", "# - here for multiple columns, we start with the whole dataframe, instead of a specific column\n", "# - and we pass the argument 1 to the axis parameter instead of letting it use the default 0 value" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 838 }, "executionInfo": { "elapsed": 381, "status": "ok", "timestamp": 1607358177617, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 300 }, "id": "gJim7Jfh_gY0", "outputId": "792b6647-4ece-4f11-ead5-f27b975955ad" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CodeTitleDescriptionPrereqsCreditshas_prereqsis_introis_entrypoint
1INST201Introduction to Information ScienceExamining the effects of new information techn...None3.0011
19INST604Introduction to Archives and Digital CurationOverview of the principles, practices, and app...None3.0011
25INST630Introduction to Programming for the Informatio...An introduction to computer programming intend...None3.0011
\n", "
" ], "text/plain": [ " Code Title \\\n", "1 INST201 Introduction to Information Science \n", "19 INST604 Introduction to Archives and Digital Curation \n", "25 INST630 Introduction to Programming for the Informatio... \n", "\n", " Description Prereqs Credits \\\n", "1 Examining the effects of new information techn... None 3.0 \n", "19 Overview of the principles, practices, and app... None 3.0 \n", "25 An introduction to computer programming intend... None 3.0 \n", "\n", " has_prereqs is_intro is_entrypoint \n", "1 0 1 1 \n", "19 0 1 1 \n", "25 0 1 1 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show me all the courses that are intro and have no prereqs\n", "courses[courses['is_entrypoint'] == 1]\n", "# if we have a list, we can do indexing like this to get the first 4 elements, say: courses[:4]\n", "# if we have a dictionary, we can retrieve base don key, like this courses['hello']" ] }, { "cell_type": "markdown", "metadata": { "id": "qiwpZzR3xPBE" }, "source": [ "More examples?" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0schoolconfrkwlsrssospts_forpts_vs...ap_preap_highap_finalpts_diffncaa_resultncaa_numericseasonyearcoachunderdog
82298229indianaBig Ten78203NaNNaNNaNNaN...303030NaNWon National Final481939-401939Branch McCracken1
2351523515wisconsinBig Ten77203NaNNaNNaNNaN...303030NaNWon National Final481940-411940Bud Foster1
\n", "

2 rows × 21 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 school conf rk w l srs sos pts_for pts_vs \\\n", "8229 8229 indiana Big Ten 78 20 3 NaN NaN NaN NaN \n", "23515 23515 wisconsin Big Ten 77 20 3 NaN NaN NaN NaN \n", "\n", " ... ap_pre ap_high ap_final pts_diff ncaa_result \\\n", "8229 ... 30 30 30 NaN Won National Final \n", "23515 ... 30 30 30 NaN Won National Final \n", "\n", " ncaa_numeric season year coach underdog \n", "8229 48 1939-40 1939 Branch McCracken 1 \n", "23515 48 1940-41 1940 Bud Foster 1 \n", "\n", "[2 rows x 21 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# for ncaa: make a column that is 1 if you had a winning season AND reached the round of 32\n", "ncaa = pd.read_csv(\"data/ncaa-team-data-cleanCoachNames.csv\")\n", "ncaa.head()\n", "\n", "def underdog_season(row):\n", " if row['w'] < 21 and row['ncaa_result'] == \"Won National Final\":\n", " return 1\n", " else:\n", " return 0\n", " \n", "ncaa['underdog'] = ncaa.apply(underdog_season, axis=1)\n", "ncaa[ncaa['underdog']==1]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Lost First Four',\n", " 'Lost First Round',\n", " 'Lost National Final',\n", " 'Lost National Semifinal',\n", " 'Lost Opening Round',\n", " 'Lost Regional Final',\n", " 'Lost Regional Final (Final Four)',\n", " 'Lost Regional Semifinal',\n", " 'Lost Second Round',\n", " 'Lost Third Round',\n", " 'Playing First Four',\n", " 'Playing First Round',\n", " 'Won National Final',\n", " nan}" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "set(ncaa['ncaa_result'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# for ncaa: create a column taht is 1 if the pt differential is non-negative BUT the team didn't make the playoffs (numeric >= 32 or 0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# apply the get_month function to every value in the Date column in bread\n", "# and store the resulting series in the Month column" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# convert the ncaa_result column to a numerical ranking of season outcome" ] }, { "cell_type": "markdown", "metadata": { "id": "OLt6ewQFyaKW", "tags": [] }, "source": [ "## Analyze subgroups of your data with the split-apply-combine pattern\n", "\n", "Going more deeply on the path of \"reshaping\", we often want to compute data based on subsets of the data, grouped by some column.\n", "\n", "For example, we might want to see how many departments offer easy entry points. \n", "\n", "We can do this with the \"split-apply-combine\" pattern, which is implemented in the `.groupby()` function.\n", "\n", "Basically, it goes like this:\n", "1. **Split** the data into subgroups (e.g., split courses into department subgroups)\n", "2. **Apply** some computation on each subgroup (e.g., find number of easy entry points for each department subgroup)\n", "3. **Combine** subgroup-computation information into an overall new dataframe that has subgroups as entries\n", "\n", "Here's a picture with a simpler datset to give an intuition:\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 698, "status": "ok", "timestamp": 1620051947962, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "vKLWqdvznfVS", "outputId": "e6dab799-9522-4812-aa8c-a6b1195488c3", "scrolled": true }, "outputs": [], "source": [ "for area, areaData in courses.groupby('area'):\n", " print(area)\n", " print(areaData)" ] }, { "cell_type": "markdown", "metadata": { "id": "-mJPLBh193_3" }, "source": [ "### Split with `.groupby()`\n", "\n", "The first step is to split the dataset into subgroups." ] }, { "cell_type": "markdown", "metadata": { "id": "INdrHFZE9-Sc" }, "source": [ "The manual way" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 550, "status": "ok", "timestamp": 1620052025194, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "XUEuVbRKOjTy", "outputId": "ecf2a32b-6f8a-4a3d-e938-cf2c637a2799" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PLCY\n", " code title \\\n", "319 PLCY101 Great Thinkers on Public Policy \n", "320 PLCY201 Public Leaders and Active Citizens \n", "321 PLCY215 Innovation and Social Change: Creating Change... \n", "322 PLCY301 Sustainability \n", "323 PLCY302 Examining Pluralism in Public Policy \n", "\n", " description prereqs credits \\\n", "319 Great ideas in public policy, such as equalit... None 3 \n", "320 Aims to inspire, teach and engage students in... None 3 \n", "321 A team-based, highly interactive and dynamic ... None 3 \n", "322 Designed for students whose academic majors w... None 3 \n", "323 Understanding pluralism and how groups and in... None 3 \n", "\n", " prereq_type area has_prereqs is_intro is_entrypoint \n", "319 None PLCY 0 0 0 \n", "320 None PLCY 0 0 0 \n", "321 None PLCY 0 0 0 \n", "322 None PLCY 0 0 0 \n", "323 None PLCY 0 0 0 \n", "3.0\n", "\n", "\n", "COMM\n", " code title \\\n", "108 COMM200 Critical Thinking and Speaking \n", "109 COMM331 News Writing and Reporting for Public Relations \n", "110 COMM332 News Editing for Public Relations \n", "111 COMM351 Public Relations Techniques \n", "112 COMM353 New Media Writing for Public Relations \n", "\n", " description \\\n", "108 Theory and practice of persuasive discourse a... \n", "109 Writing and researching news and information ... \n", "110 Copy editing, graphic principles and processe... \n", "111 The techniques of public relations, including... \n", "112 Students learn the uses and influence of new ... \n", "\n", " prereqs credits prereq_type \\\n", "108 None 3 None \n", "109 COMM201; and must have completed or be concur... 3 Flexible \n", "110 Minimum grade of C- in COMM331; or students w... 3 Flexible \n", "111 COMM332. 3 Hard \n", "112 Minimum grade of C- in COMM351. 3 Hard \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "108 COMM 0 0 0 \n", "109 COMM 1 0 0 \n", "110 COMM 1 0 0 \n", "111 COMM 1 0 0 \n", "112 COMM 1 0 0 \n", "2.870967741935484\n", "\n", "\n", "CMSC\n", " code title \\\n", "62 CMSC122 Introduction to Computer Programming via the ... \n", "63 CMSC131 Object-Oriented Programming I \n", "64 CMSC132 Object-Oriented Programming II \n", "65 CMSC133 Object Oriented Programming I Beyond Fundamen... \n", "66 CMSC216 Introduction to Computer Systems \n", "\n", " description \\\n", "62 Introduction to computer programming in the c... \n", "63 Introduction to programming and computer scie... \n", "64 Introduction to use of computers to solve pro... \n", "65 An introduction to computer science and objec... \n", "66 Introduction to the interaction between user ... \n", "\n", " prereqs credits prereq_type \\\n", "62 None 3 None \n", "63 None 4 None \n", "64 Minimum grade of C- in CMSC131; or must have ... 4 Flexible \n", "65 None 2 None \n", "66 Minimum grade of C- in CMSC132; and minimum g... 4 Hard \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "62 CMSC 0 1 1 \n", "63 CMSC 0 0 0 \n", "64 CMSC 1 0 0 \n", "65 CMSC 0 0 0 \n", "66 CMSC 1 1 0 \n", "3.0\n", "\n", "\n", "ENTS\n", " code title \\\n", "209 ENTS630 The Economics of International Telecommunicat... \n", "210 ENTS632 Telecommunications Marketing Management \n", "211 ENTS635 Decision Support Methods for Telecommunicatio... \n", "212 ENTS641 Networks and Protocols II \n", "\n", " description prereqs credits \\\n", "209 Basic microeconomic principles used by teleco... None 3 \n", "210 Topics covered include strategic marketing, s... None 3 \n", "211 The aim of this course is to introduce manage... None 3 \n", "212 Techniques for the specification, design, ana... ENTS640. 3 \n", "\n", " prereq_type area has_prereqs is_intro is_entrypoint \n", "209 None ENTS 0 0 0 \n", "210 None ENTS 0 0 0 \n", "211 None ENTS 0 0 0 \n", "212 Hard ENTS 1 0 0 \n", "3.0\n", "\n", "\n", "URSP\n", " code title \\\n", "407 URSP600 Research Design and Application \n", "408 URSP601 Research Methods \n", "409 URSP604 The Planning Process \n", "410 URSP606 Planning Economics \n", "411 URSP631 Transportation and Land Use \n", "\n", " description prereqs credits \\\n", "407 Techniques in urban research, policy analysis... None 3 \n", "408 Use of measurement, statistics, quantitative ... None 3 \n", "409 Legal framework for U.S. planning; approaches... None 3 \n", "410 Resource allocation in a market economy, the ... None 3 \n", "411 The interrelationship between transportation ... None 3 \n", "\n", " prereq_type area has_prereqs is_intro is_entrypoint \n", "407 None URSP 0 0 0 \n", "408 None URSP 0 0 0 \n", "409 None URSP 0 0 0 \n", "410 None URSP 0 0 0 \n", "411 None URSP 0 0 0 \n", "3.0\n", "\n", "\n", "STAT\n", " code title \\\n", "392 STAT100 Elementary Statistics and Probability \n", "393 STAT400 Applied Probability and Statistics I \n", "394 STAT401 Applied Probability and Statistics II \n", "395 STAT410 Introduction to Probability Theory \n", "396 STAT420 Theory and Methods of Statistics \n", "\n", " description \\\n", "392 Simplest tests of statistical hypotheses; app... \n", "393 Random variables, standard distributions, mom... \n", "394 Point estimation - unbiased and consistent es... \n", "395 Probability and its properties. Random variab... \n", "396 Point estimation, sufficiency, completeness, ... \n", "\n", " prereqs credits prereq_type \\\n", "392 MATH110, MATH112, MATH113, or MATH115; or per... 3 Flexible \n", "393 1 course with a minimum grade of C- from (MAT... 3 Flexible \n", "394 1 course with a minimum grade of C- from (STA... 3 Hard \n", "395 1 course with a minimum grade of C- from (MAT... 3 Hard \n", "396 1 course with a minimum grade of C- from (SUR... 3 Hard \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "392 STAT 1 0 0 \n", "393 STAT 1 0 0 \n", "394 STAT 1 0 0 \n", "395 STAT 1 1 0 \n", "396 STAT 1 0 0 \n", "3.0\n", "\n", "\n", "ENSP\n", " code title \\\n", "203 ENSP102 Introduction to Environmental Policy \n", "204 ENSP250 Lawns in the Landscape: Environmental Hero or... \n", "205 ENSP305 Applied Quantitative Methods in Environmental... \n", "206 ENSP330 Introduction to Environmental Law \n", "207 ENSP342 Environmental Threats to Oceans and Coasts: T... \n", "\n", " description \\\n", "203 Second of two courses that introduce students... \n", "204 Examination of the lawn as an element in the ... \n", "205 Intended for students interested in pursuing ... \n", "206 An overview of environmental law, from its co... \n", "207 An interdisciplinary study of the challenges ... \n", "\n", " prereqs credits prereq_type \\\n", "203 None 3 None \n", "204 None 3 None \n", "205 BIOM301, ECON321, GEOG306, PSYC200, or SOCY20... 3 Flexible \n", "206 None 3 None \n", "207 None 3 None \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "203 ENSP 0 1 1 \n", "204 ENSP 0 0 0 \n", "205 ENSP 1 0 0 \n", "206 ENSP 0 1 1 \n", "207 ENSP 0 0 0 \n", "3.0\n", "\n", "\n", "AMST\n", " code title \\\n", "0 AMST101 Introduction American Studies \n", "1 AMST298C Introduction to Asian American Studies \n", "2 AMST340 Introduction to History, Theories and Methods... \n", "3 AMST418N Asian American Public Policy \n", "4 AMST450 Seminar in American Studies \n", "\n", " description \\\n", "0 Introduces students to the interdisciplinary ... \n", "1 The aggregate experience of Asian Pacific Ame... \n", "2 Introduction to the process of interdisciplin... \n", "3 \n", "4 Developments in theories and methods of Ameri... \n", "\n", " prereqs credits prereq_type \\\n", "0 None 3 None \n", "1 None 3 None \n", "2 Must have completed AMST201; and 2 courses in... 3 Hard \n", "3 None 3 None \n", "4 AMST201 and AMST340; and 1 course in AMST. 3 Hard \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "0 AMST 0 1 1 \n", "1 AMST 0 1 1 \n", "2 AMST 1 1 0 \n", "3 AMST 0 0 0 \n", "4 AMST 1 0 0 \n", "3.0\n", "\n", "\n", "MATH\n", " code title \\\n", "265 MATH107 Introduction to Math Modeling and Probability \n", "266 MATH113 College Algebra and Trigonometry \n", "267 MATH115 Precalculus \n", "268 MATH120 Elementary Calculus I \n", "269 MATH121 Elementary Calculus II \n", "\n", " description \\\n", "265 A goal is to convey the power of mathematics ... \n", "266 Topics include elementary functions including... \n", "267 Preparation for MATH120, MATH130 or MATH140. ... \n", "268 Basic ideas of differential and integral calc... \n", "269 Differential and integral calculus, with emph... \n", "\n", " prereqs credits prereq_type \\\n", "265 Must have math eligibility of Math 107 or hig... 3 Flexible \n", "266 Must have math eligibility of MATH113 or high... 3 Flexible \n", "267 Must have math eligibility of MATH115 or high... 3 Flexible \n", "268 1 course with a minimum grade of C- from (MAT... 3 Flexible \n", "269 MATH120, MATH130, MATH136, or MATH140; or mus... 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "265 MATH 1 1 0 \n", "266 MATH 1 0 0 \n", "267 MATH 1 0 0 \n", "268 MATH 1 0 0 \n", "269 MATH 1 0 0 \n", "3.061224489795918\n", "\n", "\n", "INFM\n", " code title \\\n", "213 INFM605 Users and Use Context \n", "214 INFM612 Management Concepts and Principles for Inform... \n", "215 INFM620 Introduction to Strategic Information Managem... \n", "216 INFM700 Information Architecture \n", "217 INFM737 Information Management Capstone Experience \n", "\n", " description \\\n", "213 Use of information by individuals. Nature of ... \n", "214 Key aspects of management - focusing on plann... \n", "215 Strategic management is the comprehensive col... \n", "216 Principles and techniques of information orga... \n", "217 The Information Management Capstone Experienc... \n", "\n", " prereqs credits prereq_type \\\n", "213 None 3 None \n", "214 None 3 None \n", "215 INFM612; or LBSC631; or permission of instruc... 3 Flexible \n", "216 INFM603; or permission of instructor. 3 Flexible \n", "217 INFM736; and must have earned a minimum of 27... 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "213 INFM 0 0 0 \n", "214 INFM 0 0 0 \n", "215 INFM 1 1 0 \n", "216 INFM 1 0 0 \n", "217 INFM 1 0 0 \n", "3.0\n", "\n", "\n", "PHSC\n", " code title \\\n", "314 PHSC401 History of Public Health \n", "315 PHSC412 Food, Policy, and Public Health \n", "316 PHSC415 Essentials of Public Health Biology: The Cell... \n", "317 PHSC440 Public Health Nutrition \n", "318 PHSC497 Public Health Science Capstone \n", "\n", " description \\\n", "314 Emphasis is on the history of public health i... \n", "315 Broad overview of the impact of food and food... \n", "316 Presents the basic scientific and biomedical ... \n", "317 Engages students in conceptual thinking about... \n", "318 The capstone course is the culminating experi... \n", "\n", " prereqs credits prereq_type \\\n", "314 None 3 None \n", "315 Must have completed HLSA300 with a C- or high... 3 Flexible \n", "316 Minimum grade of C- in BSCI202. 3 Hard \n", "317 A minimum of C- in BSCI170, BSCI171, CHEM131,... 3 Hard \n", "318 Must have completed the professional writing ... 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "314 PHSC 0 0 0 \n", "315 PHSC 1 0 0 \n", "316 PHSC 1 0 0 \n", "317 PHSC 1 0 0 \n", "318 PHSC 1 0 0 \n", "3.0\n", "\n", "\n", "BMGT\n", " code title \\\n", "9 BMGT190H Introduction to Design and Quality \n", "10 BMGT210 Foundations of Accounting for Non Business Ma... \n", "11 BMGT302 Designing Applications for Business Analytics \n", "12 BMGT310 Intermediate Accounting I \n", "13 BMGT311 Intermediate Accounting II \n", "\n", " description \\\n", "9 QUEST students learn and apply design practic... \n", "10 Provides an understanding of the common state... \n", "11 Provides an introduction to structured progra... \n", "12 Comprehensive analysis of financial accountin... \n", "13 Continuation of BMGT310. \n", "\n", " prereqs credits prereq_type \\\n", "9 None 4 None \n", "10 None 3 None \n", "11 BMGT301; or permission of BMGT-Robert H. Smit... 3 Flexible \n", "12 BMGT221. 3 Hard \n", "13 BMGT310. 3 Hard \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "9 BMGT 0 1 1 \n", "10 BMGT 0 0 0 \n", "11 BMGT 1 0 0 \n", "12 BMGT 1 0 0 \n", "13 BMGT 1 0 0 \n", "3.0377358490566038\n", "\n", "\n", "ECON\n", " code title \\\n", "139 ECON200 Principles of Microeconomics \n", "140 ECON201 Principles of Macroeconomics \n", "141 ECON230 Applied Economic Statistics \n", "142 ECON305 Intermediate Macroeconomic Theory and Policy \n", "143 ECON306 Intermediate Microeconomic Theory & Policy \n", "\n", " description \\\n", "139 Introduces economic models used to analyze ec... \n", "140 An introduction to how market economies behav... \n", "141 Introductory course to develop understanding ... \n", "142 Analysis of the determination of national inc... \n", "143 Analysis of the theories of consumer behavior... \n", "\n", " prereqs credits prereq_type \\\n", "139 MATH107 or MATH110; or must have math eligibi... 3 Flexible \n", "140 MATH107 or MATH110; or must have math eligibi... 3 Flexible \n", "141 Must have math eligibility of MATH113 or high... 3 Flexible \n", "142 Minimum grade of C- in ECON201 and ECON200. A... 3 Flexible \n", "143 1 course with a minimum grade of C- from (ECO... 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "139 ECON 1 0 0 \n", "140 ECON 1 0 0 \n", "141 ECON 1 0 0 \n", "142 ECON 1 0 0 \n", "143 ECON 1 0 0 \n", "2.984375\n", "\n", "\n", "SPHL\n", " code title \\\n", "385 SPHL100 Foundations of Public Health \n", "386 SPHL291 Can we move beyond medication? Examining yoga... \n", "387 SPHL333 Fundamentals of Undergraduate Teaching for Ed... \n", "388 SPHL600 Foundations of Public Health \n", "389 SPHL610 Program and Policy Planning, Implementation, ... \n", "\n", " description prereqs credits \\\n", "385 An overview of the goals, functions, and meth... None 3 \n", "386 Does yoga improve the health of wounded warri... None 3 \n", "387 Supports the professional and personal develo... None 1 \n", "388 An overview of the goals, functions, and meth... None 3 \n", "389 This second course in the MPH/MHA integrated ... None 5 \n", "\n", " prereq_type area has_prereqs is_intro is_entrypoint \n", "385 None SPHL 0 0 0 \n", "386 None SPHL 0 0 0 \n", "387 None SPHL 0 0 0 \n", "388 None SPHL 0 0 0 \n", "389 None SPHL 0 0 0 \n", "2.4285714285714284\n", "\n", "\n", "INST\n", " code title \\\n", "218 INST126 Introduction to Programming for Information S... \n", "219 INST155 Social Networking \n", "220 INST201 Introduction to Information Science \n", "221 INST311 Information Organization \n", "222 INST314 Statistics for Information Science \n", "\n", " description \\\n", "218 An introduction to computer programming for s... \n", "219 Introduces methods for analyzing and understa... \n", "220 Examining the effects of new information tech... \n", "221 Examines the theories, concepts, and principl... \n", "222 Basic concepts in statistics including measur... \n", "\n", " prereqs credits prereq_type \\\n", "218 Minimum grade of C- in MATH115; or must have ... 3 Flexible \n", "219 None 3 None \n", "220 None 3 None \n", "221 None 3 None \n", "222 Minimum grade of C- in STAT100 and MATH115 (o... 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "218 INST 1 1 0 \n", "219 INST 0 0 0 \n", "220 INST 0 1 1 \n", "221 INST 0 0 0 \n", "222 INST 1 0 0 \n", "3.0\n", "\n", "\n", "PSYC\n", " code title \\\n", "347 PSYC123 The Psychology of Getting Hired \n", "348 PSYC200 Statistical Methods in Psychology \n", "349 PSYC221 Social Psychology \n", "350 PSYC300 Research Methods in Psychology Laboratory \n", "351 PSYC301 Biological Basis of Behavior \n", "\n", " description \\\n", "347 Designed to introduce students to the science... \n", "348 A basic introduction to quantitative methods ... \n", "349 The influence of social factors on the indivi... \n", "350 A general introduction and overview to the fu... \n", "351 Recent advances in neuroscience are radically... \n", "\n", " prereqs credits prereq_type \\\n", "347 None 1 None \n", "348 PSYC100. And 1 course with a minimum grade of... 3 Flexible \n", "349 PSYC100. 3 Hard \n", "350 PSYC200. 4 Hard \n", "351 PSYC100. And BSCI170 and BSCI171; or BSCI105. 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "347 PSYC 0 0 0 \n", "348 PSYC 1 0 0 \n", "349 PSYC 1 0 0 \n", "350 PSYC 1 0 0 \n", "351 PSYC 1 0 0 \n", "3.026315789473684\n", "\n", "\n" ] } ], "source": [ "# get all the unique area values\n", "course_areas = set(courses['area'].values)\n", "\n", "# iterate through each unique area value\n", "for area in course_areas:\n", " print(area)\n", " # get the subset of the course data that is associated with this area\n", " area_df = courses[courses['area'] == area]\n", " print(area_df.head())\n", " # summarize the credits for this subset of hte dataframe\n", " print(area_df['credits'].mean())\n", " print(\"\\n\")" ] }, { "cell_type": "markdown", "metadata": { "id": "lCpDzfsN-CLO" }, "source": [ "The `.groupby()` way" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 619, "status": "ok", "timestamp": 1620653002516, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "FpCRlL8a-HYh", "outputId": "b9c2c137-a6b3-4207-898c-b73696da34f7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AMST\n", " code title \\\n", "0 AMST101 Introduction American Studies \n", "1 AMST298C Introduction to Asian American Studies \n", "2 AMST340 Introduction to History, Theories and Methods... \n", "3 AMST418N Asian American Public Policy \n", "4 AMST450 Seminar in American Studies \n", "\n", " description \\\n", "0 Introduces students to the interdisciplinary ... \n", "1 The aggregate experience of Asian Pacific Ame... \n", "2 Introduction to the process of interdisciplin... \n", "3 \n", "4 Developments in theories and methods of Ameri... \n", "\n", " prereqs credits prereq_type \\\n", "0 None 3 None \n", "1 None 3 None \n", "2 Must have completed AMST201; and 2 courses in... 3 Hard \n", "3 None 3 None \n", "4 AMST201 and AMST340; and 1 course in AMST. 3 Hard \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "0 AMST 0 1 1 \n", "1 AMST 0 1 1 \n", "2 AMST 1 1 0 \n", "3 AMST 0 0 0 \n", "4 AMST 1 0 0 \n", "3.0\n", "BMGT\n", " code title \\\n", "9 BMGT190H Introduction to Design and Quality \n", "10 BMGT210 Foundations of Accounting for Non Business Ma... \n", "11 BMGT302 Designing Applications for Business Analytics \n", "12 BMGT310 Intermediate Accounting I \n", "13 BMGT311 Intermediate Accounting II \n", "\n", " description \\\n", "9 QUEST students learn and apply design practic... \n", "10 Provides an understanding of the common state... \n", "11 Provides an introduction to structured progra... \n", "12 Comprehensive analysis of financial accountin... \n", "13 Continuation of BMGT310. \n", "\n", " prereqs credits prereq_type \\\n", "9 None 4 None \n", "10 None 3 None \n", "11 BMGT301; or permission of BMGT-Robert H. Smit... 3 Flexible \n", "12 BMGT221. 3 Hard \n", "13 BMGT310. 3 Hard \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "9 BMGT 0 1 1 \n", "10 BMGT 0 0 0 \n", "11 BMGT 1 0 0 \n", "12 BMGT 1 0 0 \n", "13 BMGT 1 0 0 \n", "3.0377358490566038\n", "CMSC\n", " code title \\\n", "62 CMSC122 Introduction to Computer Programming via the ... \n", "63 CMSC131 Object-Oriented Programming I \n", "64 CMSC132 Object-Oriented Programming II \n", "65 CMSC133 Object Oriented Programming I Beyond Fundamen... \n", "66 CMSC216 Introduction to Computer Systems \n", "\n", " description \\\n", "62 Introduction to computer programming in the c... \n", "63 Introduction to programming and computer scie... \n", "64 Introduction to use of computers to solve pro... \n", "65 An introduction to computer science and objec... \n", "66 Introduction to the interaction between user ... \n", "\n", " prereqs credits prereq_type \\\n", "62 None 3 None \n", "63 None 4 None \n", "64 Minimum grade of C- in CMSC131; or must have ... 4 Flexible \n", "65 None 2 None \n", "66 Minimum grade of C- in CMSC132; and minimum g... 4 Hard \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "62 CMSC 0 1 1 \n", "63 CMSC 0 0 0 \n", "64 CMSC 1 0 0 \n", "65 CMSC 0 0 0 \n", "66 CMSC 1 1 0 \n", "3.0\n", "COMM\n", " code title \\\n", "108 COMM200 Critical Thinking and Speaking \n", "109 COMM331 News Writing and Reporting for Public Relations \n", "110 COMM332 News Editing for Public Relations \n", "111 COMM351 Public Relations Techniques \n", "112 COMM353 New Media Writing for Public Relations \n", "\n", " description \\\n", "108 Theory and practice of persuasive discourse a... \n", "109 Writing and researching news and information ... \n", "110 Copy editing, graphic principles and processe... \n", "111 The techniques of public relations, including... \n", "112 Students learn the uses and influence of new ... \n", "\n", " prereqs credits prereq_type \\\n", "108 None 3 None \n", "109 COMM201; and must have completed or be concur... 3 Flexible \n", "110 Minimum grade of C- in COMM331; or students w... 3 Flexible \n", "111 COMM332. 3 Hard \n", "112 Minimum grade of C- in COMM351. 3 Hard \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "108 COMM 0 0 0 \n", "109 COMM 1 0 0 \n", "110 COMM 1 0 0 \n", "111 COMM 1 0 0 \n", "112 COMM 1 0 0 \n", "2.870967741935484\n", "ECON\n", " code title \\\n", "139 ECON200 Principles of Microeconomics \n", "140 ECON201 Principles of Macroeconomics \n", "141 ECON230 Applied Economic Statistics \n", "142 ECON305 Intermediate Macroeconomic Theory and Policy \n", "143 ECON306 Intermediate Microeconomic Theory & Policy \n", "\n", " description \\\n", "139 Introduces economic models used to analyze ec... \n", "140 An introduction to how market economies behav... \n", "141 Introductory course to develop understanding ... \n", "142 Analysis of the determination of national inc... \n", "143 Analysis of the theories of consumer behavior... \n", "\n", " prereqs credits prereq_type \\\n", "139 MATH107 or MATH110; or must have math eligibi... 3 Flexible \n", "140 MATH107 or MATH110; or must have math eligibi... 3 Flexible \n", "141 Must have math eligibility of MATH113 or high... 3 Flexible \n", "142 Minimum grade of C- in ECON201 and ECON200. A... 3 Flexible \n", "143 1 course with a minimum grade of C- from (ECO... 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "139 ECON 1 0 0 \n", "140 ECON 1 0 0 \n", "141 ECON 1 0 0 \n", "142 ECON 1 0 0 \n", "143 ECON 1 0 0 \n", "2.984375\n", "ENSP\n", " code title \\\n", "203 ENSP102 Introduction to Environmental Policy \n", "204 ENSP250 Lawns in the Landscape: Environmental Hero or... \n", "205 ENSP305 Applied Quantitative Methods in Environmental... \n", "206 ENSP330 Introduction to Environmental Law \n", "207 ENSP342 Environmental Threats to Oceans and Coasts: T... \n", "\n", " description \\\n", "203 Second of two courses that introduce students... \n", "204 Examination of the lawn as an element in the ... \n", "205 Intended for students interested in pursuing ... \n", "206 An overview of environmental law, from its co... \n", "207 An interdisciplinary study of the challenges ... \n", "\n", " prereqs credits prereq_type \\\n", "203 None 3 None \n", "204 None 3 None \n", "205 BIOM301, ECON321, GEOG306, PSYC200, or SOCY20... 3 Flexible \n", "206 None 3 None \n", "207 None 3 None \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "203 ENSP 0 1 1 \n", "204 ENSP 0 0 0 \n", "205 ENSP 1 0 0 \n", "206 ENSP 0 1 1 \n", "207 ENSP 0 0 0 \n", "3.0\n", "ENTS\n", " code title \\\n", "209 ENTS630 The Economics of International Telecommunicat... \n", "210 ENTS632 Telecommunications Marketing Management \n", "211 ENTS635 Decision Support Methods for Telecommunicatio... \n", "212 ENTS641 Networks and Protocols II \n", "\n", " description prereqs credits \\\n", "209 Basic microeconomic principles used by teleco... None 3 \n", "210 Topics covered include strategic marketing, s... None 3 \n", "211 The aim of this course is to introduce manage... None 3 \n", "212 Techniques for the specification, design, ana... ENTS640. 3 \n", "\n", " prereq_type area has_prereqs is_intro is_entrypoint \n", "209 None ENTS 0 0 0 \n", "210 None ENTS 0 0 0 \n", "211 None ENTS 0 0 0 \n", "212 Hard ENTS 1 0 0 \n", "3.0\n", "INFM\n", " code title \\\n", "213 INFM605 Users and Use Context \n", "214 INFM612 Management Concepts and Principles for Inform... \n", "215 INFM620 Introduction to Strategic Information Managem... \n", "216 INFM700 Information Architecture \n", "217 INFM737 Information Management Capstone Experience \n", "\n", " description \\\n", "213 Use of information by individuals. Nature of ... \n", "214 Key aspects of management - focusing on plann... \n", "215 Strategic management is the comprehensive col... \n", "216 Principles and techniques of information orga... \n", "217 The Information Management Capstone Experienc... \n", "\n", " prereqs credits prereq_type \\\n", "213 None 3 None \n", "214 None 3 None \n", "215 INFM612; or LBSC631; or permission of instruc... 3 Flexible \n", "216 INFM603; or permission of instructor. 3 Flexible \n", "217 INFM736; and must have earned a minimum of 27... 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "213 INFM 0 0 0 \n", "214 INFM 0 0 0 \n", "215 INFM 1 1 0 \n", "216 INFM 1 0 0 \n", "217 INFM 1 0 0 \n", "3.0\n", "INST\n", " code title \\\n", "218 INST126 Introduction to Programming for Information S... \n", "219 INST155 Social Networking \n", "220 INST201 Introduction to Information Science \n", "221 INST311 Information Organization \n", "222 INST314 Statistics for Information Science \n", "\n", " description \\\n", "218 An introduction to computer programming for s... \n", "219 Introduces methods for analyzing and understa... \n", "220 Examining the effects of new information tech... \n", "221 Examines the theories, concepts, and principl... \n", "222 Basic concepts in statistics including measur... \n", "\n", " prereqs credits prereq_type \\\n", "218 Minimum grade of C- in MATH115; or must have ... 3 Flexible \n", "219 None 3 None \n", "220 None 3 None \n", "221 None 3 None \n", "222 Minimum grade of C- in STAT100 and MATH115 (o... 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "218 INST 1 1 0 \n", "219 INST 0 0 0 \n", "220 INST 0 1 1 \n", "221 INST 0 0 0 \n", "222 INST 1 0 0 \n", "3.0\n", "MATH\n", " code title \\\n", "265 MATH107 Introduction to Math Modeling and Probability \n", "266 MATH113 College Algebra and Trigonometry \n", "267 MATH115 Precalculus \n", "268 MATH120 Elementary Calculus I \n", "269 MATH121 Elementary Calculus II \n", "\n", " description \\\n", "265 A goal is to convey the power of mathematics ... \n", "266 Topics include elementary functions including... \n", "267 Preparation for MATH120, MATH130 or MATH140. ... \n", "268 Basic ideas of differential and integral calc... \n", "269 Differential and integral calculus, with emph... \n", "\n", " prereqs credits prereq_type \\\n", "265 Must have math eligibility of Math 107 or hig... 3 Flexible \n", "266 Must have math eligibility of MATH113 or high... 3 Flexible \n", "267 Must have math eligibility of MATH115 or high... 3 Flexible \n", "268 1 course with a minimum grade of C- from (MAT... 3 Flexible \n", "269 MATH120, MATH130, MATH136, or MATH140; or mus... 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "265 MATH 1 1 0 \n", "266 MATH 1 0 0 \n", "267 MATH 1 0 0 \n", "268 MATH 1 0 0 \n", "269 MATH 1 0 0 \n", "3.061224489795918\n", "PHSC\n", " code title \\\n", "314 PHSC401 History of Public Health \n", "315 PHSC412 Food, Policy, and Public Health \n", "316 PHSC415 Essentials of Public Health Biology: The Cell... \n", "317 PHSC440 Public Health Nutrition \n", "318 PHSC497 Public Health Science Capstone \n", "\n", " description \\\n", "314 Emphasis is on the history of public health i... \n", "315 Broad overview of the impact of food and food... \n", "316 Presents the basic scientific and biomedical ... \n", "317 Engages students in conceptual thinking about... \n", "318 The capstone course is the culminating experi... \n", "\n", " prereqs credits prereq_type \\\n", "314 None 3 None \n", "315 Must have completed HLSA300 with a C- or high... 3 Flexible \n", "316 Minimum grade of C- in BSCI202. 3 Hard \n", "317 A minimum of C- in BSCI170, BSCI171, CHEM131,... 3 Hard \n", "318 Must have completed the professional writing ... 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "314 PHSC 0 0 0 \n", "315 PHSC 1 0 0 \n", "316 PHSC 1 0 0 \n", "317 PHSC 1 0 0 \n", "318 PHSC 1 0 0 \n", "3.0\n", "PLCY\n", " code title \\\n", "319 PLCY101 Great Thinkers on Public Policy \n", "320 PLCY201 Public Leaders and Active Citizens \n", "321 PLCY215 Innovation and Social Change: Creating Change... \n", "322 PLCY301 Sustainability \n", "323 PLCY302 Examining Pluralism in Public Policy \n", "\n", " description prereqs credits \\\n", "319 Great ideas in public policy, such as equalit... None 3 \n", "320 Aims to inspire, teach and engage students in... None 3 \n", "321 A team-based, highly interactive and dynamic ... None 3 \n", "322 Designed for students whose academic majors w... None 3 \n", "323 Understanding pluralism and how groups and in... None 3 \n", "\n", " prereq_type area has_prereqs is_intro is_entrypoint \n", "319 None PLCY 0 0 0 \n", "320 None PLCY 0 0 0 \n", "321 None PLCY 0 0 0 \n", "322 None PLCY 0 0 0 \n", "323 None PLCY 0 0 0 \n", "3.0\n", "PSYC\n", " code title \\\n", "347 PSYC123 The Psychology of Getting Hired \n", "348 PSYC200 Statistical Methods in Psychology \n", "349 PSYC221 Social Psychology \n", "350 PSYC300 Research Methods in Psychology Laboratory \n", "351 PSYC301 Biological Basis of Behavior \n", "\n", " description \\\n", "347 Designed to introduce students to the science... \n", "348 A basic introduction to quantitative methods ... \n", "349 The influence of social factors on the indivi... \n", "350 A general introduction and overview to the fu... \n", "351 Recent advances in neuroscience are radically... \n", "\n", " prereqs credits prereq_type \\\n", "347 None 1 None \n", "348 PSYC100. And 1 course with a minimum grade of... 3 Flexible \n", "349 PSYC100. 3 Hard \n", "350 PSYC200. 4 Hard \n", "351 PSYC100. And BSCI170 and BSCI171; or BSCI105. 3 Flexible \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "347 PSYC 0 0 0 \n", "348 PSYC 1 0 0 \n", "349 PSYC 1 0 0 \n", "350 PSYC 1 0 0 \n", "351 PSYC 1 0 0 \n", "3.026315789473684\n", "SPHL\n", " code title \\\n", "385 SPHL100 Foundations of Public Health \n", "386 SPHL291 Can we move beyond medication? Examining yoga... \n", "387 SPHL333 Fundamentals of Undergraduate Teaching for Ed... \n", "388 SPHL600 Foundations of Public Health \n", "389 SPHL610 Program and Policy Planning, Implementation, ... \n", "\n", " description prereqs credits \\\n", "385 An overview of the goals, functions, and meth... None 3 \n", "386 Does yoga improve the health of wounded warri... None 3 \n", "387 Supports the professional and personal develo... None 1 \n", "388 An overview of the goals, functions, and meth... None 3 \n", "389 This second course in the MPH/MHA integrated ... None 5 \n", "\n", " prereq_type area has_prereqs is_intro is_entrypoint \n", "385 None SPHL 0 0 0 \n", "386 None SPHL 0 0 0 \n", "387 None SPHL 0 0 0 \n", "388 None SPHL 0 0 0 \n", "389 None SPHL 0 0 0 \n", "2.4285714285714284\n", "STAT\n", " code title \\\n", "392 STAT100 Elementary Statistics and Probability \n", "393 STAT400 Applied Probability and Statistics I \n", "394 STAT401 Applied Probability and Statistics II \n", "395 STAT410 Introduction to Probability Theory \n", "396 STAT420 Theory and Methods of Statistics \n", "\n", " description \\\n", "392 Simplest tests of statistical hypotheses; app... \n", "393 Random variables, standard distributions, mom... \n", "394 Point estimation - unbiased and consistent es... \n", "395 Probability and its properties. Random variab... \n", "396 Point estimation, sufficiency, completeness, ... \n", "\n", " prereqs credits prereq_type \\\n", "392 MATH110, MATH112, MATH113, or MATH115; or per... 3 Flexible \n", "393 1 course with a minimum grade of C- from (MAT... 3 Flexible \n", "394 1 course with a minimum grade of C- from (STA... 3 Hard \n", "395 1 course with a minimum grade of C- from (MAT... 3 Hard \n", "396 1 course with a minimum grade of C- from (SUR... 3 Hard \n", "\n", " area has_prereqs is_intro is_entrypoint \n", "392 STAT 1 0 0 \n", "393 STAT 1 0 0 \n", "394 STAT 1 0 0 \n", "395 STAT 1 1 0 \n", "396 STAT 1 0 0 \n", "3.0\n", "URSP\n", " code title \\\n", "407 URSP600 Research Design and Application \n", "408 URSP601 Research Methods \n", "409 URSP604 The Planning Process \n", "410 URSP606 Planning Economics \n", "411 URSP631 Transportation and Land Use \n", "\n", " description prereqs credits \\\n", "407 Techniques in urban research, policy analysis... None 3 \n", "408 Use of measurement, statistics, quantitative ... None 3 \n", "409 Legal framework for U.S. planning; approaches... None 3 \n", "410 Resource allocation in a market economy, the ... None 3 \n", "411 The interrelationship between transportation ... None 3 \n", "\n", " prereq_type area has_prereqs is_intro is_entrypoint \n", "407 None URSP 0 0 0 \n", "408 None URSP 0 0 0 \n", "409 None URSP 0 0 0 \n", "410 None URSP 0 0 0 \n", "411 None URSP 0 0 0 \n", "3.0\n" ] } ], "source": [ "# use groupby to split the courses df into subsets grouped by area\n", "# we can iterate through the resulting collection of dataframe subsets\n", "# where each step in the iteration allows us to grab \n", "# 1. the name of the subset, which is the shared value (in this case area)\n", "# 2. the subset dataframe (here called areaDF)\n", "for area, areaDF in courses.groupby('area'):\n", " print(area)\n", " print(areaDF.head())\n", " print(areaDF['credits'].mean()) # starting to get into the \"apply\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "courses.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "TI8gquPh-U0q" }, "source": [ "### Apply and Combine" ] }, { "cell_type": "markdown", "metadata": { "id": "JMpzeZt6-Y5o" }, "source": [ "#### The \"manual\" way: apply and combine into a new dataframe we construct from scratch" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 955 }, "executionInfo": { "elapsed": 506, "status": "ok", "timestamp": 1620654326212, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "uJ-duLT0vAiE", "outputId": "28036dbb-5a4e-44bc-8df4-12dd27bbc185" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areanum_entrypointsnum_classesentry_point_ratio
0AMST290.222222
1BMGT1530.018868
2CMSC1460.021739
3COMM0310.000000
4ECON0640.000000
5ENSP260.333333
6ENTS040.000000
7INFM050.000000
8INST4470.085106
9MATH0490.000000
10PHSC050.000000
11PLCY0280.000000
12PSYC1380.026316
13SPHL070.000000
14STAT0150.000000
15URSP070.000000
\n", "
" ], "text/plain": [ " area num_entrypoints num_classes entry_point_ratio\n", "0 AMST 2 9 0.222222\n", "1 BMGT 1 53 0.018868\n", "2 CMSC 1 46 0.021739\n", "3 COMM 0 31 0.000000\n", "4 ECON 0 64 0.000000\n", "5 ENSP 2 6 0.333333\n", "6 ENTS 0 4 0.000000\n", "7 INFM 0 5 0.000000\n", "8 INST 4 47 0.085106\n", "9 MATH 0 49 0.000000\n", "10 PHSC 0 5 0.000000\n", "11 PLCY 0 28 0.000000\n", "12 PSYC 1 38 0.026316\n", "13 SPHL 0 7 0.000000\n", "14 STAT 0 15 0.000000\n", "15 URSP 0 7 0.000000" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create an empty list to hold the new rows of the new COMBINED dataframe\n", "eba = []\n", "# SPLIT the dataframe by area, and iterate through each split\n", "for area, areaDF in courses.groupby('area'): \n", " \n", " # APPLY operations on the dataframe split\n", " # count the number of entry point courses in the subarea\n", " num_entrypoints = areaDF['is_entrypoint'].sum()\n", " \n", " # count the number of total courses in the subarea\n", " num_classes = len(areaDF)\n", " \n", " # count the ratio of entry points to total classes\n", " entrypoint_ratio = num_entrypoints/num_classes\n", " \n", " # prepare a new row to put into the COMBINEd dataframe that has areas as rows instead of courses\n", " # the row is represented as a dictionary, where each key is a column\n", " entry = {\n", " 'area': area, # each row is an area\n", " 'num_entrypoints': num_entrypoints, \n", " 'num_classes': num_classes,\n", " 'entry_point_ratio': entrypoint_ratio\n", " }\n", " # COMBINE the resulting subcomputation into a new dataset\n", " eba.append(entry) \n", "# convert the list of new entries into a dataframe\n", "eba = pd.DataFrame(eba)\n", "eba" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areanum_courses
0AMST9
1BMGT53
2CMSC46
3COMM31
4ECON64
5ENSP6
6ENTS4
7INFM5
8INST47
9MATH49
10PHSC5
11PLCY28
12PSYC38
13SPHL7
14STAT15
15URSP7
\n", "
" ], "text/plain": [ " area num_courses\n", "0 AMST 9\n", "1 BMGT 53\n", "2 CMSC 46\n", "3 COMM 31\n", "4 ECON 64\n", "5 ENSP 6\n", "6 ENTS 4\n", "7 INFM 5\n", "8 INST 47\n", "9 MATH 49\n", "10 PHSC 5\n", "11 PLCY 28\n", "12 PSYC 38\n", "13 SPHL 7\n", "14 STAT 15\n", "15 URSP 7" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# how many classes are in each area?\n", "# make a data frame that summarizes it\n", "\n", "# make a list to hold the combined data by subgroup\n", "summarized = []\n", "\n", "# SPLIT-APPLY-COMBINE\n", "# SPLIT the dataset by area\n", "for area, areaData in courses.groupby('area'):\n", " # APPLY a computation to get the number of courses\n", " # which is just the number of rows in this subset dataframe\n", " num_courses = len(areaData)\n", " entry = {\n", " 'area': area,\n", " 'num_courses': num_courses\n", " }\n", " # add it to the COMBINED summarized dataset\n", " summarized.append(entry)\n", "\n", "# turn the list of dictionaries into a dataframe\n", "summarized = pd.DataFrame(summarized)\n", "summarized" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 453, "status": "ok", "timestamp": 1620654361760, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "70JrXX2rfydd", "outputId": "efa8ae3b-898e-40f1-c53c-9076733ff8ee" }, "outputs": [], "source": [ "# iterate through the dataframe, row by row\n", "for index, row in eba.iterrows():\n", " print(row['area'])\n", " print(row['prereq_classes'])" ] }, { "cell_type": "markdown", "metadata": { "id": "oY7YIBwGiECP" }, "source": [ "We can define a whole new function that is much more complicated than a simple sum or mean, to APPLY to each subset of our SPLIT\n", "\n", "For example, we can write a function that takes in a prereq description, and extracts all the prereq classes mentinoed in the description" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "id": "jTZ4wdV8-YQm" }, "outputs": [], "source": [ "def extract_prereqs(descr):\n", " prereqs = []\n", " for word in descr.split():\n", " # clean the word\n", " clean_word = \"\"\n", " for char in word:\n", " if char.isalpha() or char.isdigit():\n", " clean_word += char\n", " if len(clean_word) == 7 and clean_word[:4].isalpha() and clean_word[-3:].isdigit():\n", " prereqs.append(clean_word)\n", " return prereqs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for area, areaDF in courses.groupby('area'):\n", " \n", " area_prereqs = set()\n", " for prereq in areaDF['prereqs']:\n", " prereqs = extract_prereqs(prereq)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 973, "status": "ok", "timestamp": 1620653937432, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "RmgvzqpGfVpx", "outputId": "2cb67d89-0f3c-4c8c-af7f-095054dba2c9" }, "outputs": [], "source": [ "for prereq_descr in courses[courses['has_prereqs'] == 1]['prereqs'].values:\n", " print(prereq_descr)\n", " print(extract_prereqs(prereq_descr))" ] }, { "cell_type": "markdown", "metadata": { "id": "tlWe4YWQ0HIc" }, "source": [ "#### Shortcut apply-combine with `.agg()`" ] }, { "cell_type": "markdown", "metadata": { "id": "buVP8v1Ly3xu" }, "source": [ "A more concise way to apply and combine is to chain the `.agg()` function to a `.groupby()` object to tell pandas to *aggregate* particular columns in particular ways (e.g., count the number of entry point courses in a given department, vs. give an average *proportion* of classes that are entry points)." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 539 }, "executionInfo": { "elapsed": 372, "status": "ok", "timestamp": 1620655141089, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "DmfjhlAapgYX", "outputId": "df851456-cc01-4408-a00c-bdad8ac69b8b" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areanum_entrypointsnum_classes
0AMST29
1BMGT153
2CMSC146
3COMM031
4ECON064
5ENSP26
6ENTS04
7INFM05
8INST447
9MATH049
10PHSC05
11PLCY028
12PSYC138
13SPHL07
14STAT015
15URSP07
\n", "
" ], "text/plain": [ " area num_entrypoints num_classes\n", "0 AMST 2 9\n", "1 BMGT 1 53\n", "2 CMSC 1 46\n", "3 COMM 0 31\n", "4 ECON 0 64\n", "5 ENSP 2 6\n", "6 ENTS 0 4\n", "7 INFM 0 5\n", "8 INST 4 47\n", "9 MATH 0 49\n", "10 PHSC 0 5\n", "11 PLCY 0 28\n", "12 PSYC 1 38\n", "13 SPHL 0 7\n", "14 STAT 0 15\n", "15 URSP 0 7" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# SPLIT by area\n", "courses.groupby(\"area\", as_index=False).agg(\n", " # create a new column named num_entrypoints, \n", " # and give it the value of the sum function applied to the is_entrypoints column\n", " # APPLY tehse computations and COMBINE into a new data frame using .agg\n", " num_entrypoints=('is_entrypoint', sum), \n", " num_classes=('area', \"count\")\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6D-6Ec5nk4JG" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "S5MEruDm0Q3p" }, "source": [ "Anatomy of combining `.groupby()` with `.agg()`" ] }, { "cell_type": "markdown", "metadata": { "id": "tAf8E2vQ0WvP" }, "source": [ "![image.png]()" ] }, { "cell_type": "markdown", "metadata": { "id": "uyAbfUETE1BE" }, "source": [ "This pattern is explained in the section \"Recommended: Tuple Named Aggregations\" in this article: https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 539 }, "executionInfo": { "elapsed": 274, "status": "ok", "timestamp": 1620655422600, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "s2x6hyKOCHRQ", "outputId": "636fe607-e1c1-4ff9-d11f-c9ec0a1e14d9" }, "outputs": [], "source": [ "# group the courses by the area column (and make sure that they show up as columns in the resulting dataframe)\n", "# then apply the functions in the .agg() function to each subgroup\n", "# and stitch it back into a dataframe that we'll put into the entrypoints_by_area variable\n", "entrypoints_by_area = courses.groupby(\"area\", as_index=False).agg(\n", " # create a new column named num_entrypoints, \n", " # and give it the value of the sum function applied to the is_entrypoints column\n", " the_num_entrypoints=('is_entrypoint', \"sum\"), \n", " num_classes=('area', \"count\")\n", ")\n", "entrypoints_by_area" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 539 }, "executionInfo": { "elapsed": 714, "status": "ok", "timestamp": 1620651701802, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "zJ5YFX-GD3lm", "outputId": "008daaff-493d-4e6e-8d47-5d386e00513c" }, "outputs": [], "source": [ "# let's now compute the proportion of entry point classes, as a proxy for \"openness\"\n", "\n", "# step 1: define the function\n", "def openness(row):\n", " return row['num_entrypoints']/row['num_classes']\n", "\n", "# step 2: apply the function and save the results\n", "entrypoints_by_area['openness'] = entrypoints_by_area.apply(openness, axis=1)\n", "\n", "entrypoints_by_area" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# what are some fun groupbys we can do on the other datasets?\n", "\n", "# e.g., for donations, we can do average and sum and range, etc. by team\n", "# or number of wins by conference in ncaa\n", "# or total sales per hour of day for bread" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In my experience, this works really well for standard analysis tasks, but is less flexible than the manual approach. This approach would be tough to adapt easily for the bread Project 4, for example. I also like teaching the manual approach first for getting an intuition for what is happening under the hood." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving data / results for later analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We often want to save the results of our analysis for later. This can be done in a few different ways (depending on what file format will be useful later, such as `json`, `html`, `xlsx` (excel spreadsheets), or `csv`). \n", "\n", "In this class, we'll practice saving to `csv`, a common file format for data (the same one you practice reading into pandas!)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# example of saving the entrypoints_by_area dataframe to a csv file\n", "entrypoints_by_area.to_csv(\"outputs/entrypoints_by_area.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.to_csv()` method can take a number of optional arguments to control what happens, but the only **required** one is the file path to where the csv file will be saved (similar to what you need when you want to write to a file with `open()`)\n", "\n", "In this example, we're saving the `entrypoints_by_area` dataframe to the `entrypoints_by_area.csv` in the `output/` folder.\n", "Just make sure (as with files), that the folder actually exists before you try to put a file there!" ] }, { "cell_type": "markdown", "metadata": { "id": "VJtL4Q1p1RVn" }, "source": [ "## Extras\n", "\n", "This is stuff we may not get to in class but is available because it may be useful for your projects and beyond (though you can certainly solve most of Project 4 without these)." ] }, { "cell_type": "markdown", "metadata": { "id": "7vL4TajexuKQ" }, "source": [ "### Use `.value_counts()` to summarize categorical data in your dataframe" ] }, { "cell_type": "markdown", "metadata": { "id": "i2EsBiOj6yZe" }, "source": [ "Last week we learned how to compute some basic statistics, overall, and by column, for quantitative data. Today, we'll learn how to use `value_counts()` to quickly summarize *categorical* data.\n", "\n", "`.value_counts()` does exactly what you think it might do based on the name: it counts the frequency of each unique value in a column! In other words, it gives us a way to count how many times each value shows up in a column. In this way, it's kinda similar to the basic \"count-based\" indexing we did in Module 3.\n", "\n", "*Hint: this could be useful for Problem 4 for Project 4!*\n", "\n", "Here's an example for the courses data: how many times does each \"area\" show up?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 522, "status": "ok", "timestamp": 1620048143838, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "4_1N-Nl1691T", "outputId": "aecc7744-1308-4632-cf71-8b4c91ed2fa8" }, "outputs": [], "source": [ "# access the area column in the courses dataframe\n", "# and apply the value_counts method to that column\n", "courses['area'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The syntax here is:\n", "\n", "nameOfDataFrame['nameOfColumn'].value_counts()\n", "\n", "`value_counts()` is a method that a Pandas *series* (i.e., column in a dataframe) data structure can do (again, make the connection back to `.append()` for lists, and `.split()` for strings).\n", "\n", "Let's try some other queries!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# for ncaa dataset\n", "# how many entries do we have for each conference?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# for bls data\n", "# how many entries do we have for each category?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Value counts returns a series, which has nice properties of both lists and dictionaries.\n", "\n", "Like lists, we can sort it using the `.sort_values()` method, though we need to make sure to either force it to run \"in place\" (with `inplace=True` as an argument for `.sort_values()`), or save it to a variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "area_counts.sort_values(ascending=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And access items by index position, which allows us to get the first thing, or the first 5 things, or the last 5 things, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the first value in the series\n", "# note: you only get the value, not the \"name\"\n", "area_counts[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "area_counts[:5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "area_counts = courses['area'].value_counts()\n", "# like a cross between a dictionary anda list\n", "# can get value by named key like a dict\n", "print(\"INST\", area_counts['INST'])\n", "print(\"most frqeuent item count\", area_counts[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 1359, "status": "ok", "timestamp": 1620048245583, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "DW8eZc8P7r6i", "outputId": "942f1b0c-2ea3-4c7a-a05e-5ea9fe46c8c7" }, "outputs": [], "source": [ "area_counts.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 905, "status": "ok", "timestamp": 1620048238418, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "YvzfwMKzZQKb", "outputId": "b37d6a7b-0f9e-4e8b-ee4f-3c38504ea450" }, "outputs": [], "source": [ "# let's say we want the top 5 most populous areas\n", "# we can slice/subset the series just like a list\n", "# and then get the keys from that subset\n", "area_counts[:5].keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 203 }, "executionInfo": { "elapsed": 696, "status": "ok", "timestamp": 1620048479492, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "hyc_2Yqf78Dz", "outputId": "9221a89b-f03f-43db-dfd8-6985032d0741" }, "outputs": [], "source": [ "# let's try with the other datasets!\n", "# ncaa-team-data\n", "# bls-by-category\n", "# BreadBasket_DMS\n", "bread = pd.read_csv(f'{folder}/BreadBasket_DMS.csv')\n", "bread.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 987, "status": "ok", "timestamp": 1620048666329, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "xYy94pjPakws", "outputId": "3768cc28-0534-4155-de0a-be955bd63d1c" }, "outputs": [], "source": [ "# how do we get the frequency counts for items in the bread dataframe?\n", "bread['Item'].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "id": "dXEkLZn0Tbgp" }, "source": [ "### Plotting\n" ] }, { "cell_type": "markdown", "metadata": { "id": "a2LwuPxI-yOJ" }, "source": [ "The main library for plotting in Python is `matplotlib`. You can learn that library later. It has lots of fine-grained controls.\n", "\n", "For now, you can use pandas \"wrapper\" over matplotlib (basically calling matplotlib from inside pandas), which is a bit easier to learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 539 }, "executionInfo": { "elapsed": 667, "status": "ok", "timestamp": 1620656070972, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "KuRl5jUEoQOI", "outputId": "a79498f6-a04d-4a0d-e862-3dc0b65e00d7" }, "outputs": [], "source": [ "entrypoints_by_area" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 539 }, "executionInfo": { "elapsed": 482, "status": "ok", "timestamp": 1620656189955, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "rAHhcEk7oc0K", "outputId": "65dce514-83e7-43e9-9752-f3772b2698e3" }, "outputs": [], "source": [ "def openness(row):\n", " return row['num_entrypoints']/row['num_classes']\n", "\n", "entrypoints_by_area['openness'] = entrypoints_by_area.apply(openness, axis=1)\n", "entrypoints_by_area" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 539 }, "executionInfo": { "elapsed": 251, "status": "ok", "timestamp": 1620656239902, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "eZmRKyJlIN-w", "outputId": "46263d4c-81df-4575-c521-2b0e07256a90" }, "outputs": [], "source": [ "# sort the data by the openness column\n", "# make sure we assign to the entry points variable again so we don't lose it (bc pandas treats dataframes as immutable, like strings, unless we force it to do otherwise)\n", "entrypoints_by_area = entrypoints_by_area.sort_values(by=\"openness\", ascending=False)\n", "entrypoints_by_area" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 335 }, "executionInfo": { "elapsed": 565, "status": "ok", "timestamp": 1620656508606, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "C-31_pvp1g4d", "outputId": "45a933ca-f52a-44a8-ac9f-3195de2f28f9" }, "outputs": [], "source": [ "# plot openness by area\n", "entrypoints_by_area.plot(\n", " x=\"area\", \n", " y=\"openness\", \n", " kind='bar', \n", " xlabel=\"AREA\", \n", " ylabel=\"Proportion of entry point classes\",\n", " title=\"Classes openness by area\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "executionInfo": { "elapsed": 899, "status": "ok", "timestamp": 1620656550256, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "kw-gJsSkIZ55", "outputId": "fec28f9d-108f-4ec2-adbe-d12b046a5ea0" }, "outputs": [], "source": [ "entrypoints_by_area.plot(y=\"openness\", kind=\"hist\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 319 }, "executionInfo": { "elapsed": 675, "status": "ok", "timestamp": 1620656744053, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "kI4USaEwqM0U", "outputId": "ee49724e-e803-4230-b3bb-efc530dbcdca" }, "outputs": [], "source": [ "entrypoints_by_area.sort_values(by=\"num_classes\", ascending=False).plot(x=\"area\", y=\"num_classes\", kind=\"bar\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Eng-xmqohx28" }, "source": [ "## Reminder: More resources" ] }, { "cell_type": "markdown", "metadata": { "id": "G_Tyt430hx29" }, "source": [ "The pandas website is decent place to start: https://pandas.pydata.org/\n", "\n", "This \"cheat sheet\" is also a really helpful guide to more common operations that you may run into later: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf\n", "\n", "There are also many blogs that are helpful, like towardsdatascience.com\n", "\n", "The cool thing about pandas and data analysis in python is that many people share notebooks that you can inspect / learn from / adapt code for your own projects (just like mine!)." ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "INST126_SP21_Week15-16_Pandas-2.ipynb", "provenance": [ { "file_id": "1kJy7BqKo1Edp7m9QOPL43MWZ5VRsTzXl", "timestamp": 1606742957063 } ], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }