{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "NxDEC90LqWMi" }, "source": [ "# 11: Pandas for data analysis with Python: Part 2\n", "\n", "## Learning objectives\n", "\n", "Last week, we learned:\n", "- Pandas is a library in Python that is designed for data manipulation and analysis\n", "- How to use libraries (import them, access their functions and data structures with `library.function_name()`)\n", "- About the `dataframe` data structure: basically a smart spreadsheet, with rows of observations, and columns of variables/data for each observation - sort of a cross between a list (sortable, indexable) and a dictionary (quickly access data by key)\n", "- Some basic operations: constructing a dataframe, summarizing, subsetting, reshaping\n", "\n", "This week, we'll most dig into more advanced operations for reshaping/modifying your dataframe:\n", "- Use `.apply()` to apply functions to one or more columns to generate new columns\n", "- Use `.groupby()` to split your data into subgroups, apply some function to their data, then combine them into a new dataframe for further analysis (the \"**split-apply-combine**\" pattern that is fundamental to data analysis with pandas)\n", "- Use some basic plotting functions to explore your data\n", "\n", "These roughly correspond to Qs 6-8 in your PCEs.\n", "\n", "If we have time, we'll learn a bit more about summarization:\n", "- Use `.value_counts()` to summarize categorical data\n", "- How to plot data" ] }, { "cell_type": "markdown", "metadata": { "id": "-Jy7Hmo3Tbgo" }, "source": [ "## Creating/modifying data columns based on one or more columns using `.apply()`" ] }, { "cell_type": "markdown", "metadata": { "id": "8cekigZU82xM" }, "source": [ "More advanced operations on dataframes involve modifying or creating new columns!" ] }, { "cell_type": "markdown", "metadata": { "id": "DAHrk-uWTbgo" }, "source": [ "In data analysis, we often want to do things to data in our columns for data preparation/cleaning. Sometimes there is missing data we want to recode, or we want to redescribe data or reclassify it for our analysis. We can do this with a combination of functions and the `apply()` method.\n", "\n", "In this way, again making a connection back to lists, `.apply()` is a little like the `map()` function that we used for lists to transform items from one list to another list with equal length (e.g., convert scores to letter grades)." ] }, { "cell_type": "markdown", "metadata": { "id": "nxl6erzh8G3a" }, "source": [ "### `.apply()` with a single column" ] }, { "cell_type": "markdown", "metadata": { "id": "f_b6BhOauQpU" }, "source": [ "The simpler version of `.apply()` only takes input from one column.\n", "\n", "To illustrate let's do some operations on the dataset `INST courses.csv`. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 271 }, "executionInfo": { "elapsed": 4181, "status": "ok", "timestamp": 1620651417488, "user": { "displayName": "Joel Chan", "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GiBPPpBf_QqgDL3pMurAsPu9WJJE_x_6UtgW13UFQ=s64", "userId": "15153559228409906865" }, "user_tz": 240 }, "id": "V3IHERsSssMe", "outputId": "478edb38-7cd5-4cba-83ee-9e931f8c80c0" }, "outputs": [ { "data": { "text/html": [ "
\n", " | Code | \n", "Title | \n", "Description | \n", "Prereqs | \n", "Credits | \n", "
---|---|---|---|---|---|
0 | \n", "INST126 | \n", "Introduction to Programming for Information Sc... | \n", "An introduction to computer programming for st... | \n", "Minimum grade of C- in MATH115; or must have m... | \n", "3.0 | \n", "
1 | \n", "INST201 | \n", "Introduction to Information Science | \n", "Examining the effects of new information techn... | \n", "None | \n", "3.0 | \n", "
2 | \n", "INST311 | \n", "Information Organization | \n", "Examines the theories, concepts, and principle... | \n", "Must have completed or be concurrently enrolle... | \n", "3.0 | \n", "
3 | \n", "INST314 | \n", "Statistics for Information Science | \n", "Basic concepts in statistics including measure... | \n", "Must have completed or be concurrently enrolle... | \n", "3.0 | \n", "
4 | \n", "INST326 | \n", "Object-Oriented Programming for Information Sc... | \n", "An introduction to programming, emphasizing un... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "
5 | \n", "INST327 | \n", "Database Design and Modeling | \n", "Introduction to databases, the relational mode... | \n", "1 course with a minimum grade of C- from (CMSC... | \n", "3.0 | \n", "
6 | \n", "INST335 | \n", "Teams and Organizations | \n", "Team development and the principles, methods a... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "
7 | \n", "INST346 | \n", "Technologies Infrastructure and Architecture | \n", "Examines the basic concepts of local and wide-... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "
8 | \n", "INST352 | \n", "Information User Needs and Assessment | \n", "Focuses on use of information by individuals, ... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "
9 | \n", "INST354 | \n", "Decision-Making for Information Science | \n", "Examines the use of information in organizatio... | \n", "INST314. | \n", "3.0 | \n", "
10 | \n", "INST362 | \n", "User-Centered Design | \n", "Introduction to human-computer interaction (HC... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "
11 | \n", "INST377 | \n", "Dynamic Web Applications | \n", "An exploration of the basic methods and tools ... | \n", "INST327. | \n", "3.0 | \n", "
12 | \n", "INST408Y | \n", "Special Topics in Information Science; Privacy... | \n", "\n", " | NaN | \n", "NaN | \n", "
13 | \n", "INST408Z | \n", "Special Topics in Information Science; The Apo... | \n", "\n", " | NaN | \n", "NaN | \n", "
14 | \n", "INST414 | \n", "Data Science Techniques | \n", "An exploration of how to extract insights from... | \n", "INST314. | \n", "3.0 | \n", "
15 | \n", "INST447 | \n", "Data Sources and Manipulation | \n", "Examines approaches to locating, acquiring, ma... | \n", "INST326 or CMSC131; and INST327. | \n", "3.0 | \n", "
16 | \n", "INST462 | \n", "Introduction to Data Visualization | \n", "Exploration of the theories, methods, and tech... | \n", "INST314. | \n", "3.0 | \n", "
17 | \n", "INST466 | \n", "Technology, Culture, and Society | \n", "Individual, cultural, and societal outcomes as... | \n", "INST201. | \n", "3.0 | \n", "
18 | \n", "INST490 | \n", "Integrated Capstone for Information Science | \n", "The capstone provides a platform for Informati... | \n", "Minimum grade of C- in INST314, INST335, INST3... | \n", "3.0 | \n", "
19 | \n", "INST604 | \n", "Introduction to Archives and Digital Curation | \n", "Overview of the principles, practices, and app... | \n", "None | \n", "3.0 | \n", "
20 | \n", "INST612 | \n", "Information Policy | \n", "Nature, structure, development and application... | \n", "None | \n", "3.0 | \n", "
21 | \n", "INST614 | \n", "Literacy and Inclusion | \n", "The educational and psychological dimensions o... | \n", "None | \n", "3.0 | \n", "
22 | \n", "INST616 | \n", "Open Source Intelligence | \n", "An introduction to Open Source Intelligence (O... | \n", "None | \n", "3.0 | \n", "
23 | \n", "INST622 | \n", "Information and Universal Usability | \n", "Information services and technologies to provi... | \n", "None | \n", "3.0 | \n", "
24 | \n", "INST627 | \n", "Data Analytics for Information Professionals | \n", "Skills and knowledge needed to craft datasets,... | \n", "None | \n", "3.0 | \n", "
25 | \n", "INST630 | \n", "Introduction to Programming for the Informatio... | \n", "An introduction to computer programming intend... | \n", "None | \n", "3.0 | \n", "
26 | \n", "INST652 | \n", "Design Thinking and Youth | \n", "Methods of design thinking specifically within... | \n", "None | \n", "3.0 | \n", "
27 | \n", "INST702 | \n", "Advanced Usability Testing | \n", "Usability testing methods -- how to design and... | \n", "Permission of instructor; or (INFM605 or INST6... | \n", "3.0 | \n", "
28 | \n", "INST709 | \n", "Independent Study | \n", "\n", " | NaN | \n", "NaN | \n", "
29 | \n", "INST728G | \n", "Special Topics in Information Studies; Smart C... | \n", "\n", " | NaN | \n", "NaN | \n", "
30 | \n", "INST728V | \n", "Special Topics in Information Studies; Digital... | \n", "\n", " | NaN | \n", "NaN | \n", "
31 | \n", "INST733 | \n", "Database Design | \n", "Principles of user-oriented database design. ... | \n", "LBSC690, LBSC671, or INFM603; or permission of... | \n", "3.0 | \n", "
32 | \n", "INST737 | \n", "Introduction to Data Science | \n", "An exploration of some of the best and most ge... | \n", "INST627; and (LBSC690, LBSC671, or INFM603). O... | \n", "3.0 | \n", "
33 | \n", "INST741 | \n", "Social Computing Technologies and Applications | \n", "Tools and techniques for developing and config... | \n", "INFM603 and INFM605; or (LBSC602 and LBSC671);... | \n", "3.0 | \n", "
34 | \n", "INST742 | \n", "Implementing Digital Curation | \n", "Management of and technology for application o... | \n", "INST604; or permission of instructor. | \n", "3.0 | \n", "
35 | \n", "INST746 | \n", "Digitization of Legacy Holdings | \n", "Through hands on exercises and real-world proj... | \n", "INST604. | \n", "3.0 | \n", "
36 | \n", "INST762 | \n", "Visual Analytics | \n", "Visual analytics is the use of interactive vis... | \n", "INFM603 or INST630; or permission of instructor. | \n", "3.0 | \n", "
37 | \n", "INST767 | \n", "Big Data Infrastructure | \n", "Principles and techniques of data science and ... | \n", "INST737; or permission of instructor. | \n", "3.0 | \n", "
38 | \n", "INST776 | \n", "HCIM CAPSTONE PROJECT | \n", "The opportunity to apply the skills learned th... | \n", "INST775; or permission of instructor. | \n", "3.0 | \n", "
39 | \n", "INST785 | \n", "Documentation, Collection, and Appraisal of Re... | \n", "Development of documentation strategies and pl... | \n", "INST604; or permission of instructor. | \n", "3.0 | \n", "
40 | \n", "INST794 | \n", "Capstone in Youth Experience | \n", "Through a supervised project, to synthesize de... | \n", "INST650, INST651, and INST652; or permission o... | \n", "3.0 | \n", "
\n", " | Code | \n", "Title | \n", "Description | \n", "Prereqs | \n", "Credits | \n", "has_prereqs | \n", "
---|---|---|---|---|---|---|
0 | \n", "INST126 | \n", "Introduction to Programming for Information Sc... | \n", "An introduction to computer programming for st... | \n", "Minimum grade of C- in MATH115; or must have m... | \n", "3.0 | \n", "1 | \n", "
1 | \n", "INST201 | \n", "Introduction to Information Science | \n", "Examining the effects of new information techn... | \n", "None | \n", "3.0 | \n", "0 | \n", "
2 | \n", "INST311 | \n", "Information Organization | \n", "Examines the theories, concepts, and principle... | \n", "Must have completed or be concurrently enrolle... | \n", "3.0 | \n", "1 | \n", "
3 | \n", "INST314 | \n", "Statistics for Information Science | \n", "Basic concepts in statistics including measure... | \n", "Must have completed or be concurrently enrolle... | \n", "3.0 | \n", "1 | \n", "
4 | \n", "INST326 | \n", "Object-Oriented Programming for Information Sc... | \n", "An introduction to programming, emphasizing un... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "1 | \n", "
5 | \n", "INST327 | \n", "Database Design and Modeling | \n", "Introduction to databases, the relational mode... | \n", "1 course with a minimum grade of C- from (CMSC... | \n", "3.0 | \n", "1 | \n", "
6 | \n", "INST335 | \n", "Teams and Organizations | \n", "Team development and the principles, methods a... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "1 | \n", "
7 | \n", "INST346 | \n", "Technologies Infrastructure and Architecture | \n", "Examines the basic concepts of local and wide-... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "1 | \n", "
8 | \n", "INST352 | \n", "Information User Needs and Assessment | \n", "Focuses on use of information by individuals, ... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "1 | \n", "
9 | \n", "INST354 | \n", "Decision-Making for Information Science | \n", "Examines the use of information in organizatio... | \n", "INST314. | \n", "3.0 | \n", "1 | \n", "
\n", " | Code | \n", "Title | \n", "Description | \n", "Prereqs | \n", "Credits | \n", "has_prereqs | \n", "is_intro | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "INST126 | \n", "Introduction to Programming for Information Sc... | \n", "An introduction to computer programming for st... | \n", "Minimum grade of C- in MATH115; or must have m... | \n", "3.0 | \n", "1 | \n", "1 | \n", "
1 | \n", "INST201 | \n", "Introduction to Information Science | \n", "Examining the effects of new information techn... | \n", "None | \n", "3.0 | \n", "0 | \n", "1 | \n", "
2 | \n", "INST311 | \n", "Information Organization | \n", "Examines the theories, concepts, and principle... | \n", "Must have completed or be concurrently enrolle... | \n", "3.0 | \n", "1 | \n", "0 | \n", "
3 | \n", "INST314 | \n", "Statistics for Information Science | \n", "Basic concepts in statistics including measure... | \n", "Must have completed or be concurrently enrolle... | \n", "3.0 | \n", "1 | \n", "0 | \n", "
4 | \n", "INST326 | \n", "Object-Oriented Programming for Information Sc... | \n", "An introduction to programming, emphasizing un... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "1 | \n", "0 | \n", "
5 | \n", "INST327 | \n", "Database Design and Modeling | \n", "Introduction to databases, the relational mode... | \n", "1 course with a minimum grade of C- from (CMSC... | \n", "3.0 | \n", "1 | \n", "0 | \n", "
6 | \n", "INST335 | \n", "Teams and Organizations | \n", "Team development and the principles, methods a... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "1 | \n", "0 | \n", "
7 | \n", "INST346 | \n", "Technologies Infrastructure and Architecture | \n", "Examines the basic concepts of local and wide-... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "1 | \n", "0 | \n", "
8 | \n", "INST352 | \n", "Information User Needs and Assessment | \n", "Focuses on use of information by individuals, ... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "1 | \n", "0 | \n", "
9 | \n", "INST354 | \n", "Decision-Making for Information Science | \n", "Examines the use of information in organizatio... | \n", "INST314. | \n", "3.0 | \n", "1 | \n", "0 | \n", "
\n", " | Code | \n", "Title | \n", "Description | \n", "Prereqs | \n", "Credits | \n", "has_prereqs | \n", "is_intro | \n", "is_entrypoint | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "INST126 | \n", "Introduction to Programming for Information Sc... | \n", "An introduction to computer programming for st... | \n", "Minimum grade of C- in MATH115; or must have m... | \n", "3.0 | \n", "1 | \n", "1 | \n", "0 | \n", "
1 | \n", "INST201 | \n", "Introduction to Information Science | \n", "Examining the effects of new information techn... | \n", "None | \n", "3.0 | \n", "0 | \n", "1 | \n", "1 | \n", "
2 | \n", "INST311 | \n", "Information Organization | \n", "Examines the theories, concepts, and principle... | \n", "Must have completed or be concurrently enrolle... | \n", "3.0 | \n", "1 | \n", "0 | \n", "0 | \n", "
3 | \n", "INST314 | \n", "Statistics for Information Science | \n", "Basic concepts in statistics including measure... | \n", "Must have completed or be concurrently enrolle... | \n", "3.0 | \n", "1 | \n", "0 | \n", "0 | \n", "
4 | \n", "INST326 | \n", "Object-Oriented Programming for Information Sc... | \n", "An introduction to programming, emphasizing un... | \n", "1 course with a minimum grade of C- from (INST... | \n", "3.0 | \n", "1 | \n", "0 | \n", "0 | \n", "
\n", " | Code | \n", "Title | \n", "Description | \n", "Prereqs | \n", "Credits | \n", "has_prereqs | \n", "is_intro | \n", "is_entrypoint | \n", "
---|---|---|---|---|---|---|---|---|
1 | \n", "INST201 | \n", "Introduction to Information Science | \n", "Examining the effects of new information techn... | \n", "None | \n", "3.0 | \n", "0 | \n", "1 | \n", "1 | \n", "
19 | \n", "INST604 | \n", "Introduction to Archives and Digital Curation | \n", "Overview of the principles, practices, and app... | \n", "None | \n", "3.0 | \n", "0 | \n", "1 | \n", "1 | \n", "
25 | \n", "INST630 | \n", "Introduction to Programming for the Informatio... | \n", "An introduction to computer programming intend... | \n", "None | \n", "3.0 | \n", "0 | \n", "1 | \n", "1 | \n", "
\n", " | Unnamed: 0 | \n", "school | \n", "conf | \n", "rk | \n", "w | \n", "l | \n", "srs | \n", "sos | \n", "pts_for | \n", "pts_vs | \n", "... | \n", "ap_pre | \n", "ap_high | \n", "ap_final | \n", "pts_diff | \n", "ncaa_result | \n", "ncaa_numeric | \n", "season | \n", "year | \n", "coach | \n", "underdog | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8229 | \n", "8229 | \n", "indiana | \n", "Big Ten | \n", "78 | \n", "20 | \n", "3 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "30 | \n", "30 | \n", "30 | \n", "NaN | \n", "Won National Final | \n", "48 | \n", "1939-40 | \n", "1939 | \n", "Branch McCracken | \n", "1 | \n", "
23515 | \n", "23515 | \n", "wisconsin | \n", "Big Ten | \n", "77 | \n", "20 | \n", "3 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "30 | \n", "30 | \n", "30 | \n", "NaN | \n", "Won National Final | \n", "48 | \n", "1940-41 | \n", "1940 | \n", "Bud Foster | \n", "1 | \n", "
2 rows × 21 columns
\n", "\n", " | area | \n", "num_entrypoints | \n", "num_classes | \n", "entry_point_ratio | \n", "
---|---|---|---|---|
0 | \n", "AMST | \n", "2 | \n", "9 | \n", "0.222222 | \n", "
1 | \n", "BMGT | \n", "1 | \n", "53 | \n", "0.018868 | \n", "
2 | \n", "CMSC | \n", "1 | \n", "46 | \n", "0.021739 | \n", "
3 | \n", "COMM | \n", "0 | \n", "31 | \n", "0.000000 | \n", "
4 | \n", "ECON | \n", "0 | \n", "64 | \n", "0.000000 | \n", "
5 | \n", "ENSP | \n", "2 | \n", "6 | \n", "0.333333 | \n", "
6 | \n", "ENTS | \n", "0 | \n", "4 | \n", "0.000000 | \n", "
7 | \n", "INFM | \n", "0 | \n", "5 | \n", "0.000000 | \n", "
8 | \n", "INST | \n", "4 | \n", "47 | \n", "0.085106 | \n", "
9 | \n", "MATH | \n", "0 | \n", "49 | \n", "0.000000 | \n", "
10 | \n", "PHSC | \n", "0 | \n", "5 | \n", "0.000000 | \n", "
11 | \n", "PLCY | \n", "0 | \n", "28 | \n", "0.000000 | \n", "
12 | \n", "PSYC | \n", "1 | \n", "38 | \n", "0.026316 | \n", "
13 | \n", "SPHL | \n", "0 | \n", "7 | \n", "0.000000 | \n", "
14 | \n", "STAT | \n", "0 | \n", "15 | \n", "0.000000 | \n", "
15 | \n", "URSP | \n", "0 | \n", "7 | \n", "0.000000 | \n", "
\n", " | area | \n", "num_courses | \n", "
---|---|---|
0 | \n", "AMST | \n", "9 | \n", "
1 | \n", "BMGT | \n", "53 | \n", "
2 | \n", "CMSC | \n", "46 | \n", "
3 | \n", "COMM | \n", "31 | \n", "
4 | \n", "ECON | \n", "64 | \n", "
5 | \n", "ENSP | \n", "6 | \n", "
6 | \n", "ENTS | \n", "4 | \n", "
7 | \n", "INFM | \n", "5 | \n", "
8 | \n", "INST | \n", "47 | \n", "
9 | \n", "MATH | \n", "49 | \n", "
10 | \n", "PHSC | \n", "5 | \n", "
11 | \n", "PLCY | \n", "28 | \n", "
12 | \n", "PSYC | \n", "38 | \n", "
13 | \n", "SPHL | \n", "7 | \n", "
14 | \n", "STAT | \n", "15 | \n", "
15 | \n", "URSP | \n", "7 | \n", "
\n", " | area | \n", "num_entrypoints | \n", "num_classes | \n", "
---|---|---|---|
0 | \n", "AMST | \n", "2 | \n", "9 | \n", "
1 | \n", "BMGT | \n", "1 | \n", "53 | \n", "
2 | \n", "CMSC | \n", "1 | \n", "46 | \n", "
3 | \n", "COMM | \n", "0 | \n", "31 | \n", "
4 | \n", "ECON | \n", "0 | \n", "64 | \n", "
5 | \n", "ENSP | \n", "2 | \n", "6 | \n", "
6 | \n", "ENTS | \n", "0 | \n", "4 | \n", "
7 | \n", "INFM | \n", "0 | \n", "5 | \n", "
8 | \n", "INST | \n", "4 | \n", "47 | \n", "
9 | \n", "MATH | \n", "0 | \n", "49 | \n", "
10 | \n", "PHSC | \n", "0 | \n", "5 | \n", "
11 | \n", "PLCY | \n", "0 | \n", "28 | \n", "
12 | \n", "PSYC | \n", "1 | \n", "38 | \n", "
13 | \n", "SPHL | \n", "0 | \n", "7 | \n", "
14 | \n", "STAT | \n", "0 | \n", "15 | \n", "
15 | \n", "URSP | \n", "0 | \n", "7 | \n", "