7: Strings#
What are strings and why should we care about them?#
Strings are everywhere#
We need to learn to work with strings because a lot of data we want to do things with live in the world as mixed data
Email addresses
Webpage URLs
Names
Documents, words
Sales records
Etc.
Strings are the ultimate “lingua franca” between systems
Data is often passed as “serialized” forms (e.g., JSON: Javascript String Object Notation)
We assume strings coming in, and we parse it appropriately. This can include data (numbers/records), as we see in one of the Projects for this module!
This also includes the “human system” (i.e., the user)!
Let’s see what comes back from the input() function we introduced in Iteration.
1age = input('What is your age?')
What is your age? 37
1age
'37'
What data type is in age?
Strings are sequences of characters#
But what is a string? It’s fundamentally a sequence of characters.
And that’s exactly what a string is in Python too: it’s a sequence of characters, much like (though not exactly) like a list. This means we can iterate through a string in a similar way that we can iterate through any other list.
1s = "banana"
2for char in s:
3 print(char)
b
a
n
a
n
a
1sentence = "she sells seashells by the seashore, except when she doesn't want to sell seashells"
2for char in sentence:
3 print(char)
s
h
e
s
e
l
l
s
s
e
a
s
h
e
l
l
s
b
y
t
h
e
s
e
a
s
h
o
r
e
,
e
x
c
e
p
t
w
h
e
n
s
h
e
d
o
e
s
n
'
t
w
a
n
t
t
o
s
e
l
l
s
e
a
s
h
e
l
l
s
Characters don’t have to be visible/letters!#
Notice that even the “blank space” is a character! A string that includes an empty space character is NOT the same as an empty string (i.e., a list of characters of length zero), even though they print out the same. This distinction is very important to remember as you work with real world data.
1a = "" # a blank/empty string
2b = " " # a string with one blank space *character*
3print("Printing out the value of a")
4print(a)
5print("Printing out the value of b")
6print(b)
7print(len(a), len(b))
8print(a == b)
Printing out the value of a
Printing out the value of b
0 1
False
This means that whitespace at the beginning or end of strings define completely different strings! For example:
1a = "James"
2b = " James"
3c = "James "
4print(a == b)
5print(b == c)
6print(a == c)
False
False
False
1clean = ""
2for char in b:
3 if char != " ":
4 clean = clean + char
5clean
'James'
Other kinds of characters that don’t look like “letters”:
\tfor tabs\nfor newlines
1# tab is \t
2s = "a\ttab\nand a newline"
3print(s)
4for idx, char in enumerate(s):
5 print(idx, char)
a tab
and a newline
0 a
1
2 t
3 a
4 b
5
6 a
7 n
8 d
9
10 a
11
12 n
13 e
14 w
15 l
16 i
17 n
18 e
Again, even though they may look similar to our eyes (“blank”), they are not the same!
1a = " "
2b = "\t"
3c = "\n"
4print("a is ", a, "with length", len(a))
5print("b is ", b, "with length", len(b))
6print("c is ", c, "with length", len(c))
7print("is a the same as b?", a == b)
8print("is b the same as c?", b == c)
9print("is a the same as c?", a == c)
We’ll see in the next module how text data often comes in with a mix of various whitespace characters, which we can use to parse the data into structured records.
For example, the following string may look ridiculous, but the \t and \n characters in the string actually break it up quite nicely into a structured format that is readable by both humans and Python.
1records = "NAME\tSCORE\tGRADE\nJoel\t81\tB-\nRony\t98\tA+\nSravya\t99\tA+"
2print(records)
NAME SCORE GRADE
Joel 81 B-
Rony 98 A+
Sravya 99 A+
Properties of strings#
Because strings are a special case of a list, most of the properties and functions that apply to lists also apply to strings (e.g., sortable, has length, can check if something is “in” it), with one important exception: strings are immutable: you can never modify a string directly, only create a new string that you must then assign to a variable (or reassign to the same variable) if you want to preserve that change. More on this when we talk about working with strings.
Strings are ordered (and therefore can be sorted and indexed and sliced like lists)#
1a_string = "hello world Hi"
2print(sorted(a_string))
[' ', ' ', 'H', 'd', 'e', 'h', 'i', 'l', 'l', 'l', 'o', 'o', 'r', 'w']
1# can be indexed
2a_string[0]
'h'
1# and sliced
2a_string[:5]
'hello'
Strings have length#
1# has length
2print(len(a_string))
14
Strings are IMMUTABLE (you can’t modify them directly)#
1a_string = "hello"
2print(a_string.upper())
3a_string = a_string.upper()
4print(a_string)
HELLO
hello
This property of strings has one critical implication for how you work with them (vs. other data types): anytime you modify a string, you must have some kind of variable assignment statement (even if it is back to itself) to preserve the change.
This is a really important point we’ll drive home later when we dig into ways of working with strings.
Aside: string encoding#
The previous observation about blank spaces illustrates a larger point: we deliberately say strings are sequences of characters, not letters. This is because strings can include numbers, as we’ve seen (think usernames like joelchan86, or your uids), but also all sorts of other characters, including various kinds of blank spaces — like tabs, spaces, and newlines — and even emoji!
Check this resource for an overview and initial guide: https://realpython.com/python-encodings-guide/
This is something I want to show you to give you a better intuition for what strings are, but there is also an important practical implication: you need to be very careful to transform or normalize your strings when you want to sort or compare them. What’s the same or different to your human eye will often not be the same or different to the computer’s eye.
For example, “A” and “a” have different encodings. Thus, Python does not see them as the same “letter”. Sometimes you’ll even be reading in strings that are in a different encoding
1s1 = "James "
2s2 = "James"
3s3 = "james"
4s4 = "JAmes"
5s1 == s4
False
Working with strings: basics#
Similar to lists, many basic operations with strings revolve around indexing and iteration.
Getting parts of a string#
Indexing#
Works similarly to lists.
1s = "my name is inigo montoya, you killed my father, prepare to die!"
2for i in range(len(s)):
3 char = s[i]
4 print(i, char)
0 m
1 y
2
3 n
4 a
5 m
6 e
7
8 i
9 s
10
11 i
12 n
13 i
14 g
15 o
16
17 m
18 o
19 n
20 t
21 o
22 y
23 a
24 ,
25
26 y
27 o
28 u
29
30 k
31 i
32 l
33 l
34 e
35 d
36
37 m
38 y
39
40 f
41 a
42 t
43 h
44 e
45 r
46 ,
47
48 p
49 r
50 e
51 p
52 a
53 r
54 e
55
56 t
57 o
58
59 d
60 i
61 e
62 !
1s = "my name is inigo montoya, you killed my father, prepare to die!"
2for char in s:
3 print(char)
1s = "my name is inigo montoya, you killed my father, prepare to die!"
2# give me the last letter
3print(s[-1])
!
1l = [1,4,5,6,7,]
2# give me the first item
3print(l[0])
1
PRACTICE: How would you get the first number of the level (after the four-letter code)?
1code = "INST201"
2# your code here
'2'
PRACTICE: How would you get the first initial for each name?
1names = ["Joel", "Sarah", "John", "Michael", "Patrick", "Kacie"]
2# your code here
J
S
J
M
P
K
Slicing#
Remember slicing? Here we can think about substrings. Super useful for truncation, or getting particular parts of strings when you know the pattern (e.g., first four characters of a course code is always the department).
Remember: the index before the : indicates where you want to start, and the index after the : indicates where you want to stop before. So [0:4] will go from index 0 to index 3 (before index 4). Leaving out an index implicitly says “to the max” (e.g., from 0 or until the end).
1code = "INST201"
2# get the first four chars (index 0 to 3, stop before 4)
3area = code[:4]
4print(area)
INST
PRACTICE: how would you get the course number?
1# get last three characters
2code = "INST201"
3# your code here
'201'
PRACTICE: How would you get the first 2 letters of the name?
1name = "Michelle"
2# your code here
We can put these into filtering/counting patterns that check parts of strings!#
Practice: How many students in the class have names that begin with Eli?
1names = [
2 "Eliana",
3 "John",
4 "Elias",
5 "Esther",
6 "Joseph",
7 "Ebenezer",
8 "Eric",
9 "Josiah",
10 "Joe",
11 "Eliza",
12 "Frank",
13 "Ellie",
14]
15
16count = 0
17
18for name in names:
19 if name[:3] == "Eli": # fill in your boolean expression
20 count += 1
21
22count
3
PRACTICE: What about grabbing the names that begin with Jo?
1names = [
2 "Eliana",
3 "John",
4 "Elias",
5 "Esther",
6 "Joseph",
7 "Ebenezer",
8 "Eric",
9 "Josiah",
10 "Joe",
11 "Eliza",
12 "Frank",
13 "Ellie",
14]
15
16target_names = []
17
18# your code here
['John', 'Joseph', 'Josiah', 'Joe']
Join strings#
We’ve also already shown you concatenation.
1s1 = "Hello"
2s2 = " World!"
3print(s1 + s2)
Hello World!
Now that you’ve seen lists, you can get a bit more intuition for how it works.
1l1 = [1, 2, 3]
2l2 = [4, 5, 6]
3print(l1 + l2)
[1, 2, 3, 4, 5, 6]
Check if character(s) is in string#
Just like lists, the in operator also works for strings. We can think of this as checking whether some substring (could be a single character, or a sequence of characters) is part of a target string.
Example: check whether this message contains a keyword
1message = "hello, my name is inigo montoya"
2keyword = "ingo"
3# let's check if the message mentions my name!
4print(keyword in message)
False
PRACTICE: check whether these strings contain a space or tab!
tab is represented by '\t'
space is represented by ' '
1s1 = "\tInigo"
2s2 = " Inigo"
3# your code
True
Put it into our filtering pattern!#
PRACTICE: Only grab the classes from CMSC
1course_codes = ["INST201", "INST126", "INFM322", "CMSC126"]
2
3target_courses = []
4
5# your code
['CMSC126']
PRACTICE: Only grab the emails that are from .edu domain
1emails = ["oasislab@gmail.com", "joelchan@terpmail.umd.edu", "rony@terpmail.com", "joelchan@umd.edu", "joelchan@gmail.com", "sarahp@umd.edu", "sarah@umd.org"]
2# your code
3target_emails = []
4
5# your code
['joelchan@terpmail.umd.edu', 'joelchan@umd.edu', 'sarahp@umd.edu']
Working with strings: advanced#
Similar to lists, there is a collection of in-built string methods: functions in Python that operate on strings: https://docs.python.org/3/library/stdtypes.html#string-methods
1s = "hello"
2# dir(s)
I’m not going to show you all of them, but I will talk through them and discuss some fairly common ones
No need to memorize them – just know:
There are many methods that allow you to do things with strings – if you want to do something, first search for that method! It’s often way more efficient/bug-free than what you’ll write (even after you get good)
Where to find the exact code for it, how to figure out how they work
More importantly, I want you to practice reading documentation, get a sense of how to use functions (code that other people have written that you can reuse): what are the parameters? return values? what can you learn from examples? how do you learn how to use it appropriately in your own code?
1#
2message = "Hello, my name is Inigo Montoya"
3# let's check if the message mentions my name!
4print("inigo" in message.lower())
Checking a string#
Common methods include:
.isnumeric()- is it all numeric?.isalphanumeric()- is it all letters and numbers?.isalpha()- is it all letters?.startswith()- does it start with some substring?.endswith()- does it end with some substring?
Example: .isnumeric() checks if the string is entirely composed of numeric characters.
1a = " 123"
2a.isnumeric()
False
1# we want to do math
2a = "x123"
3b = "567"
4# but first we want to make sure the strings are all numbers before we convert them
5if a.isnumeric() and b.isnumeric():
6 a = int(a)
7 b = int(b)
8 print(a*b)
9else:
10 a_num = int(str(filter(lambda x: x.isnumeric(), a))
11 print("One of the input strings contains non-digits!")
One of the input strings contains non-digits!
Example application: cleaning a sales record!
1# need to turn into a number so I can do math with it
2sales_record = "$1,000,000"
3
4# with iteration
5cleaned = "" # initialize clean string as a blank/empty string
6# for each character int he sales record string
7for char in sales_record:
8 if char.isnumeric(): # if the character is numeric
9 cleaned += char # grab it (notice the use of concatenation here)
10print(cleaned)
1000000
Another example: .startswith() and .endswith() check whether… the beginning/end of a string matches a substring (single charcter or sequence of characters.
1l = ["INST201", "INST126", "INFM322", "CMSC126", "joelchan@umd.edu", "joelchan", ".edu", "sarah@umd.edu"]
2# get all the strings that start with INST
3for item in l:
4 if item.startswith("INST"):
5 print(item)
PRACTICE: Use .startswith() to only grab the classes from CMSC
1course_codes = ["INST201", "INST126", "INFM322", "CMSC126"]
2target_courses = []
3# your code
['CMSC126']
PRACTICE: Use .endswith() to only grab the emails that are from .edu domain
1emails = ["oasislab@gmail.com", "joelchan@terpmail.umd.edu", "rony@terpmail.com", "joelchan@umd.edu", "joelchan@gmail.com", "sarahp@umd.edu", "sarah@umd.org"]
2
3target_emails = []
4# your code
['joelchan@terpmail.umd.edu', 'joelchan@umd.edu', 'sarahp@umd.edu']
“Cleaning” / normalizing a string#
Often we get data in string form, and we need to make sure it conforms to our expectations. Sometimes this means we modify it.
Common methods include:
.lower()or.upper()- convert the string to all lowercase or uppercase so we eliminate differences that have to do with case; remember thatAandaare different to Python!.replace()- replace parts of a string with something else - often we use this to strip out characters we don’t like (by replacing them with a blank string).strip()- remove hidden whitespace at the beginning and end of strings. super handy in data cleaning scenarios! closely related are.lstrip()and.rstrip()(can you guess what they do?)
Note: you can “chain” methods, and often want to do so! Often this is handy to “bundle together” cleaning/normalizing operations.
1# can use .replace() if you know in advance which characters you want to strip out
2def normalize_sales_record(sale):
3 return sale.replace("$","").replace(",","")
4
5sales_record = "$1,000,000"
6cleaned = normalize_sales_record(sales_record)
7print(cleaned)
1000000
Chaining works because a str.method() expression yields a string, which then is also able to do another .method(). We can do this as many times as we like, though be careful to make sure your code is still readable!
1def normalize_string(s):
2 return s.upper().strip() # convert the string to upper case and remove leading and trailing blank spaces
3
4# need to make sure it's normalized and we remove all weird stuff
5n = " Josh Lyman"
6m = "JOSH LYMAN"
7print(n)
8print(m)
9print(n == m)
10n_normal = normalize_string(n)
11m_normal = normalize_string(m)
12print(n_normal)
13print(m_normal)
14print(n_normal == m_normal)
Josh Lyman
JOSH LYMAN
False
JOSH LYMAN
JOSH LYMAN
True
“Parsing” a string (getting specific bits we want)#
You can do this if you know there is some separator that you can rely on to divide the string into the “bits” you want.
Examples:
Parse an email
Parse a URL
Parse a sentence into words!
Parse a time stamp
These all use the .split() method, which takes a separator parameter as input, and returns a list of strings that are separated around the seperator.
Example: get elements of an email address.
1email = "joelchan@umd.edu"
2# we want only the domain and server
3elements = email.split("@") # split on the `@` character
4print(elements)
5username = elements[0] # domain server is the 2nd element in the split
6print(username)
['joelchan', 'umd.edu']
joelchan
1email = "joelchan@umd.edu"
2# if we only want the domain (.edu), we can do a multiple split
3split1 = email.split("@") # split the email by the @ separator
4domainserver = split1[1] # grab the second item
5split2 = domainserver.split(".") # split that second item by the . separator
6domain = split2[1] # get the second item from that one
7print(domain)
Example: get elements of a url.
1url = "www.ischool.umd.edu"
2elements = url.split(".") # split on the `.` character
3elements[1]
'ischool'
PRACTICE: get elements of a timestamp (get the hour).
1# get hour
2timestamp = "13:30:31"
3# your code here
'13'
PRACTICE: get the words in a sentence.
1# get words
2message = "She sells seashells by the sea shore, with sea in the wind, and sea in my shoes"
3# your code here
['She',
'sells',
'seashells',
'by',
'the',
'sea',
'shore,',
'with',
'sea',
'in',
'the',
'wind,',
'and',
'sea',
'in',
'my',
'shoes']
A more complicated example: parse records string into a list of lists!
1records_string = "NAME\tSCORE\tGRADE\nJoel\t81\tB-\nRony\t98\tA+\nSravya\t99\tA+"
2
3records = []
4
5for row in records_string.split("\n"):
6 row_data = row.split("\t")
7 records.append(row_data)
8records
[['NAME', 'SCORE', 'GRADE'],
['Joel', '81', 'B-'],
['Rony', '98', 'A+'],
['Sravya', '99', 'A+']]
PRACTICE: A more complicated example: parse comma-separated records string into a list of lists!
1records_string = "NAME,SCORE,GRADE\nJoel,81,B-\nRony,98,A+\nSravya,99,A+"
2records = []
3# your code here
[['NAME', 'SCORE', 'GRADE'],
['Joel', '81', 'B-'],
['Rony', '98', 'A+'],
['Sravya', '99', 'A+']]
REMEMBER: STRINGS ARE IMMUTABLE#
Remember! Unlike lists, string methods return a new object (and do not modify the original string), since strings are immutable.
This means if you don’t assign the return value of the string method to a new variable, the change will be lost. Remember this!
1a = "hello"
2b = "Hello"
3print(a.lower())
4print(b.lower())
5print(a == b)
6print(a, b)
hello
hello
False
hello Hello
1a = "hello"
2b = "Hello"
3a = a.lower()
4b = b.lower()
5a == b
6print(a, b)
1message = "Hello, my name is Inigo Montoya"
2print(message)
3# let's check if the message mentions my name!
4message = message.lower() # change to lower case
5message = message.replace("inigo", "MYSTERY")
6print(message)
Hello, my name is Inigo Montoya
hello, my name is MYSTERY montoya
String formatting#
So far we’ve taken strings as given, and we often specify a string directly. But frequently it is useful to compose a string programmatically, from variables.
Often this is done for debugging (to read the state of your program at various steps), but often this is used as outputs of your program, intermediate or final.
Here’s an example
1msg = "hello"
2friend = "rony"
3name = "anna"
4output = f"{msg} {friend}, my name is {name}!"
5print(output)
hello rony, my name is anna!
1weight_kgs = 120
2print(f"{weight_kgs} is {weight_kgs*2.2} lbs")
120 is 264.0 lbs
Example application: a game!
1# game parameters
2name = "sarah"
3attempts = 5
4target = 138
5
6# initial guess
7guess = input("Guess the number (or type q to quit): ")
8
9# as long as there are attempts remaining
10# and the user hasn't said quit
11while attempts > 0 and guess != "q":
12 # change to number
13 guess_num = int(guess)
14
15 # if the guess is correct
16 if guess_num == target:
17 # congratulate the user
18 msg = f"congrats, {name}! {guess} is indeed the number!"
19 print(msg)
20 # and break out of the loop (no need to keep guessing!)
21 break
22 # otherwise
23 else:
24 # subtract 1 from the number of attempts
25 attempts -= 1
26 # tell the user how many attempts are left
27 msg = f"sorry that wasn't right, {name}, you have {attempts} remaining attempts"
28 print(msg)
29 # get another guess
30 guess = input("Take another guess (or type q to quit): ")
Guess the number (or type q to quit): 3
sorry that wasn't right, sarah, you have 4 remaining attempts
Take another guess (or type q to quit): 5
sorry that wasn't right, sarah, you have 3 remaining attempts
Take another guess (or type q to quit): 1
sorry that wasn't right, sarah, you have 2 remaining attempts
Take another guess (or type q to quit): 138
congrats, sarah! 138 is indeed the number!
1sales = ["$100", "$250", "$500"]
2
3for idx, sale in enumerate(sales):
4 print(f"Processing the item at index {idx}: {sale}") # example of debugging/tracing statement
5 print(sale)
Processing the item at index 0: $100
$100
Processing the item at index 1: $250
$250
Processing the item at index 2: $500
$500
The basics#
Let’s look in more detail. The intuition here is that you’re defining a series of “slots” for variables. Each slot is indicated with the {} curly braces. And you put data / variables in them (which can include expressions that yield data!).
You also indicate that you’re doing this slot thing by prefixing the string with the letter f
Here’s how it looks:
1names = ["Joel", "Sarah", "Michael", "Kacie"]
2for name in names:
3 message = f"Welcome, {name}!"
4 print(message)
1birth_year = 1956
2this_year = 2023
3name = "Joel"
4message = f"Happy birthday, {name}! You are {this_year - birth_year} this year!"
5print(message)
Happy birthday, Joel! You are 67 this year!
PRACTICE: print out the following message for each row: “{name} got a {grade}”!
1records_string = "NAME,SCORE,GRADE\nJoel,81,B-\nRony,98,A+\nSravya,99,A+"
2for row in records_string.split("\n")[1:]:
3 name, score, grade = row.split(",")
4 # add your msg here!
Controlling the way it looks#
You can also control how the string looks! Various things like controlling how many decimal places are printed out (very useful when doing math), or how wide or indented the string is.
1# most common
2x = 2
3y = 3
4message = f"{x} divided by {y} is {x/y:.2f}" # only show two decimal places for the float value of result
5print(message)
6# print(result)
2 divided by 3 is 0.67
The general design pattern here is to put a colon after and then specify some kind of formatting option. More details here: http://zetcode.com/python/fstring/
PRACTICE: complete the output message so it prints out something like this: “Please charge my card for $5.23” (if the total value is 5.23)
1tip = 0.18
2check = 25.00
3total_value = check + check*tip
4# complete the output msg
5msg =
6print(msg)
File "/var/folders/xz/_hjc5hsx743dclmg8n5678nc0000gn/T/ipykernel_35420/2095506778.py", line 5
msg =
^
SyntaxError: invalid syntax
For the curious: there was a time when string formatting was done differently (but Python’s creators basically tell everyone not to use it anymore): just pointing it out as a historical novelty in case you see it in the wild in other people’s code (cough Joel’s code cough).
https://realpython.com/python-string-formatting/#1-old-style-string-formatting-operator