String operations

Back to main page

Strings as objects

If we have the string "Alice" and the string "Bob". It is clear that, despite both being strings, each string object is distinct from each other. In programming, we think of objects as specific instances of a class (or data type). We will be a little more thorough with this definition later (in fact, it’s not a correct one because we are leaving details out) but this naive understanding will suffice for now.

For many objects, it is possible to invoke the methods associated with that object. Methods are special kinds of functions associated with the data type of an object. As a concrete example, consider the code

myMsg = 'Hello, World!'
myMsgUpper = myMsg.upper()
print(myMsgUpper)

If you run this code, you would obtain the following:

HELLO, WORLD!

The .upper() part of the code is an example of a method implemented in Python’s str data type. When it is invoked on myMsg, it knows that its effect should take place specifically only on the contents of myMsg. Thus, we arrive at the string HELLO, WORLD!.

Python’s str data type implements a lot of different methods. For a full list of those, you should consult this page on the Python documentation page. We will cover a couple of these, but not all of them.

Indexing strings

There are many cases where we may wish to access a particular single character from a string. To do this, we use the indexing operator:

letters = 'ABCDE'
print(letters[1])

This will output

B

Notice that this tells us that letters[1] selects the second character from letters. In computer science, we always start counting from zero, not one. There are a couple reasons why this is the case:

  • Dijkstra’s Why numbering should start at zero
  • In lower-level programming languages such as C and C++, indexing is a special case of pointer arithmetic where the 0th index makes the most sense.

Typically, we think of indices as telling us how far to go along the sequence of characters in the string from left to right. There are cases, however, where it may be useful to go from right to left instead. A common example is if we need the last character of a string:

myMsg = 'Hello, World!'
print(myMsg[-1])

This will output

!

That is, the \((-1)\)-index corresponds to the last character in the string. Accordingly, the \((-2)\)-index corresponds to the second to last character in the string, and so on.

Slicing strings

Building off of indexing strings, sometimes we don’t want just one character, but rather a substring of our string. To obtain those substrings, we slice the string. For example,

myString = 'ABCDEFG'
mySlice = myString[1:4]
print(mySlice)

This prints BCD. That is, myString[1:4] is interpreted as “return the part of myString from 1st character to the 4th character, including the first and excluding the last.

There are a few variants on this:

myString = 'ABCDEFG'
print(myString[:3])  # prints ABC
print(myString[3:])  # prints DEFG

We can even use negative indices:

myString = 'ABCDEFG'
print(myString[:-2])  # prints ABCDE
print(myString[-2:])  # prints FG

Length of string

To find the length of a string (i.e. the number of characters in a string), we use the len() function:

myString = 'ABCDEFG'
print(len(myString))  # prints 7

A common mistake that many programmers (even experienced ones) make is the following:

myString = 'ABCDEFG'
print(myString[len(myString)])  # try to print last character

This gives IndexError: string index out of range. This occurs because because Python is \(0\)-indexed and the first character of a string starts at index 0. Accordingly, the change we need to make is change myString[len(myString)] to myString[len(myString) - 1].

Immutability of strings

It is possible that, at some point, we may want to change the content of a string. Python prevents this, however:

myString = 'Alice'
myString[0] = 'B'  # error
print(myString)

Instead of printing 'Blice', we are met with the runtime error TypeError: 'str' object does not support item assignment. The precise term for this is that strings are immutable (i.e. they cannot be modified).

The concept of strings being immutable is intentional. In fact, it would be bad for strings to be mutable. Why? Suppose we that strings were mutable and we had something like

class Person:
    def __init__(self, name):
        self.__name = name
    
    def get_name():
        return self.__name

aliceName = 'Alice'
alice = Person(aliceName)

aliceName[0] = 'B'
print(alice.get_name())

We haven’t talked about classes yet, so don’t be worried about not knowing what exactly is going on there. For now, interpret that part of the code as a template that defines a custom data type called Person. In any case, if that code were valid, it would print Blice. At this scale, that might seem unimportant, but imagine if we had a project that involved thousands of instances of Person and they all crucially depend on aliceName. If we modify aliceName so that its first character is 'B', then we would change the name of thousands of Person’s which would be bad.

Accordingly, it is more simple to prevent strings from being modified. The best we can do, if we want to obtain 'Blice' from 'Alice' is through string concatenation:

myString = 'Alice'
myString = 'B' + myString[1:]
print(myString)

Traversing over strings

In many scenarios, it is useful to be able to traverse a string from left to right. This is easily done via the for loop:

for character in 'Alice':
    print(character)

When run, this code outputs

A
l
i
c
e

Another way that strings are traversed through is via indices. For instance:

name = 'Alice'
for i in range(len(name)):
    character = name[i]
    print(character)

This approach is useful if we need to have access to the position of a particular character in a string. If we don’t need the position, however, then it is recommended to use the previous approach.

The in operator

There are many cases where one may wish to check if a string is a substring of another string. In Python, we check for this by using the in operator.

print('A' in 'Alice')   # True
print('a' in 'Alice')   # False --- case-sensitive
print('li' in 'Alice')  # True
print('il' in 'Alice')  # False --- order-sensitive

We can also combine the not operator with in:

print('A' not in 'Alice')   # False
print('a' not in 'Alice')   # True
print('li' not in 'Alice')  # False
print('il' not in 'Alice')  # True

Splitting strings

Suppose we are dealing with data, in the form of a string, that is delimited by commas:

White,Walter,Chemist

We might be interested in splitting the string up into a list of strings so that we can access each entry of the data separately. For instance:

data = 'White,Walter,Chemist'
dataList = data.split(',')
print(dataList)

This outputs

['White', 'Walter', 'Chemist']

Now, of course, we will not want to represent our data initially in real-life as a gigantic string. Because of this, we will later talk about representing data more sanely using the pandas library.

Formatting strings

At some point, you are going to hate having to write string concatenations that look something like this:

a = -7
b = 2
q = a // b
r = a % b
print('Long division on ' + str(a) + ': ' + str(a) + ' = ' + str(q) + '*' + str(b) + ' + ' + str(r))

Not only are such coding statements ugly and verbose, but they also are not particularly easy to read either. One use of string formatting is precisely to get rid of code that looks like this. The above can be rewritten as

a = -7
b = 2
q = a // b
r = a % b
print('Long division on {}: {} = {}*{} + {}'.format(a, a, q, b, r))

The author can hardly do string formatting justice, so the recommendation here is simply to read the Python documentation. String formatting is quiet powerful and is useful for many other purposes than the one example shown above.

Exercises

  1. Write a function count_occurences(searchStr, letter) which returns the number of times letter occurs in the string searchStr.

  2. One variant of the slice notation not mentioned above is that we can specify a “step” value. What happens when we run the following?
    start = 1
    stop = 13
    step = 2
    myStr = 'ABCDEFGHIJKLMNOP'
    print(myStr[start:stop:step])
    

    Explain why we get that output. Experiment more with this version of slice notation. What happens if step = -1?

  3. Write a function that reverses the order of characters in a string.

  4. Write a function that removes all occurences of a substring in a string. (Hint: Use the find method)

Back to main page