One of the most controversial design choices of Python is the use of dynamic types. Dynamic types of variables can often lead to confusion for beginners, but even for experts it is a common source of hard-to-find bugs. For this reason the concept of type annotations has been introduced later in the language to allow for static code analysis and more detailed source code documentation. However, type annotations are not enforced by the Python interpreter and therefore do not prevent the user from passing the wrong type of data to a function or class. This can lead to delayed code failure, bad user experiences, and difficult-to-find bugs.
For beginners the dynamic typing of Python can be a source of confusion. Consider the following example:
def my_function(x):
return 2 * x
# Expected behaviour
print(my_function(3))
# Half-Expected behaviour
print(my_function(3.14))
6
6.28
Now we call the same method with a different type of argument:
# Possibly unexpected behaviour
print(my_function("hello"))
hellohello
# Maybe even mor unexpected behaviour
print(my_function(x=True))
2
Now let’s define a different fucntion, which naively does the same as before:
def my_other_function(x):
return 2.0 * x
# Now try this
print(my_other_function("hello"))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[4], line 6
2 return 2.0 * x
5 # Now try this
----> 6 print(my_other_function("hello"))
Cell In[4], line 2, in my_other_function(x)
1 def my_other_function(x):
----> 2 return 2.0 * x
TypeError: can't multiply sequence by non-int of type 'float'
Python will “happily” execute the calls to my_function
, but the result is possibly un-expected. In contrast the call to my_other_function
fails. The reason is that the *
operator is overloaded for strings / bools and integers, but not for strings and floats. This behaviour is built into the Python language and cannot really be changed.
In this example we read an example input YAML file, which contains temperature data. The data is then used to calculate the average temperature. However, the input data is not validated and therefore the code will fail at a later point in time.
import yaml
def parse_data_into_lists(data):
"""Parse data into lists of values"""
temperature = [_["temperature"] for _ in data]
time = [_["time"] for _ in data]
return {
"temperature": temperature,
"time": time,
}
with open("my-data.yaml", "r") as fh:
data = yaml.safe_load(fh)
processed = parse_data_into_lists(data=data)
# Compute mean
mean = sum(processed["temperature"]) / len(processed["temperature"])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[5], line 20
17 processed = parse_data_into_lists(data=data)
19 # Compute mean
---> 20 mean = sum(processed["temperature"]) / len(processed["temperature"])
TypeError: unsupported operand type(s) for +: 'int' and 'str'
This is especially annoying if the code is part of a larger application and the reading and processing takes time.
One last example is of the category “stupid bug”: this includes typos, incorrect use of APIs and other mistakes that are easy to make and sometimes hard to find, often only by close inspection of the code.
from matplotlib import pyplot as plt
x = [1, 2, 3, 4, 5]
y = [my_function(i) for i in x]
plt.plot(x, y)
plt.xlabel("x")
plt.ylabel("y")
plt.xlim = (0, 6) # Searching for half an hour why the plot has the wrong limits
plt.xlim(0, 6) # Now trying to correct in the notebook...ops even more confusing...
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 plt.xlim(0, 6) # Now trying to correct in the notebook...ops even more confusing...
TypeError: 'tuple' object is not callable
Quick survey:
Recall that classes in Python are defined with the class
keyword followed by the name of the class (usually capitalized) and a colon. Class and instance methods are defined inside the class by indenting once. Let’s create a simple class to represent a two dimensional point coordinate. We will add “V1” to the end of the class name so that we can make improvements later (“V2”, “V3”, etc.) and compare versions.
class PointV1:
"""Representation of a two-dimensional point coordinate."""
def __init__(self, x, y):
"""Initializes a PointV1 with the given coordinates."""
self.x = x
self.y = y
def distance_to(self, other):
"""Computes the distance to another `PointV1`."""
dx = self.x - other.x
dy = self.y - other.y
return (dx**2 + dy**2) ** 0.5
This class definition is correct and will behave as expected, assuming the user knows how to use the class properly. But this class suffers from two problems:
First, users of this class may be able to understand that x
and y
are intended to be integers or floats. But they would need to look at the code in order to come to this conclusion and this may be difficult for more complex classes. This can be improved by modifying the docstrings to include the intended argument types, but we will soon see an even better way to show the user the intended argument and output types.
Secondly, your IDE will not understand the intended argument and output types and thus will not highlight bugs such as the following:
p1 = PointV1(x="5", y="7")
Of course the user will discover the bug when attempting to compute the distance to another PointV1
:
p2 = PointV1(x=-1, y=4)
p1.distance_to(p2)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[10], line 3
1 p2 = PointV1(x=-1, y=4)
----> 3 p1.distance_to(p2)
Cell In[8], line 11, in PointV1.distance_to(self, other)
9 def distance_to(self, other):
10 """Computes the distance to another `PointV1`."""
---> 11 dx = self.x - other.x
12 dy = self.y - other.y
13 return (dx**2 + dy**2) ** 0.5
TypeError: unsupported operand type(s) for -: 'str' and 'int'
But it would be nice if your code editor (IDE) could highlight this mistake for you before even executing a single line of code.
Introduced in Python 3.5, type annotations (also known as type hints) provide a mechanism to solve both problems simultaneously.
Quick survey:
Intended types can be supplied after an argument or variable by separating the two with a colon. And output types can by supplied after the closing parenthesis and before the colon with an arrow ->
. Let’s improve the PointV1
class with type annotations.
class PointV2:
"""Representation of a two-dimensional point coordinate."""
def __init__(self, x: float, y: float) -> None:
"""Initializes a PointV2 with the given coordinates."""
self.x = x
self.y = y
def distance_to(self, other: "PointV2") -> float:
"""Computes the distance to another `PointV2`."""
dx = self.x - other.x
dy = self.y - other.y
return (dx**2 + dy**2) ** 0.5
The first problem is solved by clearly indicating the expected argument and output types. x
and y
are marked as float
indicating that the user should provide floats
for these arguments. The output of __init__
is marked as None
indicating that there is no value returned by this method. other
is marked as "PointV2"
to indicate that the the argument supplied to distance_to
should be another PointV2
. Note the quotes around "PointV2"
because the PointV2
class is not fully defined until after the body of the PointV2
class. Most modern IDEs will understand that "PointV2"
means PointV2
. And the output of distance_to
is marked as float
to indicate that the output should a float.
One other note is that self
is not given any type hints. This is convention and modern IDEs will understand that self
is a reference to an instance of PointV2
.
The second problem is solved because now your IDE will understand if you have a bug like the one above. For example, PyCharm will warn you by highlighting the offending pieces of code as in the following screen shot:
Notice the orange squiggly lines under x="5"
and y="7"
in the creation of p1
and the lack of yellow squiggly lines under x=-1
and y=4
in the creation of p2
.
You can also use a tool such as mypy to check the types of your code. Let’s check the code in the example my-script.py
:
!cat my-script.py
class PointV2:
"""Representation of a two-dimensional point coordinate."""
def __init__(self, x: float, y: float) -> None:
"""Initializes a PointV2 with the given coordinates."""
self.x = x
self.y = y
def distance_to(self, other: "PointV2") -> float:
"""Computes the distance to another `PointV2`."""
dx = self.x - other.x
dy = self.y - other.y
return (dx**2 + dy**2) ** 0.5
p1 = PointV2(x="5", y="7")
p2 = PointV2(x=5, y=7)
!mypy my-script.py
my-script.py:16: [1m[31merror:[m Argument [m[1m"x"[m to [m[1m"PointV2"[m has incompatible type [m[1m"str"[m; expected [m[1m"float"[m [m[33m[arg-type][m
my-script.py:16: [1m[31merror:[m Argument [m[1m"y"[m to [m[1m"PointV2"[m has incompatible type [m[1m"str"[m; expected [m[1m"float"[m [m[33m[arg-type][m
[1m[31mFound 2 errors in 1 file (checked 1 source file)[m
However it is important to repeat that type annotations are not enforced by the Python interpreter, but are typically only used by external tools to perform static code analysis.
There is another small problem with the class definitions we introduced above: it includes a lot of so called “Boilerplate Code”. This is code that is just required to make the defintion work and does not add any functionality. In this case the boilerplate code is the __init__
method. This method is required to initialize the class, but it does not add any functionality.
In fact, the __init__
method is so common that Python introduced various shortcuts in the standard library to simplfy the definiton of such data structures. The most common ones are:
dataclass
decoratornamedtuple
methodTypedDict
typeQuick survey:
dataclass
, namedtuple
or TypedDict
before?Let’s write the Point class, but this time using the dataclass
decorator:
from dataclasses import dataclass
@dataclass
class PointV3:
"""Representation of a two-dimensional point coordinate."""
x: float
y: float
def distance_to(self, other: "PointV3") -> float:
"""Computes the distance to another `PointV3`."""
dx = self.x - other.x
dy = self.y - other.y
return (dx**2 + dy**2) ** 0.5
This allows to simplify the class definition and skip the definiton of the __init__
method.
p1 = PointV3(x=5, y=7)
p2 = PointV3(x=-1, y=4)
print(p1.distance_to(other=p2))
6.708203932499369
There are similar concepts to dataclass
, such as namedtuple
, which allows to define a class with a fixed set of attributes:
from collections import namedtuple
PointV4 = namedtuple("PointV4", "x y")
p1 = PointV4(x=5, y=7)
display(p1)
PointV4(x=5, y=7)
However they have very specific use cases and are not as flexible as dataclass
. For example, namedtuple
does not directly allow to define methods and attributes are immutable:
p1.x = 64
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[17], line 1
----> 1 p1.x = 64
AttributeError: can't set attribute
Another variation is to use a TypedDict
to define a class with a fixed set of attributes:
from typing import TypedDict
class PointV5(TypedDict):
x: float
y: float
However this is used in a rather unconventional way:
p1: PointV5 = {"x": 5, "y": 7}
And as with the other classes, the types are not enforced during runtime, such that they can be changed anytime later:
p1["x"] = "34"
The TypedDict
is again mostly useful for static code analysis, in which case the statement above would be highlighted as an error.
All these limitations and issues listed above can be addressed with the pydantic library. This library allows to define data classes with type annotations and to validate the input data against the type annotations.