Additional Material: Persistence
Last updated on 2024-11-20 | Edit this page
Introduction
Follow up from Section 3
This episode could be read as a follow up from the end of Section 3 on software design and development.
Our patient data system so far can read in some data, process it, and display it to people. What’s missing?
Well, at the moment, if we wanted to add a new patient or perform a new observation, we would have to edit the input CSV file by hand. We might not want our staff to have to manage their patients by making changes to the data by hand, but rather provide the ability to do this through the software. That way we can perform any necessary validation (e.g. inflammation measurements must be a number) or transformation before the data gets accepted.
If we want to bring in this data, modify it somehow, and save it back to a file, all using our existing MVC architecture pattern, we’ll need to:
- Write some code to perform data import / export (persistence)
- Add some views we can use to modify the data
- Link it all together in the controller
Serialisation and Serialisers
The process of converting data from an object to and from storable formats is often called serialisation and deserialisation and is handled by a serialiser. Serialisation is the process of exporting our structured data to a usually text-based format for easy storage or transfer, while deserialisation is the opposite. We’re going to be making a serialiser for our patient data, but since there are many different formats we might eventually want to use to store the data, we’ll also make sure it’s possible to add alternative serialisers later and swap between them. So let’s start by creating a base class to represent the concept of a serialiser for our patient data - then we can specialise this to make serialisers for different formats by inheriting from this base class.
By creating a base class we provide a contract that any kind of patient serialiser must satisfy. If we create some alternative serialisers for different data formats, we know that we will be able to use them all in exactly the same way. This technique is part of an approach called design by contract.
We’ll call our base class PatientSerializer
and put it
in file inflammation/serializers.py
.
PYTHON
# file: inflammation/serializers.py
from inflammation import models
class PatientSerializer:
model = models.Patient
@classmethod
def serialize(cls, instances):
raise NotImplementedError
@classmethod
def save(cls, instances, path):
raise NotImplementedError
@classmethod
def deserialize(cls, data):
raise NotImplementedError
@classmethod
def load(cls, path):
raise NotImplementedError
Our serialiser base class has two pairs of class methods (denoted by
the @classmethod
decorators), one to serialise (save) the
data and one to deserialise (load) it. We’re not actually going to
implement any of them quite yet as this is just a template for how our
real serialisers should look, so we’ll raise
NotImplementedError
to make this clear if anyone tries to
use this class directly. The reason we’ve used class methods is that we
don’t need to be able to pass any data in using the
__init__
method, as we’ll be passing the data to be
serialised directly to the save
function.
There are many different formats we could use to store our data, but a good one is JSON (JavaScript Object Notation). This format comes originally from JavaScript, but is now one of the most widely used serialisation formats for exchange or storage of structured data, used across most common programming languages.
Data in JSON format is structured using nested arrays (very similar to Python lists) and objects (very similar to Python dictionaries). For example, we’re going to try to use this format to store data about our patients:
JSON
[
{
"name": "Alice",
"observations": [
{
"day": 0,
"value": 3
},
{
"day": 1,
"value": 4
}
]
},
{
"name": "Bob",
"observations": [
{
"day": 0,
"value": 10
}
]
}
]
Compared to the CSV format, this gives us much more flexibility to describe complex structured data. If we wanted to represent this data in CSV format, the most natural way would be to have two separate files: one with each row representing a patient, the other with each row representing an observation. We’d then need to use a unique identifier to link each observation record to the relevant patient. This is how relational databases work, but it would be quite complicated to manage this ourselves with CSVs.
Now, if we are going to follow TDD
(Test Driven Development), we should write some test code. Our JSON
serialiser should be able to save and load our patient data to and from
a JSON file, so for our test we could try these save-load steps and
check that the result is the same as the data we started with. Again you
might need to change these examples slightly to get them to fit with how
you chose to implement your Patient
class.
PYTHON
# file: tests/test_serializers.py
from inflammation import models, serializers
def test_patients_json_serializer():
# Create some test data
patients = [
models.Patient('Alice', [models.Observation(i, i + 1) for i in range(3)]),
models.Patient('Bob', [models.Observation(i, 2 * i) for i in range(3)]),
]
# Save and reload the data
output_file = 'patients.json'
serializers.PatientJSONSerializer.save(patients, output_file)
patients_new = serializers.PatientJSONSerializer.load(output_file)
# Check that we've got the same data back
for patient_new, patient in zip(patients_new, patients):
assert patient_new.name == patient.name
for obs_new, obs in zip(patient_new.observations, patient.observations):
assert obs_new.day == obs.day
assert obs_new.value == obs.value
Here we set up some patient data, which we save to a file named
patients.json
. We then load the data from this file and
check that the results match the input.
With our test, we know what the correct behaviour looks like - now
it’s time to implement it. For this, we’ll use one of Python’s built-in
libraries. Among other more complex features, the json
library provides functions for converting between Python data structures
and JSON formatted text files. Our test also didn’t specify what the
structure of our output data should be, so we need to make that decision
here - we’ll use the format we used as JSON example earlier.
PYTHON
# file: inflammation/serializers.py
import json
from inflammation import models
class PatientSerializer:
model = models.Patient
@classmethod
def serialize(cls, instances):
return [{
'name': instance.name,
'observations': instance.observations,
} for instance in instances]
@classmethod
def deserialize(cls, data):
return [cls.model(**d) for d in data]
class PatientJSONSerializer(PatientSerializer):
@classmethod
def save(cls, instances, path):
with open(path, 'w') as jsonfile:
json.dump(cls.serialize(instances), jsonfile)
@classmethod
def load(cls, path):
with open(path) as jsonfile:
data = json.load(jsonfile)
return cls.deserialize(data)
For our save
/ serialize
methods, since the
JSON format is similar to nested Python lists and dictionaries, it makes
sense as a first step to convert the data from our Patient
class into a dictionary - we do this for each patient using a list
comprehension. Then we can pass this to the json.dump
function to save it to a file.
As we might expect, the load
/ deserialize
methods are the opposite of this. Here we need to first read the data
from our input file, then convert it to instances of our
Patient
class. The **
syntax here may be
unfamiliar to you - this is the dictionary unpacking
operator. The dictionary unpacking operator can be used when
calling a function (like a class __init__
method) and
passes the items in the dictionary as named arguments to the function.
The name of each argument passed is the dictionary key, the value of the
argument is the dictionary value.
When we run the tests however, we should get an error:
ERROR
FAILED tests/test_serializers.py::test_patients_json_serializer - TypeError: Object of type Observation is not JSON serializable
This means that our patient serializer almost works, but we need to write a serializer for our observation model as well!
Since this new serializer is not a type of
PatientSerializer
, we need to inherit from a new base class
which holds the design that is shared between
PatientSerializer
and ObservationSerializer
.
Since we don’t actually need to save the observation data to a file
independently, we won’t worry about implementing the save
and load
methods for the Observation
model.
PYTHON
# file: inflammation/serializers.py
from inflammation import models
class Serializer:
@classmethod
def serialize(cls, instances):
raise NotImplementedError
@classmethod
def save(cls, instances, path):
raise NotImplementedError
@classmethod
def deserialize(cls, data):
raise NotImplementedError
@classmethod
def load(cls, path):
raise NotImplementedError
class ObservationSerializer(Serializer):
model = models.Observation
@classmethod
def serialize(cls, instances):
return [{
'day': instance.day,
'value': instance.value,
} for instance in instances]
@classmethod
def deserialize(cls, data):
return [cls.model(**d) for d in data]
...
Now we can link this up to the PatientSerializer
and our
test should finally pass.
PYTHON
# file: inflammation/serializers.py
...
class PatientSerializer(Serializer):
model = models.Patient
@classmethod
def serialize(cls, instances):
return [{
'name': instance.name,
'observations': ObservationSerializer.serialize(instance.observations),
} for instance in instances]
@classmethod
def deserialize(cls, data):
instances = []
for item in data:
item['observations'] = ObservationSerializer.deserialize(item.pop('observations'))
instances.append(cls.model(**item))
return instances
...
Linking it All Together
We’ve now got some code which we can use to save and load our patient data, but we’ve not yet linked it up so people can use it.
Just like we did with the display_patient
view in Section 3, try
adding some views to work with our patient data using the JSON
serialiser. When you do this, think about the design of the command line
interface - what arguments will you need to get from the user, what
output should they receive back?
Equality Testing
When we wrote our serialiser test, we said we wanted to check that
the data coming out was the same as our input data, but we actually
compared just parts of the data, rather than just using
assert patients_new == patients
.
The reason for this is that, by default, ==
comparing
two instances of a class tests whether they’re stored at the same
location in memory, rather than just whether they contain the same
data.
Add some code to the Patient
and
Observation
classes, so that we get the expected result
when we do assert patients_new == patients
. When you have
this comparison working, update the serialiser test to use this
instead.
Hint: The method Python uses to check for equality
of two instances of a class is called __eq__
and takes the
arguments self
(as all normal methods do) and
other
.
Advanced Challenge: Abstract Base Classes
Since our Serializer
class is designed not to be
directly usable and its methods raise NotImplementedError
,
it ideally should be an abstract base class. An abstract base class is
one which is intended to be used only by creating subclasses of it and
can mark some or all of its methods as requiring implementation in the
new subclass.
Using Python’s documentation on the abc module,
convert the Serializer
class into an ABC.
Hint: The only component that needs to be changed is
Serializer
- this should not require any changes to the
other classes.
Hint: The abc module documentation refers to metaclasses - don’t worry about these. A metaclass is a template for creating a class (classes are instances of a metaclass), just like a class is a template for creating objects (objects are instances of a class), but this isn’t necessary to understand if you’re just using them to create your own abstract base classes.
Advanced Challenge: CSV Serialization
Try implementing an alternative serialiser, using the CSV format instead of JSON.
Hint: Python also has a module for handling CSVs - see the documentation for the csv module. This module provides a CSV reader and writer which are a bit more flexible, but slower for purely numeric data, than the ones we’ve seen previously as part of NumPy.
Can you think of any cases when a CSV might not be a suitable format to hold our patient data?