Extra Content: Persistence

Last updated on 2024-12-06 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • How can we store and transfer structured data?
  • How can we make it easier to substitute new components into our software?

Objectives

  • Describe how the environment in which software is used may constrain its design.
  • Identify common components of multi-layer software projects.
  • Define serialisation and deserialisation.
  • Store and retrieve structured data using an appropriate format.
  • Define what is meant by a contract in the context of Object Oriented design.
  • Explain the benefits of contracts and implement software components which fulfill them.

Follow up from Section 3

This episode could be read as a follow up from the end of Section 3 on software design and development.

Our patient data system so far can read in some data, process it, and display it to people. What’s missing?

Well, at the moment, if we wanted to add a new patient or perform a new observation, we would have to edit the input CSV file by hand. We might not want our staff to have to manage their patients by making changes to the data by hand, but rather provide the ability to do this through the software. That way we can perform any necessary validation (e.g. inflammation measurements must be a number) or transformation before the data gets accepted.

If we want to bring in this data, modify it somehow, and save it back to a file, all using our existing MVC architecture pattern, we will need to:

  • Write some code to perform data import / export (persistence)
  • Add some views we can use to modify the data
  • Link it all together in the controller

Serialisation and Serialisers


The process of converting data from an object to and from storable formats is often called serialisation and deserialisation and is handled by a serialiser. Serialisation is the process of exporting our structured data to a usually text-based format for easy storage or transfer, while deserialisation is the opposite. We are going to be making a serialiser for our patient data, but since there are many different formats we might eventually want to use to store the data, we will also make sure it is possible to add alternative serialisers later and swap between them. So let us start by creating a base class to represent the concept of a serialiser for our patient data - then we can specialise this to make serialisers for different formats by inheriting from this base class.

By creating a base class we provide a contract that any kind of patient serialiser must satisfy. If we create some alternative serialisers for different data formats, we know that we will be able to use them all in exactly the same way. This technique is part of an approach called design by contract.

We will call our base class PatientSerializer and put it in file inflammation/serializers.py.

PYTHON

# file: inflammation/serializers.py

from inflammation import models


class PatientSerializer:
    model = models.Patient

    @classmethod
    def serialize(cls, instances):
        raise NotImplementedError

    @classmethod
    def save(cls, instances, path):
        raise NotImplementedError

    @classmethod
    def deserialize(cls, data):
        raise NotImplementedError

    @classmethod
    def load(cls, path):
        raise NotImplementedError

Our serialiser base class has two pairs of class methods (denoted by the @classmethod decorators), one to serialise (save) the data and one to deserialise (load) it. We are not actually going to implement any of them quite yet as this is just a template for how our real serialisers should look, so we will raise NotImplementedError to make this clear if anyone tries to use this class directly. The reason we have used class methods is that we do not need to be able to pass any data in using the __init__ method, as we will be passing the data to be serialised directly to the save function.

There are many different formats we could use to store our data, but a good one is JSON (JavaScript Object Notation). This format comes originally from JavaScript, but is now one of the most widely used serialisation formats for exchange or storage of structured data, used across most common programming languages.

Data in JSON format is structured using nested arrays (very similar to Python lists) and objects (very similar to Python dictionaries). For example, we are going to try to use this format to store data about our patients:

JSON

[
    {
        "name": "Alice",
        "observations": [
            {
                "day": 0,
                "value": 3
            },
            {
                "day": 1,
                "value": 4
            }
        ]
    },
    {
        "name": "Bob",
        "observations": [
            {
                "day": 0,
                "value": 10
            }
        ]
    }
]

Compared to the CSV format, this gives us much more flexibility to describe complex structured data. If we wanted to represent this data in CSV format, the most natural way would be to have two separate files: one with each row representing a patient, the other with each row representing an observation. We would then need to use a unique identifier to link each observation record to the relevant patient. This is how relational databases work, but it would be quite complicated to manage this ourselves with CSVs.

Now, if we are going to follow TDD (Test Driven Development), we should write some test code. Our JSON serialiser should be able to save and load our patient data to and from a JSON file, so for our test we could try these save-load steps and check that the result is the same as the data we started with. Again you might need to change these examples slightly to get them to fit with how you chose to implement your Patient class.

PYTHON

# file: tests/test_serializers.py

from inflammation import models, serializers

def test_patients_json_serializer():
    # Create some test data
    patients = [
        models.Patient('Alice', [models.Observation(i, i + 1) for i in range(3)]),
        models.Patient('Bob', [models.Observation(i, 2 * i) for i in range(3)]),
    ]

    # Save and reload the data
    output_file = 'patients.json'
    serializers.PatientJSONSerializer.save(patients, output_file)
    patients_new = serializers.PatientJSONSerializer.load(output_file)

    # Check that we have got the same data back
    for patient_new, patient in zip(patients_new, patients):
        assert patient_new.name == patient.name

        for obs_new, obs in zip(patient_new.observations, patient.observations):
            assert obs_new.day == obs.day
            assert obs_new.value == obs.value

Here we set up some patient data, which we save to a file named patients.json. We then load the data from this file and check that the results match the input.

With our test, we know what the correct behaviour looks like - now it is time to implement it. For this, we will use one of Python’s built-in libraries. Among other more complex features, the json library provides functions for converting between Python data structures and JSON formatted text files. Our test also didn’t specify what the structure of our output data should be, so we need to make that decision here - we will use the format we used as JSON example earlier.

PYTHON

# file: inflammation/serializers.py

import json
from inflammation import models

class PatientSerializer:
    model = models.Patient

    @classmethod
    def serialize(cls, instances):
        return [{
            'name': instance.name,
            'observations': instance.observations,
        } for instance in instances]

    @classmethod
    def deserialize(cls, data):
        return [cls.model(**d) for d in data]


class PatientJSONSerializer(PatientSerializer):
    @classmethod
    def save(cls, instances, path):
        with open(path, 'w') as jsonfile:
            json.dump(cls.serialize(instances), jsonfile)

    @classmethod
    def load(cls, path):
        with open(path) as jsonfile:
            data = json.load(jsonfile)

        return cls.deserialize(data)

For our save / serialize methods, since the JSON format is similar to nested Python lists and dictionaries, it makes sense as a first step to convert the data from our Patient class into a dictionary - we do this for each patient using a list comprehension. Then we can pass this to the json.dump function to save it to a file.

As we might expect, the load / deserialize methods are the opposite of this. Here we need to first read the data from our input file, then convert it to instances of our Patient class. The ** syntax here may be unfamiliar to you - this is the dictionary unpacking operator. The dictionary unpacking operator can be used when calling a function (like a class __init__ method) and passes the items in the dictionary as named arguments to the function. The name of each argument passed is the dictionary key, the value of the argument is the dictionary value.

When we run the tests however, we should get an error:

ERROR

FAILED tests/test_serializers.py::test_patients_json_serializer - TypeError: Object of type Observation is not JSON serializable

This means that our patient serializer almost works, but we need to write a serializer for our observation model as well!

Since this new serializer is not a type of PatientSerializer, we need to inherit from a new base class which holds the design that is shared between PatientSerializer and ObservationSerializer. Since we do not actually need to save the observation data to a file independently, we will not worry about implementing the save and load methods for the Observation model.

PYTHON

# file: inflammation/serializers.py

from inflammation import models


class Serializer:
    @classmethod
    def serialize(cls, instances):
        raise NotImplementedError

    @classmethod
    def save(cls, instances, path):
        raise NotImplementedError

    @classmethod
    def deserialize(cls, data):
        raise NotImplementedError

    @classmethod
    def load(cls, path):
        raise NotImplementedError


class ObservationSerializer(Serializer):
    model = models.Observation

    @classmethod
    def serialize(cls, instances):
        return [{
            'day': instance.day,
            'value': instance.value,
        } for instance in instances]

    @classmethod
    def deserialize(cls, data):
        return [cls.model(**d) for d in data]

...

Now we can link this up to the PatientSerializer and our test should finally pass.

PYTHON

# file: inflammation/serializers.py
...

class PatientSerializer(Serializer):
    model = models.Patient

    @classmethod
    def serialize(cls, instances):
        return [{
            'name': instance.name,
            'observations': ObservationSerializer.serialize(instance.observations),
        } for instance in instances]

    @classmethod
    def deserialize(cls, data):
        instances = []

        for item in data:
            item['observations'] = ObservationSerializer.deserialize(item.pop('observations'))
            instances.append(cls.model(**item))

        return instances

...

Linking it All Together

We have now got some code which we can use to save and load our patient data, but we have not yet linked it up so people can use it.

Try adding some views to work with our patient data using the JSON serialiser. When you do this, think about the design of the command line interface - what arguments will you need to get from the user, what output should they receive back?

Equality Testing

When we wrote our serialiser test, we said we wanted to check that the data coming out was the same as our input data, but we actually compared just parts of the data, rather than just using assert patients_new == patients.

The reason for this is that, by default, == comparing two instances of a class tests whether they are stored at the same location in memory, rather than just whether they contain the same data.

Add some code to the Patient and Observation classes, so that we get the expected result when we do assert patients_new == patients. When you have this comparison working, update the serialiser test to use this instead.

Hint: The method Python uses to check for equality of two instances of a class is called __eq__ and takes the arguments self (as all normal methods do) and other.

Advanced Challenge: Abstract Base Classes

Since our Serializer class is designed not to be directly usable and its methods raise NotImplementedError, it ideally should be an abstract base class. An abstract base class is one which is intended to be used only by creating subclasses of it and can mark some or all of its methods as requiring implementation in the new subclass.

Using Python’s documentation on the abc module, convert the Serializer class into an ABC.

Hint: The only component that needs to be changed is Serializer - this should not require any changes to the other classes.

Hint: The abc module documentation refers to metaclasses - do not worry about these. A metaclass is a template for creating a class (classes are instances of a metaclass), just like a class is a template for creating objects (objects are instances of a class), but this is not necessary to understand if you are just using them to create your own abstract base classes.

Advanced Challenge: CSV Serialization

Try implementing an alternative serialiser, using the CSV format instead of JSON.

Hint: Python also has a module for handling CSVs - see the documentation for the csv module. This module provides a CSV reader and writer which are a bit more flexible, but slower for purely numeric data, than the ones we have seen previously as part of NumPy.

Can you think of any cases when a CSV might not be a suitable format to hold our patient data?

Key Points

  • Planning software projects in advance can save a lot of effort later - even a partial plan is better than no plan at all.
  • The environment in which users run our software has an effect on many design choices we might make.
  • By breaking down our software into components with a single responsibility, we avoid having to rewrite it all when requirements change.
  • These components can be as small as a single function, or be a software package in their own right.
  • When writing software used for research, requirements always change.