This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

Intermediate Research Software Development: Refactor 1: Software Design

Introduction

Ideally, we should have at least a rough design of our software sketched out before we write a single line of code. This design should be based around the requirements and the structure of the problem we are trying to solve: what are the concepts we need to represent in our code and what are the relationships between them. And importantly, who will be using our software and how will they interact with it.

As a piece of software grows, it will reach a point where there is too much code for us to keep in mind at once. At this point, it becomes particularly important to think of the overall design and structure of our software, how should all the pieces of functionality fit together, and how should we work towards fulfilling this overall design throughout development. Even if you did not think about the design of your software from the very beginning - it is not too late to start now.

It is not easy to come up with a complete definition for the term software design, but some of the common aspects are:

There is literature on each of the above software design aspects - we will not go into details of them all here. Instead, we will learn some techniques to structure our code better to satisfy some of the requirements of ‘good’ software and revisit our software’s MVC architecture in the context of software design.

Good Software Design Goals

Aspirationally, what makes good code can be summarised in the following quote from the Intent HG blog:

“Good code is written so that is readable, understandable, covered by automated tests, not over complicated and does well what is intended to do.”

Software has become a crucial aspect of reproducible research, as well as an asset that can be reused or repurposed. Thus, it is even more important to take time to design the software to be easily modifiable and extensible, to save ourselves and our team a lot of time later on when we have to fix a problem or the software’s requirements change.

Satisfying the above properties will lead to an overall software design goal of having maintainable code, which is:

Now that we know what goals we should aspire to, let us take a critical look at the code in our software project and try to identify ways in which it can be improved.

Our software project contains a branch full-data-analysis with code for a new feature of our catchment analysis software. Recall that you can see all your branches as follows:

$ git branch --all

Let’s checkout a new local branch from the full-data-analysis branch, making sure we have saved and committed all current changes before doing so.

git checkout -b full-data-analysis

This new feature enables user to pass a new command-line parameter --full-data-analysis causing the software to find the directory containing the first input data file (provided via command line parameter infiles) and invoke the data analysis over all the data files in that directory. This bit of functionality is handled by catchment-analysis.py in the project root. E.g.

python catchment-analysis.py data/rain_data_small.csv --full-data-analysis

The new data analysis code is located in compute_data.py file within the catchment directory in a function called analyse_data(). This function loads all the data files for a given directory path, then calculates and compares standard deviation across all the data by day and finally plots a graph.

Exercise: Identifying How Code Can be Improved?

Critically examine the code in analyse_data() function in compute_data.py file.

In what ways does this code not live up to the ideal properties of ‘good’ code? Think about ways in which you find it hard to understand. Think about the kinds of changes you might want to make to it, and what would make making those changes challenging.

Solution

You may have found others, but here are some of the things that make the code hard to read, test and maintain.

  • Hard to read: everything is implemented in a single function. In order to understand it, you need to understand how file loading works at the same time as the analysis itself.
  • Hard to modify: if you wanted to use the data for some other purpose and not just plotting the graph you would have to change the data_analysis() function.
  • Hard to modify or test: it always analyses a fixed set of CSV data files within whichever directory it accesses, not always the file that is given as an argument.
  • Hard to modify: it does not have any tests so we cannot be 100% confident the code does what it claims to do; any changes to the code may break something and it would be harder and more time-consuming to figure out what.

Make sure to keep the list you have created in the exercise above. For the remainder of this section, we will work on improving this code. At the end, we will revisit your list to check that you have learnt ways to address each of the problems you had found.

There may be other things to improve with the code on this branch, e.g. how command line parameters are being handled in catchment-analysis.py, but we are focussing on analyse_data() function for the time being.

Poor Design Choices & Technical Debt

When faced with a problem that you need to solve by writing code - it may be tempting to skip the design phase and dive straight into coding. What happens if you do not follow the good software design and development best practices? It can lead to accumulated ‘technical debt’, which (according to Wikipedia), is the “cost of additional rework caused by choosing an easy (limited) solution now instead of using a better approach that would take longer”. The pressure to achieve project goals can sometimes lead to quick and easy solutions, which make the software become more messy, more complex, and more difficult to understand and maintain. The extra effort required to make changes in the future is the interest paid on the (technical) debt. It is natural for software to accrue some technical debt, but it is important to pay off that debt during a maintenance phase - simplifying, clarifying the code, making it easier to understand - to keep these interest payments on making changes manageable.

There is only so much time available in a project. How much effort should we spend on designing our code properly and using good development practices? The following XKCD comic summarises this tension:

Writing good code comic

At an intermediate level there are a wealth of practices that could be used, and applying suitable design and coding practices is what separates an intermediate developer from someone who has just started coding. The key for an intermediate developer is to balance these concerns for each software project appropriately, and employ design and development practices enough so that progress can be made. It is very easy to under-design software, but remember it is also possible to over-design software too.

Techniques for Improving Code

How code is structured is important for helping people who are developing and maintaining it to understand and update it. By breaking down our software into components with a single responsibility, we avoid having to rewrite it all when requirements change. Such components can be as small as a single function, or be a software package in their own right. These smaller components can be understood individually without having to understand the entire codebase at once.

Code Refactoring

Code refactoring is the process of improving the design of an existing code - changing the internal structure of code without changing its external behavior, with the goal of making the code more readable, maintainable, efficient or easier to test. This can include things such as renaming variables, reorganising functions to avoid code duplication and increase reuse, and simplifying conditional statements.

Code Decoupling

Code decoupling is a code design technique that involves breaking a (complex) software system into smaller, more manageable parts, and reducing the interdependence between these different parts of the system. This means that a change in one part of the code usually does not require a change in the other, thereby making its development more efficient and less error prone.

Code Abstraction

Abstraction is the process of hiding the implementation details of a piece of code (typically behind an interface) - i.e. the details of how something works are hidden away, leaving code developers to deal only with what it does. This allows developers to work with the code at a higher level of abstraction, without needing to understand fully (or keep in mind) all the underlying details at any given time and thereby reducing the cognitive load when programming.

Abstraction can be achieved through techniques such as encapsulation, inheritance, and polymorphism, which we will explore in the next episodes. There are other abstraction techniques available too.

Improving Our Software Design

Refactoring our code to make it more decoupled and to introduce abstractions to hide all but the relevant information about parts of the code is important for creating more maintainable code. It will help to keep our codebase clean, modular and easier to understand.

Writing good code is hard and takes practise. You may also be faced with an existing piece of code that breaks some (or all) of the good code principles, and your job will be to improve/refactor it so that it can evolve further. We will now look into some examples of the techniques that can help us redesign our code and incrementally improve its quality.