This lesson is still being designed and assembled (Pre-Alpha version)

Setting up a Project

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do I set up a project in practice?

  • What organization will help support the goals of my project?

  • What additional infrastructure will support opening my project

Objectives
  • Create a project structure

  • Save helper excerpts of code

Project Organization

Now that we’ve brainstormed the parts of a project and talked a little bit about what each of them consists of. How should we organize the code to help our future self and collaborators?

There isn’t a specific answer, but there are some guiding principles. There are also some packages that create a basic setup for you. These are helpful for getting started sometimes, if you are building something that follows a lot of standards, but do not help you reorganize your existing ode.

We will begin in this section talking about how to start from scratch, noting that often the reality is that you have code and want to organize and sort it to be more functional. We start from clean to give you the ideas and concepts, then we’ll return to how to sort and organize code into the bins we created.

Exercise

Let’s look around on GitHub for some examples and compare and contrast them.

Here are some ideas to consider:

Questions

  1. What files and directory structures are common?
  2. Which ones do you think you could get started with right away?
  3. What different goals do they seem to be organized for?

So next we think about how these ideas and which of these and talk about some specific advice in each topic.

File Naming

This is the least resistence step you can take to make your code more reusable. Naming things is an important aspect of programming. This Data Carpentry episode provides some useful principles for file naming.

These are the three main characteristics of a good file name:

  1. Machine readable
  2. Human readable
  3. Plays well with default ordering

Guiding Principles

There are numerous resources on good practices for starting and developing your project, such as:

In this lesson, we are going to create a project that attempts to abide by the guiding principles presented in these resources.

Setting up a project

Sometimes we get to start from scratch. So we can set up everything from the beginning.

Templates

For some types of projects there are tools that generate the base structure for you. These tools are sometimes called “cookie cutters” or simply project templates. They are available in a variety of languages, and some examples include:

For our lesson, we will be manually creating a small project. However, it will be similar to the examples above.

git clone
cd project
mkdir data
mkdir docs
mkdir experiments
mkdir package
touch setup.py
touch README.md

Exercise

Make each of the following files in the project in the correct location by replacing the __ on each line

touch __/raw_data.csv # raw data for processing
touch __/generate_figures.py # functions to create figures for presentation/publication
touch __/new_technique.py # contains the novel method at the core of your publication
touch __/reproduce_paper.py # code to re-run the analyses reported in your methods paper about the package
touch __/helper_functions.py # auxilliary functions for routine tasks associated with the novel method
touch __/how_to_setup.md # details to help others prepare equivalent experiments to those presented in your paper

Solution

touch data/raw_data.csv
touch experiments/generate_figures.py
touch package/new_technique.py
touch experiments/reproduce_paper.py
touch package/helper_functions.py
touch docs/how_to_setup.md

Where to store results?

A question that we may ask ourselves is where to store our results, that is, our final, processed data. This is debatable, and depends on the characteristics of our project. If we have many intermediate files between our raw data and final results, it may be interesting to create a results/ directory. If we only have a couple of intermediate files, we could simply store our results in the data/ directory. If we generate many figures from our data, we could create a directory called figures/ or img/.

Configuration files and .gitignore

The root of the project, i.e. the project folder containing all of our subdirectories, may also contain configuration files from various applications. Some types of configuration files include:

The .gitignore is particularly useful if we have large data files, and don’t want them to be attached to our package distribution. To address this, we should include instructions or scripts to download the data. Results should also be ignored. For our example project, we could create a .gitignore file:

touch .gitignore

And add the following content to it:

data/*.csv

This would ignore the data/raw_data.csv file, preventing from adding it to version control. This can be very important depending on the size of our data! We don’t want the user to have to download very large files along with our repository, and it may even cause problems with hosting services. If a results/ subdirectory was created, we should also add that to the .gitignore file.

Exercise

Label each of the following excerpts for where it goes in the project

excerpt 1

Getting Started
----------------

to install

excerpt 2

for data_file in file_list:
  proc_data = pkg.preprocess(data_file)
  proc_data.to_csv(data_file[:-3] + '_proc.csv')
  pkg.new_method(proc_data)

excerpt 3

df = pd.read_csv(data_file)
df.head()
df.describe()

excerpt 4

This technique involves the best new analysis technique ever
the background to understand the technique is these three things

Open Source Basics, MWE

Open source guidelines are generally written to be ready to scale. Here we propose the basics to get your project live and usable vs. things that will help if it grows and builds a community, but n

README

A README file is the first information about your project most people will see. It should encourage people to start using it and cover key steps in that process. It includes key information, such as:

If you are not sure of what to put in your README, these bullet points are a good starting point. There are many resources on how to write good README files, such as Awesome README.

Exercise

Choose 2 README files from the Awesome README gallery examples or from projects that you regularly use and discuss with a group:

  • What are common sections?
  • What is the purpose of the file?
  • What useful information does it contain?

Licenses

As a creative work, software is subject to copyright. When code is published without a license describing the terms under which it can be used by others, all of the author’s rights are reserved by default. This means that no-one else is allowed to copy, re-use, or adapt the softwarewithout the express permission of the author. Such cases are surprisingly common but, if you want your methods to be useful to, and used by, other people you should make sure to include a license to tell them how you want them to do this.

Choosing a license for your software can be intimidating and confusing, and you should make sure you feel well-informed before you do so. This lesson and the paper linked from it provide more information about why licenses are important, which are in common use for research software, and what you might consider when choosing one for your own project. Choosealicense.com is another a helpful tool to guide you through this process.

Exercise

Using the resources linked above, compare the terms of the following licenses:

What do you think are the benefits and drawbacks of each with regards to research software?

Discuss with a partner before sharing your thoughts with the rest of the group.

Open Source, Next Steps

Other common components are

Even more advanced for building a community

For training and mentoring see Mozilla Open Leaders. For reading, check out the curriculum.

Re-organizing a project

Practice working on projects

To practice organising a project, download the Gapminder example project and spend a few minutes organising it: create directories, move and rename files, and even edit the code if needed. Be sure to have a glance at the content of each file. To download it, either use this link to download a ZIP file or clone the repository.

Solution

A possible way to organise the project is in the project’s tidy branch. You can open the tidy branch on GitHub by clicking on this link. Discuss: what is different from the version you downloaded?

  • How are directories organised?
  • Where is each file located?
  • Which directory represents a Python package?

Key Points

  • Data and code should be governed by different principles

  • A package enables a project to be installed

  • An environment allows different people to all have the same versions and run software more reliably

  • Documentation is an essential component of nay complete project and should exist with the code