Project directory structure#
Creating a general projects directory#
Now that we know where to create our project directory, it is time to create the directory itself.
In this tutorial, we assume that you create the project directory in your home directory (/home/<user>
or /Users/<user>, where <user> is your user name).
To keep the top of your home directory clean, it is best to create a main directory for all projects.
This is a bit subjective, but if you do that, it clearly separates your science project files from
downloads, pictures, etc. We create the directory Projects for this purpose, but of course you can
give it a different name if you like.
On Linux and Mac, you can create the folder in the terminal like this:
user@terminal:~> mkdir ~/Projects
This will be the place for all your future projects.
The created directory structure#
The NWO cookiecutter creates the following directory structure:
├── README.md <- The top-level README.md file for information about the project.
├── data
│ ├── external <- Data from third party sources.
│ ├── intermediate <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data set.
│ └── raw <- The original, immutable data dump.
│
├── references <- Related papers, manuals, and all other explanatory materials.
│ ├── manuals <- Directory for manuals.
│ ├── papers <- Directory for related papers.
│ └── other <- Other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ ├── presentations <- Presentations given about the project.
│ ├── meetings <- Updates presented in group meetings, etc.
│ └── figures <- Generated graphics and figures to be used in reporting.
│
├── .env <- A file to contain all the environment variables
│ and software sources to run the project.
├── Load-Env.ps1 <- A script to load the environment in Windows Powershell.
│
├── src <- Source code for use in this project (* Git).
│ │
│ ├── docs <- Documentation for the source code.
│ │
│ ├── notebooks <- Jupyter notebooks.
│ │
│ ├── <repo_name> <- Python module for project.
│ │ │
│ │ ├── __init__.py <- Make this directory a python module.
│ │ ├── scripts <- Directory to add python scripts.
│ │ └── functions.py <- Add your python functions here.
│ │
│ └── pyproject.toml <- Python project definition file used to build a python package.
│
└── paper <- Directory to write your paper (* Git/Overleaf).
Data#
Now that the directory structure is created, we can discuss the contents. In the data directory you can store your data. These data can be in various stages of analysis. There can be raw data from an instrument or data from an external source, from which you derive intermediate data and end results.
It is usually a good idea to separate the original data (raw or external) from the processed data. You usually do not want to alter the raw or external files in any way, so saving them in separate directories avoids accidentally changing these files.
Subdirectories#
The data directory contains the following subdirectories:
external: To store external data,
raw: To store raw data from the instrument,
intermediate: To store intermediate data products while processing,
processed: To store the end results of the processing.
Link to large file storage#
If your datasets are large, it is best to store them on a disk with high
capacity. You can still include them in this directory structure by creating
a symbolic link. It is best to replace the current directory. For example,
the external directory needs to be removed before we can make a link to
an existing directory elsewhere. Be careful with the rm command!
First check whether the external directory is empty:
user@linux:~/Projects/hubble/data> ls external/
If the directory does not contain anything that you want to keep, then remove it:
user@linux:~/Projects/hubble/data> rm -rf external/
If we have the external data stored in /path/to/storage/external, then
we can create the new link with the command:
user@linux:~/Projects/hubble/data> ln -s /path/to/storage/external external
The data will still be available at data/external, but the files will be
physically stored in /path/to/storage/external.
The .env file#
Since you probably use a number of external software packages in this
project, it may be a good idea to use .env as a source file to
set all the software packages and environment variables that you need.
In the file, you can source software packages to make them available in your environment. In addition, you can set project specific variables. For example, you can set a variable containing the absolute path to your project directory. You can then read this variable in your Python programs and use it to find your files relative to this path. This way, you can move your directory to another system and all you need to change is the path contained in the environment variable.
For an example, see PROJECT_DATA_DIR set in .env. In
src/<repo_name>/functions.py you can find the get_data_dir function
which reads and checks the path from PROJECT_DATA_DIR. This
way, you can use the path throughout your package.
Please make sure that your .env file does not end up in
a git repository, or at least make sure that the environment
does not contain sensitive information.
EXERCISE
Uncomment the PROJECT_DIR and PROJECT_DATA_DIR environment
variables in .env and change the directory path to your hubble
project location.
Once you have saved the .env file, you can load the variables into
your environment using the source command (on Linux and Mac):
(base) user@terminal:~/Projects/hubble> source .env
In the Windows Powershell you can load the environment variables using
the dotenv module. You need to install this first (one time):
> Install-Module dotenv -Scope CurrentUser
Then run the load script .\Load-Env.ps1 to load the environment variables
into the session.
References#
You can use this directory to save any explanatory materials related to your project. These can be papers, manuals, and other documents that are useful to have around.
Reports#
The reports directory can be used to store all kinds of (intermediate) reports about the project. From figures that you presented at group meetings to presentations given at conferences.
Tip for the meeting directory: If you create a subdirectory for each
meeting with the date in YYYY-MM-DD format (the ISO-8601 standard),
they will be nicely ordered by time in your directory listing.
Source directory (src)#
In this directory, you can put your Python source code. It is usually very helpful to create functions or classes for operations that you use often.
It is very useful to keep this directory under git version control. This will be discussed in the next section.
The directory is set up in a way such that you can easily
create a Python package. It contains the docs and
notebooks directories to document the code and provide
examples as jupyter notebooks.
There is a lot to say about structuring a Python project. If you want to learn more about the dos and don’ts of structuring Python, you can read this part of the Hitchhiker’s guide to Python, for example.
Docs#
You can use this directory to create source code documentation.
A way to do that is to use the python sphinx package.
If you are interested, then look at the Sphinx Quickstart guide.
Notebooks#
In this directory, you can store jupyter notebooks that you use to explore the data and visualize it.
Pyproject.toml#
The pyproject.toml file describes the properties of your python
package, what packages it depends on and how it is installed. You
can find more information about this file here:
Write your own pyproject.toml.
Paper#
In the paper directory, you can write your (LaTeX) paper. Many people use Overleaf these days and it is possible to use git version control to sync your paper files with Overleaf.
Please see the Git integration manual of Overleaf to get your Overleaf project into this directory.
README#
EXERCISE
Rename the standard README.md file in the top directory and download the template README.md file from this Github page. This will become our README file for the reproduction package at the end.
File formats#
When you store your data or text in files, it is very important that you do this in open file formats. There are many ways to encode data in files, and for the computer to read them, you usually need a special library or module. For the file format you choose, it is best that that format has publicly available read and write software. Large tables with numbers are usually most efficiently stored in binary files, like FITS, or HDF5. Text that needs to be human readable, like READMEs or source code are stored in ASCII format. In all cases, make sure that the files you create are easy to read by others.
Naming files#
Next to a good directory structure, giving your files consistent names is also very helpful. Checkout the checklist below from The Turing way handbook:
Name your files consistently
Keep it short but descriptive
Avoid special characters or spaces to keep it machine-compatible
Use capitals, dashes, or underscores to keep it human-readable
Use consistent date formatting, for example ISO 8601: YYYY-MM-DD to maintain default order
Include a version number when applicable
Share/establish a naming convention when working with collaborators
Record a naming convention in your data management plan (if you have one)