Code Incomplete

VERSION: 53126eb-dirty

1 Introduction

This is a collection of my passive-aggressive notes on software development. It primarily concerns C++, Linux, and robotic applications, in other words, embedded / headless systems consisting of a large number of heterogeneous software components, running with minimal human intervention.

Sections of the document are ordered by decreasing level of abstraction, feel free to skip those that seem to be non-substantial.

The markdown source of this document is available at https://github.com/asherikov/codeincomplete, an HTML version – at http://www.sherikov.net/codeincomplete/.

2 First principles

  1. Human friendliness: minimize everyone’s mental work.

    • The less you need to think the more you can do.
  2. Consistency: make your choices and stick to them.

    • Ensures (1).
  3. Automation: any action that is performed more than once must be automated.

    • Ensures (1) and (2).

3 Organization

3.1 Process vs progress

3.2 Let it go

The field is quite dynamic and, although the changes are not always for the best, you have to adapt and be ready to abandon your past achievements if necessary and move on.

3.3 Reuse vs reinvent

Professional curiosity and ambitions, as well as laziness, often push developers to reinventing existing tools and methodologies, rather than studying alternatives and making an informed and pragmatic choice. Maintaining your own code is often more expensive than working around the quirks of 3rd party software.

3.4 What and why before how

When a new development is being discussed it is common to lose focus and drift toward discussion of implementation rather than the purpose. Even experienced developers tend to be carried away. For this reason, it is advisable to explicitly separate discussions of what must be done, and how it must be done. The first must result in a requirements document that would help to ensure that development is moving in the right direction and to track progress.

3.5 Adequate processes

It feels that many teams fail at adequately matching the complexity of the processes with the team size and project scale. 5-people teams spend a third of their time writing KPIs and doing sprint reviews, while 50-people teams do not bother to track their tasks in Jira. Both extremes negatively impact performance, so it is crucial to find a reasonable balance and evolve processes to reflect changes in the team and the projects.

3.6 Communication

Work related communication should happen through channels that facilitate tracing and sharing of information and decisions: wiki, email, issue tracker, source code management systems, etc. It is a crime to ignore your notifications and expect everyone to reach you in-person to resolve issues.

4 Design

4.1 Architectural patterns

4.1.1 Transformation API symmetry

If you transform data to a different representation, e.g., by writing it from memory to a file, you or somebody else almost always is going to need to perform the reverse transformation. Take this into account when designing new API and don’t neglect implementation of reverse operations: that helps to find bugs, facilitates testing, makes your code and data more coherent. Also, assuming that users must manually compose input files for your code is a dick move.

Examples:

4.1.2 Configuration-driven development

4.1.3 Process isolation

Isolation between processes is way better than between threads. Use this to your advantage by isolating 3rd-party, potentially unreliable, or non-critical components of you software system in standalone processes. For the same reason, I recommend using ROS nodelets only when absolutely necessary, they are just a workaround for poor IPC.

4.1.4 Let it crash

“Let it crash” approach to failure handling comes from Erlang – a language designed for telecommunication applications https://en.wikipedia.org/wiki/Erlang_(programming_language)#%22Let_it_crash%22_coding_style

You have to accept that your programs are going to crash, which means that:

4.1.5 Let it vanish

This is also an old and ubiquitous design pattern, but I don’t recall a specific term for it so I named it to relate to “let it crash”. The main idea is that you have to design your software to lose data when appropriate:

4.1.6 I/O isolation

I/O is expensive, especially when it is performed via interactive terminals. It is a common practice to localize I/O in separate threads to avoid interference with time critical operations. For a similar reason, you should generally delay transferring of debug and logging information until the end of time critical methods, such as control loops.

4.1.7 State machines

State machines are often employed for representing robot behaviors, but in my opinion they should be used more for implementation of individual services. Any time you work on a service that changes its behavior in response to some command messages, for example using http://wiki.ros.org/actionlib, it is necessary to consider a finite state machine.

4.1.8 Fail early

Early failures usually have lower cost, for example, if you can detect a failure during source code compilation you are saving time on deployment and tests, if you can validate drone state before takeoff you can potentially avoid a crash, and so on.

4.1.9 Single source shared parameters

Parameters shared by different components of the stack should come from the same source. If they are stored in multiple independent locations, they are inevitably going to get out of sync, which, in turn, would lead to failures or poor performance. Extraction and distribution of shared parameters should not necessarily happen at runtime, on the contrary, it may be preferable to perform this during startup or build phases in order to comply with “fail early” principle.

For example, an URDF / SDF model of a robot can be the source of its total mass and geometric dimensions.

4.1.10 Separate logic from interfaces

Separate software logic from interfaces, both human and API:

For example, if your node receives a path, do not generate visual debugging data, such as ROS markers for rviz, in the same node; but rather create an independent node that would also subscribe for the path and generate necessary visualization info.

4.1.11 Separate representation and handling

Separate data containers from logic that uses or modifies the data, e.g., use separate classes for a message from a message publisher.

4.2 Project bootstrapping

There are a few things that should be done first when starting a new project.

4.2.1 Infrastructure first

4.2.2 Simulation

Simulation is a crucial component for testing your system. All code should always be validated in simulation before deployment to save time and reduce risks. For this reason simulation must be implemented as soon as possible.

4.2.3 Integration tests

Integration tests are more important on early stages of development: a simple test that starts your system with a simulator and terminates immediately has more value than a single thoroughly unit-tested component. Integration tests focus your attention on the whole system rather than its parts. However, complex integration tests are difficult to maintain and are prone to become fragile on later stages of development, at which point you should support them with component specific tests.

5 Development

5.1 Programming languages

5.1.1 Three elephants

5.2 Version control and reviews

5.3 Handling dependencies

5.4 Build tools

5.4.1 cmake

cmake is probably the most common build tool for C++ projects. It may seem to be simple to use, but in my experience it is a wrong impression. Making the project portable, cross-compilable, usable as dependency, relocatable after installation, etc is tricky, especially due to tendency to half-ass cmake scripts: nobody bothers to study the meaning of parameters and their implications in different scenarios.

5.5 Continuous integration

5.6 Deployment

Deployment can be divided into three stages:

  1. system image generation;
  2. system configuration;
  3. binary package deployment.

At this point docker is perceived as the way to go, but keep in mind that it covers only the third stage and has other drawbacks. Binary packages, however, for example Debian packages, can address the second stage as well using embedded installation scripts.

5.6.1 Test deployments

Development of robotic systems usually requires quick deployments of packages for testing. A common approach to address this task in ROS environment is to perform a plain copy of locally compiled packages to a target machine, e.g., from a workspace installation directory. This method does not facilitate tracking of deployments, which may become a significant issue when the target is shared by many developers. It is, however, even more inconvenient to perform binary package releases for this purpose –- the ROS buildfarm system is not designed for this, which is, in my opinion, a direct consequence of treating this task as non-interactive and “pure-CI”. I’ve tried to address it in https://github.com/asherikov/ccws by allowing developers to generate binary packages locally.

5.7 Documentation

5.8 Source management

5.8.1 git

6 Runtime failures

6.1 Out of memory (OOM)

6.2 Segmentation fault

6.3 A note on return status

6.4 Races

In my experience races are quite common and particularly difficult to debug: a service behavior may depend on states of multiple other services, hardware components, etc, in which case it is necessary to be extra cautious when responding to changes in these states and avoid implicit assumptions on the sequence of their appearance.

7 Algorithms and data structures

7.1 Performance

7.2 Volumetric data

OcTree (https://octomap.github.io/) is commonly used for representation of volumetric data, but it is not always a good solution:

7.3 Rotations / orientations

There are multiple ways to represent rotations. Some people claim than quaternion is always the right thing to use – they are wrong.

7.3.1 Euler angles

7.3.2 Rotation matrices

7.3.3 Angle-axis

7.3.4 Quaternions

8 Geodesy

8.1 Gravity

Gravity is not constant: it depends on position on the globe and altitude – in some applications this variation is non-negligible.

8.2 Geodetic coordinates

9 Date, time, and locale

10 Protocols and serialization

11 General style policies

11.1 Naming

11.1.1 Ordering numbers

11.1.2 Filenames

11.1.3 Hierarchies

11.1.4 Versions

11.1.5 Prefix

11.1.6 Other

11.2 Formatting

11.3 Use tools

12 Coding styles

12.1 Follow common styles

12.1.1 C++

12.2 Don’t follow styles literally

Even widely accepted conventions are sometimes inconvenient or dangerous, such as double space indentations or usage of auto. Ignore non-compromising zealots that defend bad practices because “our father did it that way” or “this is the modern style”.

12.3 General rules

12.3.1 Comments

12.3.2 Hungarian notation

I am generally against any conventions that cannot not be automatically validated and are fully reliant on developer, including https://en.wikipedia.org/wiki/Hungarian_notation. There are, however, several cases when indicating the kind of variable, type, or function may be useful:

12.4 C++

12.4.1 General rules

12.4.2 Fundamental (built-in) types

12.4.3 Type names

12.4.4 Enumerations

12.4.5 Variables

12.4.6 Functions and methods

12.4.7 Classes

  struct MyClass
  {
      Subscription sub_; // uses callback
      CallbackData data_;

      callback()
      {
          // data_ is destroyed before sub_, so the callback may get
          // executed when data_ is already destroyed.
          use data_;
      }
  };

12.4.8 Macro

12.4.9 Namespaces

12.4.10 Templates

12.4.11 Headers

12.4.12 Avoid auto

auto can break semantics of your program without breaking its syntax. Consider the following program using Eigen:

#include <iostream>
#include <Eigen/Core>

namespace
{
Eigen::Matrix3d getRandomSymmetricMatrix()
{
    Eigen::Matrix3d m = Eigen::Matrix3d::Random();

    return (m.transpose() * m);
}

Eigen::Matrix3d getRandomSymmetricMatrix2()
{
    auto m = Eigen::Matrix3d::Random();

    return (m.transpose() * m);
}
}

int main()
{
std::cout << getRandomSymmetricMatrix() << std::endl << std::endl;
std::cout << getRandomSymmetricMatrix2() << std::endl;

return (EXIT_SUCCESS);
}

Example output:

  1.22374 -0.234887  0.579107
-0.234887  0.994478 -0.659169
 0.579107 -0.659169   1.07338

-0.482262  0.839256 0.0753774
 -1.02165 -0.202315  0.942273
  0.15891  -0.84441  0.247298

Note that output of the second function is wrong, the reason for that is that Random() returns a random matrix generator rather than a matrix, so the second function returns a product between two different matrices generated on spot. Such errors cannot be detected by compiler or sanitizers, since the code is 100% correct. They are also difficult to pick up by reading the code – you have to know how Eigen API works and what to look for. Even though the issue is well known and documented https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3 developers following the ‘modern’ style fall for it over and over, e.g.

Another possible side-effect of using auto with Eigen is performance degradation: auto mat3 = mat2 * mat1 here mat3 is an expression rather than a result of multiplication – it is going to be reevaluated every time it is used in the code.

Sometimes auto is encouraged to avoid typing long typenames, e.g., std::map<std::string, std::pair<int, std::string>>::const_iterator. This is a wrong solution to this problem:

This brings us to another important point: type is documentation which is automatically verified and enforced by compiler. Type omission makes the code more difficult to comprehend, e.g., consider an example from http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines

auto hello = "Hello!"s; // a std::string
auto world = "world"; // a C-style string

C++ is a language where the difference between std::string and C-style string can be important, so we should avoid obscuring such details.

12.4.13 Performance

12.4.14 Patterns

13 Package conventions

13.1 Naming

13.2 Layout

13.3 Installation paths

14 Networking

14.1 TCP

Sometimes TCP may seem to be a good choice for critical real-time data transfer since it doesn’t lose data. However, a better design choice is to embrace possible data loss and build a system that can tolerate it. Moreover, there are important implementation details in TCP that need to be kept in mind:

15 Fixing robot models

15.1 Inertia

Ironically some commercial CAD systems incorrectly export inertia matrices of rigid bodies to URDF, so it is a good idea to verify them. One way to achieve this is to perform eigendecomposition of the matrix https://en.wikipedia.org/wiki/Moment_of_inertia#Principal_axes to obtain principal axes and moments of inertia, which can be used to specify orientation and extent of a rectangular cuboid. Obtained cuboid should roughly match visual representation of a rigid body.

16 Other

16.1 ROS

16.2 Telemetry