If you’re not careful, entropy creeps its way into your code base. You take a shortcut or code something in a way that you know is sloppy and you say to yourself “I’ll come back to that later”, but later you’re faced with new feature requests or some other Imminent Disaster(tm) and you don’t go back and clean up the mess you made earlier. This is only natural in a code base and when the messes are few and far between, it is tolerable.
However, at some point the messes accumulate and you need to do something about it because the mess is preventing you from making forward progress in a reasonable manner. Entire books have been written about this situation such as “Working Effectively With Legacy Code” by Michael Feathers (see my review), “Refactoring: Improving the Design of Existing Code” by Martin Fowler and “Refactoring to Patterns” by Joshua Kerievsky. One could even consider “Domain-Driven Design: Tackling Complexity in the Heart of Software” by Eric Evans to be a book full of advice about how to avoid the accumulation of entropy in your code base. Still, sometimes you find yourself with too much entropy in your code base and it’s time to do something about it.
In this post, we’ll take a look at an open source project with a code base that is over 30 years old and has accumulated some “cruft” along the way. We’ll discuss various strategies for coping with the cruft and how to get rid of it in as safe a manner as possible. After all, we don’t want to introduce bugs while we clean up cruft.
Note: Links to the case study source repository are shown with an asterisk (*).
Our case study is the open source fractal generator Iterated Dynamics*, a fork of FRACTINT. This is a 30 year old code base that started as a 16-bit MS-DOS program written in C and 16-bit x86 assembly. The code base has contributions from many, many authors over the years. As a result, source formatting and identifier conventions are varied due to the large number of authors. Source code organization has been tortured by the constraints of early MS-DOS programs operating in a limited amount of physical memory and the use of overlays. Conditional compilation was sprinkled throughout to enable various debugging facilities or experiments in alternative implementations. As C did not get inline functions until C99, 11 years after this code base started, there are some places that heavily use preprocessor macros. Some of the functions are exceedingly long*. The program uses an inordinately large number of global variables*.
The code exhibits almost every code smell that you could have, except for those that require object-oriented programming. The code uses global headers for data* and function* declarations, exacerbating compile times.
A common analogy for increasing entropy in your code is to think of it as technical debt. Just as in life, a little debt is not necessarily a bad thing. Few of us can afford to buy a house in cash from our bank account, so if we can afford mortgage payments then taking out a mortgage in order to have the benefits of home ownership is a good thing. However, if most our disposable income is going to a mortgage payment, a car payment, utility bills, food and other consumables and servicing of outstanding credit card balances, we could be in for a tough time. The same is true with the technical debt that’s been accumulating in your code.
Prioritize Your Work
Just as with multiple outstanding monetary loans, it helps to prioritize the outstanding technical debt that you have in your code base. If you’re going to be working in this code base often, take care of the daily annoyances first that are going to be bothering you every time you work in the code. Sometimes these are just small things like formatting or filenames or whatnot, but they irritate you every time you have to look at or modify this code. You may not think of these items as having a high interest rate on their debt, but they compound every time you go to work in the code and they keep you grumpy, so just take care of them if they are easy to fix.
Sometimes you’ll encounter something that is irritating, but you realize that changing it would be a pervasive modification through a large number of source files, classes or functions. In this case, you might want to “let sleeping dogs lie” or “don’t wake the dragon!”. If you don’t need to make a big change now, particularly if you have no automated test suite to cover your changes, defer it to later.
Work Carefully to Ensure Always Forward Progress
Older, cruftier code bases are often the same code bases that have few, if any, automated tests. That means we’ll need to be particularly careful as we change the code. In “Working Effectively With Legacy Code”, Michael Feathers presents a catalog of techniques you can use to make your code more testable and drive changes with test-driven development. All of that is good advice and I recommend making functional changes in this manner. However, there are some other things we can do to work carefully on our legacy code base.
First, the code should be in some sort of version control system. If it isn’t already in version control (or perhaps it’s in another version control system that is different from what your team is currently using), then that’s your first order of business. Secondly, once under version control, we can use the version control system to our advantage to help us work carefully. Remember,
that commits to your version control system are free. There’s no reason to prefer gigantic commits over a set of small, focused commits.
When we’re making many changes to a legacy code base, it helps to make them in small steps. This includes things like running a formatting tool over source files. You’d be surprised how many times a formatting tool gets things wrong and introduces breaking changes into your code. It’s helpful to apply such changes to subdirectories in a large project so that when reviewing the diffs they’re not ridiculously long.
When you’ve got a set of changes that you want to apply to your legacy code base, proceed through them one at a time. Use isolated commits for each different kind of change that you are making. You want to expand tabs into spaces and eliminate all trailing white space? Use two separate commits: one for the tab expansion and another for trailing white space elimination. If at some point later you discover an error as a result of these changes (say there was an embedded TAB inside a raw string literal that needed to be preserved), it’s going to be easier to back out an isolated change or review its diff than to figure out what went wrong in a composite mega commit. Remember, commits are free.
You may not be sure exactly how to achieve the kind of change you are after, but you still want to be able to lock-in positive progress by comitting partial work. Use a branch to isolate your work so that you can experiment and commit forward progress without impacting any of your teammates. Once you’ve got it figured out in your branch, you can merge back down into the working branch to bring the change to the rest of the team. If you’re working in a branch, remember to integrate positive forward progress often so that you don’t suffer from difficult merges because the main development has diverged significantly from your branch.
Remember to stay focused on the task at hand; if you encounter other changes that should be made, add them to your To Do List and come back to them later. Only work on one kind of change at a time. This will give you smaller, focused commits that lock in forward progress and you avoid becoming “Refactoring Cat”.
If you have existing automated tests, great! Use them frequently, even for changes that “don’t matter” to make sure that you’re making forward progress in getting rid of cruft and not accidentally introducing bugs. If you don’t have any automated tests, consider creating automated regression tests before making risky changes. A tool like FitNesse can assist you in creating automated regression tests to cover large chunks of your application’s functionality. When you are making changes to smaller chunks of code, in addition to the techniques described by Michael Feathers you may want to consider using property-based testing, or fuzz testing using libFuzzer or OSS-Fuzz for open source projects.
A legacy code base often has legacy infrastructure around it. The software development community has advanced tremendously in the past couple of decades with respect to infrastructure. Cloud services and devops might not even have existed as concepts when your code base was started. Infrastructure includes everything needed to build, package, and deploy your software as a product.
If you’ve got old crufty code, chances are you also have an old, crufty build. Maybe it uses a build style that no one on your team understands or if you have cross-platform code you have one build script per platform. Worse still, using old build scripts can tie you to old versions of your compilers and keep you from benefitting from C++11, C++14 and C++17 features. Now would be a good time to switch your build to CMake so that you can have portable build scripts that aren’t coupled to specific IDEs.
When you switch to CMake, you’ll want to use CMake 3.0 or later to take advantage of modern CMake facilities. Speaking of which, you’ll want to adhere to modern CMake best practices when creating your new build. You can create your new CMake build along side your existing build, migrating components one at a time in small commits as you go. You’ll want to keep your build cross-platform as you go by ensuring that each component builds properly on all platforms you support. This will keep a working system so that you don’t have to backfill platform specific requirements across many components at the end.
Eventually, you can replace your custom build scripts with a
CMakeLists.txt* that generates project files for IDEs in a version-independent manner. Depending on the size and complexity of your build, this transition could take some time. For large projects, you may wish to opt for a hybrid approach where some components are built with the legacy build system and some are built with CMake. This can be useful to ensure that your CMake based build is building the components correctly. If the components are built to the same specification, then it should be possible to mix and match components from the two build systems and integrate the system properly. Here is another case where application level acceptance or regression tests will give you confidence in the changes you are making.
If you upgrade your compiler, you’ll be able to take advantage of better optimizations provided by newer compilers. You’ll also make available to yourself newer language features that can simplify your code base. For instance, it is much easier to use standard library algorithms that take functors or predicates if you can write C++11 lambda expressions. Using C++14 and C++17 extends lambda expressions and
constexpr in ways that can simplify your code even further. Consider upgrading your compiler to the latest version you can for all the platforms you have to support. Consider C++14 a minimum as of 2018.
Many times the compiler you’re using is dictated by the IDE your legacy code base is forcing you to use. You may have ABI or other compatibility issues that prevent you from using more modern environments without introducing a breaking change. Only you can decide the pros and cons of such a trade-off, but I would recommend erring in the direction of moving forward instead of clinging to the past. The longer you defer modernizing your toolchain, the harder it will be to continue supporting the old code.
If you’ve migrated your build to modern CMake, then switching between different versions of the IDE or compiler becomes much simpler because CMake generates IDE project files and build scripts that invoke the compiler. Your build description has taken one step back from the specifics of a particular IDE release or compiler and has become that much more generic allowing you to target different releases.
FRACTINT suffered from such an antiquated build toolchain, Microsoft Visual C 1.52, that it was difficult (or expensive) to obtain the necessary tools for building and debugging on a modern machine. FRACTINT still hasn’t migrated to a Win32 program and current users struggle with proper configurations of DOSBox in order to get the application to run properly. Clearly, clinging to 16-bit DOS executables is a drag here and the DOS compatibility is more of a bug than a feature. Iterated Dynamics began as work in the FRACTINT tree to port the code base to Win32. I wanted to go further and embrace OpenGL and more GPU-oriented approaches to the application, but the existing developers wanted to stick with the existing approach and this resulted in the fork of Iterated Dynamics.
If you aren’t yet using continuous integration, now would be a good time to establish this for your legacy code. If you already have a continuous integration system, consider setting up a build job for your experimental legacy improvement branches to help catch simple mistakes as soon as possible. For open source projects use travis and AppVeyor. For commercial projects, consider using a hosted continuous integration service or install Jenkins. Continuous integration helps lock in forward progress.
Aside from monitoring the health of your builds through continuous integration, you can also gather data about the health of your source code. You can gather metrics relating to complexity, cohesion, coupling and other forms of source code analysis. SonarQube is an open source web application for monitoring the quality of your application. With the open source C++ support plug-in, you can monitor the health of your application using a convenient dashboard for displaying and understanding trends. A similar sort of view can be obtained by using graph plugins with Jenkins, but SonarQube is source code centric allowing you to drill into views of your source from problem reports.
For instance, suppose you have a legacy application that uses deprecated header files for a particular operating system or framework. This can result in an excessive number of warnings as the inclusion of each deprecated header file emits a warning in the compilation output. For a decent sized application, the number of warnings can be large enough that the compiler will stop counting after it reaches some limit, such as 10,000 warnings. In this sort of scenario, how will we know if we are adding new warnings to the build? They’ll be lost in the sea of deprecation warnings.
If we setup a Jenkins job to scrape the build output for warnings and simply report the total number of warnings, we can start improving the situation. If we spend an hour every morning eliminating the most commonly occurring warnings, we’ll eventually get the count down below the maximum limit. As we continue to eliminate warnings every day, we start to push the number of warnings lower and lower. With a graph of the number of warnings per build, we will be rewarded with a steadily decreasing curve. Any commits that introduce new warnings will be shown as an uptick in the graph. We might eventually get to a warnings-free build of the application. At that point we can lock in positive progress by changing the build settings to treat warnings as errors. Any new warnings will cause failed builds and we’ll be notified by our continuous integration system.
Sometimes it’s the little things that just bug you every time you have to work in a code base. One of the things that can be a big distraction is inconsistent formatting of the code base and inconsistent naming of identifiers. Fortunately, there are automated tools that can assist you in these areas.
For code formatting, you can proceed incrementally or do a once-over on your entire code base. Which approach works will work best for you is a discussion for your team. Sometimes it suffices to reformat a file and submit that as an isolated commit before you do other work in the file. Other times, you want to start making more global changes on the code and it makes sense to do a global reformat as a series of commits clumped together. Remember to proceed slowly and carefully; even reformatting tools have bugs.
Visual Studio can reformat code in the editor, but doesn’t have an option for automatically reformatting an entire project or solution. This is probably best used when you decide to proceed one file at a time as you make other changes. ReSharper for C++ provides more extensive formatting options than the base Visual Studio, allowing you to customize the style in more detail.
For batch reformatting, you can use Artistic Style or clang-format. Both of these tools support recording their formatting options in a text file; you should commit that file to your source tree along with the results of the reformatting so that you can adjust the style and reformat again automatically as you discover situations where you didn’t get the best results out of the tool. For ongoing editing, a tool like EditorConfig can help ensure that you keep the basics consistent. Don’t forget to add your
.editorconfig parameters to the source tree. Remember to proceed in small steps* by working one file at a time to avoid surprises and bugs. Compile your code frequently as you reformat and leverage your continuous integration build on your branch. Use a periodic job that executes the formatter to keep source code consistently formatted (cron job, jenkins, etc.) if inconsistencies are likely to creep back into your code based on past team behavior.
Formatting in Iterated Dynamics
At the time I reformatted Iterated Dynamics, I chose Artistic Style. Most of the time it did a reasonable job and while not every file was formatted as I would have done it manually, it was enough of an improvement that I accepted the results of the tool. I committed the style options file* to the repository for future reference, but I haven’t found the need to rerun the tool again wholesale.
Name Identifiers Consistently
Naming identifiers in a consistent manner helps you to understand identifiers in context when you read the code. Is this identifier a namespace or a class name? Is this identifier a local variable, a member variable, a constant, a file-scope static variable or a global variable? Do I use
underline_case for multi-word identifiers? Whatever conventions you decide to adopt, it will help if they are applied consistently and uniformly throughout your code.
Again, you have the choice of addressing them individually as you make changes, or you can attempt to address the problem globally. However, the there are additional wrinkles that don’t come up with formatting changes. You don’t want to just global replace text because that might change comments, string literals and unrelated identifiers. So how can you do this in a safe manner?
The Manual Approach
First, let’s discuss a manual approach. Identify the defining point of the symbol you want to change it and rename it at that location. Then you can use Michael Feather’s technique of Leaning on the Compiler to find all references and rename them manually. This works well for identifiers that are highly localized: the number of build errors will be small and you can quickly iterate through them and adjust the usages to match. For identifiers with broader scope, the number of build errors may be overwhelming.
For identifiers used in a broader scope, we might be able to approach the problem incrementally so that we can proceed in safe small steps. For instance, if we are interested in renaming a namespace, we can use a namespace alias to forward the old name to the new name. Not every construct in the language that has a name has a forwarding mechanism like this, but many of them do have a mechanism we can use to roll out name changes incrementally. If we have a tool that can reliably flag identifier violations, we can use that to build a list of identifiers needing migration and work through it incrementally over time. Remember to proceed in small steps; commits are free.
One wrinkle to watch out for is inactive preprocessor blocks in your code base; some tools can’t rename inside inactive alternatives and you might miss them when you Lean on the Compiler if you only try your changes on one platform. Use “Find in Files” (e.g. grep) over your source tree to ensure that all mentions of the old name are gone. Leverage your continuous integration build to ensure that the change was properly propagated to all build platforms.
The Automated Approach
Second, let’s look at an automated approach to having consistent identifiers. Changing identifiers amounts to the Rename refactoring, however automatic renaming is not perfect. Clang-Tidy has a check named readability-identifier-naming that will examine identifiers for consistency with a naming convention.
The current version of this clang-tidy check offers fixes for any identifiers that fail to match the naming convention. It can be useful to use clang-tidy to create a “To Do List” of inconsistent identifiers. For a large code base you may want to simply reduce the total number of warnings to single number and plot that in your continuous integration system, such as Jenkins. This will give you the ability to monitor the trend and ensure things are going in the right direction.
ReSharper for C++ can be configured with an identifier convention for various categories of identifiers and can highlight identifiers that fail to match the convention. ReSharper provides you with a quick fix that can rename the identifier to follow the convention. The latest release of ReSharper also integrates clang-tidy checks, so you can use a combination of methods to locate identifiers not following your convention.
Global Variables in Iterated Dynamics
In Iterated Dynamics, there exists a large number of global variables* and they didn’t follow any consistent naming scheme due to the large number of authors. When reading code, particularly in long functions, it was difficult to tell which variables were local and which were global. I used a combination of manual and automated tools to rename the globals to use a
g_ prefix. I used ReSharper for C++ to rename them, one at a time, in series of small steps*. Along the way my continuous integration system (Travis and AppVeyor) alerted me when the unix build was broken because my tool didn’t rename through conditionally compiled blocks of code.
Refactoring is the process of changing the structure of the code without changing the functionality. With legacy code, we’ll probably identify quite a number of things that we’d like to refactor. Remember to work carefully and slowly, on one refactoring task at a time. As you encounter items that need refactoring, make a To Do List for them. You may decide that longer-term goals of refactoring to a new architectural structure require certain smaller scale refactorings to take place first.
Convert C to C++
One simple refactoring you may want to approach first is to Convert C to C++. C is mostly syntactically compatible with C++ (some keywords in C++ are not keywords or reserved words in C), but C lacks C++’s zero-overhead abstraction facilities and C++’s stronger static type checking. If application binary interoperability is a concern, then functions can be exported with “C” linkage. This will allow you to convert source files to C++ incrementally, one at a time. Once all the source files have been converted to C++, you can remove the
extern "C" linkage and simply treat the entire code base as a C++ application.
C++ has better alternatives for most uses of the preprocessor. When examining the use of the preprocessor in your legacy code, it might be a good time to examine how conditional compilation is used. Look for dead code and eliminate it. Look for conditional compilation that is no longer relevant and eliminate it, e.g. do you still need to support the VAX?. Conditionally compiled code is significantly harder to reason about than unconditionally compiled code, both for humans and for automated refactoring tools. Seek to minimize the amount of conditionally compiled code in your code base.
Text replacing macros often have better counterparts in C++. Symbolic constants are best expressed as typed,
const values. Two common uses for macros are for sequentially named constants and bitwise disjoint constants used as flags. For the former, prefer an
enum class style scoped enumerations. For the latter, an
enum style unscoped enumeration may be the simplest. Chances are that combinations of the old macro names are stored in
int variables and an unscoped enumeration’s implicit conversion to and from
int can result in a simpler change away from the macro usage. Some frameworks provide facilities for a flags class implemented with a scoped enumeration and providing the usual bit-wise operators for combining flags into a single value, e.g. Qt’s QFlags class. A good first step is to replace the macros with an enum.
Function-like macros that compute values are best replaced with inline functions. This will provide type safety and improved debugging experience in the places where these macros are used.
What remains will generally be two categories of macros: those that expand to new strings or identifiers based on macro arguments and those that exhibit inline control structures, such as
Code that expands macro arguments as strings or combines them to create new tokens through the
## operators are best left as-is on a first pass through preprocessor usage. There is no C++ language equivalent for these operations and these operators are often used in fairly complex expansion contexts, so it is risky to change them.
For macros with inline control structures, a larger change is needed at the call sites in order to replace them with equivalent non-macro code. Therefore it is safer to leave these macros as-is on a first pass. These macros are likely to expand to multiple statements. Multi-statement macros can be problematic when they are used in the context of a control structure expecting a single statement or a block. When the macro is used without being enclosed in a block, there is the possibility that the control structure is satisfied with the first statement from the multi-statement macro, but our intention was that the entire series of statements from the expansion of the multi-statement macro be executed as the body of the control structure. If a block is always used to enclose the statements in a control structure, this avoids the problem. However, a legacy code base may not have followed this practice. You can use the refactoring Guard Macro Body for any such macros to ensure that the multi-statement macro expansion is executed as expected.
Preprocessor Usage in Iterated Dynamics
Since Iterated Dynamics began as a C application, macros were used for most of the situations described above. Proceeding in small, careful steps, most macro usage was eliminated. Symbolic constant macros were eliminated by applications of Replace Macro With Enum*. Function-like macros that computed a value were eliminated by applications of Replace Macro With Inline Function*.
Replace Integer With Boolean
Legacy code, particularly code originating as C or originating from C programmers, may contain
bool values disguised as
int values. We want to express ourselves directly in code, so we apply the refactoring Replace Integer With Boolean.
As you apply this refactoring locally, you may find that it implies that return values of functions should change from
bool as well. We can proceed carefully in small steps by using a
bool within the function and at return statements use an adapting expression such as
return pred ? 1 : 0; to maintain compatibility with existing callers of the function. One by one you can push the boolean values up through their uses the call chain in safe steps.
Replace Integer With Boolean in Iterated Dynamics
Iterated Dynamics contains many on/off flags and options for the fractal rendering process. All of these were encoded as zero/non-zero values integer variables. Repeated applications of Replace Integer With Boolean* makes it clear which options are a simple either-or choices and which options are multi-valued indicating one of a set of options.
The Global Header
In a C or C++ application, the organization of declarations in header files can have implications for compilation times as well as implications for design. For instance, one style of organizing declarations is to place all function declarations in a single header file, e.g.
prototypes.h, and to place all declarations of global data in another header file, e.g.
externs.h. Similarly, there may be a single header file containing all datatypes for the application, e.g.
types.h. These global headers are included in every translation unit in the application.
This header organization requires that every translation unit be recompiled whenever any change is made to any of these header files. For instance, renaming a global variable, renaming a global function, renaming a member of a global type, or reordering fields within a global type, forces the entire application to be recompiled. This is unfortunate from a build perspective, but it also has a design implication that through these header files every translation unit is tightly coupled to every other translation unit. We desire our designs to have high cohesion and loose coupling.
We can shatter these global headers into headers for each translation unit by following the following algorithm:
- Create a header file for a single source file.
- Move declarations exported from the single source file from the global header(s) to the new header file.
- Lean on the compiler to identify all dependent source files that need to include the new header file.
- Repeat until all declarations have been removed from the global header.
- Remove the global header and all the
#includedirectives of it.
Remember to check your continuous integration build to ensure that each new module header works on all your supported platforms.
The Global Header in Iterated Dynamics
Iterated Dynamics contained global headers for all global data and global functions. By proceeding in small steps*, the global headers were shattered into headers for each translation unit. Now when looking at any individual translation unit, we can see the dependencies on other translation units by looking at which header files it includes. Due to the nature of overlays, the data and functions are still not “clumped” together in a manner that represents the best cohesion. For instance, data is often defined in one translation unit but manipulated in another translation unit. Shattering the global header into module headers is a step in the right direction and module cohesion can be improved later, if desired.
We’d like to follow the SOLID principles of object-oriented design. This applies to packaging (libraries, translation units, header files, etc.) just as much as it applies to functions, classes and interfaces.
In a legacy C++ application, code organization is often impacted by the drudgery of dealing with IDE project files across multiple platforms. It can simply be less of a hassle to add a new class to an existing source file and existing header file than to create a new set of source files for the new class. If we add them to existing source files, we don’t have to mess with the IDE project files. Fixing these files might mean a physical trip over to a different machine in order to manipulate the IDE natively at the screen. Our version control system might not make it easy for us to transport a change in progress from one machine to another, e.g. subversion. In such situations, we end up with lots of source files files that couple classes together and provide for low cohesion within a header file or a translation unit.
If we modernized our build with CMake, addressing these concerns becomes fairly trivial for all platforms we support. We no longer need to visit other physical machines in order to introduce new source files or even split new libraries out from existing libraries. We can apply the refactoring Split Module to increase the cohesion of our translation units and reduce unwanted coupling. Proceed in a careful manner, one translation unit at a time. Keep an eye on your continuous integration builds along the way to make sure that all supported platforms are building and running properly.
Modernizing Code with Clang-Tidy
The best way to process C++ source files in an automatic manner is to use a tool that understands how to completely and accurately parse the language. Clang-Tidy is a tool built on top of the clang compiler’s parser and preprocessor. This gives clang-tidy an excellent view of our source code and a good framework for applying automatic source transformations. In particular, clang-tidy has been evolving in a direction that moves C++ source code towards more modern idioms such as range for loops and the use of
auto. It also contains checks that alert you towards suspicious constructs in your code based on a variety of coding guidelines.
When using clang-tidy from the command-line, it requires a JSON compilation database, commonly named
compile_commands.json. This file gives clang-tidy specific instructions on how every translation unit in your project is compiled. In particular, it provides compilation flags and definitions of preprocessor macros supplied through the command-line. Both of these affect how code is interpreted. For instance, we won’t suggest use of range for loops unless the compilation environment supports C++11 or later. Macros defined on the command-line may affect conditional compilation and other text expansion and any source transformations to be applied will need to take this into account.
CMake 2.8.5 or later can generate the necessary JSON file on unix-like platforms by defining
CMAKE_EXPORT_COMPILE_COMMANDS when running CMake. Unfortunately, it does not generate the file on Windows. For Windows, there are several options for running clang-tidy, all of which rely on an extension to Visual Studio. You can create a Visual Studio project and use the sourcetrail add-on to generate a compilation database and run clang-tidy from the command-line. The latest version of ReSharper for C++, another add-on for Visual Studio, integrates clang-tidy support and allows you to supply clang-tidy transformations interactively to your code. Clang Power Tools is a free add-on that can run clang-tidy on your code. You install clang-tidy from the LLVM web site directly for use with Clang Power Tools.
A word of caution is in order. Transforming C++ source code correctly is a tricky business, particularly when interactions with the preprocessor are concerned. Clang-tidy is not a commercial tool, so support is in on a volunteer basis. You may encounter bugs with some of the transformations and it is always recommended that you review the changes and proceed one translation unit at a time. Remember, go slow in small steps, commit often and keep an eye on your continuous integration build system for signs of problems.
Here are some suggested checks you may want to use on a legacy code base. Consult the clang-tidy documentation for an exhaustive list of checks and how they work.
- modernize-loop-convert to switch to range for loops
- modernize-redundant-void-arg to remove old C-style declarations/definitions
- readability-implicit-bool-cast is a good way to find
booldisguised as an integral type
Modernizing Iterated Dynamics With Clang-Tidy
Several of the above checks were applied to Iterated Dynamics: modernize-use-nullptr*, misc-macro-parentheses*, readability-braces-around-statements*, readability-else-after-return*, and readability-simplify-boolean-expr*. The only problems encountered were during misc-macro-parenthesis, where a lack of indentation at the beginning of lines in a multi-line macro gave the check a hard time. Remember to review the diffs of tools that change your source code to check for suspicious changes, proceed in small steps with small grained commits and leverage your continuous integration builds across platforms.
A legacy code base probably hasn’t been scrutinized by any static or dynamic analysis tools. New tools continue to appear and existing tools continue to improve in the quality of their analysis over time. Different tools have different strengths and weaknesses. The general advice is to run as many tools as you can and check for as many things as you can manage. While the aim of every tool is to report zero false positives, they are not perfect. If your first attempt at using these tools drowns you in warnings, dial back the parameters to focus on specific problems for which you’ve verified that a sample of the reports are not false positives.
Static Analysis Tools
Static analysis tools work by examining the source code and looking for suspicious constructs. They may operate by pattern matching or they may perform some more sophisticated data flow analysis to track special values like
nullptr. Some static analysis tools you may want to consider are: OCLint, cppcheck, the clang static analyzer and Visual Studio. Clang’s static analyzer checks can be run via clang-tidy. The Clang Power Tools and ReSharper for C++ add-ons for Visual Studio provide clang static analyzer support via their clang-tidy integration. Visual Studio also provides built-in support for static analysis using the Microsoft C++ compiler. Visual Studio releases offer increasing support for the C++ Core Guidelines. Coverity is a commercial static analyzer that is free for open source projects and can be licensed for commercial use.
Dynamic Analysis Tools
Dynamic analysis tools work by instrumenting the code at compilation time. In order for the analysis to proceed, the code must be executed. If you have regression or acceptance tests for your application, these are perfect execution scenarios for dynamic analysis tools. Using dynamic analysis, we can check for things like misuse of the memory, threading errors or undefined behavior. For instance, we may be able to detect memory used after being freed, memory freed twice, memory leaked, storing data outside the bounds of a block of dynamically allocated memory, storing outside the bounds of data allocated on the stack. Just as for static analysis tools, different dynamic analysis tools have their own strengths and weaknesses and we should run as many tools as we can to look for problems.
The clang compiler supports a number of dynamic analysis tools called “sanitizers”. Different sanitizers check for different types of problems:
- Address sanitizer detects memory errors
- Memory sanitizer detects reads of uninitialized memory
- Thread sanitizer detects data races
- Undefined behavior sanitizer
Other open source dynamic analysis tools, such as valgrind, and commercial dynamic analysis tools such as BoundsChecker are available. Just as for static analysis tools, you should use as many dynamic analysis tools as you can to leverage their different strengths.
Adding Static Analysis to Iterated Dynamics
Adding static analysis to Iterated Dynamics was relatively straightforward*. Over time, the static analysis checks have ensured that forward progress remains locked in. All the checks performed by cppcheck and clang’s static analyzer get flagged as failed builds in the continuous integration server and that prevents problems, such as reading a variable before initializing it, from slowly creeping back into the code base.
Legacy code can be frustrating to deal with because the backlog of problems and technical debt can be overwhelming. However, with a little discipline we can get the debt under control and come up with a plan for paying down the debt so that working in the code can be enjoyable, or at least less irritating. If we work carefully and in an incremental manner, we can take care of the problems in a manner that always locks in forward progress. Our goal isn’t to fix everything at once. Our goal is to ratchet up positive improvements and not slide backwards.
Iterated Dynamics is in better shape now than when it forked from FRACTINT. Code with excessive technical debt is difficult to read and difficult to understand. Technical debt in an open source project creates an unnecessary barrier for new contributors. Yet the investment in this code is worth keeping instead of throwing it away and starting from scratch. Iterated Dynamics generates a wide variety of fractal images, possibly more than any other open source fractal renderer. It has a large body of integrated help text that is worth preserving. There is more work to do for Iterated Dynamics to support multiple cores, GPU based rendering and distributed rendering, just to mention a few. These are significant architectural extensions that can’t be made until the large number of interrelated global variables are encapsulated. By proceeding carefully in small steps, locking in forward progress at every step, we ensure that we are making things better and not worse.
Thanks to Barbara from the C++ Community Leaders mailing list for suggestions on the consistent naming of identifiers, clang-format, continuous integration with Jenkins, clang sanitizers and CMake.