Counting Flops

Original Post - 06 Jul 2025 - Michael H. Scott

When I was an undergraduate at Pine State University, all engineering freshmen had to take a programming course. We could pick between Fortran, Pascal, and C++. From what I recall, most civil and mechanical engineering students took Fortran because that’s how you crunch numbers. I chose C++ and liked it so much I willingly took a second programming class.

It was C++ from the mid 90s. Templates were a thing, but widespread adoption of the STL was a few years over the horizon. So I learned the ins and outs of memory management, how to DIY linked lists, and other “low level” issues that modern C++ programmers take for granted. And even though MATLAB was becoming a “programming language” while I was in graduate school, I got a heavy dose of implementation efficiency (via Fortran) in computational mechanics.

And that dose has been a constant reminder to un-MATLAB your OpenSees and to unroll your element and constitutive model implementations. This post shows another example.

Consider the CorotCrdTransf2d class implementation of getGlobalResistingForce(). The first three lines of the method transform the basic forces to the local coordinate system.

const Vector &
CorotCrdTransf2d::getGlobalResistingForce(const Vector &pb, const Vector &p0)
{
    // transform resisting forces from the basic system to local coordinates
    this->compTransfMatrixBasicLocal(Tbl);
    static Vector pl(6);
    pl.addMatrixTransposeVector(0.0, Tbl, pb, 1.0);    // pl = Tbl ^ pb;

    //
    // More stuff down here
    //

}

The first line calls a private function that populates the local-to-basic transformation matrix based on the displaced configuration of the element. Then, after declaring a static Vector (a post for another day), the addMatrixTransposeVector() method transforms the forces from basic to local, \({\bf p}_l = {\bf T}_{bl}^T {\bf p}_b\).

The form of the local-to-basic transformation matrix is known a priori.

Transformation matrix from local to basic system

In and of itself, the transformation of basic to local forces via dense matrix-vector multiplication in addMatrixTransposeVector() requires 30 flops (18 multiplies and 12 additions).

But we have no need to multiply the third and sixth columns of the transformation matrix by anything because these columns are all zeros and ones. Furthermore, the entries in columns one, two, four, and five form an exploitable pattern that becomes obvious after we write out the six equations in \({\bf p}_l = {\bf T}_{bl}^T {\bf p}_b\).

Unrolled basic to local equilibrium equations

To minimize flops, we could re-implement the first three lines of getGlobalResistingForce() as follows:

const Vector &
CorotCrdTransf2d::getGlobalResistingForce(const Vector &pb, const Vector &p0)
{
    // transform resisting forces from the basic system to local coordinates
    static Vector pl(6);
    // unrolling pl = Tbl ^ pb;
    double axial = pb(0);
    double sinAxial = sinAlpha*axial;
    double cosAxial = cosAlpha*axial;
    double shear = (pb(1)+pb(2))/Ln;
    double sinShear = sinAlpha*shear;
    double cosShear = cosAlpha*shear;
    pl(3) = cosAxial + sinShear;
    pl(0) = -pl(3);
    pl(4) = sinAxial - cosShear;
    pl(1) = -pl(4);
    pl(2) = pb(1);
    pl(5) = pb(2);
 
    //
    // More stuff down here
    //

}

Now we’re down to seven flops–four multiplies and three additions. I’m not counting the division by Ln because that also happens in the compTransfMatrixBasicLocal() method (four times in fact, but we don’t need to go there).

Unrolled code is not always easy to read. But, generally, what you give up in readability, you gain in performance…although you will find many implementations in OpenSees that are both unreadable and inefficient.

A brief script offers some insight on efficiency gains from unrolling the basic-to-local force transformation for the CorotCrdTransf2d class. The model is a simple elastic cantilever with the corotational transformation and an increasing axial load–basically something to get through a lot of uneventful state determinations.

import openseespy.opensees as ops
import time

ops.wipe()
ops.model('basic','-ndm',2,'-ndf',3)

ops.node(1,0,0); ops.fix(1,1,1,1)
ops.node(2,48,0)

ops.geomTransf('Corotational',1)

ops.element('elasticBeamColumn',1,1,2,20,29000,800,1)

ops.timeSeries('Linear',1)
ops.pattern('Plain',1,1)
ops.load(2,1.0,0,0)

ops.analysis('Static','-noWarnings')

Nsteps = 1000000
totalTime = 0
for i in range(Nsteps):
    t0 = time.time()
    ok = ops.analyze(1)
    t1 = time.time()
    totalTime += (t1-t0)

print(totalTime)

Running this script five times with the original implementation takes an average of 1.85 sec while the unrolled implementation takes 1.78 sec on average. That’s about a 3% reduction in run-time. Not huge, but it’s something that adds up. It’s also something that may get washed out in the analysis of larger models. Still something nonetheless.

We’d see larger reductions in run time for the CorotCrdTransf2d class if we unrolled the matrix triple product that transforms basic stiffness to the local system and the addition of geometric stiffness terms to the local stiffness matrix. That unrolled code is more unreadable than the transformation of forces and won’t blog well, but you get the idea.

OpenSees Cloud

OpenSees AMI

Counting Flops