Unrolling the Four Node Quad

Original Post - 11 Sep 2022 - Michael H. Scott

Visit Structural Analysis Is Simple on Substack.

The FourNodeQuad was one of the first, if not the first, solid elements in OpenSees.

Despite the element’s mediocre implementation, the coding style was copied by others into subsequent solid element implementations. Fortunately, before that copying happened, I revised the implementation to be more computationally efficient–and I was kind enough to leave the original author’s name in the source code.

The improved computational efficiency came from unrolling the matrix-vector operations involving the strain-displacement matrix, B, which enters several important stages of the element state determination:

Compatibility of strains with nodal displacements
\({\boldsymbol \varepsilon}={\bf B}{\bf u}\)
Equilibrium of nodal forces and stresses
\({\displaystyle {\bf P} = \int_V {\bf B}^T {\boldsymbol \sigma} \: dV}\)
Tangent stiffness matrix
\({\displaystyle {\bf K} = \int_V {\bf B}^T {\bf D} {\bf B} \: dV}\)

For more details on these equations, see the “Cook book”.

These operations with the B matrix occur at each Gauss point, in each quad element, at each Newton iteration within each analysis time step. So, it behooves us to trim as many flops as possible from the state determination.

The original implementation of FourNodeQuad used the overloaded matrix-vector operators:

// Compatibility
eps = B*u;
// Equilibrium
P = P + B^sig*dV;
// Stiffness
K = K + B^D*B*dV;

The first step to efficiency was to use the daxpy-like functions in order to avoid the temporary Matrix and Vector objects created by the overloaded operators:

// Compatibility
eps.addMatrixVector(0.0, B, u, 1.0);
// Equilibrium
P.addMatrixTransposeVector(1.0, B, sig, dV);
// Stiffness
K.addMatrixTripleProduct(1.0, B, D, dV);

But we can do better than that. Inspection of the B matrix shows there are always zeros in known locations:

EQUATION

Taking the sparsity pattern into account, the final unrolled implementation does away with storage of the B matrix and loops over the nodes for contributions to strain, forces, and stiffness.

// Compatibility
eps.Zero();
for (int beta = 0; beta < 4; beta++) {
   eps(0) += shp[0][beta]*u[0][beta];
   eps(1) += shp[1][beta]*u[1][beta];
   eps(2) += shp[0][beta]*u[1][beta] + shp[1][beta]*u[0][beta];
}

// Equilibrium
for (int alpha = 0, ia = 0; alpha < 4; alpha++, ia += 2) {
   P(ia) += dV*(shp[0][alpha]*sigma(0) +
                shp[1][alpha]*sigma(2));
   P(ia+1) += dV*(shp[1][alpha]*sigma(1) + 
                  shp[0][alpha]*sigma(2));
}

// Stiffness
/*
A lot of code that's not shown here
*/

I omitted some important implementation details here, but you get the idea. For the full picture, take a look at _FourNodeQuad.cpp}. Also, FEAP programmers may recognize the unrolling style as I embarked on this efficiency quest after taking CE 233 in Fall 2000 from Prof. Taylor, who came out of quasi-retirement to help cover Prof. Armero’s sabbatical.

So, how much difference does all of this make?

I created a mesh of 20 elastic isotropic quad elements then loaded the model to a peak value in 100,000 time steps. Even though the model is linear, I used a Newton algorithm to mimic what goes on in a nonlinear analysis. The mesh size is small so the equation solution doesn’t matter. The 100,000 time steps wash out variance in timing individual analysis steps.

Everything else being equal, differences in implementation of the element state determination are apparent from timing the ops.analyze(100000) command.

Implementation	Wall Time Relative to Unrolled Implementation
Unrolled	1.000
daxpy-like	1.227
Overloaded operators	1.322

Using the overloaded matrix-vector operators takes about 30% more wall time than the unrolled implementation. Similarly, the daxpy-like implementation takes about 20% more wall time than the unrolled implementation. And the B matrix is not all that sparse.

So, going from the overloaded operators to daxpy-like implementation speeds up the analysis–it’s a win. But, if you want to go for the gold, unroll the calculations to account for zeros–especially if you are multiplying block diagonal matrices. Unrolling will make a BIG difference.

OpenSees Cloud

OpenSees AMI

Unrolling the Four Node Quad