OpenGL Matrix Transforms
This tutorial is designed to help the beginner understand how to use OpenGL's basic transformations - and how not to get burned by them. As usual, in order to do any real 3D work, a solid grasp of vector geometry is needed, and a basic understanding of linear algebra (particularly matrix operations.) I've tried to iron out some of the kinks so that you won't need to go break out the textbooks.
What you'll know at the end of this:
- How and where OpenGL translation, rotation, and scaling operations should be used
- The derivation of those operations
- How to track coordinate systems by using "reference frames" instead of individual transformations
- How to manipulate a camera in world coordinates, rather than attempting to manipulate the world in camera coordinates
- How to evaluate what the user is pointing at in your world
It took me a bit of time to derive these mechanisms, so I thought I'd share my work.
Contents
- Fundamentals
- The Matrix Stack and Transformations
- Reference Frame Interpretation
- The Camera
- Putting It All Together
- Calculating the Cursor Position in World Coordinates
Notation
I'm not terribly precise with my notation - I tend to forget to use it when I feel the context is sufficient. Let me know if you think this is a problem somewhere. In general, I follow the rules below:
- vectors will be in bold, such as the vector a
- scalars will be in italic, such as the scalar v
- dot product: v = a . b
- cross product: v = a x b
- magnitude of a vector: |a|
- the vector a is a column vector (ax, ay, az, aw)T
Review
Properties of dot products:
- a . b = |a|b|cos(angle between a and b)
- If a is perpendicular to b, a . b = 0 (obvious from the above)
- Dot products are scalar
Properties of cross products:
- |a x b| = |a|b|sin(angle between a and b)
- If a is parallel to b, a x b = 0 (again, see above)
- Cross products are vectors.
- The cross product of any two vectors is perpendicular to *both* vectors.
Properties of matrices:
- M1 * M2 != M2 * M1
- (M1*M2)-1 = M2-1 * M1-1
- (M1*M2)T = M2T * M1T
- det(M-1) = 1/det(M)
- det(M) != 0: Matrix is invertible.
- det(M1 * M2) = det(M1) * det(M2)
- det(MT) = det(M)
All of the above are taken for granted; if you want to understand that better, refer to any basic college math text.
Vertices in OpenGL: x, y, z, w?
OpenGL uses 4x4 matrices to represent all vertex transformations. Matrices are used because they can accumulate linear transformations - a series of rotations, translations, and scaling operations can be accumulated into a single matrix. The main reason for a 4x4 matrix is that translations cannot be represented as a linear operation with a 3x3 matrix.
As a consequence, throughout the GL you'll find calls like glVertex4d - and you'll find that they're rather unpredictable if you're not careful with your w coordinate (the last one, after z.) There is a method to this madness, however:
- If the vertex you're referring to is a position, then the w coordinate must be 1.
- If the vertex you're referring to is a direction, the w coordinate must be 0.
Position vertices are used to locate a point; direction vertices point out a direction, but do not depend on a particular point in space. The difference between two position vertices is a direction. A position + any value times a direction is another position. And so forth. It's not necessary to be this pedantic, generally, but paying attention to your w coords can help you check operations for sanity.
Note that, in the above convention, the origin is the position (0, 0, 0, 1), and the direction from the origin to a point p is p - (0, 0, 0, 1); the convention is consistent.
The Matrix Stacks
There are two main matrices to concern yourself with here: the Projection and Modelview matrices.
- The Projection matrix is used to transform, as the book says, from eye coordinates to clip coordinates (and onwards to screen coordinates.) We'll cover it in more detail when we get to "determining the cursor position in world coordinates"
- The Modelview matrix is used to transform from your coordinate systems to eye coordinates. This is the matrix we'll focus on in this tutorial.
The distinction between the two matrices is somewhat subtle, but may cause you grief. For example, the Projection matrix is not used in transforming surface normals used for lighting or fog coordinates. Steve Baker's article on projection abuse sums up the issue nicely; in super-short, use the Projection matrix for your initial calls to glOrtho*/glFrustum*/gluPerspective *only*, and leave the rest of the transforms to the modelview matrix.
The transformation of a vertex all the way from glVertex:
From there, 3-dimensions NormalizedDeviceCoords = (1.0 / ClipCoordsw) * (ClipCoordsx, ClipCoordsy, ClipCoordsz). Typically, ClipCoordsw will equal 1 - this indicates that you used a "position" vertex for glVertex.
If the NormalizedDeviceCoords fall into the cube (-1 <= x <= 1, -1 <= y <= 1, -1 <= z <= 1), the Z coordinate is compared against the depth buffer as appropriate and the pixel is rendered if the depth test passes. Otherwise, it's either culled or clipped (I believe the GL will automatically clip your surface to the viewport boundaries).
Translation
The translation matrix, applied by glTranslate*, is the simplest conceptually: It translates (read: moves) your coordinate system. Translation of the vector a by (x, y, z, w=1) can be performed by the application of the matrix:
M = | 0 1 0 y | , M*a = | ay + y |
| 0 0 1 z | | az + z |
| 0 0 0 w | | aw * w |
Your W coordinate should be 1 here to preserve your sanity.
Note that the determinant of the translation matrix is w - that is, the inverse is defined as long as w != 0. Its inverse is intuitive - translate it back:
inverse(M) = | 0 1 0 -y |
| 0 0 1 -z |
| 0 0 0 1/w |
Scaling
The scaling matrix is used to stretch or shrink your coordinate system. Scaling is something that should not be used too lightly - it can interfere with lighting calculations, and scaling by large factors or factors close to zero can introduce error into your system. To scale the vector a by (sx, sy, sz, sw=1.0), use the matrix:
M = | 0 sy 0 0 |, M*a = | ay*sy |
| 0 0 sz 0 | | az*sz |
| 0 0 0 sw | | aw*sw |
Again, w=1.0 or you'll go insane. The determinant of the scaling matrix is sx*sy*sz*sw - the inverse is defined as long as *none* of the scale factors = 0.
Its inverse, also intuitive ("unstretch" it):
inverse(M) = | 0 1/sy 0 0 |
| 0 0 1/sz 0 |
| 0 0 0 1/sw |
Rotation
The rotation matrix is used to rotate vertices about a given axis. The derivation and use of this matrix is much trickier to understand than that of scaling or translation. Initially when I tried to wrap my mind around it, I kept thinking in terms of angles, sines, cosines. This turned out to be a mistake.
Basically, to rotate one point X about an axis A, you do this: First, find the part of the axis of rotation shared by the
X and axis A. This is the vector v1 = (x . A) A. From here, the vector X' = v + a vector of fixed length on the plane normal to A. Let's look at the two vectors
= X - (X.A) A v3
= A x v2
= A x (X - v1)
= A x (X - (X.A)A)
Since v1 is parallel to A, A x v1 = 0, we reduce v3 to simply A x X. Note that v2 and v3 are both perpendicular to the axis A and to each other. Assuming that |A| = 1 - the axis of rotation A is a unit vector - then v2 and v3 also have the same magnitude:
= X^2 sin(θ)^2 |X - (X.A)A|^2
= X^2 + X^2 A^4 cos(θ)^2 - 2 X^2 A^2 cos(θ)^2
= X^2(1 - cos(θ)^2) = X^2 sin(θ)^2
Therefore, v1 + cos(?)v2 + sin(?)v3 defines a circle around v1. Finally we can see that:
==> expanding out vectors v ==>
X' = (X . A) A + cos(?)*(X - (X . A) A) + sin(?) * (A x X)
Expanding this into Cartesian coordinates, we get:
X'=| Xx(AyAx) + Xy(Ay^2) + Xz(AyAz) |+cos(θ)*| -Xx(AyAx) + Xy(1-Ay^2) - Xz(AyAz) |+sin(θ)*| Xx(Az) - Xz(Ax) |
| Xx(AzAx) + Xy(AzAy) + Xz(Az^2) | | -Xx(AzAx) - Xy(AzAy) + Xz(1-Az^2) | | -Xx(Ay) +Xy(Ax) |
==> combining and factoring out X ==>
| Ax^2 + cos(θ)(1-Ax^2) AxAy - cos(θ)AxAy - sin(θ)Az AxAz - cos(θ)AxAz + sin(θ)Ay |
X' = | AyAx - cos(θ)AyAx + sin(θ)Az Ay^2 + cos(θ)(1-Ay^2) AyAz - cos(θ)AxAz - sin(θ)Ax | X
| AzAx - cos(θ)AzAx - sin(θ)Ay AzAy - cos(θ)AzAy + sin(θ)Ax Az^2 + cos(θ)(1-Az^2) |
Finally, factoring out, adding the extra row and column (which have no impact on the rotation itself), and solving just for the matrix itself:
R = | AyAx(1 - cos(θ)) + sin(θ)Az Ay^2(1 - cos(θ)) + cos(θ) AyAz(1 - cos(θ)) - sin(θ)Ax 0 |
| AzAx(1 - cos(θ)) - sin(θ)Ay AzAy(1 - cos(θ)) + sin(θ)Ax Az^2(1 - cos(θ)) + cos(θ) 0 |
| 0 0 0 1 |
To invert this operation, we use the intuitive definition of inversion: Inversion of a linear operation is the same as reversing that operation. To reverse a rotation operation, we simply replace "θ" with "-θ". Substituting cos(θ) == cos(-θ), and -sin(θ) == sin(θ), we find out that R-1 == RT, and R*RT = R*R-1 = I. It follows that:
and therefore, det(R) = +/- 1. In other words, the inverse of a rotation matrix is *always* well-defined, and is simply its transpose.
The Reference Frame
This is where I diverge from the book a bit. I've found the problem of tracking sequences of translations, rotations, and scales somewhat intractable, and found a different, equivalent way of thinking about it that's easier to process. We'll get back to those matrices shortly.
I define an object's "reference frame" as its orientation and position in the world. In order to uniquely specify an object's position and orientation, you need at least two direction vectors to orient it - let's call them "up" and "right" - and one position vector to place it with respect to the origin. Given these, we can calculate the last orientation vector - let's call it "backwards" (we'll get to that) - as right x up = backwards.
By this definition, we can define OpenGL's default reference frame:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
Note that defining backwards=(0, 0, 1, 0) is consistent with OpenGL's "looking down the negative z axis", for a camera. It is no coincidence that that looks like an identity matrix; again, we'll get to that.
Let's define our object O as a triangle consisting of the vertices: (0, 0, 0), (1, 0, 0), (0, 1, 0). (If I omit the w coord, it's 1 for a position like these vertices, and 0 for a direction.) We want to move this triangle around, but we don't want to manually calculate the vertices for this triangle (and the other 10000 triangles that may constitute a more complex model.) Let's put it at the position P=(2, 3, -10) for 2 to the right, 3 up, and 10 forwards. Let's also rotate it a bit - let's say 30 degrees about the Z axis. To do this, you could call glTranslate* followed by a specially-concocted glRotate*, but let's not do that right now. Don't rush.
Let's instead consider the new reference frame of our object. We calculate the new right, up, and forward axes as:
| sqrt(3)/2 0.5 0 2 |
F = | -0.5 sqrt(3)/2 0 3 |
| 0 0 1 -10 |
| 0 0 0 1 |
Now, if you go ahead and calculate what you'd get if you'd called glTranslate* and glRotate*, you'll find that you got *exactly that matrix.* This always works, no matter the orientation or position, though be careful with scaling (haven't thought that part through yet.)
In other words, you can interpret the composite matrix F = T * R as four column vectors, (Right, Up, Backwards, Position); and more importantly, vice-versa. That is, if you orient your models to this frame before saving them, you can always quickly place a model by simply building the reference frame matrix with the Right-Up-Backwards-Position vectors given.
To sum it up: We define the reference frame matrix F as the matrix T * R:
F = | 0 1 0 Ty | * | Ryx Ryy Ryz 0 | = | Ryx Ryy Ryz Ty | = | Ry Uy By Py | = | Right Up Backwards Position |
| 0 0 1 Tz | | Rzx Rzy Rzz 0 | | Rzx Rzy Rzz Tz | | Rz Uz Bz Pz |
| 0 0 0 1 | | 0 0 0 1 | | 0 0 0 1 | | 0 0 0 1 |
Controlling Object Motion in a Reference Frame
There are a few key advantages to the reference frame interpretation of matrix transforms.
One is that, since you know which way right, up, and backwards are, you can implement a control set trivially: To go forward or backward, add -/+ v * backwards to the position vector. To go up and down, add +/- v * up to the position vector; to strafe left and right, add +/- v * right to position vector.
Controlling direction is also easy:
Pitch: refframe = refframe * rotation_matrix(refframe.right, angle)
Roll: refframe = refframe * rotation_matrix(refframe.backwards, -angle)
Since OpenGL specifies matrices in column-major order, an array of four vectors (right, up, backwards, position) can be directly casted to double* and applied via glMultMatrix*. To be totally honest, I suspect this might be the rationale behind the GL's choice of column-major rather than row-major matrix storage, though I can't speak for history. More of a minor note than a major advantage.
The Camera
No rule is good without an exception. Cameras work slightly differently.
When applying an object refframe transform, the object's refframe is stored in the parent frame's coordinates, and converts from parent (world) coords to child (object) coords.
The camera's refframe is stored in the child frame's (world's) coordinates, and converts from child (world) to parent (eye) coords. This is precisely the inverse of the refframe transform.
Thus, you need to apply the camera's refframe's inverse to convert world to eye coordinates. You can still store the camera refframe in world coordinates - and there's good reason to do it, since 1) the forward transform does have a meaning which we'll use later, 2) you can treat your camera like a physical object, and 3) you can implement first-person controls as mentioned above. You just have to invert the camera refframe before applying it to the modelview matrix.
Matrix inversion isn't really covered here, but basically you have three options:
- The way I've seen other people taught that has to do with adjuncts and submatrices
- Gauss-Jordan elimination
- Reverse the operations of the camera refframe
The first way is one I don't remember how to do. I implemented it once, found it overly complex to implement, not intuitive for manual calculation, and it ran slower than my unoptimized Gauss-Jordan elimination mechanism. I don't know, maybe this way would lend itself better to streamlining via SIMD operations, but really, you shouldn't be doing so many inverses/frame that it should matter. Matrix inversion is a very expensive operation no matter how you do it.
The second way, Gauss-Jordan elimination, I found to be relatively fast. If you need to invert a matrix, look up that mechanism and implement it. You can assume that the inverse exists because we've already *proven* above that a matrix composed of translation and rotation operations has an inverse.
The third way is by looking at the camera refframe matrix, decomposing it into its translation and rotation matrices, inverting those, and multiplying them. I find this to be the easiest mechanism, *as long as you don't incorporate any scaling operations into your camera's reference frame.*
F_camera = | Ry Uy By Py | = | 0 1 0 Py | * | Ry Uy By 0 |
| Rz Uz Bz Pz | | 0 0 1 Pz | | Rz Uz Bz 0 |
| 0 0 0 1 | | 0 0 0 1 | | 0 0 0 1 |
==> remember, inverse of rotation matrix == transpose of rotation matrix ==>
| Rx Ry Rz 0 | | 1 0 0 -Px |
inverse(F_camera) = | Ux Uy Uz 0 | * | 0 1 0 -Py |
| Bx By Bz 0 | | 0 0 1 -Pz |
| 0 0 0 1 | | 0 0 0 1 |
Putting It Together
So, your typical GL application, with camera and objects, will have a display function rather like this:
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(aspect, fovy, etc);
// now, set up modelview to identity
glMatrixMode(GL_MODELVIEW);
glLoadIdentity(); // interpret as "GL default reference frame"
glMultMatrixd(camera.refframe.inverse);
for(o in object_list) {
glPushMatrix();
glMultMatrixd(o.refframe);
o.render();
glPopMatrix();
}
}
Note that you can apply successive reference frames - that is, they are recursive. I believe this is why the matrix stack exists.
Calculating the Cursor Position in World Coordinates
So, now that you've got your world set up, you might want to interact with it. Maybe even with a mouse. Maybe not. If you do want to use a mouse, this section should help.
First, let's look back on how your pixel goes from glVertex* to screen coordinates:
Next step was NormalizedDeviceCoords = (1/w) ClipCoords[0:3], or just the first three components of ClipCoords if w is sane (1).
These are then clipped to the boundaries (-1 <= x <= 1), (-1 <= y <= 1), (-1 <= z <= 1), and the Z coord is used for depth testing and then discarded. The hidden transform here is the one performed by the viewport.
glViewport maps the domain (-1<=x<= 1), (-1<=y<=1) to the window coordinates specified by the viewport. We'll call this viewport (Vx1 = left, Vy1 = top, Vx2 = left + width, Vy2 = top + height), assuming the Windows convention of X = 0 at left, increasing rightwards, and Y=0 at top, increasing downwards.
If your mouse cursor is in the window coordinate position m = (mx, my), then the cursor position in quasi-normalized-device-coords is the line segment connecting p and q:
py = qy = (2.0*(Vy2 - my) / (Vy2 - Vy1) - 1.0)
pz = -zNear, qz = -zFar pw = qw = 1.0
where zNear is the distance to the near clipping plane and zFar, to the far clip plane. zNear and zFar can both be positive or negative. Technically, in normalized device coords, pz = -1 and qz = 1, but the above values for pz and qz help us dodge tricky calculations later.
This is your cursor line pre-projection. For ortho matrices centered on the origin, we can now calculate p' and q' in eye coords:
M = | 0 (top-bottom)/2 0 0 |
| 0 0 (far-near)/2 0 |
| 0 0 0 1 |
p' = Mp
q' = Mq
For gluPerspective matrices, a subset of symmetric glFrustum matrices also centered on the origin, we can also calculate p' and q' in eye coords:
p' = | 0 zNear*sin(fovy/2) 0 0| p
| 0 0 1 0|
| 0 0 0 1|
| aspect*zFar*sin(fovy/2) 0 0 0|
q' = | 0 zFar*sin(fovy/2) 0 0| q
| 0 0 1 0|
| 0 0 0 1|
That is to say, p' and q' are scaled to fit the viewing volume indicated by gluPerspective, and the choice of pz=-zNear and qz=-zFar prevents us from having to think of the Z scale factor.
Finally, calculate p'' and q'' in world coordinates, by:
q'' = camera.refframe * q'
p'' is the position of the cursor on the near clipping plane, and q'' on the far plane.
You may find that you hit a complex object and want to know what part of the object you hit. No fear here: p''' and q''' in object coords are simply object.refframe.inverse * p'' and object.refframe.inverse * q'' respectively. You can dig deeper into nested reference frames by repeating this process.
In general, you can look at it this way: When your projection and modelview are sane and you give a vertex with w=1,
=>
vertex_world_coords = ProjectionMatrix.inverse * camera.refframe * device_xform_inverse(screen_coords)
The "device_xform_inverse" is the operation above that scales your x and y coordinates to the domains (-1, 1), and adds the z dimension (values -1 and 1 for p and q respectively) and the w dimension (value 1).
Hopefully that's enough to get you started building a coherent, interactive world using OpenGL. I won't delve into the more complicated issues of collision detection; that's still up to you. Happy rendering!
