USS Clueless - System control

Stardate 20040702.1344

(Captain's log): In response to comments here about latency, Bart writes:

As an agronomist (soil chemist) and farmer I work with living systems. My experience is that an overwhelming percentage of the time cause and effect are separated in both time (latency, as you discuss) and space. We are fortunate indeed to pry an R² of 35 out of most work. I'm curious... is something like mechanical engineering 'cleaner' in that respect?

It's different. I don't know that it's cleaner. Our time and space issues were more constrained, but our standard for acceptable performance was much more strict than anything Bart ever has to satisfy.

And for us, latency was one problem among many.

I'm not an ME; I'm a programmer. I spent most of my career doing embedded software. That means I wrote firmware for microprocessors which were incorporated into larger systems. Usually the microprocessor was responsible for the human interface or the control interface fed by another computer, and it also had to control custom hardware in the system and report back what happened. But most of my jobs involved controlling custom electronics. There was the only one time in my career I worked on controlling mechanical systems.

It was at a company which produced robotic arms for the semiconductor processing industry. Our robots operated in ultra-clean high-vacuum environments, and were designed to move silicon wafers around. There were incredible constraints in terms of particulates, vacuum, speed, reliability, and precision.

Our motors had to be outside the clean volume so they could be cooled. (Also, motors are inherently "dirty" and shed particles.) That meant we had to run shafts through the wall from normal air into high vacuum. We had to make sure we didn't leak excessively at the interface. There's always some leakage; it's impossible to avoid. But when you're trying to maintain vacuum at levels below 10^-7 torr, you can't tolerate damned much leakage. Sealing rotating shafts without screwing up their ability to rotate is an interesting problem.

We generally had excellent reliability. The requirements were quite strict (but not unreasonable) for mean time to failure, mean time to repair, mean interval of maintenance, and mean time to perform maintenance. (All of which was summarized as an "uptime" spec.)

We had quite challenging requirements for precision. Absolute accuracy was not important at all. But our spec for repeatability was extremely strict: no more than 5 mils worst case (where a "mil" was .001 inch), about 125 microns (one eighth of a millimeter).

We also had major constraints for cleanliness. Every particle which lands on an IC and is present during some kinds of processing steps ruins the die it sits on, so obviously you'd like to minimize that.

We were not permitted to touch the top or edge of the wafer or even to have anything extend spatially above the top plane of the wafer anywhere near it. So we couldn't clamp onto the top of the wafer or hold it in place by its edges.

That meant the wafer sat on three small plastic pads on the end effector, making contact only on the bottom, well away from the edges. The wafer was held in place solely by friction. Given that the wafer didn't weigh much, there wasn't really very much friction. So when we moved, we had to be extremely careful to make sure that the lateral force between pads and wafer did not become so great as to cause the wafer to shift.

The primary challenge in motion control related to the geometry of the robot arms, and before I discuss that it would probably be helpful for me to give you a mental model of the robot.

Imagine a human holding his hands flat, palms face up, in line with his shoulders. His elbows stick out to the side. (That isn't a very comfortable position, but just imagine it that way.) His job is to pick up and move dinner plates, and he is only permitted to touch the bottom of the plate. The robot actually had only one "hand" (the "end effector"), and it was connected to both arms. So imagine that the person's hands are taped together.

The robot had three "motions" it could perform. The "rotate" motion was analogous to the human turning in place over a single spot. "Lift/lower" was analogous to the human using his calf muscles to raise and lower himself with his ankles. The "extend/retract" motion was like the human moving his hands horizontally in whatever direction he's facing, directly away from his chest or towards his chest by straightening or bending his arms.

To move a wafer from one place to another, we executed the following sequence of motions:

1. With the arm retracted and lowered and empty, rotate the robot to face the source station. (this was fast)
2. Extend the arm to the wafer station. The end effector slides under the wafer. (fast)
3. Lift the wafer.
4. Retract the arm, holding the wafer. (slow)
5. Rotate to face the destination station. (slow)
6. Extend the arm carrying the wafer to the destination station. (slow)
7. Lower the wafer.
8. retract the arm, leaving the wafer in place. (fast)

"Lift" and "lower" usually was on the order of two or three millimeters. But most of the robots I worked on could extend to a position more than a meter from the center of the robot, and retract so that the wafer was less than a quarter of a meter from the center of the robot. They also had a rotary range of 370° stop to stop.

What we referred to as "movement profiles" were extremely complex to manage, since our customers wanted wafers moved as fast as possible.

"Fast" motions had to be smooth but didn't have to limit the force used, since there was no wafer on the end effector. The goal was to move as quickly as possible without overshoot or other potentially disastrous miscontrol.

"Slow" motions, when there was a wafer, were a real bitch. The wafer was held in place solely by friction, so we had to make sure that the total force applied to the wafer by the end effector pads never was great enough to exceed friction and cause the wafer to slip.

At all times, we had to know all the force vectors applied to the wafer and had to make sure that the magnitude of the vector sum didn't exceed the threshold which would overwhelm friction and result in wafer movement relative to the end effector.

When we had a wafer and were rotating, if we rotated too rapidly then the centripetal force could cause the wafer to shift. A small shift would violate our precision spec. A medium shift could result in eventual collision. A huge shift would throw the wafer like a Frisbee. Any shift was very bad.

When we were rotating rapidly, there was considerable centripetal force. That meant we couldn't use as much force to accelerate and decelerate. When rotational speed was lower, at the beginning and end of a rotational movement, we could use more force to accelerate or decelerate.

Controlling the motion so we didn't exceed the friction between the end effector and the wafer was tough. But we couldn't "play safe".

Our customers wanted the wafers moved as fast as possible. In a fab, amortized cost of processing equipment is huge even when measured per-day. A modern state-of-the-art fab can cost upwards of $2 billion, and most of that is for the processing equipment. That is by far the largest component of the operation cost for a fab, and most of the other operating costs are also "fixed".

Revenue comes from sales of ICs, and if you want to be profitable, your revenue better exceed your operating cost. Roughly speaking, total revenue per day is the product of the revenue per IC and the number of ICs produced per day. Roughly speaking, that in turn is calculated by multiplying the percentage yield (useful ICs per wafer) by processing speed (number of wafers processed per day).

One tradeoff is part complexity. Small ICs which don't require many processing steps have a low commercial value, but you get a lot of them per wafer and yields tend to be very high. Big ICs which are very complex require far more processing steps. There are fewer per wafer and a larger percentage will be bad. But they also sell for a much higher price.

However, no matter how you decide to handle that tradeoff, the more wafers you can process per day the better. The limiting factor is another tradeoff, because to some extent processing speed trades off with yield. When you process faster, you ruin more dice per wafer, but you can also process more wafers per day.

Wafers have to be moved in order for them to be processed. Time spent moving wafers is time spent not processing them. Our customers wanted them moved as fast as possible with negligible impact on yield.

That meant that when we moved wafers, we had to push the friction limit as close as we could without ever exceeding it. If we "played safe" we'd move too slowly.

The extend/retract motion was by far the most complicated to control. Our actual control mechanism was extremely indirect. Our motors drove the angle between the body and the upper arm member. We had no "muscle" controlling the elbows and wrists; instead, there were passive mechanisms built into the arm which made them behave properly.

So at the first level of indirection, the microprocessor controlled the angular force applied by the motor at the shoulders. Our "sense" was an encoder on one of the shafts which precisely measured the angle of one of the shoulders.

The basic geometry changes as the end effector extends, and constant rotational force at the shoulder yields widely variable force on the wafer at different points of the motion, being lowest when fully retracted or extended, peaking when the angle between the upper and lower arm members is somewhere near 90° (usually, though by no means always). "Fully retracted" and "fully extended" were extremes of permissible motion. "Full retraction" didn't mean 0° angle at the elbow, and "full extension" didn't mean 180°.

The actual physical position (and mechanical response) of the robot end effector (the "hand") as a function of shoulder angle was complicated to even describe mathematically, let alone to control well. Just keeping motion smooth was tough. Stopping was also tough. In these kinds of systems, you may end up with metastable oscillation centered on the destination point, because the control loop doesn't settle.

One potential hazard was that the force profile would have right overall shape, but would have a high frequency oscillation imposed on it. The integral would be right, but if the amplitude of the high frequency oscillation was great it could cause us to exceed the permissible force threshold and result in wafer shift. That kind of oscillation can easily happen in this kind of system if the control loop isn't tuned well. Unfortunately, it isn't always apparent to the eye when it's happening. You have to use accelerometers to find out for sure.

That's one place where latency came into the picture. That kind of oscillation is very often caused by time lag between measurement/command and physical response if the control loop logic doesn't properly take the lag into account.

In terms of how latency affected the robot, there were three primary ways. The simplest one was due to the fact that the microprocessor wasn't infinitely fast. It controlled movement using a pre-calculated motion profile. During motion, it monitored the motion to make sure it conformed to the profile, and compensated for any divergence. There was non-zero time between when the microprocessor measured and detected a deviation from the profile and when it began compensating for it by modifying the commands sent to the hardware.

Another source of latency was motor response. Above I mentioned that at the "first level of indirection" what we controlled was angular force applied by the motors to the shoulders. What the microprocessor actually controlled was how much current passed through a small number of power JFETs. The JFETs controlled phases on the motors.

It's kind of hard to describe how multiphase analog motors work without thousands of words. The short description is that the microprocessor controlled the amount of current fed to different motor phase coils aligned at different angles. That created a polarized magnetic field inside the motor, and the permanent magnets on the rotor naturally tried to align themselves with that field. It was possible to control the angle to a very small fraction of a degree, and to control the field intensity. Indirectly, therefore, that controlled the amount of force applied to the rotor if it wasn't aligned, which effectively was the force applied to the arms at the shoulder.

The JFETs were blazingly fast, but the motor phase coils had significant inductances, and therefore it took some time from when we changed the current to when the induced magnetic field fully responded. (Just to make things even more interesting, the effective inductance of each motor phase coil was partially a function of rotor orientation and the magnetic field orientation of the rotor's magnets, and also of the fields being produced by other phase coils.)

The more force we needed to apply, the more change we made in the current flow, and the longer the delay until the inductor responded.

Processor response time was significant and had to be taken into account. I believe that motor response was treated as one kind of inertia folded in with all the other kinds of inertia, and was handled as part of the overall feedback control loop.

I think the worst latency problem was due to physical resonance in the arm assembly. The arm members (the "bones") were as stiff as we could make them subject to other constraints, but there is always some degree of flexibility, and if you twist one end of a long object, the other end doesn't follow instantly. The force is transmitted by arm "springiness", and there's a small delay in response at the far end.

There was also a secondary resonance. We applied force to the upper-arm member, and its stiffness and mass yielded a resonance. As a function of that resonance, there would be a small delay in response. But once it did respond, that then applied force to the lower arms, who also had a small delay in responding. As they resisted, responded, and resonated, that fed back force to the upper arm member, affecting its response to our motor force.

There's a tendency for such systems to ring. In terms of mechanical design, you try to minimize that by making all the members as stiff as possible and all the linkages as tight as possible. It also helps to make the lower arm members physically as similar as possible in length and mass to the upper arm members. But this can't be eliminated entirely, and the motion control algorithm still had to compensate for it.

Another problem is manufacturing tolerance. You can't insist that every manufactured unit be exactly identical because your yields would be dreadful. You have to allow a degree of variation in manufacturing, and what that meant was that each individual robot responded a bit differently, because the motors were different, the bearings were different, the seals were different, and the arms were different. The differences were small, but were more than enough so that they could not be ignored. That meant that the control algorithm had to be tuned manually for each robot.

So the control algorithm had to be designed to be to be tunable. There needed to be parameters which could be set per-robot and stored in that robot in non-volatile memory, and there had to be a way to test the completed robot and figure out what those parameter should be set to for that particular robot.

I wasn't directly involved in the design for any of this. I helped create some of the tools which were peripherally involved in the testing process, to support the tech who did acceptance testing on completed robots and who figured out the parameters to tune each one.

Originally that tuning process was done by an engineer. (He was quite a character; a really good guy. I enjoyed working with him a lot. He was Persian. His English was so good that I thought he'd grown up in the US. He had no trace whatever of a foreign accent. In fact, to my Oregon-tuned ears he had a slight Boston accent. So I was surprised when I learned that he came to the US for college. He had been an anti-Khomeini activist here as a student, and when he graduated and his student visa expired, after the revolution, he was granted "political refugee" status by the State Department. [They contacted him about that, not the other way around.] Eventually he naturalized, and like every other naturalized immigrant I've ever worked with he was fiercely patriotic about the US, because he knew how much worse things could be from personal experience.)

Setting up those parameters for each individual robot was really hard. If they were set wrong one way, the robot could exhibit jerky motion. Set wrong another way, it could lose wafers. Set wrong yet another way, it would be slower than it should be. There was a lot of art involved, and I don't really know how it was done.

What I do know was that we had long since reached the point where PID didn't cut it. The first robots I was involved with had a hardware PID controller, but the microprocessor had to monitor and supervise it – and override it when necessary. Later robots got even larger, and we abandoned PID entirely. We ended up using a dedicated signal processor which ran a much more complicated control algorithm which I didn't even slightly understand.

I have by no means listed all the problems; there were many others. For instance, when the arm was extended, it sagged a bit. So the end effector faced away a bit. That meant the force we applied wasn't aligned with the wafer plane. When we began to retract, a small part of the force applied to the wafer would try to lift the wafer off the pads rather than trying to move the wafer laterally. Thus the effective "weight" was reduced, and there was less friction holding the wafer in place.

Is this problem "simpler" than the problems Bart deals with in agronomy? In some ways it is. We didn't have to concern ourselves with outside influences which were variable and unpredictable, like weather or banking interest rates. It was possible to analyze our system using statics and dynamics and the principles of classical mechanics. The resulting model was grotesquely complex, but didn't contain any black-boxes labeled "Heere be draggons".

Any feedback system is potentially chaotic, but it wasn't necessary for us to use chaos theory in our design. (If our system became chaotic, we had already failed.)

And though latency was a problem for us, the latency time constants were well understood and quite consistent, and we could influence them to some degree with the mechanical/electrical/software design.

The challenge Bart faces is that his system isn't fully understood or analyzed. There are draggons in his system, it is influenced strongly by external factors which are unpredictable. Our chall