Mediated Reality: University of Toronto RWM Project
As I wrote last month, I am an inventor who likes to think outside the box and chose to use the Linux operating system because it gives me a programming environment in which the box is not welded shut. I described my framework for human-machine intelligence which I call Humanistic Intelligence (HI). I also described the apparatus of an invention I call “WearComp” that embodies HI. In particular, I outlined some of the reasons for choosing Linux for WearComp and emphasized how problems like that of software fascism are much more pronounced on WearComp than they are in the context of regular desktop computing.
In this article, I will explore practical uses of WearComp. I will also explain how WearComp turns the traditional business model of real-world information space (e.g., advertising) upside down.
Summarizing briefly, WearComp is a wearable computational device that provides the user with a self-created personal space. The most fundamental issue of WearComp is that of personal empowerment (see Resources 1). I will show by way of example how WearComp provides the wearer with a self-created visual space. I will also describe the concept of “Mediated Reality” and the use of a “visual filter” that allows the wearer to create a visual attention access control system.
If the eye is the window to the soul, then our soul is available for anyone to steal. Our visual attention is a valuable resource that can be consumed by billboards, distracting advertising and other material often thrust upon us against our will.
Solitude is a valuable form of humanistic property all too easily subject to theft.
I am taking the liberty of using these strongly judgmental words—steal and theft. Such strong wording, however, is already present in the context of intellectual property. We readily accept terms like “software piracy”, which make an analogy between someone who copies a floppy disk and someone who seizes control of an ocean-going vessel, often killing everyone on board. An analogy between such gruesome mass murder and copying software ought to raise certain questions about our social value system. Thus, against this backdrop, I believe that use of terms like theft and steal are not out of line in the context of what I call humanistic property.
Those who steal our solitude not only take away humanistic property, but force material upon us that can put our lives in danger.
Advertising is an evolving entity. In the old days, there were fixed signs with a static display of company slogans. Once we became accustomed to these signs, new ones were invented with more distracting and vibrant colours and even moving parts to fight for our attention. As we became accustomed to these signs, they were made brighter. Concepts such as light chasers, lamp sequencers and the like were introduced so that motion arising from sequentially illuminated bulbs could further distract us.
Then came the pixel boards, which also got brighter as we became accustomed to them. Some pixel boards use as many as 2000 watts per pixel. When lights this bright are put along major highways, they pose a serious threat to road safety. Still, we do our best to ignore these distractions and keep our eyes on the road or on whatever task has our attention.
The latest trend is something I call “signal-band advertising”. It tries to trick us by resituating advertising (“noise”) into what we perceive to be a “signal” band. For example, we are now seeing WWW banner advertisements with an imitation of a cursor; thus, the user is momentarily tricked into thinking there are two cursors on his screen. The advertisement contains what looks like a cursor, which moves around very much like a real cursor normally does. These kinds of ads are the cyberspace equivalent of trying to get attention by yelling “Fire!” in a crowded movie theater.
Another example of signal-band advertising exists on parking-lot booms. By renting a sign on the boom of a parking lot, advertisers can further confuse drivers by placing ads where only road signs would normally be. We now need to distinguish between advertising and important road signs, both of which are directly in our path in the center of the road. The advertising is no longer only off to the side of the road. This theft of visual attention makes it that much harder to see stop signs and other important traffic markers.
Perhaps next, advertisers will start to make their signs red and octagon-shaped and hang them on lamp posts along the street, so they will be able to grab even more of our attention. A red octagon, with a product slogan in white letters in the center, posted at a busy intersection could get lots of attention and would be harder to ignore than traditional billboards. This is what I mean by “signal-band advertising”.
Those who steal our visual attention are not content to just clutter roads and open public space with advertising, but they appear to also want to intrude on more private spaces as well.
A solution to this problem may be obtained through something I call “Mediated Reality”. Mediated Reality (MR) differs from virtual reality (or augmented reality) in the sense that it allows us to filter out things we do not wish to have thrust upon us against our will. This capability is implicit in the notion of self-determination and mastery over one's own destiny. Just as a Sony Walkman allows us to drown out Muzak with our own choice of music, MR allows us to implement a “visual filter”. I will now describe how MR works. Later, we will see the importance of a good software basis for MR and why Linux was selected as the operating system for the apparatus of the invention (WearComp) upon which MR is based.
To understand how the reality mediator works, imagine first a device called a “Lightspace Analyzer” (see Figure 1). The Lightspace Analyzer is a hypothetical “lightspace glass” that absorbs and quantifies incoming light—it is completely opaque. It provides a numerical description (e.g., it turns light into numbers). It is not necessarily flat; the analyzer is drawn curved in the figure to emphasize this point.
Imagine also a “Lightspace Synthesizer” (see Figure 2). The Lightspace Synthesizer turns an input stream of numbers into corresponding rays of light.
Suppose we connect the output of the Lightspace Analyzer to the input of the Lightspace Synthesizer (see Figure 3). We now have an illusory transparency.
Moreover, suppose we could bring the Lightspace Analyzer glass into direct contact with the Lightspace Synthesizer glass. Placing the two back to back would create a collinear illusory transparency, in which any emergent ray of virtual light would be collinear with the incoming ray of real light that gave rise to it. (See Figure 4.)
Now, a natural question to ask is, why all this effort in making a simple illusion of transparency when we could just as easily purchase a small piece of clear glass? The answer is that we have the ability to modify our perception of visual reality by inserting a WearComp between the Lightspace Analyzer and the Lightspace Synthesizer. (See Figure 5.)
In practice, there are other more practical embodiments of this invention than the one described above, but the basic principle is the same. Some practical examples are described further elsewhere in the literature (see Proceedings of IEEE ISWC98, “WearCam, the Wearable Camera”, by Steve Mann, pages 124-131). The result is a computational means of altering one's visual perception of reality.
WearComp has the potential to make all the world virtual as well as real; moreover, the potential is there to create a modified perception of visual reality. Such a computer-mediated reality can not only augment, but also diminish or otherwise alter the perception of reality.
Why would one want to do this? Why would anyone buy a pair of sunglasses that made them see less?
An example might be when we are driving and trying to concentrate on the road. Sunglasses that not only diminish the glare of the sun's rays but also filter out distracting billboards could help us see the road better, and therefore drive more safely.
Moreover, Mediated Reality can help us reclaim solitude in personal spaces. By wearing special sunglasses in which a visual filter is implemented (see Figure 6), it is possible to filter out offensive advertising.
Lightspace entering the analysis side of the glass manifests itself as an input image sequence where it is absorbed and quantified by the special sunglasses. Figure 7 shows Convention Hall as it truly is; Figure 8 shows its transformation with visual filtering.
Recall that the sunglasses are totally opaque except for the fact that the WearComp copies the input side to the output side, with possible modification. In the case of an offensive advertisement, the modification could take the form of replacing the advertisement with a more calming image of a waterfall.
Personal Imaging is a camera-based computational framework in which the camera behaves as a true extension of the mind and body, after a period of long-term adaptation (see Resources 2). In this framework, the computer becomes a device that allows the wearer to augment, diminish, or otherwise alter his visual perception of reality. Moreover, it lets the wearer allow others to alter his visual perception of reality, thereby becoming a communication device.
The communication capabilities of WearComp allow for multiple wearers of the special sunglasses to share a common visual reality. Currently, the sunglasses are connected to the Internet by way of a 2Mbps (megabit per second) radio. This is a significant speed upgrade from the old 1987 radio design (running at only 56Kbps); thus, the shared realities may be updated at a much higher rate. The current system permits real-time video update rates for shared video.
One application of computer-mediated reality is to create, for each user of the apparatus, a possibly different interpretation of the same visual reality. Since the apparatus shares the same first-person perspective as the user (and in fact the apparatus is what enables the user to see at all), then, of course, the apparatus provides the processing system (WearComp) with a view of how the user is interacting with the world. In this way, each user may build his or her own user interface within the real world. For example, one user may decide to have the computer automatically run a telephone directory program whenever it sees the user pick up a telephone. This example is similar to hypertext, in the sense that picking it up is like clicking on it with a mouse as if it were in an HTML document. “Clicking” on real objects is done by simply touching them.
Outlining objects with the fingertip is another example of a reality user interface (RUI).
When windows are used together with the RUI, a new kind of window manager results. For example, while waiting in a lounge or other waiting area, a user might define walls around the lounge as various windows. In this way, screen real estate is essentially infinite. Although not all screens are visible at any one time, portions of them become visible when they are looked at through the WearComp glasses. Others in the lounge need not be able to see them, unless they are wearing similar glasses and the user has permitted them access to these windows (as when two users are planning upon the same calendar space).
There are no specific boundaries in this form of window manager. For example, if a user runs out of space in the lounge, he or she can walk out into the hall and create more windows on the walls of the hallway leading into the lounge. It is also easier to remember where all the windows are when they are associated with the real world. Part of this ease of memory comes from having to walk around the space or at least turn one's head around in the space.
This window manager, called RWM, also provides a means of making the back of the head “transparent” in a sense so that one can see windows in the front as rightside up and windows behind as upside down. This scheme simply obeys the laws of projective geometry. Rearview windows may be turned on and off, since they are distracting for concentration, but they are useful for quick navigation around a room. An illustration depicting the function of RWM to operate a video recording system is given in Figure 8.
A vision analysis processor typically uses the output of the Lightspace Analyzer for head tracking. This head tracking determines the relative orientation (yaw, pitch and roll) of the head based on the visual location of objects in the Lightspace Analyzer's field of view.
A vision analysis processor is implemented in the WearComp, as well as remotely, by way of the radio connection. The choice of which of these to use is made automatically based on how good a radio connection can be established.
The vision analysis processor does 3-D object recognition and parameter estimation, or constructs a 3-D scene representation. An information processor takes this visual information and decides which virtual objects, if any, to insert into the Lightspace Synthesizer.
A graphics synthesis processor creates a computer-graphics rendering of a portion of the 3-D scene specified by the information processor and presents this computer-graphics rendering to the wearer by way of the Lightspace Synthesizer.
The objects displayed are synthetic (virtual) objects overlaid in the same position as some of the real objects from the scene. The virtual objects displayed on the Lightspace Synthesizer correspond to real objects within the Lightspace Synthesizer field of view. Thus, even though the Lightspace Synthesizer may only have 480 lines of resolution, a virtual television screen, of extremely high resolution, wrapping around the wearer, may be implemented by virtue of the Lightspace Analyzer head-tracker, so that the wearer may view very high-resolution pictures through what appears to be a small window that pans back and forth across the picture triggered by head movements of the wearer.
Optionally, in addition to overlaying synthetic objects on real objects to enhance them, the graphics synthesis processor may cause the display of other synthetic objects on the virtual television screen.
For example, Figure 9 illustrates a virtual television screen with some virtual (synthetic) objects such as an Emacs Buffer upon an xterm (text window in the commonly used X Window System graphical user interface). The graphics synthesis processor causes the Lightspace Synthesizer screen to display a reticle seen in a virtual view finder window.
The viewfinder has 640 pixels across and 480 down, which is just enough resolution to display one xterm window since an xterm window is typically 640 pixels across and 480 down also (sufficient for 24 rows of 80 characters of text). Thus, by turning his head to look back and forth, the wearer can position the viewfinder reticle on top of any number of xterms that appear to hover in space above various objects. The true objects, when positioned inside the mediation zone established by the viewfinder, may also be visually enhanced as seen through the viewfinder.
Suppose the wearer of the apparatus is in a department store and, after picking up a $7 item for purchase, he hands the cashier a $20 dollar bill, but receives only $3 change (e.g., receives change for a $10 bill). Upon realizing this fact a minute or so later, the wearer locates a fresh, available (e.g., one that has no programs running in it so that it can accept commands) xterm. The wearer makes this xterm active by head movement up and to the right, as shown in Figure 9. Thus, the Lightspace Analyzer (typically implemented by a camera with special optics) functions also as a head tracker, and it is by orienting the head (and hence the camera) that the cursor may be positioned. Making a window active in the X Window System is normally done by placing the mouse cursor on the window and sometimes clicking on it. However, using a mouse with a wearable camera/computer system is difficult, owing to the fact that it requires a great deal of dexterity to position a cursor while walking around. With the invention described here, the wearer's head is the mouse and the center of the viewfinder is the cursor.
In Figures 8 and 9, objects outside the viewfinder mediation zone are depicted in dashed lines, because they are not actually visible to the wearer. He can see real objects outside the field of vision of the viewfinder (either through the remaining eye, or because the viewfinder permits one to see around it). However, only xterms in the viewfinder are visible. Portions of the xterms within the viewfinder are shown with solid lines, as this is all the wearer will see.
Once the wearer selects the desired window by looking at it, he then presses “d” to begin “recorDing”, as indicated on the window selected. Note that “d” is pressed for “recorD”, because “r” means “Recall” (in some ways equivalent to “Rewind” on a VCR). Letters are selected by way of a small number of belt-mounted switches that can be operated with one hand, in a manner similar to what courtroom stenographers use to form letters of the alphabet by pressing various combinations of pushbutton switches. Note that the wearer does not need to look right into the center of the desired window: the window accepts commands as long as it is active and doesn't need to be completely visible to accept commands.
Recording is typically retroactive, in the sense that the wearable camera system, by default, always records into a 5-minute circular buffer, so that pressing “d” begins recording starting 5 minutes before “d” is actually pressed. This means that if the wearer presses “d” within a couple of minutes of realizing that the cashier shortchanged him, then the transaction will have been successfully recorded. The customer can then review the past 5 minutes and can assert with confidence (through perfect photographic/videographic memory Recall, e.g., by pressing “r”) to the cashier that a $20 bill was given. The extra degree of personal confidence afforded by the invention typically makes it unnecessary to actually present the video record (e.g., to a supervisor) in order to correct the situation. Of course, if necessary, the customer could file a report or notify authorities while at the same time submit the recording as evidence. The recording is also sent to the Internet by way of the 2Mbps transmitter so that the cashier or other representatives of the department store (such as a security guard who might be a close friend of the cashier) cannot seize and destroy the storage medium upon which the recording was made.
Note that here, the drawings depict objects moved translationally (e.g., the group of translations specified by two scalar parameters), while in actual practice, virtual objects undergo a projective coordinate transformation in two dimensions governed by eight scalar parameters, or objects undergo three-dimensional coordinate transformations. When the virtual objects, such as text windows, are flat, the user interface is called a “Reality Window Manager”.
When using the invention, various windows appear to hover above various real objects. Regardless of the orientation of the wearer's head (position of the viewfinder), the system sustains the illusion that the virtual objects (in this example, xterms) are attached to real objects. Panning the head back and forth in order to navigate around the space of virtual objects may also cause an extremely high resolution picture to be acquired through appropriate processing of multiple pictures captured on the special camera. This action mimics the function of the human eye, where saccades are replaced with head movements to sweep out the scene using the camera's light measurement ability, typical in “Photoquantigraphic Imaging”. Thus, head movements are used to direct the camera to scan out a scene in the same way eyeball movements normally orient the eye for this purpose.
Of course, one cannot expect a head-tracking device to be provided in all possible environments, so head tracking is done by the reality mediator, using the VideoOrbits (see Resources 3) tracking algorithm. (The VideoOrbits package upon which RWM is based is freely available at http://wearcam.org/orbits/index.html.) The VideoOrbits head tracker does head tracking based on a visually observed environment, yet works without the need for high-level object recognition.
VideoOrbits builds upon the tradition of image processing (see Resources 4 and 5) combined with the Horn and Schunk equations (see Resources 6) and some new ideas in algebraic projective geometry and homometric imaging, using a spatiotonal model, p, that works in the neighborhood of the identity:
where øT = [Fx(xy, x, y, 1), Fy(xy, x, y, 1), F, 1], F(x,t) = f(q(x)) at time t, Fx(x,t) = (df/dq)(dq(x)/dx), at time t, and Ft(x,t) is the difference of adjacent frames. This “approximate model” is used in the innermost loop of a repetitive process, then related to the parameters of an exact projectivity and gain group of transformations, so that the true group structure is preserved throughout. In this way, virtual objects inserted into the “reality stream” of the person wearing the glasses, follow the orbit of this group of transformations, hence the name VideoOrbits.
A quantagraphic version of VideoOrbits is also based on the fact that the unknown nonlinearity of the camera, f, can be obtained from differently exposed images f(q) and f(kq), etc., and that these can be combined to estimate the actual quantity of light entering the imaging system:
where ci is the derivative of the recovered nonlinear response function of the camera, f, and A, b and c are the parameters of the true projective coordinate transformation of the light falling on the image sensor. This method allows the actual quantity of light entering the reality mediator to be determined. In this way, the reality mediator absorbs and truly quantifies the rays of light entering it. Moreover, light rays entering the eye due to the real and virtual objects are placed on an equal footing.
MR sets forth a new computational framework in which the visual interpretation of reality is finely customized to the needs of each individual wearer of the apparatus. The computer becomes very much like a prosthetic device or like prescription eyeglasses. Just as you would not want to wear undergarments or another person's mouth guard, you may not want to find yourself wearing another person's computer.
The traditional paradigm of one worldwide software vendor providing everyone with identical copies of an executable-only distribution no longer applies. Instead, complete reconfigurability is needed and each user will customize his or her own environment. Since many laypersons are not well-versed in operating system, kernel source code, a need will grow for system administrators and consultants.
In the future, software will be free and users will buy support. There will be little problem with software piracy, both because the software will be free and because a version of the software customized for one person will be of less use to someone with different needs. Because the computer will function as a true extension of the user's mind and body, it would not do the user well to ingest software owned by someone else. The computer will function much like a “second brain”, and in the true spirit of freedom of thought, it would be preferable that any commercial interests in the customization and contents of one's “second brain” be a work for hire (e.g., an interaction in which the end user owns the rights) rather than a software purchase. Thus, there will be an exponentially growing need for personal system administrators as new people enter the community of connected, collective, humanistic intelligence.
Linux has eliminated the need for pirated copies of poorly written commercial operating systems. Freely distributable software has resulted in improved operating system software and changed the nature of intellectual property.
Similarly, there is the issue of Humanistic Property. Humanistic Property was formerly free for others to steal, but now a technological means to prevent theft of Humanistic Property is proposed. This means that in the future, individuals will decide what advertisements they would like to see or not see.
For example, I am currently not interested in seeing advertisements for cars, cleaning products or condoms. However, I am currently in the market for certain components that I need to build the next embodiment of WearComp, so I would very much welcome the opportunity to see any advertisements by vendors of these products. I do not believe we will see the end of advertising, just the end of unwanted advertising—the end of theft of our visual attention.
Thanks to Kodak and Digital Equipment Corporation (DEC) for assistance with the Personal Imaging and Humanistic Intelligence projects.
Steve Mann inventor of WearComp (wearable computer) and WearCam (eyetap camera and reality mediator), is currently a faculty member at the University of Toronto, Department of Electrical and Computer Engineering. Dr. Mann has been working on his WearComp invention for more than 20 years, dating back to his high school days in the 1970s. He brought his inventions and ideas to the Massachusetts Institute of Technology in 1991, founding what was to later become the MIT Wearable Computing Project, and received his Ph.D. degree from MIT in 1997 in this new field he established. Anyone interested in joining or helping out with the “community of cyborgs” project or the RWM project may wish to contact the author by e-mail at firstname.lastname@example.org.