Mediated Reality: University of Toronto RWM Project

Dr. Mann describes his WearComp (“Wearable Computer”) invention as a tool for “Mediated Reality”. WearComp originated in the context of photographic tools as true extensions of the mind and body and evolved into a philosophical basis for self-determination, characteristic of the Linux operating system that runs on WearComp.
Connected Collective Human Intelligence

Personal Imaging is a camera-based computational framework in which the camera behaves as a true extension of the mind and body, after a period of long-term adaptation (see Resources 2). In this framework, the computer becomes a device that allows the wearer to augment, diminish, or otherwise alter his visual perception of reality. Moreover, it lets the wearer allow others to alter his visual perception of reality, thereby becoming a communication device.

The communication capabilities of WearComp allow for multiple wearers of the special sunglasses to share a common visual reality. Currently, the sunglasses are connected to the Internet by way of a 2Mbps (megabit per second) radio. This is a significant speed upgrade from the old 1987 radio design (running at only 56Kbps); thus, the shared realities may be updated at a much higher rate. The current system permits real-time video update rates for shared video.

Reality User Interface

One application of computer-mediated reality is to create, for each user of the apparatus, a possibly different interpretation of the same visual reality. Since the apparatus shares the same first-person perspective as the user (and in fact the apparatus is what enables the user to see at all), then, of course, the apparatus provides the processing system (WearComp) with a view of how the user is interacting with the world. In this way, each user may build his or her own user interface within the real world. For example, one user may decide to have the computer automatically run a telephone directory program whenever it sees the user pick up a telephone. This example is similar to hypertext, in the sense that picking it up is like clicking on it with a mouse as if it were in an HTML document. “Clicking” on real objects is done by simply touching them.

Outlining objects with the fingertip is another example of a reality user interface (RUI).

Reality Window Manager

When windows are used together with the RUI, a new kind of window manager results. For example, while waiting in a lounge or other waiting area, a user might define walls around the lounge as various windows. In this way, screen real estate is essentially infinite. Although not all screens are visible at any one time, portions of them become visible when they are looked at through the WearComp glasses. Others in the lounge need not be able to see them, unless they are wearing similar glasses and the user has permitted them access to these windows (as when two users are planning upon the same calendar space).

There are no specific boundaries in this form of window manager. For example, if a user runs out of space in the lounge, he or she can walk out into the hall and create more windows on the walls of the hallway leading into the lounge. It is also easier to remember where all the windows are when they are associated with the real world. Part of this ease of memory comes from having to walk around the space or at least turn one's head around in the space.

This window manager, called RWM, also provides a means of making the back of the head “transparent” in a sense so that one can see windows in the front as rightside up and windows behind as upside down. This scheme simply obeys the laws of projective geometry. Rearview windows may be turned on and off, since they are distracting for concentration, but they are useful for quick navigation around a room. An illustration depicting the function of RWM to operate a video recording system is given in Figure 8.

Figure 8

A vision analysis processor typically uses the output of the Lightspace Analyzer for head tracking. This head tracking determines the relative orientation (yaw, pitch and roll) of the head based on the visual location of objects in the Lightspace Analyzer's field of view.

A vision analysis processor is implemented in the WearComp, as well as remotely, by way of the radio connection. The choice of which of these to use is made automatically based on how good a radio connection can be established.

The vision analysis processor does 3-D object recognition and parameter estimation, or constructs a 3-D scene representation. An information processor takes this visual information and decides which virtual objects, if any, to insert into the Lightspace Synthesizer.

A graphics synthesis processor creates a computer-graphics rendering of a portion of the 3-D scene specified by the information processor and presents this computer-graphics rendering to the wearer by way of the Lightspace Synthesizer.

The objects displayed are synthetic (virtual) objects overlaid in the same position as some of the real objects from the scene. The virtual objects displayed on the Lightspace Synthesizer correspond to real objects within the Lightspace Synthesizer field of view. Thus, even though the Lightspace Synthesizer may only have 480 lines of resolution, a virtual television screen, of extremely high resolution, wrapping around the wearer, may be implemented by virtue of the Lightspace Analyzer head-tracker, so that the wearer may view very high-resolution pictures through what appears to be a small window that pans back and forth across the picture triggered by head movements of the wearer.

Optionally, in addition to overlaying synthetic objects on real objects to enhance them, the graphics synthesis processor may cause the display of other synthetic objects on the virtual television screen.

For example, Figure 9 illustrates a virtual television screen with some virtual (synthetic) objects such as an Emacs Buffer upon an xterm (text window in the commonly used X Window System graphical user interface). The graphics synthesis processor causes the Lightspace Synthesizer screen to display a reticle seen in a virtual view finder window.

The viewfinder has 640 pixels across and 480 down, which is just enough resolution to display one xterm window since an xterm window is typically 640 pixels across and 480 down also (sufficient for 24 rows of 80 characters of text). Thus, by turning his head to look back and forth, the wearer can position the viewfinder reticle on top of any number of xterms that appear to hover in space above various objects. The true objects, when positioned inside the mediation zone established by the viewfinder, may also be visually enhanced as seen through the viewfinder.

Figure 9.

Suppose the wearer of the apparatus is in a department store and, after picking up a $7 item for purchase, he hands the cashier a $20 dollar bill, but receives only $3 change (e.g., receives change for a $10 bill). Upon realizing this fact a minute or so later, the wearer locates a fresh, available (e.g., one that has no programs running in it so that it can accept commands) xterm. The wearer makes this xterm active by head movement up and to the right, as shown in Figure 9. Thus, the Lightspace Analyzer (typically implemented by a camera with special optics) functions also as a head tracker, and it is by orienting the head (and hence the camera) that the cursor may be positioned. Making a window active in the X Window System is normally done by placing the mouse cursor on the window and sometimes clicking on it. However, using a mouse with a wearable camera/computer system is difficult, owing to the fact that it requires a great deal of dexterity to position a cursor while walking around. With the invention described here, the wearer's head is the mouse and the center of the viewfinder is the cursor.

In Figures 8 and 9, objects outside the viewfinder mediation zone are depicted in dashed lines, because they are not actually visible to the wearer. He can see real objects outside the field of vision of the viewfinder (either through the remaining eye, or because the viewfinder permits one to see around it). However, only xterms in the viewfinder are visible. Portions of the xterms within the viewfinder are shown with solid lines, as this is all the wearer will see.

Once the wearer selects the desired window by looking at it, he then presses “d” to begin “recorDing”, as indicated on the window selected. Note that “d” is pressed for “recorD”, because “r” means “Recall” (in some ways equivalent to “Rewind” on a VCR). Letters are selected by way of a small number of belt-mounted switches that can be operated with one hand, in a manner similar to what courtroom stenographers use to form letters of the alphabet by pressing various combinations of pushbutton switches. Note that the wearer does not need to look right into the center of the desired window: the window accepts commands as long as it is active and doesn't need to be completely visible to accept commands.

Recording is typically retroactive, in the sense that the wearable camera system, by default, always records into a 5-minute circular buffer, so that pressing “d” begins recording starting 5 minutes before “d” is actually pressed. This means that if the wearer presses “d” within a couple of minutes of realizing that the cashier shortchanged him, then the transaction will have been successfully recorded. The customer can then review the past 5 minutes and can assert with confidence (through perfect photographic/videographic memory Recall, e.g., by pressing “r”) to the cashier that a $20 bill was given. The extra degree of personal confidence afforded by the invention typically makes it unnecessary to actually present the video record (e.g., to a supervisor) in order to correct the situation. Of course, if necessary, the customer could file a report or notify authorities while at the same time submit the recording as evidence. The recording is also sent to the Internet by way of the 2Mbps transmitter so that the cashier or other representatives of the department store (such as a security guard who might be a close friend of the cashier) cannot seize and destroy the storage medium upon which the recording was made.

Note that here, the drawings depict objects moved translationally (e.g., the group of translations specified by two scalar parameters), while in actual practice, virtual objects undergo a projective coordinate transformation in two dimensions governed by eight scalar parameters, or objects undergo three-dimensional coordinate transformations. When the virtual objects, such as text windows, are flat, the user interface is called a “Reality Window Manager”.

When using the invention, various windows appear to hover above various real objects. Regardless of the orientation of the wearer's head (position of the viewfinder), the system sustains the illusion that the virtual objects (in this example, xterms) are attached to real objects. Panning the head back and forth in order to navigate around the space of virtual objects may also cause an extremely high resolution picture to be acquired through appropriate processing of multiple pictures captured on the special camera. This action mimics the function of the human eye, where saccades are replaced with head movements to sweep out the scene using the camera's light measurement ability, typical in “Photoquantigraphic Imaging”. Thus, head movements are used to direct the camera to scan out a scene in the same way eyeball movements normally orient the eye for this purpose.