Towards Spontaneous Interaction with the
Perceptive Workbench, a
Semi-Immersive Virtual Environment
Bastian Leibe, Thad Starner, William Ribarsky, Zachary Wartell, David
Krum, Justin Weeks, Brad Singletary, and Larry Hodges
GVU Center, Georgia Institute of Technology
Abstract
The Perceptive Workbench enables a spontaneous and unimpeded
interface between the physical and virtual worlds. It uses vision-based
methods for interaction that eliminate the need for wired input devices and
wired tracking. Objects are recognized and tracked when placed on the
display surface. Through the use of multiple infrared light sources, the
object's 3D shape can be captured and inserted into the virtual interface.
This ability permits spontaneity since either preloaded objects or those objects
selected on the spot by the user can become physical icons. Integrated
into the same vision-based interface is the ability to identify 3D hand
position, pointing direction, and sweeping arm gestures. Such gestures can
enhance selection, manipulation, and navigation tasks. In this paper, the
Perceptive Workbench is used for augmented reality gaming and terrain navigation
applications, which demonstrate the utility and flexibility of the
interface.
1. Introduction
Until now, we have interacted with computers mostly by
using devices that are constrained by wires. Typically, the wires limit the
distance of movement and inhibit freedom of orientation. In addition, most
interactions are indirect. The user moves a device as an analogue for the
action to be created in the display space. We envision an interface
without these restrictions. It is untethered; accepts direct, natural
gestures; and is capable of spontaneously accepting as interactors any objects
we choose.
In conventional 3D interaction, the devices that track position
and orientation are still usually tethered to the machine by wires.
Devices, such as pinch gloves, that permit the user to experience a more
natural-seeming interface often do not perform as well and are less preferred
with users, than simple handheld devices with buttons [Kessler95,
Seay99].
Pinch gloves carry assumptions about the position of the user's hand and
fingers with respect to the tracker. Of course, users' hands differ
in size and shape, so the assumed tracker position must be recalibrated for each
user. This is hardly ever done. Also, the glove interface causes
subtle changes to recognized hand gestures. The result is that fine
manipulations can be imprecise, and the user comes away with the feeling that
the interaction is slightly off in an indeterminate way. If we can
recognize gestures directly, we take into account the difference in hand sizes
and shapes.
An additional problem is that any device held in the hand can become awkward
while gesturing. We have found this even with a simple pointing device,
such as a stick with a few buttons [Seay99].
Also a user, unless fairly skilled, often has to pause to identify and select
buttons on the stick. With accurately tracked hands most of this
awkwardness disappears. We are adept at pointing in almost any direction
and can quickly pinch fingers, for example, without looking at them.
Finally, physical objects are often natural interactors (such as phicons [Ullmer97]).
However, with current systems these objects must be inserted in advance or
specially prepared. One would like the system to accept objects that one
chooses spontaneously for interaction.
 |
Figure 1: A user interacting
with Georgia Tech's Virtual Workbench using a 6DoF
pointer.
In this paper we discuss
methods for producing more seamless interaction between the physical and virtual
environments through the creation of the Perceptive Workbench. The system
is then applied to an augmented reality game and a terrain navigating
system. The Perceptive Workbench can reconstruct 3D virtual
representations of previously unseen real-world objects placed on its
surface. In addition, the Perceptive Workbench identifies and tracks such
objects as they are manipulated on the desk's surface and allows the user to
interact with the augmented environment through 2D and 3D gestures. These
gestures can be made on the plane of the desk's surface or in the 3D space above
the desk. Taking its cue from the user's actions, the Perceptive Workbench
switches between these modes automatically, and all interaction is controlled
through computer vision, freeing the user from the wires of traditional sensing
techniques.
2. Related Work
While the Perceptive Workbench [Leibe00]
is unique in its extensive ability to interact with the physical world,
it has a rich heritage of related work [Arai95,
Bimber99,
Coquillart99,
Kobayashi98,
Krueger91,
Krueger95,
May99,
Rekimoto97,
Schmalstieg99,
Seay99,
Ullmer97,
Underkoffler98,
vdPol99,
Wellner93].
Many augmented desk and virtual reality designs use tethered props, tracked by
electromechanical or ultrasonic means, to encourage interaction through
manipulation and gesture [Bolt92,
Bimber99,
Coquillart99,
Schmalstieg99,
Seay99,
Sturman92,
vdPol99].
Such designs tether the user to the desk and require the time-consuming ritual
of donning and doffing the appropriate equipment.
Fortunately, the computer vision community has taken up
the task of tracking the user's hands and identifying gestures. While
generalized vision systems track the body in room and desk-based scenarios for
games, interactive art, and augmented environments [Bobick96,
Wren95,
Wren97],
reconstruction of fine hand detail involves carefully calibrated systems and is
computationally intensive [Rehg93].
Even so, complicated gestures such as those used in sign language [Starner98,
Vogler98]
or the manipulation of physical objects [Sharma97]
can be recognized. The Perceptive Workbench uses computer vision techniques to
maintain a wireless interface.
Most directly related to the Perceptive Workbench, the "Metadesk" [Ullmer97]
identifies and tracks objects placed on the desk's display surface using a
near-infrared computer vision recognizer, originally designed by Starner.
Unfortunately, since not all objects reflect infrared light and infrared shadows
are not used, objects often need infrared reflective "hot mirrors" placed in
patterns on their bottom surfaces to aid tracking and identification. Similarly,
Rekimoto and Matsushita's "Perceptual Surfaces" [Rekimoto97]
employ 2D barcodes to identify objects held against the "HoloWall" and
"HoloTable." In addition, the HoloWall can track the user's hands (or other body
parts) near or pressed against its surface, but its potential recovery of the
user's distance from the surface is relatively coarse compared to the 3D
pointing gestures of the Perceptive Workbench. Davis and Bobick's SIDEshow [Davis98]
is similar to the Holowall except that it uses cast shadows in infrared for
full-body 2D gesture recovery. Some augmented desks have cameras and projectors
above the surface of the desk and are designed to augment the process of
handling paper or interacting with models and widgets through the use of
fiducials or barcodes [Arai95,
Kobayashi98,
Underkoffler98,
Wellner93].
Krueger's VIDEODESK [Krueger91],
an early desk-based system, used an overhead camera and a horizontal visible
light table (for high contrast) to provide hand gesture input for interactions
displayed on a monitor on the far side of the desk. In contrast with the
Perceptive Workbench, none of these systems address the issues of introducing
spontaneous 3D physical objects into the virtual environment in real-time and
combining 3D deictic (pointing) gestures with object tracking and
identification.
3. Apparatus
The display environment for the Perceptive Workbench is
based on Fakespace's immersive workbench, consisting of a wooden desk
with a horizontal frosted glass surface on which a stereoscopic image can be
projected from behind the Workbench.
 |
 |
Figure 2: Light and camera positions for the
Perceptive Workbench.
The top view shows how shadows are cast and
the 3D arm position is tracked.
We placed a
standard monochrome surveillance camera under the projector that watches the
desk surface from underneath (see Figure
2). A filter placed in front of the camera lens makes it insensitive to
visible light and to images projected on the desk's surface. Two infrared
illuminators placed next to the camera flood the surface of the desk with
infrared light that is reflected toward the camera by objects placed on the
desk's surface (Figure
4).
A ring of seven similar light-sources is mounted on the ceiling surrounding
the desk (Figure
2). Each computer-controlled light casts distinct shadows on the
desk's surface based on the objects on the table (Figure
3a). A second camera, this one in color, is placed next to the desk to
provide a side view of the user's arms (Figure
3b). This side camera is used solely for recovering 3D pointing gestures.
 |
 |
Figure 3: Images seen by the infrared and color
cameras: (a) arm shadow from ceiling IR lights;
(b) image from side
camera
We decided to use near-infrared light since it is invisible to the
human eye. Thus, illuminating the scene with it does not interfere with
the user's interaction. Neither the illumination from the IR light
sources underneath the table, nor the shadows cast from the overhead lights can
be observed by the user. On the other hand, IR light can still be seen by
most standard CCD cameras. This makes it a very inexpensive method for
observing the interaction. In addition, by equipping the camera with an infrared
filter, the camera image can be analyzed regardless of changes in (visible)
scene lighting.
All vision processing is done on two SGI R10000 O2s (one for each camera),
which communicate with a display client on an SGI Onyx via sockets.
However, the vision algorithms could also be run on one SGI with two digitizer
boards or be implemented using inexpensive, semi-custom signal-processing
hardware.
We use this setup for three different kinds of interaction which
will be explained in more detail in the following sections: recognition and
tracking of objects placed on the desk surface based on their contour, full 3D
reconstruction of object shapes on the desk surface from shadows cast by the
ceiling light-sources, and recognition and quantification of hand and arm
gestures.
For display on the Perceptive Workbench, we use the Simple Virtual
Environment Toolkit (SVE), a graphics and sound library developed by the Georgia
Tech Virtual Environments Group [Kessler97].
SVE permits us to rapidly prototype applications used in this work. In
addition we use the workbench version of VGIS, a global terrain visualization
and navigation system [Lindstrom96,
Lindstrom97]
as an application for interaction using hand and arm gestures. The workbench
version of VGIS has stereoscopic rendering and an intuitive interface for
navigation [Wartell99a,
Wartell99b].
Both systems are built on OpenGL and have both SGI and PC implementations.
4. Object Tracking & Recognition
As a basic building block for our
interaction framework, we want to enable the user to manipulate the virtual
environment by placing objects on the desk surface. The system should
recognize these objects and track their positions and orientations while they
are being moved over the table. The user should be free to pick any set of
physical objects he wants to use.
The
motivation behind this is to use physical objects in a graspable user interface
[Fitzmaurice95].
Physical objects are often natural interactors (such as "phicons" [Ullmer97]).
They can provide physical handles to control the virtual application in a way
that is very intuitive for the user [Ishii97].
In addition, the use of objects encourages two-handed direct manipulation and
allow parallel input specification, thereby improving the communication
bandwidth with the computer. [Fitzmaurice95,
Ishii97].
To achieve this goal, we use an improved version of the technique described
in Starner et al. [Starner00].
The underside of the desk is illuminated by two near-infrared light-sources (Figure
2). Every object close to the desk surface (including the user's
hands) reflects this light and can be seen by the camera under the display
surface (Figure
4). Using a combination of intensity thresholding and background
subtraction, we extract interesting regions of the camera image and analyze
them. The resulting blobs are classified as different object types based on a
set of features, including area, eccentricity, perimeter, moments, and the
contour shape.
 |
Figure 4: Image of reflection from the IR lights
underneath the desk, as seen from the infrared
camera.
There are several complications due
to the hardware arrangement. The foremost problem is that our two light sources
under the table can only provide a very uneven lighting over the whole desk
surface, bright in the middle, and getting weaker toward the borders. In
addition, the light rays are not parallel, and the reflection on the mirror
surface further exacerbates this effect. As a result, the perceived sizes and
shapes of objects on the desk surface can vary depending on position and
orientation. Finally, when the user moves an object, the reflection from his
hand can also add to the perceived shape. This makes it necessary to use an
additional stage in the recognition process that matches recognized objects to
objects known to be on the table and that can filter out wrong classification or
even handle complete loss of information about an object for several frames.
In this work, we are using the object recognition and tracking capability
mainly for "cursor objects". Our focus is on fast and accurate position
tracking, but the system may be trained on a different set of objects to be used
as navigational tools or physical icons [Ullmer97].
A future project will explore different modes of interaction based on this
technology.
5. Deictic Gesture Tracking
Hand gestures can be roughly classified into
symbolic (iconic, metaphoric, and beat) and deictic (pointing) gestures. Symbolic gestures carry an abstract meaning that may still be
recognizable in iconic form in the associated hand movement. Without the
necessary cultural context, however, the meaning may be arbitrary. Examples for
symbolic gestures include most conversational gestures in everyday use, and
whole gesture languages, for example, American Sign Language. Previous work by
Starner [Starner98]
has shown that a large set of symbolic gestures can be distinguished and
recognized from live video images using hidden Markov models
(HMMs).
Deictic gestures, on the other hand, are characterized by a
strong dependency on location and orientation of the performing hand. Their
meaning is determined by the location at which a finger is pointing, or by the
angle of rotation of some part of the hand. This information acts not only as a
symbol for the gesture's interpretation, but also as a measure of by how
much the corresponding action should be executed or to which object it should be
applied.
For navigation and object manipulation in a virtual environment, many
gestures are likely to have a deictic component. It is usually not enough to
recognize that an object should be rotated, but we will also need to know the
desired amount of rotation. For object selection or translation, we want to
specify the object or location of our choice just by pointing at it. For these
cases, gesture recognition methods that only take the hand shape and trajectory
into account will not be sufficient. We need to recover 3D information about the
user's hand and arm in relation to his body.
In the past, this information has largely been obtained by using
wired gloves or suits, or magnetic trackers [Bolt92,
Bimber99].
Such methods provide sufficiently accurate results but rely on wires and have to
be tethered to the user's body, or to specific interaction devices, with
all the aforementioned problems. Our goal is to develop a purely vision-based
architecture that facilitates unencumbered 3D interaction.
With vision-based 3D tracking techniques, the first issue is to determine
which information in the camera image is relevant, i.e. which regions represent
the user's hand or arm. This task is made even more difficult by
variation in user clothing or skin color and by background activity. Although typically only one user interacts with the environment at
a given time using traditional methods of interaction, the physical dimensions
of large semi-immersive environments such as the workbench invite people to
watch and participate.
In a virtual workbench environment, there are few places
where a camera can be placed to provide reliable hand position
information. One camera can be set up next to the table without overly
restricting the available space for users, but if a similar second camera were
to be used at this location, either multi-user experience or accuracy would be
compromised. We have addressed this problem by employing our shadow-based
architecture (as described in the hardware
section). The user stands in front of the workbench and extends an arm over
the surface. One of the IR light-sources mounted on the ceiling to the left of,
and slightly behind the user, shines its light on the desk surface, from where
it can be seen by the IR camera under the projector (see Figure
5). When the user moves his arm over the desk, it casts a shadow on the desk
surface (see Figure
6a). From this shadow, and from the known light-source position, we can
calculate a plane in which the user's arm must lie.
 |
Figure 5: Principle of
pointing direction recovery.
Simultaneously, the second camera to the right of the table (Figure
5 and Figure
7a) records a side view of the desk surface and the user's arm. It
detects where the arm enters the image and the position of the fingertip. From
this information, it extrapolates two lines in 3D space, on which the observed
real-world points must lie. By intersecting these lines with the shadow plane,
we get the coordinates of two 3D points, one on the upper arm, and one on the
fingertip. This gives us the user's hand position, and the direction in
which the user is pointing. As shown in Figure
13a, this information can be used to project a icon for the hand position
and a selection ray in the workbench display.
Obviously, the success of the gesture tracking capability
relies very strongly on how fast the image processing can be done. It is
therefore necessary to use simple algorithms. Fortunately, we can make some
simplifying assumptions about the image content.
We must first
recover arm direction and fingertip position from both the camera and the shadow
image. Since the user is standing in front of the desk and the user's arm
is connected to the user's body, the arm's shadow should always
touch the image border. Thus our algorithm exploits intensity thresholding and
background subtraction to discover regions of change in the image and searches
for areas in which these touch the front border of the desk surface (which
corresponds to the top border of the shadow image or the left border of the
camera image). It then takes the middle of the touching area as an approximation
for the origin of the arm (Figure
6b and Figure
7b). For simplicity we will call this point the "shoulder", although in most
cases it is not. Tracing the contour of the shadow, the algorithm searches for
the point that is farthest away from the shoulder and takes it as the fingertip.
The line from the shoulder to the fingertip reveals the 2D direction of the arm.
 |
 |
Figure 6: (a) Arm shadow from
overhead IR lights; (b) resulting contour with recovered arm
direction.
In our experiments, the
point thus obtained was coincident with the pointing fingertip in all but a few
extreme cases (such as the fingertip pointing straight down at a right angle to
the arm). The method does not depend on a pointing gesture, but also works
for most other hand shapes, including but not restricted to, a hand held
horizontally, vertically or in a fist. These shapes may be distinguished by
analyzing a small section of the side camera image and may be used to trigger
specific gesture modes in the future.
 |
 |
Figure 7: (a) image from side camera; (b) arm contour (from similar image) with recovered arm
direction.
The computed arm
direction is correct as long as the user's arm is not overly bent (see Figure
7). In such cases, the algorithm still connected shoulder and fingertip,
resulting in a direction somewhere between the direction of the arm and the one
given by the hand. Although the absolute resulting pointing position did not
match the position towards which the finger was pointing, it still managed to
capture the trend of movement very well. Surprisingly, the technique is
sensitive enough such that the user can stand at the desk with his arm extended
over the surface and direct the pointer simply by moving his index finger,
without arm movement.
Limitations
The architecture used poses several limitations. The primary
problem with the shadow approach is finding a position for the light source that
can give a good shadow of the user's arm for a large set of possible
positions, while avoiding capture of the shadow from the user's body.
Since the area visible to the IR camera is coincident with the desk surface,
there are necessarily regions where the shadow is not visible in, touches, or
falls outside of the borders. Our solution to this problem is to switch
automatically to a different light source whenever such a situation is detected,
the choice of the new light source depending on where the shadows touched the
border. By choosing overlapping regions for all light sources, we can keep the
number of light source switches to a necessary minimum. In practice, four light
sources in the original set of seven were enough to cover the relevant area of
the desk surface. However, an additional spotlight, mounted directly overhead of
the desktop, has been added to provide more direct coverage of the desktop
surface.
Another problem can be seen in Figure
7b, where segmentation based on color background subtraction detects both
the hand and the change in the display on the workbench. A more recent
implementation replaces the side color camera with an infrared spotlight and a
monochrome camera equipped with an infrared-pass filter. By adjusting the angle
of the light to avoid the surface of the desk or any other close objects, the
user's arm is illuminated and made distinct from the background and changes in
the workbench's display does not affect the tracking.
A bigger problem is caused by the actual location of the side camera. If the
user extends both of his arms over the desk surface, or if more than one user
tries to interact with the environment at the same time, the images of these
multiple limbs can overlap and be merged to a single blob. As a consequence, our
approach will fail to detect the hand positions and orientations in these cases.
A more sophisticated approach using previous position and movement information
could yield more reliable results, but we chose, at this first stage, to accept
this restriction and concentrate on high frame rate support for one-handed
interaction. This may not be a serious limitation for a single user for certain
tasks; a recent study shows that for a task normally requiring
two hands in
a real environment, users have no preference for one versus two hands in a
virtual environment that does not
model effects such as gravity and inertia
[Seay99].
6. 3D Reconstruction
To complement the capabilities
of the Perceptive Workbench, we want to be able to insert real objects into the
virtual world and to share them with other users at different locations. An
example application for this could be a telepresence or computer-supported
collaborative work (CSCW) system. Instead of verbally describing an object, the
user would be able to quickly create a 3D reconstruction and send it to his
co-workers (see Figure
8). For this, it is necessary to design a reconstruction mechanism that does
not interrupt the interaction. Our focus is providing an almost instantaneous
visual cue for the object as part of the interaction, not necessarily on
creating a highly accurate model.
 |
Figure 8: Real object
inserted into the virtual world.
Several methods have been designed to reconstruct objects from silhouettes
[Srivastava90,
Sullivan98]
or dynamic shadows [Daum98]
using either a moving camera or light-source on a known trajectory or a
turntable for the object [Sullivan98].
Several systems have been developed for the reconstruction of relatively simple
objects, including the commercial system Sphinx3D.
However, the necessity to move either the camera or the object imposes severe
constraints on the working environment. To reconstruct an object with these
methods, it is usually necessary to interrupt the user's interaction with
it, take the object out of the user's environment, and place it into a
specialized setting (sometimes in a different room). Other approaches make use
of multiple cameras from different view points to avoid this problem at the
expense of more computational power to process and communicate the results.
In this project, using only one camera and the infrared light sources, we
analyze the shadows cast on the object from multiple directions (see Figure
9). As the process is based on infrared light, it can be applied
independently of the lighting conditions and without interfering with the user's
natural interaction with the desk.
The existing approaches to reconstruct shape from shadows
or silhouettes can be divided into two camps. The volume approach, pioneered by
Baumgart [Baumgart74]
intersects view volumes to create a representation for the object. Common
representations for the resulting model are polyhedra [Baumgart74,
Conolly89],
or octrees [Srivastava90,
Chien86].
The surface approach reconstructs the surface as the envelope of its tangent
planes. It has been realized in several systems [Boyer96,
Seales95].
Both approaches can be combined, as in Sullivan's work [Sullivan98],
which uses volume intersection to create an object and then smooths the surfaces
with splines.
We have chosen to use a volume approach to create
polyhedral reconstructions for several reasons. We want to create models that
can be used instantaneously in a virtual environment. Our focus is not on
getting a photorealistic reconstruction, but on creating a quick low
polygon-count model for an arbitrary real-world object in real-time, without
interrupting the ongoing interaction. Polyhedral models offer significant
advantages over other representations, such as generalized cylinders,
superquadrics, or polynomial splines. They are simple and computationally
inexpensive, Boolean set operations can be performed on them with reasonable
effort, and most current VR engines are optimized for fast rendering of
polygonal scenes. In addition, polyhedral models are the basis for many later
processing steps. If desired, they can still be refined with splines using a
surface approach.
 |
Figure 9: Principle of the 3D
reconstruction.
To obtain the different
views, we mounted a ring of seven infrared light sources in the ceiling, each
one of which is switched independently by computer control. The system detects
when a new object is placed on the desk surface, and the user can initiate the
reconstruction by touching a virtual button rendered on the screen (Figure
13). This action is detected by the camera, and, after only one second, all
shadow images are taken. In another second, the reconstruction is complete (Figure
12), and the newly reconstructed object is part of the virtual world.
Our approach is fully automated and does not require any special hardware
(e.g. stereo cameras, laser range finders, structured lighting,
etc.). The method is extremely inexpensive, both in hardware and in
computational cost. In addition, there is no need for extensive calibration,
which is usually necessary in other approaches to recover the exact position or
orientation of the object in relation to the camera. We only need to know
the approximate position of the light-sources (+/- 2 cm), and we need to adjust
the camera to reflect the size of the display surface, which must be done only
once. Neither the camera, light-sources, nor the object are moved during
the reconstruction process. Thus recalibration is unnecessary. We
have substituted for all mechanical moving parts, which are often prone to wear
and imprecision, by a series of light beams from known locations.
An obvious limitation for this approach is that we are confined to a fixed
number of different views from which to reconstruct the object. The turntable
approach, on the other hand, allows the system to take an arbitrary number of
images from different view points. However, Sullivan's work [Sullivan98]
and our experience with our system have shown that even for quite complex
objects, usually seven to nine different views are enough to get a reasonable 3D
model of the object. Note that the reconstruction uses
the same hardware as the deictic gesture tracking capability discussed in the
previous section. Thus, it comes at no additional cost.
The speed of the reconstruction process is mainly limited by the switching
time of the light sources. Whenever a new light-source is activated, the image
processing system has to wait for several frames to receive a valid image. The
camera under the desk records the sequence of shadows cast by an object on the
table when illuminated by the different lights. Figure
10 shows two series of contour shadows extracted from two sample objects by
using different IR sources. By approximating each shadow as a polygon (not
necessarily convex) [Rosin95],
we create a set of polyhedral "view cones", extending from the light source to
the polygons. The intersection of these cones creates a polyhedron that roughly
contains the object. Figure
11 shows some more shadows with the resulting
polygons and a visualization of the intersection of polyhedral cones.
 |
 |
Figure 10: Shadow contours
and interpolated polygons from a watering can (left) and a teapot
(right).
Intersecting non-convex polyhedral objects is a very
complex problem, further complicated by numerous special cases that have to be
taken care of. Fortunately, this problem has already been exhaustively
researched, and solutions are available [Mantyla88].
For the intersection calculations in our application, we used Purdue
University's TWIN Solid Modeling Library [TWIN95].
 |
Figure 11: Steps of 3D object reconstruction
including extraction of contour shapes from shadows
and intersection
of multiple view cones (bottom).
Figure
13b shows a freshly reconstructed model of a watering can placed on the desk
surface. The same model can be seen in more detail in Figure
12, along with the reconstruction of a teapot. The colors were chosen to
highlight the different model faces by interpreting the face normal as a vector
in RGB color space.
 |
 |
 |
 |
 |
 |
Figure 12: 3D reconstruction
of the watering can seen in Figure 5-5 (top) and a teapot
(bottom).
The colors are chosen to
highlight the model faces.
Limitations
An obvious
limitation to our approach is that not every non-convex object can be exactly
reconstructed from its silhouettes or shadows. The closest approximation
that can be obtained with volume intersection is its visual hull. But even
for objects with a polyhedral visual hull an unbounded number of silhouettes may
be necessary for an exact reconstruction [Laurentini97].
In practice, we can get sufficiently accurate models for a large variety of
real-world objects, even from a relatively small number of different
views.
Exceptions are spherical or cylindrical objects.
The quality of reconstruction for these objects depends largely on the number of
available views. With only seven light-sources, the resulting model will appear
faceted. This problem can be solved by either adding more light-sources,
or by improving the model with the help of splines.
Apart from this, the accuracy by which objects can be
reconstructed is bounded by another limitation of our architecture. All
our light-sources are mounted to the ceiling. From this point of view they
cannot provide full information about the object's shape. There is
a pyramidal "blind spot" above all horizontal flat surfaces that the
reconstruction cannot eliminate. The slope of these pyramids depends on
the angle between the desk surface and the rays from the light-sources.
For our current hardware setting, this angle ranges between 37° and 55°,
depending on the light-source. Only structures with a greater slope will
be reconstructed entirely without error. This problem is intrinsic to the
method and does also occur with the turntable approach, but on a much smaller
scale.
We
expect that we can greatly reduce the effects of this error by using the image
from the side camera and extracting an additional silhouette of the object from
this point of view. This will help to keep the error angle well below
10°. Calculations based on the current position of the side camera
(optimized for the gesture recognition) promise an error angle of only
7°.
In the current version of our software, an additional
error is introduced through the fact that we are not yet handling holes in the
shadows. But this is merely an implementation issue, which will be
resolved in a future extension to our project.
7. Performance Analysis
Object and Gesture Tracking
Both object and
gesture tracking perform at a stable 12-18 frames per second. Frame rate
depends on the number of objects on the table and the size of the shadows,
respectively. Both techniques are able to follow fast motions and
complicated trajectories. Latency is currently 0.25-0.33 seconds but has
improved since last testing (an acceptable threshold is typically around 0.1
second). Surprisingly, this level of latency seems adequate for most
pointing gestures in our current applications. Since the user is provided
with continuous feedback about his hand and pointing position, and most
navigation controls are relative rather than absolute, the user adapts his
behavior readily to the system. With object tracking, the physical object
itself can provide the user with adequate tactile feedback as the system catches
up with the user's manipulations. In general, since the user is moving
objects across a very large desk surface, the lag is noticeable but rarely
troublesome in the current applications.
Even so, we expect that simple improvements in the socket communication
between the vision and rendering code and in the vision code itself will improve
latency significantly. For the terrain navigation task below, rendering
speed provides a limiting factor. However, Kalman filters may compensate
for render lag and will also add to the stability of the tracking system.
3D Reconstruction
Calculating the error from
the 3D reconstruction process requires choosing known 3D models, performing the
reconstruction process, aligning the reconstructed model and the ideal model,
and calculating an error measure. For simplicity, a cone and pyramid were
chosen. The centers of mass of the ideal and reconstructed models were set
to the same point in space, and their principal axes were aligned.
To measure error, we used the Metro tool [Cignoni98].
It approximates the real distance between the two surfaces by choosing a set of
(100,000-200,000) points on the reconstructed surface, and then calculating the
two-sided distance (Hausdorff distance) between each of these points and the
ideal surface. This distance is defined as
with E(S1,S2) denoting the one-sided distance between the surfaces S1 and S2:
The Hausdorff distance directly corresponds to the reconstruction
error. In addition to the maximum distance, we also calculated the mean
and mean square distances. Table
1 shows the results. In these examples, the relatively large maximal
error was caused by the difficulty in accurately reconstructing the tip of the
cone and the pyramid.
|
Cone |
Pyramid |
Maximal Error |
0.0215 (7.26 %) |
0.0228 (6.90 %) |
Mean Error |
0.0056 (1.87 %) |
0.0043 (1.30 %) |
Mean Square Error |
0.0084 (2.61 %) |
0.0065 (1.95 %) |
Table 1: Reconstruction errors averaged over three
runs
(in meters and percentage of object
diameter).
While improvements may be made
by precisely calibrating the camera and lighting system, by adding more light
sources, and by obtaining a silhouette from the side camera (to eliminate
ambiguity about the top of the surface), the system meets its goal of providing
virtual presences for physical objects in a quick and timely manner that
encourages spontaneous interactions.
8. Putting It to Use: Spontaneous Gesture Interfaces
All the components of the Perceptive Workbench, deictic gesture
tracking, object recognition, tracking, and reconstruction, can be combined into
a single, consistent framework. The Perceptive Workbench interface detects how
the user wants to interact with it and automatically switches to the desired
mode.
When the user moves his hand above the display surface, the hand and arm are
tracked as described in Section
4. A cursor appears at the projected hand position on the display surface,
and a ray emanates along the projected arm axis. These can be used in selection
or manipulation, as in Figure
13a. When the user places an object on the surface, the cameras recognize
this and identify and track the object. A virtual button also appears on the
display (indicated by the arrow in Figure
13b). Through shadow tracking, the system determines when the hand overlaps
the button, selecting it. This action causes the system to capture the 3D object
shape, as described in Section
5.
 |
 |
Figure 13: (a) Pointing gesture with hand icon
and selection ray; (b) Virtual button
rendered on the screen when
object is detected on the surface.
The decision whether or not
there is an object on the desk surface is made easy by the fact that shadows
from the user's arms always touch the image border. Thus, if a shadow is
detected that does not touch any border, the system can be sure that it is
caused by an object on the desk surface. As a result, it will switch to object
recognition and tracking mode. Similarly, the absence of such shadows, for a
certain period, indicates that the object has been taken away, and the system
can safely switch back to gesture tracking mode. Note that once the system is in
object recognition mode, the ceiling lights are switched off, and the light
sources underneath the table are activated instead. It is therefore safe for the
user to grab and move objects on the desk surface, since his arms will not cast
any shadows that could disturb the perceived object contours.
This set provides the elements of a perceptual interface, operating without
wires and without restrictions as to objects employed. For example, we have
constructed a simple application where objects placed on the desk are selected,
reconstructed, and then placed in a "template" set, displayed as slowly rotating
objects on the left border of the workbench display. These objects can then be
grabbed by the user and could act as new physical icons that are attached by the
user to selection or manipulation modes. Or the shapes themselves could be used
in model-building or other applications.
An Augmented Reality Game
We have created a more elaborate collaborative
interface using the Perceptive Workbench. This involves the workbench
communicating with a person in a separate space wearing an augmented reality
headset. All interaction is via image-based gesture tracking without attached
sensors. The game is patterned after a martial arts fighting game. The user in
the augmented reality headset is the player, and one or more people interacting
with the workbench are the game masters. The workbench display surface acts as a
top-down view of the player's space. The game masters place different
objects on the surface, which appear to the player as distinct monsters at
different vertical levels in his space. The game masters move the objects around
the display surface, toward and away from the player; this motion is replicated
in the player's view by the monsters which move in their individual planes. Figure
14a shows the game masters moving objects, and Figure
15b displays the moving monsters in the virtual space.
 |
 |
Figure 14: (a) Two game
masters controlling virtual monsters; (b) hardware outfit worn by mobile
player.
The mobile player wears a
"see-through" Sony Glasstron (Figure
14b) equipped with two cameras. Fiducials or natural features in the
player's space are tracked by the forward facing camera to recover head
orientation. This allows graphics (such as monsters) to be rendered roughly
registered with the physical world. The second camera looks down at the
player's hands to recognize "martial arts" gestures [Starner98].
 |
 |
Figure 15: (a) Mobile player
performing Kung-Fu gestures to ward off monsters;
(b) Virtual monsters overlayed on the real background as seen by
the mobile player.
While a more
sophisticated hidden Markov model system is under development, a simple template
matching method is sufficient for recognizing a small set of martial arts
gestures. To effect attacks on the monsters, the user accompanies the
appropriate attack gesture (Figure
15a) with a Kung Fu yell ("heee-YAH"). Each foe requires a different
gesture. Foes that are not destroyed enter the player's personal space
and injure him. Enough injuries will cause the player's defeat.
The system has been used by faculty and graduate students in the
GVU lab. They have found the experience compelling and balanced. Since it is
difficult for the game master to keep pace with the player, two or more game
masters may participate (Figure
14a). The Perceptive Workbench's object tracker scales naturally to
handle multiple, simultaneous users. For a more detailed description of this
application, see Starner et al. [Starner00].
3D Terrain Navigation
We have developed a global terrain navigation
system on the virtual workbench which allows one to fly continuously from outer
space to terrain or buildings with features at one foot or better resolution [Wartell99a].
Since features are displayed stereoscopically [Wartell99b],
the navigation is both compelling and detailed. In our third person
navigation interface, the user interacts with the terrain as if it were an
extended relief map laid out below one on a curved surface. Main
interactions include zooming, panning, and rotating. Since the user is
head-tracked he can move his head to look at the 3D objects from different
angles. Previously, interaction has been by using button sticks with six
degrees of freedom electromagnetic trackers attached. We employ the
deictic gestures of the Perceptive Workbench, as described in Section
5, to remove this constraint. Direction of navigation is chosen by
pointing and can be changed continuously (Figure
16c). Moving the hand towards the display increases speed towards the
earth and moving it away increases speed away from the earth. Panning is
accomplished by lateral gestures in the direction to be panned (Figure
16a). Rotation is accomplished by making a rotating gesture with the
arm (Figure
16b). At present these three modes are chosen by keys on a keyboard
attached to the workbench. In the future we expect to use gestures
entirely (e.g., pointing will indicate zooming).
Although there are currently some problems with latency and accuracy (both of
which will be diminished in the future), a user can successfully employ gestures
for navigation. In addition the set of gestures are quite natural to
use. Further, we find that the vision system can distinguish hand
articulation and orientation quite well. Thus we will be able to attach
interactions to hand movements (even without the larger arm movements). At the
time of this writing, an HMM framework has been developed to allow the user to
train his own gestures for recognition. This system, in association with the
terrain navigation database, should allow more sophisticated interactions in the
future.
 |
 |
 |
Figure 16: Terrain navigation using deictic
gestures: (a) panning; (b) rotation (about an axis perpendicular to
and through the end of the rotation lever); (c) zooming
in.
Telepresence, CSCW
As
another application of the Perceptive Workbench, we have built a small
telepresence system. Using the sample interaction framework described at the
beginning of this section, a user can point to any location on the desk,
reconstruct objects, and move them across the desk surface. Every one of his
actions is immediately applied to a virtual reality model of the workbench
mirroring the current state of the real desk. Thus, when performing deictic
gestures, the current hand and pointing position is displayed on the model
workbench by a red selection ray. Similarly, the reconstructed shapes of objects
on the desk surface are displayed at the corresponding positions in the model
(Figure
17). This makes it possible for co-workers at a distant location to follow
the user's actions in real-time, while having complete freedom to choose
a favorable view point.
 |
Figure 17: An example for a
telepresence system: A virtual instantiation of the
workbench
mirroring, in
real-time, the type and position of objects placed on the real
desk.
9. Future Work and Conclusions
Several improvements can be made to the
Perceptive Workbench. Higher resolution reconstruction and improved recognition
for small objects can be achieved via an active pan/tilt/zoom camera mounted
underneath the desk. The color side camera can be used to improve 3D
reconstruction and construct texture maps for the digitized object. The
reconstruction code can be modified to handle holes in objects and to correct
errors caused by non-square pixels. The latency of the gesture/rendering loop
can be improved through code refinement and the application of Kalman filters.
With a difficult object, recognition from the reflections from the light source
underneath can be successively improved by using cast shadows from the light
sources above or the 3D reconstructed model directly. Hidden Markov models can
be employed to recognize symbolic hand gestures [Starner98]
for controlling the interface. Finally, as hinted by the multiple game masters
in the gaming application, several users may be supported through careful,
active allocation of resources.
In conclusion, the Perceptive Workbench uses a vision-based system to enable
a rich set of interactions, including hand and arm gestures, object recognition
and tracking, and 3D reconstruction of objects placed on its surface.
These elements are combined seamlessly into the same interface and can be used
in diverse applications. In addition, the sensing system is relatively
inexpensive, retailing approximately $1000 for the cameras and lighting
equipment plus the cost of a computer with one or two video digitizers,
depending on the functions desired. As seen from the multiplayer gaming
and terrain navigation applications, the Perceptive Workbench provides an
untethered and spontaneous interface that encourages the inclusion of physical
objects in the virtual environment.
Acknowledgements
This work is supported in part by a contract from the
Army Research Lab, an NSF grant, an ONR AASERT grant, and funding from Georgia
Tech's Broadband Institute. We thank Brygg Ullmer, Jun Rekimoto, and Jim
Davis for their discussions and assistance. In addition we thank Paul
Rosin and Geoff West for their line segmentation code [Rosin95],
the Purdue CADLAB for TWIN [TWIN95],
and P. Cignoni, C. Rocchini, and R. Scopigno for Metro [Cignoni98].
References
Arai95 |
Arai, T. and K. Machii and S.
Kuzunuki.
Retrieving Electronic Documents with
Real-World Objects on InteractiveDesk.
UIST'95, pp. 37-38
(1995). |
Baumgart74 |
Baumgart,
B.G.,
Geometric
Modeling for Computer Vision,
PhD Thesis,
Stanford University, Palo Alto, CA, 1974. |
Bobick96 |
Bobick, A., S.
Intille, J. Davis, F. Baird, C. Pinhanez, L. Campbell, Y. Ivanov, A.
Schutte, and A. Wilson,
The KidsRoom: A
Perceptually-Based Interactive and Immersive Story
Environment,
MIT Media Lab
Technical Report (1996). |
Bolt92 |
Bolt R. and E.
Herranz,
Two-handed
gesture in multi-modal natural dialog,
UIST 92, pp.
7-14 (1992). |
Boyer96 |
Boyer,
E.,
Object Models
from Contour Sequences,
Proceedings of
the Fourth European Conference on Computer Vision, Cambridge (England),
April 1996, pp. 109-118. |
Bimber99 |
Bimber,
O.,
Gesture
Controlled Object Interaction: A Virtual Table Case
Study,
Computer
Graphics, Visualization, and Interactive Digital Media, Vol. 1, Plzen,
Czech Republic, 1999. |
Chien86 |
Chien, C.H., J.K.
Aggarwal,
Computation of
Volume/Surface Octrees from Contours and Silhouettes of Multiple
Views,
IEEE
Conference on Computer Vision and Pattern Recognition (CVPR'86), Miami
Beach, FL, June 22-26, 1986, pp. 250-255 (1986). |
Cignoni98 |
Cignoni, P.,
Rocchini, C., and Scopigno, R.,
Metro: Measuring
Error on Simplified Surfaces,
Computer
Graphics Forum, Vol. 17(2), June 1998, pp.
167-174. |
Conolly89 |
Connolly, C.I., J.R.
Stenstrom,
3D Scene
Reconstruction from Multiple Intensity Images,
Proceedings
IEEE Workshop on Interpretation of 3D Scenes, Austin, TX, Nov. 1989, pp.
124-130 (1989). |
Coquillart99 |
Coquillart, S. and
G. Wesche,
The Virtual
Palette and the Virtual Remote Control Panel: A Device and an Interaction
Paradigm for the Responsive Workbench,
IEEE Virtual
Reality '99 Conference (VR'99), Houston, March 13-17,
1999. |
Daum98 |
Daum, D. and G.
Dudek,
On 3-D Surface
Reconstruction Using Shape from Shadows,
IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR'98),
1998. |
Davis98 |
Davis, J.W. and A.
F. Bobick,
SIDEshow: A
Silhouette-based Interactive Dual-screen
Environment,
MIT Media Lab
Tech Report No. 457 (1998). |
Fitzmaurice95 |
Fitzmaurice, G.W.,
Ishii, H., and Buxton, W.,
Bricks: Laying
the Foundations for Graspable User Interfaces,
Proceedings of
CHI'95, pp. 442-449 (1995). |
Ishii97 |
Ishii, H., and
Ullmer, B.,
Tangible Bits:
Towards Seamless Interfaces between People, Bits, and
Atoms,
Proceedings of
CHI'97, pp. 234-241 (1997). |
Kessler95 |
Kessler, G.D., L.F.
Hodges, and N. Walker,
Evaluation of the
CyberGlove as a Whole-Hand Input Device,
ACM Trans. on
Computer-Human Interactions, 2(4), pp. 263-283
(1995). |
Kessler97 |
Kessler, D., R.
Kooper, and L. Hodges,
The Simple
Virtual Environment Libary: User´s Guide Version
2.0,
Graphics,
Visualization, and Usability Center, Georgia Institute of Technology,
1997. |
Kobayashi98 |
Kobayashi, M. and H.
Koike,
EnhancedDesk:
integrating paper documents and digital documents,
Proceedings of
3rd Asia Pacific Computer Human Interaction, pp. 57-62
(1998). |
Krueger91 |
Krueger,
M.,
Artificial
Reality II,
Addison-Wesley, 1991. |
Krueger95 |
Krueger, W., C.-A.
Bohn, B. Froehlich, H. Schueth, W. Strauss, G. Wesche,
The Responsive
Workbench: A Virtual Work Environment,
IEEE Computer,
vol. 28. No. 7. July 1995, pp. 42-48. |
Laurentini97 |
Laurentini,
A.,
How Many 2D
Silhouettes Does It Take to Reconstruct a 3D
Object?,
Computer
Vision and Image Understanding, Vol. 67, No. 1, July 1997, pp. 81-87
(1997). |
Leibe00 |
Leibe, B., T.
Starner, W. Ribarsky, Z. Wartell, D. Krum, B. Singletary, and L.
Hodges,
The Perceptive
Workbench: Towards Spontaneous and Natural Interaction in Semi-Immersive
Virtual Environments,
IEEE Virtual
Reality 2000 Conference (VR'2000), New Brunswick, NJ, March 2000, pp.
13-20. |
Lindstrom96 |
Lindstrom, P., D.
Koller, W. Ribarsky, L. Hodges, N. Faust, and G.
Turner,
Real-Time,
Continuous Level of Detail Rendering of Height
Fields,
Report
GIT-GVU-96-02, SIGGRAPH 96, pp. 109-118 (1996). |
Lindstrom97 |
Lindstrom, Peter,
David Koller, William Ribarsky, Larry Hodges, and Nick
Faust,
An Integrated
Global GIS and Visual Simulation System,
Georgia Tech
Report GVU-97-07 (1997). |
Mantyla88 |
Mäntylä,
M.,
An Introduction
to Solid Modeling,
Computer
Science Press, 1988. |
May99 |
May,
R.,
HI-SPACE: A Next
Generation Workspace Environment.
Master's
Thesis, Washington State Univ. EECS, June 1999. |
Rehg93 |
Rehg, J.M and T.
Kanade,
DigitEyes:
Vision-Based Human Hand-Tracking,
School of
Computer Science Technical Report CMU-CS-93-220, Carnegie Mellon
University, December 1993. |
Rekimoto97 |
Rekimoto, J., N.
Matsushita,
Perceptual
Surfaces: Towards a Human and Object Sensitive Interactive
Display,
Workshop on
Perceptual User Interfaces (PUI '97), 1997. |
Rosin95 |
Rosin, P.L. and
G.A.W. West,
Non-parametric
segmentation of curves into various
representations,
IEEE PAMI'95,
17(12) pp. 1140-1153 (1995). |
Schmalstieg99 |
Schmalstieg, D., L.
M. Encarnacao, Z. Szalavar,
Using Transparent
Props For Interaction With The Virtual Table,
Symposium on
Interactive 3D Graphics (I3DG'99), Atlanta,
1999. |
Seales95 |
Seales,
B.,
Building
Three-Dimensional Object Models from Image
Sequences,
Computer
Vision and Image Understanding, Vol. 61, 1995, pp. 308-324
(1995). |
Seay99 |
Seay, A.F., D. Krum,
W. Ribarsky, and L. Hodges,
Multimodal
Interaction Techniques for the Virtual Workbench,
Proceedings of
CHI'99. |
Sharma97 |
Sharma R. and J.
Molineros,
Computer vision
based augmented reality for guiding manual
assembly,
Presence, 6(3)
(1997). |
Srivastava90 |
Srivastava, S.K. and
N. Ahuja,
An Algorithm for
Generating Octrees from Object Silhouettes in Perspective
Views,
IEEE Computer
Vision, Graphics and Image Processing, 49(1), pp. 68-84
(1990). |
Starner98 |
Starner, T., J.
Weaver, A. Pentland,
Real-Time
American Sign Language Recognition Using Desk and Wearable Computer Based
Video,
IEEE PAMI ,
20(12), pp. 1371-1375 (1998). |
Starner00 |
Starner, T., B.
Leibe, B. Singletary, and J. Pair,
MIND-WARPING:
Towards Creating a Compelling Collaborative Augmented Reality Gaming
Interface through Wearable Computers and Multi-modal Input and
Output,
IEEE
International Conference on Intelligent User Interfaces (IUI'2000),
2000. |
Sturman92 |
Sturman,
D,
Whole-hand
input,
Ph.D. Thesis,
MIT Media Lab (1992). |
Sullivan98 |
Sullivan, S. and J.
Ponce,
Automatic Model
Construction, Pose Estimation, and Object Recognition from Photographs
Using Triangular Splines,
IEEE PAMI ,
20(10), pp. 1091-1097 (1998). |
TWIN95 |
TWIN Solid
Modeling Package Reference Manual.
Computer Aided
Design and Graphics Laboratory (CADLAB), School of Mechanical Engineering,
Purdue University, 1995.
http://cadlab.www.ecn.purdue.edu/cadlab/twin/TWIN_TOC.html |
Ullmer97 |
Ullmer, B. and H.
Ishii,
The metaDESK:
Models and Prototypes for Tangible User
Interfaces,
Proceedings of
UIST'97, October 14-17, 1997. |
Underkoffler98 |
Underkoffler, J. and
H. Ishii,
Illuminating
Light: An Optical Design Tool with a Luminous-Tangible
Interface,
Proceedings of
CHI '98, April 18-23, 1998. |
vdPol99 |
van de Pol, Rogier,
William Ribarsky, Larry Hodges, and Frits Post,
Interaction in
Semi-Immersive Large Display Environments,
Report
GIT-GVU-98-30, Virtual Environments '99, pp. 157-168 (Springer, Wien,
1999). |
Vogler98 |
Vogler C. and D.
Metaxas,
ASL Recognition
based on a coupling between HMMs and 3D Motion
Analysis,
Sixth
International Conference on Computer Vision, pp. 363-369
(1998). |
Wartell99a |
Wartell, Zachary,
William Ribarsky, and Larry F.Hodges,
Third Person
Navigation of Whole-Planet Terrain in a Head-tracked Stereoscopic
Environment,
Report
GIT-GVU-98-31, IEEE Virtual Reality 99, pp. 141-149
(1999). |
Wartell99b |
Wartell, Zachary,
Larry Hodges, and William Ribarsky,
Distortion in
Head-Tracked Stereoscopic Displays Due to False Eye
Separation,
Report
GIT-GVU-99-01, SIGGRAPH 99,.pp. 351-358 (1999). |
Wellner93 |
Wellner
P,
Interacting with
paper on the digital desk,
Comm. of the
ACM, 36(7), pp. 86-89 (1993). |
Wren95 |
Wren C., F.
Sparacino, A. Azarbayejani, T. Darrell, T. Starner, A. Kotani, C. Chao, M.
Hlavac, K. Russell, and A. Pentland,
Perceptive Spaces
for Performance and Entertainment: Untethered Interaction Using Computer
Vision and Audition,
Applied
Artificial Intelligence , 11(4), pp. 267-284
(1995). |
Wren97 |
Wren C., A.
Azarbayejani, T. Darrell, and A. Pentland,
Pfinder:
Real-Time Tracking of the Human Body,
IEEE PAMI,
19(7), pp. 780-785 (1997). |