Intelligent Image Processing PDF
Intelligent Image Processing PDF
Steve Mann
Copyright 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-40637-6 (Hardback); 0-471-22163-5 (Electronic)
Steve Mann
University of Toronto
Copyright 2002 by John Wiley & Sons, Inc. All rights reserved.
ISBN 0-471-22163-5
For more information about Wiley products, visit our web site at
www.Wiley.com.
CONTENTS
Preface xv
Appendixes
A Safety First! 295
Bibliography 332
Index 341
PREFACE
This book has evolved from the author’s course on personal imaging taught at the
University of Toronto, since fall 1998. It also presents original material from the
author’s own experience in inventing, designing, building, and using wearable
computers and personal imaging systems since the early 1970s.
The idea behind this book is to provide the student with the fundamental
knowledge needed in the rapidly growing field of personal imaging. This field is
often referred to colloquially as wearable computing, mediated (or augmented)
‘reality,’ personal technologies, mobile multimedia, and so on. Rather than trying
to address all aspects of personal imaging, the book places a particular emphasis
on the fundamentals.
New concepts of image content are essential to multimedia communications.
Human beings obtain their main sensory information from their visual system.
Accordingly, visual communication is essential for creating an intimate connec-
tion between the human and the machine. Visual information processing also
provides the greatest technical challenges because of the bandwidth and comp-
lexity that is involved.
A computationally mediated visual reality is a natural extension of the next-
generation computing machines. Already we have witnessed a pivotal shift from
mainframe computers to personal/personalizable computers owned and operated
by individual end users. We have also witnessed a fundamental change in the
nature of computing from large mathematical “batch job” calculations to the
use of computers as a communications medium. The explosive growth of the
Internet (which is primarily a communications medium as opposed to a calcula-
tions medium), and more recently the World Wide Web, is a harbinger of what
will evolve into a completely computer-mediated world. Likely in the immediate
future we will see all aspects of life handled online and connected.
This will not be done by implanting devices into the brain — at least not in
this course — but rather by noninvasively “tapping” the highest bandwidth “pipe”
into the brain, namely the eye. This “eye tap” forms the basis for devices that are
being currently built into eyeglasses (prototypes are also being built into contact
lenses) to tap into the mind’s eye.
xv
xvi PREFACE
STEVE MANN
University of Toronto
Intelligent Image Processing. Steve Mann
Copyright 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-40637-6 (Hardback); 0-471-22163-5 (Electronic)
1
HUMANISTIC INTELLIGENCE
AS A BASIS FOR
INTELLIGENT IMAGE
PROCESSING
There are two kinds of constancy: one is called operational constancy, and
the other is called interactional constancy [2]. Operational constancy also refers
to an always ready-to-run condition, in the sense that although the apparatus may
have power-saving (“sleep” ) modes, it is never completely “dead” or shut down
or in a temporary inoperable state that would require noticeable time from which
to be “awakened.”
The other kind of constancy, called interactional constancy, refers to a
constancy of user-interface. It is the constancy of user-interface that separates
systems embodying a personal imaging architecture from other personal devices,
such as pocket calculators, personal digital assistants (PDAs), and other imaging
devices, such as handheld video cameras.
For example, a handheld calculator left turned on but carried in a shirt pocket
lacks interactional constancy, since it is not always ready to be interacted with
(e.g., there is a noticeable delay in taking it out of the pocket and getting ready
to interact with it). Similarly a handheld camera that is either left turned on or is
designed such that it responds instantly, still lacks interactional constancy because
it takes time to bring the viewfinder up to the eye in order to look through it. In
order for it to have interactional constancy, it would need to always be held up
to the eye, even when not in use. Only if one were to walk around holding the
camera viewfinder up to the eye during every waking moment, could we say it
is has true interactional constancy at all times.
By interactionally constant, what is meant is that the inputs and outputs of the
device are always potentially active. Interactionally constant implies operationally
constant, but operationally constant does not necessarily imply interactionally
constant. The examples above of a pocket calculator worn in a shirt pocket, and
left on all the time, or of a handheld camera even if turned on all the time, are said
to lack interactional constancy because they cannot be used in this state (e.g., one
still has to pull the calculator out of the pocket or hold the camera viewfinder up
to the eye to see the display, enter numbers, or compose a picture). A wristwatch
is a borderline case. Although it operates constantly in order to continue to keep
proper time, and it is wearable; one must make some degree of conscious effort
to orient it within one’s field of vision in order to interact with it.
radar screen, envisioned that the cathode ray screen could also display letters
of the alphabet, as well as computer-generated pictures and graphical content,
and thus envisioned computing as an interactive experience for manipulating
words and pictures. Engelbart envisioned the mainframe computer as a tool for
augmented intelligence and augmented communication, in which a number of
people in a large amphitheatre could interact with one another using a large
mainframe computer [11,12]. While Engelbart himself did not seem to understand
the significance of the personal computer, his ideas are certainly embodied in
modern personal computing.
What is now described is a means of realizing a similar vision, but with
the computational resources re-situated in a different context, namely the
truly personal space of the user. The idea here is to move the tools of
augmented intelligence, augmented communication, computationally mediated
visual communication, and imaging technologies directly onto the body. This will
give rise to not only a new genre of truly personal image computing but to some
new capabilities and affordances arising from direct physical contact between
the computational imaging apparatus and the human mind and body. Most
notably, a new family of applications arises categorized as “personal imaging,”
in which the body-worn apparatus facilitates an augmenting and computational
mediating of the human sensory capabilities, namely vision. Thus the augmenting
of human memory translates directly to a visual associative memory in which
the apparatus might, for example, play previously recorded video back into the
wearer’s eyeglass mounted display, in the manner of a visual thesaurus [13] or
visual memory prosthetic [14].
Input Output
Human Human
Computer Computer
(a ) (b )
Human
(c ) (d )
Figure 1.1 The three basic operational modes of WearComp. (a) Signal flow paths for a
computer system that runs continuously, constantly attentive to the user’s input, and constantly
providing information to the user. Over time, constancy leads to a symbiosis in which the user
and computer become part of each other’s feedback loops. (b) Signal flow path for augmented
intelligence and augmented reality. Interaction with the computer is secondary to another
primary activity, such as walking, attending a meeting, or perhaps doing something that
requires full hand-to-eye coordination, like running down stairs or playing volleyball. Because
the other primary activity is often one that requires the human to be attentive to the environment
as well as unencumbered, the computer must be able to operate in the background to augment
the primary experience, for example, by providing a map of a building interior, and other
information, through the use of computer graphics overlays superimposed on top of the
real world. (c) WearComp can be used like clothing to encapsulate the user and function
as a protective shell, whether to protect us from cold, protect us from physical attack (as
traditionally facilitated by armor), or to provide privacy (by concealing personal information
and personal attributes from others). In terms of signal flow, this encapsulation facilitates the
possible mediation of incoming information to permit solitude, and the possible mediation
of outgoing information to permit privacy. It is not so much the absolute blocking of these
information channels that is important; it is the fact that the wearer can control to what extent,
and when, these channels are blocked, modified, attenuated, or amplified, in various degrees,
that makes WearComp much more empowering to the user than other similar forms of portable
computing. (d) An equivalent depiction of encapsulation (mediation) redrawn to give it a similar
form to that of (a) and (b), where the encapsulation is understood to comprise a separate
protective shell.
6 HUMANISTIC INTELLIGENCE AS A BASIS FOR INTELLIGENT IMAGE PROCESSING
between the human and computers makes it harder to attack directly, for
example, as one might look over a person’s shoulder while they are typing
or hide a video camera in the ceiling above their keyboard.1
Because of its ability to encapsulate us, such as in embodiments of
WearComp that are actually articles of clothing in direct contact with our
flesh, it may also be able to make measurements of various physiological
quantities. Thus the signal flow depicted in Figure 1.1a is also enhanced by
the encapsulation as depicted in Figure 1.1c. To make this signal flow more
explicit, Figure 1.1c has been redrawn, in Figure 1.1d, where the computer
and human are depicted as two separate entities within an optional protective
shell that may be opened or partially opened if a mixture of augmented and
mediated interaction is desired.
Note that these three basic modes of operation are not mutually exclusive in the
sense that the first is embodied in both of the other two. These other two are also
not necessarily meant to be implemented in isolation. Actual embodiments of
WearComp typically incorporate aspects of both augmented and mediated modes
of operation. Thus WearComp is a framework for enabling and combining various
aspects of each of these three basic modes of operation. Collectively, the space of
possible signal flows giving rise to this entire space of possibilities, is depicted in
Figure 1.2. The signal paths typically comprise vector quantities. Thus multiple
parallel signal paths are depicted in this figure to remind the reader of this vector
nature of the signals.
Unmonopolizing Unrestrictive
Human
Controllable
Observable
Computer
Attentive Communicative
Figure 1.2 Six signal flow paths for the new mode of human–computer interaction provided
by WearComp. These six signal flow paths each define one of the six attributes of WearComp.
1 Forthe purposes of this discussion, privacy is not so much the absolute blocking or concealment of
personal information, rather, it is the ability to control or modulate this outbound information channel.
For example, one may want certain members of one’s immediate family to have greater access to
personal information than the general public. Such a family-area network may be implemented with
an appropriate access control list and a cryptographic communications protocol.
8 HUMANISTIC INTELLIGENCE AS A BASIS FOR INTELLIGENT IMAGE PROCESSING
technology. Computer systems will become part of our everyday lives in a much
more immediate and intimate way than in the past.
Physical proximity and constancy were simultaneously realized by the
WearComp project2 of the 1970s and early 1980s (Figure 1.3). This was a first
attempt at building an intelligent “photographer’s assistant” around the body,
and it comprised a computer system attached to the body. A display means was
constantly visible to one or both eyes, and the means of signal input included a
series of pushbutton switches and a pointing device (Figure 1.4) that the wearer
could hold in one hand to function as a keyboard and mouse do, but still be able
to operate the device while walking around. In this way the apparatus re-situated
the functionality of a desktop multimedia computer with mouse, keyboard, and
video screen, as a physical extension of the user’s body. While the size and
weight reductions of WearComp over the last 20 years have been quite dramatic,
the basic qualitative elements and functionality have remained essentially the
same, apart from the obvious increase in computational power.
However, what makes WearComp particularly useful in new and interesting
ways, and what makes it particularly suitable as a basis for HI, is the collection of
other input devices. Not all of these devices are found on a desktop multimedia
computer.
(a ) ( b)
Figure 1.3 Early embodiments of the author’s original ‘‘photographer’s assistant’’ application
of personal Imaging. (a) Author wearing WearComp2, an early 1980s backpack-based
signal-processing and personal imaging system with right eye display. Two antennas operating
at different frequencies facilitated wireless communications over a full-duplex radio link. (b)
WearComp4, a late 1980s clothing-based signal processing and personal imaging system with
left eye display and beamsplitter. Separate antennas facilitated simultaneous voice, video, and
data communication.
2 For a detailed historical account of the WearComp project, and other related projects, see [19,20].
PRACTICAL EMBODIMENTS OF HUMANISTIC INTELLIGENCE 11
(a ) (b )
Figure 1.4 Author using some early input devices (‘‘keyboards’’ and ‘‘mice’’) for WearComp.
(a) 1970s: Input device comprising pushbutton switches mounted to a wooden hand-grip.
(b) 1980s: Input device comprising microswitches mounted to the handle of an electronic
flash. These devices also incorporated a detachable joystick (controlling two potentiometers),
designed as a pointing device for use in conjunction with the WearComp project.
The last three, in particular, are not found on standard desktop computers, and
even the first three, which often are found on standard desktop computers, appear
in a different context in WearComp than they do on a desktop computer. For
12 HUMANISTIC INTELLIGENCE AS A BASIS FOR INTELLIGENT IMAGE PROCESSING
example, in WearComp the camera does not show an image of the user, as it
does typically on a desktop computer, but rather it provides information about
the user’s environment. Furthermore the general philosophy, as will be described
in Chapter 4, is to regard all of the input devices as measurement devices. Even
something as simple as a camera is regarded as a measuring instrument within
the proposed signal-processing framework.
Certain applications use only a subset of these devices but include all of
them in the design facilitates rapid prototyping and experimentation with new
applications. Most embodiments of WearComp are modular so that devices can
be removed when they are not being used.
A side effect of this WearComp apparatus is that it replaces much of the
personal electronics that we carry in our day-to-day living. It enables us to interact
with others through its wireless data communications link, and therefore replaces
the pager and cellular telephone. It allows us to perform basic computations,
and thus replaces the pocket calculator, laptop computer, and personal data
assistant (PDA). It can record data from its many inputs, and therefore it replaces
and subsumes the portable dictating machine, camcorder, and the photographic
camera. And it can reproduce (“play back”) audiovisual data, so it subsumes the
portable audio cassette player. It keeps time, as any computer does, and this may
be displayed when desired, rendering a wristwatch obsolete. (A calendar program
that produces audible, vibrotactile, or other output also renders the alarm clock
obsolete.)
However, WearComp goes beyond replacing all of these items, because
not only is it currently far smaller and far less obtrusive than the sum
of what it replaces, but these functions are interwoven seamlessly, so that
they work together in a mutually assistive fashion. Furthermore entirely new
functionalities, and new forms of interaction, arise such as enhanced sensory
capabilities.
The wearable signal-processing apparatus of the 1970s and early 1980s was
cumbersome at best. An effort was directed toward not only reducing its size and
weight but, more important, reducing its undesirable and somewhat obtrusive
appearance. An effort was also directed at making an apparatus of a given
size and weight more comfortable to wear and bearable to the user [1] by
bringing components in closer proximity to the body, thereby reducing torques
and moments of inertia. Starting in 1982, Eleveld and Mann [20] began to build
circuitry directly into clothing. The term “smart clothing” refers to variations of
WearComp that are built directly into clothing and are characterized by (or at
least an attempt at) making components distributed rather than lumped, whenever
possible or practical.
It was found [20] that the same apparatus could be made much more
comfortable by bringing the components closer to the body. This had the effect
PRACTICAL EMBODIMENTS OF HUMANISTIC INTELLIGENCE 13
of reducing the torque felt bearing the load as well as the moment of inertia felt
in moving around.
More recent related work by others [22], also involves building circuits into
clothing. A garment is constructed as a monitoring device to determine the
location of a bullet entry. The WearComp differs from this monitoring apparatus
in the sense that the WearComp is totally reconfigurable in the field, and also
in the sense that it embodies HI (the apparatus reported in [22] performs a
monitoring function but does not facilitate wearer interaction, and therefore is
not an embodiment of HI).
Figure 1.5 Author’s personal imaging system equipped with sensors for measuring
biological signals. The sunglasses in the upper right are equipped with built in video cameras
and display system. These look like ordinary sunglasses when worn (wires are concealed
inside the eyeglass holder). At the left side of the picture is an 8 channel analog to digital
converter together with a collection of biological sensors, both manufactured by Thought
Technologies Limited, of Canada. At the lower right is an input device called the ‘‘twiddler,’’
manufactured by HandyKey, and to the left of that is a Sony Lithium Ion camcorder battery with
custom-made battery holder. In the lower central area of the image is the computer, equipped
with special-purpose video-processing/video capture hardware (visible as the top stack on this
stack of PC104 boards). This computer, although somewhat bulky, may be concealed in the
small of the back, underneath an ordinary sweater. To the left of the computer, is a serial to
fiber-optic converter that provides communications to the 8 channel analog to digital converter
over a fiber-optic link. Its purpose is primarily one of safety, to isolate high voltages used in
the computer and peripherals (e.g., the 500 volts or so present in the sunglasses) from the
biological sensors, which are in close proximity, and typically with very good connection, to the
body of the wearer.
14 HUMANISTIC INTELLIGENCE AS A BASIS FOR INTELLIGENT IMAGE PROCESSING
3 The first wearable computers equipped with multichannel biosensors were built by the author
during the 1980s inspired by a collaboration with Dr. Ghista of McMaster University. Later, in
1995, the author put together an improved apparatus based on a Compaq Contura Aero 486/33 with
a ProComp eight channel analog to digital converter, worn in a Mountainsmith waist bag, and
sensors from Thought Technologies Limited. The author subsequently assisted Healey in duplicating
this system for use in trying to understand human emotions [23].
Intelligent Image Processing. Steve Mann
Copyright 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-40637-6 (Hardback); 0-471-22163-5 (Electronic)
2
WHERE ON THE BODY IS
THE BEST PLACE FOR A
PERSONAL IMAGING
SYSTEM?
This chapter considers the question of where to place the sensory and display appa-
ratus on the body. Although the final conclusion is that both should be placed,
effectively, right within the eye itself, various other possibilities are considered and
explained first. In terms of the various desirable properties the apparatus should be:
• Covert: It must not have an unusual appearance that may cause objections
or ostracization owing to the unusual physical appearance. It is known, for
instance, that blind or visually challenged persons are very concerned about
their physical appearance notwithstanding their own inability to see their
own appearance.
• Incidentalist: Others cannot determine whether or not the apparatus is in
use, even when it is not entirely covert. For example, its operation should
not convey an outward intentionality.
• Natural: The apparatus must provide a natural user interface, such as may
be given by a first-person perspective.
• Cybernetic: It must not require conscious thought or effort to operate.
These attributes are desired in range, if not in adjustment to that point of the range of
operational modes. Thus, for example, it may be desired that the apparatus be highly
visible at times as when using it for a personal safety device to deter crime. Then
one may wish it to be very obvious that video is being recorded and transmitted.
So ideally in these situations the desired attributes are affordances rather than
constraints. For example, the apparatus may be ideally covert but with an additional
means of making it obvious when desired. Such an additional means may include a
display viewable by others, or a blinking red light indicating transmission of video
data. Thus the system would ideally be operable over a wide range of obviousness
levels, over a wide range of incidentalism levels, and the like.
15
16 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
• Wet plate process: Large glass plates that must be prepared in a darkroom
tent. Apparatus requires mule-drawn carriages or the like for transport.
• Dry plates: Premade, individual sheets typically 8 by 10 or 4 by 5 inches
are available so it was possible for one person to haul apparatus in a back-
pack.
• Film: a flexible image recording medium that is also available in rolls so
that it can be moved through the camera with a motor. Apparatus may be
carried easily by one person.
• Electronic imaging: For example, Vidicon tube recording on analog video-
tape.
• Advanced electronic imaging: For example, solid state sensor arrays, image
capture on computer hard drives.
• Laser EyeTap: The eye itself is made to function as the camera, to effort-
lessly capture whatever one looks at. The size and weight of the apparatus
is negligible. It may be controlled by brainwave activity, using biofeedback,
so that pictures are taken automatically during exciting moments in life.
Originally, only pictures of very important people or events were ever recorded.
However, imaging became more personal as cameras became affordable and more
pervasive, leading to the concept of family albums. It is known that when there
is a fire or flood, the first thing that people will try to save is the family photo
album. It is considered priceless and irreplaceable to the family, yet family albums
often turn up in flea markets and church sales for pennies. Clearly, the value of
one’s collection of pictures is a very personal matter; that is, family albums are
often of little value outside the personal context. Accordingly an important aspect
of personal imaging is the individuality, and the individual personal value of the
picture as a prosthesis of memory.
Past generations had only a handful of pictures, perhaps just one or two glass
plates that depicted important points in their life, such as a wedding. As cameras
became cheaper, people captured images in much greater numbers, but still a
WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM? 17
small enough number to easily sort through and paste into a small number of
picture albums.
However, today’s generation of personal imaging devices include handheld
digital cameras that double as still and movie cameras, and often capture
thousands of pictures before any need for any to be deleted. The family of the
future will be faced with a huge database of images, and thus there are obvious
problems with storage, compression, retrieval, and sorting, for example.
Tomorrow’s generation of personal imaging devices will include mass-
produced versions of the special laser EyeTap eyeglasses that allow the eye
itself to function as a camera, as well as contact lens computers that might
capture a person’s entire life on digital video. These pictures will be transmitted
wirelessly to friends and relatives, and the notion of a family album will be far
more complete, compelling, and collaborative, in the sense that it will be a shared
real-time videographic experience space.
Personal imaging is not just about family albums, though. It will also
radically change the way large-scale productions such as feature-length movies
are made. Traditionally movie cameras were large and cumbersome, and were
fixed to heavy tripods. With the advent of the portable camera, it was possible
to capture real-world events. Presently, as cameras, even the professional
cameras get smaller and lighter, a new “point-of-eye” genre has emerged.
Sports and other events can now be covered from the eye perspective of the
participant so that the viewer feels as if he or she is actually experiencing
the event. This adds a personal element to imaging. Thus personal imaging
also suggests new photographic and movie genres In the future it will be
possible to include an EyeTap camera of sufficiently high resolution into
a contact lens so that a high-quality cinematographic experience can be
recorded.
This chapter addresses the fundamental question as to where on the body a
personal imaging system is best located. The chapter follows an organization
given by the following evolution from portable imaging systems to EyeTap
mediated reality imaging systems:
Imaging systems have evolved from once cumbersome cameras with large glass
plates to portable film-based systems.
Next these portable cameras evolved into small handheld devices that could be
operated by one person. The quality and functionality of modern cameras allows
a personal imaging system to replace an entire film crew. This gave rise to new
genres of cinematography and news reporting.
of bringing the camera up to the eye. Even if the size of the camera could be
reduced to the point of being negligible (e.g., suppose that the whole apparatus is
made no bigger than the eyecup of a typical camera viewfinder), the very gesture
of bringing a device up to the eye would still be unnatural and would attract
considerable attention, especially in large public establishments like department
stores, or establishments owned by criminal or questionable organizations (some
gambling casinos come to mind) where photography is often prohibited.
However, it is in these very establishments in which a visitor or customer
may wish to have a video record of the clerk’s statement of the refund policy
or the terms of a sale. Just as department stores often keep a video recording
of all transactions (and often even a video recording of all activity within the
establishment, sometimes including a video recording of customers in the fitting
rooms), the goal of the present invention is to assist a customer who may wish
to keep a video record of a transaction, interaction with a clerk, manager, refund
explanation, or the like.
Already there exist a variety of covert cameras such a camera concealed
beneath the jewel of a necktie clip, cameras concealed in baseball caps, and
cameras concealed in eyeglasses. However, such cameras tend to produce inferior
images, not just because of the technical limitations imposed by their small
size but, more important, because they lack a viewfinder system (a means of
viewing the image to adjust camera angle, orientation, exposure, etc., for the
best composition). Because of the lack of viewfinder system, the subject matter
of traditional covert cameras is not necessarily centered well in the viewfinder,
or even captured by the camera at all, and thus these covert cameras are not well
suited to personal documentary or for use in a personal photographic/videographic
memory assistant or a personal machine vision system.
Aux. camera
Camera
(main
camera)
Pen
Main screen
Aux. screen
Battery Comm.
Computer
pack system
Figure 2.1 Diagram of a simple embodiment of the invention having a camera borne by a
personal digital assistant (PDA). The PDA has a separate display attached to it to function as a
viewfinder for the camera.
CONCOMITANT COVER ACTIVITIES AND THE VIDEOCLIPS CAMERA SYSTEM 21
Paper
sheet to
conceal
screen
Camera
Viewfinder
screen
Writing
surface
Pen
Battery Comm.
Computer system
pack
Figure 2.2 Diagram of an alternate embodiment of the system in which a graphics tablet is
concealed under a pad of paper and an electronic pen is concealed inside an ordinary ink pen
so that all of the writing on the paper is captured and recorded electronically together with
video from the subject in front of the user of the clipboard while the notes are being taken.
way they are neither hidden nor visible, but rather they serve as an uncertain
deterrent to criminal conduct. While they could easily be hidden inside smoke
detectors, ventilation slots, or small openings, the idea of the dome is to make the
camera conceptually visible yet completely hidden. In a similar manner a large
lens opening on the clipboard may, at times, be desirable, so that the subject will
be reminded that there could be a recording but will be uncertain as to whether
or not such a recording is actually taking place. Alternatively, a large dark shiny
plexiglass strip, made from darkly smoked plexiglass (typically 1 cm high and
22 cm across), is installed across the top of the clipboard as a subtle yet visible
deterrent to criminal behavior. One or more miniature cameras are then installed
behind the dark plexiglass, looking forward through it. In other embodiments, a
camera is installed in a PDA, and then the top of the PDA is covered with dark
smoky plexiglass.
The video camera (see Fig. 2.1) captures a view of a person standing in front
of the user of the PDA and displays the image on an auxiliary screen, which may
be easily concealed by the user’s hand while the user is writing or pretending to
22 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
write on the PDA screen. In commercial manufacture of this device the auxiliary
screen may not be necessary; it may be implemented as a window displaying
the camera’s view on a portion of the main screen, or overlaid on the main
screen. Annotations made on the main screen are captured and stored together
with videoclips from the camera so that there is a unified database in which
the notes and annotations are linked with the video. An optional second camera
may be present if the user wishes to make a video recording of himself/herself
while recording another person with the main camera. In this way, both sides
of the conversation may be simultaneously recorded by the two cameras. The
resulting recordings could be edited later, and there could be a cut back and
forth between the two cameras to follow the natural flow of the conversation.
Such a recording might, for example, be used for an investigative journalism story
on corrupt organizations. In the early research prototypes, an additional wire was
run up the sleeve of the user into a separate body-worn pack powered by its own
battery pack. The body-worn pack typically contained a computer system which
houses video capture hardware and is connected to a communications system
with packet radio terminal node controller (high-level data link controller with
modem) and radio; this typically establishes a wireless Internet connection. In the
final commercial embodiment of this invention, the body-worn pack will likely
disappear, since this functionality would be incorporated into the handheld device
itself.
The clipboard version of this invention (Fig. 2.2) is fitted with an electronic
display system that includes the capability of displaying the image from the
camera. The display serves then as a viewfinder for aiming the camera at the
subject. Moreover the display is constructed so that it is visible only to the user
of the clipboard or, at least, so that the subject of the picture cannot readily see
the display. Concealment of the display may be accomplished through the use of
a honeycomb filter placed over the display. Such honeycomb filters are common
in photography, where they are placed over lights to make the light sources
behave more directionally. They are also sometimes placed over traffic lights
where there is a wye intersection, for the lights to be seen from one direction
in order that the traffic lights not confuse drivers on another branch of a wye
intersection that faces almost the same way. Alternatively, the display may be
designed to provide an inherently narrow field of view, or other barriers may be
constructed to prevent the subject from seeing the screen.
The video camera (see Fig. 2.2) displays on a miniature screen mounted to
the clipboard. A folded-back piece of paper conceals the screen. The rest of
the sheets of paper are placed slightly below the top sheet so that the user can
write on them in a natural fashion. From the perspective of someone facing the
user (the subject), the clipboard will have the appearance of a normal clipboard
in which the top sheet appears to be part of the stack. The pen is a combined
electronic pen and real pen so that the user can simultaneously write on the paper
with real ink, as well as make an electronic annotation by virtue of a graphics
tablet below the stack of paper, provided that the stack is not excessively thick.
In this way there is a computer database linking the real physical paper with
CONCOMITANT COVER ACTIVITIES AND THE VIDEOCLIPS CAMERA SYSTEM 23
its pen strokes and the video recorded of the subject. From a legal point of
view, real physical pen strokes may have some forensic value that the electronic
material may not (e.g., if the department store owner asks the customer to sign
something, or even just to sign for a credit card transaction, the customer may
place it over the pad and use the special pen to capture the signature in the
customer’s own computer and index it to the video record). In this research
prototype there is a wire going from the clipboard, up the sleeve of the user.
This wire would be eliminated in the commercially produced version of the
apparatus, by construction of a self-contained video clipboard with miniature
built-in computer, or by use of a wireless communications link to a very small
body-worn intelligent image-processing computer.
The function of the camera is integrated with the clipboard. This way textual
information, as well as drawings, may be stored in a computer system, together
with pictures or videoclips. (Hereafter still pictures and segments of video will
both be referred to as videoclips, with the understanding that a still picture is just
a video sequence that is one frame in length.)
Since videoclips are stored in the computer together with other information,
these videoclips may be recalled by an associative memory working together
with that other information. Thus tools like the UNIX “grep” command may
be applied to videoclips by virtue of the associated textual information which
typically resides as a videographic header. For example, one can grep for the
word “meijer,” and may find various videoclips taken during conversations with
clerks in the Meijer department store. Thus such a videographic memory system
may give rise to a memory recall of previous videoclips taken during previous
visits to this department store, provided that one has been diligent enough to
write down (e.g., enter textually) the name of the department store upon each
visit.
Videoclips are typically time-stamped (e.g., there exist file creation dates) and
GPS-stamped (e.g., there exists global positioning system headers from last valid
readout) so that one can search on setting (time + place).
Thus the video clipboard may be programmed so that the act of simply taking
notes causes previous related videoclips to play back automatically in a separate
window (in addition to the viewfinder window, which should always remain
active for continued proper aiming of the camera). Such a video clipboard may,
for example, assist in a refund explanation by providing the customer with an
index into previous visual information to accompany previous notes taken during
a purchase. This system is especially beneficial when encountering department
store representatives who do not wear name tags and who refuse to identify
themselves by name (as is often the case when they know they have done
something wrong, or illegal).
This apparatus allows the user to take notes with pen and paper (or pen and
screen) and continuously record video together with the written notes. Even
if there is insufficient memory to capture a continuous video recording, the
invention can be designed so that the user will always end up with the ability to
produce a picture from something that was seen a couple of minutes ago. This
24 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
may be useful to everyone in the sense that we may not want to miss a great
photo opportunity, and often great photo opportunities only become known to
us after we have had time to think about something we previously saw. At the
very least, if, for example, a department store owner or manager becomes angry
and insulting to the customer, the customer may retroactively record the event
by opening a circular buffer.
allows the wearer to place the wrist upon a countertop and rotate the entire arm
and wrist about a fixed point. Either embodiment is well suited to shooting a
high-quality panoramic picture or orbit of an official behind a high counter, as
is typically found at a department store, bank, or other organization.
Moreover the invention may perform other useful tasks such as functioning as
a personal safety device and crime deterrent by virtue of its ability to maintain
a video diary transmitted and recorded at multiple remote locations. As a tool
for photojournalists and reporters, the invention has clear advantages over other
competing technologies.
(a ) (b )
and finally appeared on the cover of Linux Journal, July 2000, issue 75, together
with a feature article.
Although it was a useful invention, the idea of a wristwatch videoconferencing
computer is fundamentally flawed, not so much because of the difficulty in
inventing, designing, and building it but rather because it is difficult to operate
without conscious thought and effort. In many ways the wristwatch computer
was a failure not because of technology limitations but because it was not a very
good idea to start with, when the goal is constant online connectivity that drops
below the conscious level of awareness. The failure arose because of the need to
lift the hand and shift focus of attention to the wrist.
allow the apparatus to provide operational modes that drop below the conscious
level of awareness. However, before we consider eyeglass-based systems, let us
consider some other possibilities, especially in situations where reality only needs
to be augmented (e.g., where nothing needs to be mediated, filtered, or blocked
from view).
The telepointer is one such other possibility. The telepointer is a wearable
hands-free, headwear-free device that allows the wearer to experience a visual
collaborative telepresence, with text, graphics, and a shared cursor, displayed
directly on real-world objects. A mobile person wears the device clipped onto
his tie, which sends motion pictures to a video projector at a base (home) where
another person can see everything the wearer sees. When the person at the base
points a laser pointer at the projected image of the wearer’s site, the wearer’s
aremac’s1 servo’s points a laser at the same thing the wearer is looking at. It is
completely portable and can be used almost anywhere, since it does not rely on
infrastructure. It is operated through a reality user interface (RUI) that allows the
person at the base to have direct interaction with the real world of the wearer,
establishing a kind of computing that is completely free of metaphors, in the
sense that a laser at the base controls the wearable laser aremac.
1 An aremac is to a projector as a camera is to a scanner. The aremac directs light at 3-D objects.
28 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
SUBJECT BASE
SCREEN
MATTER LIGHT
S.
B.
B.
FILT
BASE STATION
WEAR BASE
WEAR STATION
CAM CAM
W
.B
.S WEAR BASE
LIGHT
PROJ.
.
COMP COMP
AREMAC
AREMAC
PROJ.
WASTE
WASTE
PROJ.
Figure 2.4 Telepointer system for collaborative visual telepresence without the need for
eyewear or headwear or infrastructural support: The wearable apparatus is depicted on the
left; the remote site is depicted on the right. The author wears the WEAR STATION, while his
wife remotely watches on a video projector, at BASE STATION. She does not need to use a
mouse, keyboard, or other computerlike device to interact with the author. She simply points
a laser pointer at objects displayed on the SCREEN. For example, while the author is shopping,
she can remotely see what’s in front of him projected on the livingroom wall. When he’s
shopping, she sees pictures of the grocery store shelves transmitted from the grocery store to
the livingroom wall. She points her laser pointer at these images of objects, and this pointing
action teleoperates a servo-mounted laser pointer in the apparatus worn by the author. When
she points her laser pointer at the picture of the 1% milk, the author sees a red dot appear on
the actual carton of 1% milk in the store. The user interface metaphor is very simple, because
there is none. This is an example of a reality user interface: when she points her laser at an
image of the milk carton, the author’s laser points at the milk carton itself. Both parties see
their respective red dots in the same place. If she scribbles a circle around the milk carton, the
author will see the same circle scribbled around the milk carton.
PICTURED
SUBJECT
MATTER SUBJECT
MATTER
WEAR
y BASE POINT
POINT
AZ.
x EL.
SCREEN
SIG. EL.
AREMAC
LASER
BASE WEAR
STATION STATION
Figure 2.5 Details of the telepointer (TM) aremac and its operation. For simplicity the
livingroom or manager’s office is depicted on the left, where the manager can point at
the screen with a laser pointer. The photo studio, or grocery store, as the case may be, is
depicted on the right, where a body-worn laser aremac is used to direct the beam at objects in
the scene.
Figure 2.5 illustrates how the telepointer works to use a laser pointer (e.g., in
the livingroom) to control an aremac (wearable computer controlled laser in the
grocery store). For simplicity, Figure 2.5 corresponds to only the portion of the
signal flow path shown in bold lines of Figure 2.4.
SUBJECT MATTER in front of the wearer of the WEAR STATION is transmitted and
displayed as PICTURED SUBJECT MATTER on the projection screen. The screen is
updated, typically, as a live video image in a graphical browser such as glynx,
while the WEAR STATION transmits live video of the SUBJECT MATTER.
One or more persons at the base station are sitting at a desk, or on a sofa,
watching the large projection screen, and pointing at this large projection screen
using a laser pointer. The laser pointer makes, upon the screen, a bright red dot,
designated in the figure as BASE POINT.
The BASE CAM, denoted in this figure as SCREEN CAMERA, is connected to a
vision processor (denoted VIS. PROC.) of the BASE COMP, which simply determines
the coordinates of the brightest point in the image seen by the SCREEN CAMERA. The
SCREEN CAMERA does not need to be a high-quality camera, since it will only be
used to see where the laser pointer is pointing. A cheap black- and white-camera
will suffice for this purpose.
Selection of the brightest pixel will tell us the coordinates, but a better estimate
can be made by using the vision processor to determine the coordinates of a
bright red blob, BASE POINT, to subpixel accuracy. This helps reduce the resolution
needed, so that smaller images can be used, and therefore cheaper processing
hardware and a lower-resolution camera can be used for the SCREEN CAMERA.
These coordinates are sent as signals denoted EL. SIG. and AZ. SIG. and are
received at the WEAR STATION. They are fed to a galvo drive mechanism (servo)
30 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
that controls two galvos. Coordinate signal AZ. SIG. drives azimuthal galvo AZ.
Coordinate signal EL. SIG. drives elevational galvo EL. These galvos are calibrated
by the unit denoted as GALVO DRIVE in the figure. As a result the AREMAC LASER is
directed to form a red dot, denoted WEAR POINT, on the object that the person at
the base station is pointing at from her livingroom or office.
The AREMAC LASER together with the GALVO DRIVE and galvos EL and AZ together
comprise the device called an aremac, which is generally concealed in a brooch
pinned to a shirt, or in a tie clip attached to a necktie, or is built into a necklace.
The author generally wears this device on a necktie. The aremac and WEAR CAM
must be registered, mounted together (e.g., on the same tie clip), and properly
calibrated. The aremac and WEAR CAM are typically housed in a hemispherical
dome where the two are combined by way of beamsplitter W.B.S.
Figure 2.6 Wearable portion of apparatus, as worn by author. The necktie-mounted visual
augmented reality system requires no headwear or eyewear. The apparatus is concealed in
a smoked plexiglass dome of wine-dark opacity. The dark dome reduces the laser output to
safe levels, while at the same time making the apparatus blatantly covert. The dome matches
the decor of nearly any department store or gambling casino. When the author has asked
department store security staff what’s inside their dark ceilings domes, he’s been called
‘‘paranoid,’’ or told that they are light fixtures or temperature sensors. Now the same security
guards are wondering what’s inside this dome.
PORTABLE PERSONAL PULSE DOPPLER RADAR VISION SYSTEM 31
Figure 2.7 Necktie clip portion. The necktie-mounted visual augmented reality system. A
smoked plexiglass dome of wine-dark opacity is used to conceal the inner components. Wiring
from these components to a body-concealed computer runs through the crack in the front of
the shirt. The necktie helps conceal the wiring.
“Today we saw Mary Baker Eddy with one eye!” — a deliberately cryptic sentence
inserted into a commercial shortwave broadcast to secretly inform colleagues across
the Atlantic of the successful radar imaging of a building (spire of Christian Science
building; Mary Baker Eddy, founder) with just one antenna for both receiving and
transmitting. Prior to this time, radar systems required two separate antennas, one
to transmit, and the other to receive.
Telepointer, the necktie worn dome (“tiedome”) of the previous section bears
a great similarity to radar, and how radar in general works. In many ways the
telepointer tiedome is quite similar to the radomes used for radar antennas. The
telepointer was a front-facing two-way imaging apparatus. We now consider a
backward-facing imaging apparatus built into a dome that is worn on the back.
Time–frequency and q-chirplet-based signal processing is applied to data from
a small portable battery-operated pulse Doppler radar vision system designed and
built by the author. The radar system and computer are housed in a miniature
radome backpack together with video cameras operating in various spectral bands,
to be backward-looking, like an eye in the back of the head. Therefore all the
ground clutter is moving away from the radar when the user walks forward,
and is easy to ignore because the radar has separate in-phase and quadrature
channels that allow it to distinguish between negative and positive Doppler.
A small portable battery powered computer built into the miniature radome
allows the entire system to be operated while attached to the user’s body. The
fundamental hypothesis upon which the system operates is that actions such as an
attack or pickpocket by someone sneaking up behind the user, or an automobile
on a collision course from behind the user, are governed by accelerational
32 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
0 Time 26 s
Figure 2.8 Sliding window Fourier transform of small but dangerous floating iceberg fragment
as seen by an experimental pulse Doppler X-band marine radar system having separate
in-phase and quadrature components. The radar output is a complex-valued signal for which
we can distinguish between positive and negative frequencies. The chosen window comprises a
family of discrete prolate spheroidal sequences [27]. The unique sinusoidally varying frequency
signature of iceberg fragments gave rise to the formulation of the w-chirplet transform [28].
Safer navigation of oceangoing vessels was thus made possible.
PORTABLE PERSONAL PULSE DOPPLER RADAR VISION SYSTEM 33
which is the q-chirplet transform of signal z(t) taken with a Gaussian window.
Q-chirplets are also related to the fractional Fourier transform [34].
Figure 2.9 Early personal safety device (PSD) with radar vision system designed and built by
the author, as pictured on exhibit at List Visual Arts Center, Cambridge, MA (October 1997). The
system contains several sensing instruments, including radar, and camera systems operating
in various spectral bands, including infrared. The headworn viewfinder display shows what is
behind the user when targets of interest or concern appear from behind. The experience of
using the apparatus is perhaps somewhat like having eyes in the back of the head, but with
extra signal processing as the machine functions like an extension of the brain to provide visual
intelligence. As a result the user experiences a sixth or seventh sense as a radar vision system.
The antenna on the hat was for an early wireless Internet connection allowing multiple users to
communicate with each other and with remote base stations.
PORTABLE PERSONAL PULSE DOPPLER RADAR VISION SYSTEM 35
radomes the size of a large building rather than in sizes meant for a battery-
operated portable system.
Note that the museum artifact pictured in Figure 2.9 is a very crude early
embodiment of the system. The author has since designed and built many newer
systems that are now so small that they are almost completely invisible.
Freq.
Freq.
Freq.
Freq.
Freq.
Freq.
0
Freq.
Freq.
Freq.
Freq.
Freq.
Freq.
which is unlikely to change over the short time period of an attack. The instant
the attacker spots a wallet in a victim’s back pocket, the attacker may accelerate
by applying a roughly constant force (defined by his fixed degree of physical
fitness) against the constant mass of the attacker’s own body. This gives rise to
uniform acceleration which shows up as a straight line in the time–frequency
distribution.
Some examples following the principle of accelerational intentionality are
illustrated in Figure 2.10.
Real Imaginary
2500 2300
2400
2200
Sample value
2300
2200 2100
2100
2000
2000
1900 1900
0 1000 2000 3000 4000 0 1000 2000 3000 4000
Sample index Sample index
Spectrogram Chirplet transform
0.2
0.2
0.1
0.1
Freq. end
Freq.
0
0
−0.1
−0.1
−0.2
−0.2
0 0.1 0.2 0.3 0.4 0.5 −0.2 −0.1 0 0.1 0.2
Time Freq. Beg.
Figure 2.11 Most radar systems do not provide separate real and imaginary components
and therefore cannot distinguish between positive and negative frequencies (e.g., whether an
object is moving toward the radar or going away from it). The author’s radar system provides
in-phase and quadrature components: REAL and IMAG (imaginary) plots for 4,000 points (half a
second) of radar data are shown. The author was walking at a brisk pace, while a car was
accelerating toward the author. From the time–frequency distribution of these data we see the
ground clutter moving away and the car accelerating toward the author. The chirplet transform
shows two distinct peaks, one corresponding to all of the ground clutter (which is all moving
away at the same speed) and the other corresponding to the accelerating car.
PORTABLE PERSONAL PULSE DOPPLER RADAR VISION SYSTEM 37
transform, in which the window size σ is kept constant, and the time origin t0
is also kept constant. The two degrees of freedom of frequency b and chirpiness
c are parameterized in terms of instantaneous frequency at the beginning and
end of the data record, to satisfy the Nyquist chirplet criterion [28]. Here we see
a peak for each of the two targets: the ground clutter (e.g., the whole world)
moving away; and the car accelerating toward the radar. Other examples of
chirplet transforms from the miniature radar set are shown in Figure 2.12.
Experimental Results
Radar targets were classified based on their q-chirplet transforms, with approxi-
mately 90% accuracy, using the mathematical framework and methods described
in [28] and [35]. Some examples of the radar data are shown as time–frequency
distributions in Figure 2.14.
0.2 0.2
0.1 0.1
Freq. end
Freq. end
0 0
−0.1 −0.1
−0.2 −0.2
Figure 2.12 Chirplet transforms for ground clutter only, and pickpocket only. Ground clutter
falls in the lower left quadrant because it is moving away from the radar at both the beginning
and end of any time record (window). Note that the pickpocket is the only kind of activity
that appears in the lower right-hand quadrant of the chirplet transform. Whenever there is
any substantial energy content in this quadrant, we can be very certain there is a pickpocket
present.
38 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
Uncalibrated Calibrated
1800 4
1600 2
1400 0
Imag
Imag
1200 −2
1000 −4
800
−6
600
1500 2000 2500 3000 −5 0 5
Real Real
0.2
0.2
0.1
0.1
Frequency
Frequency
0
0
−0.1
−0.1
−0.2
−0.2
1 2 3 4 5 1 2 3 4 5
Time Time
Figure 2.13 The author’s home-built radar generates a great deal of distortion. Notice, for
example, that a plot of real versus imaginary data shows a strong correlation between real and
imaginary axes, and also an unequal gain in the real and imaginary axes, respectively (note
that the unequal signal strength of REAL and IMAG returns in the previous figure as well). Note
further that the dc offset gives rise to a strong signal at f = 0, even though there was nothing
moving at exactly the same speed as the author (e.g., nothing that could have given rise to a
strong signal at f = 0). Rather than trying to calibrate the radar exactly, and to remove dc offset
in the circuits (all circuits were dc coupled), and risk losing low-frequency components, the
author mitigated these problems by applying a calibration program to the data. This procedure
subtracted the dc offset inherent in the system, and computed the inverse of the complex
Choleski factorization of the covariance matrix (e.g., covz defined as covariance of real and
imaginary parts), which was then applied to the data. Notice how the CALIBRATED data forms
an approximately isotropic circular blob centered at the origin when plotted as REAL versus
IMAGinary. Notice also the removal of the mirroring in the FREQ = 0 axis in the CALIBRATED data,
which was quite strong in the UNCALIBRATED data.
When both the image acquisition and image display embody a headworn first-
person perspective (e.g., computer takes input from a headworn camera and
provides output to a headworn display), a new and useful kind of experience
results, beyond merely augmenting the real world with a virtual world.
WHEN BOTH CAMERA AND DISPLAY ARE HEADWORN 39
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Figure 2.14 Various test scenarios were designed in which volunteers carried metal objects
to simulate weapons, or lunged toward the author with pieces of metal to simulate an attack.
Pickpockets were simulated by having volunteers sneak up behind the author and then retreat.
The ‘‘pickpocket signature’’ is a unique radar signature in which the beginning and ending
frequency fall on either side of the frequency defined by author’s walking Doppler frequency.
It was found that of all the radar signatures, the pickpocket signature was the most unique,
and easiest to classify. The car plot in the middle of the array of plots was misclassified as a
stabbing. It appears the driver stepped on the accelerator lightly at about time 1 second. Then
just before time 3 seconds it appears that the driver had a sudden change of intentionality
(perhaps suddenly realized lateness, or perhaps suddenly saw that the coast was clear) and
stepped further on the accelerator, giving rise to an acceleration signature having two distinct
portions.
40 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
Images shot from a first-person perspective give rise to some new research
directions, notably new forms of image processing based on the fact that smaller,
lighter cameras can track and orient themselves much faster, relative to the effort
needed to move one’s entire body. The mass (and general dynamics) of the
camera is becoming small as compared to that of the body, so the ultimate limiting
factor becomes the physical constraint of moving one’s body. Accordingly the
laws of projective geometry have greater influence, allowing newer methods of
image processing to emerge.
we fail to capture on film: by the time we find the camera and load it with film,
the moment has passed us by.
2 The heart may stop, or skip a beat at first, but over time, on average, experience tells us that our
heart beats faster when frightened, unless the victim is killed in which case the apparatus should
detect the absence of a heart beat.
3 It has been suggested that the robber might become aware that his or her victim is wearing a personal
safety device and try to eliminate the device or perhaps even target it for theft. In anticipation of
these possible problems, personal safety devices operate by continuous transmission of images, so
that the assailant cannot erase or destroy the images depicting the crime. Moreover the device itself,
owing to its customized nature, would be unattractive and of little value to others, much as are
undergarments, a mouthguard, or prescription eyeglasses. Furthermore devices could be protected by
a password embedded into a CPLD that functions as a finite state machine, making them inoperable
by anyone but the owner. To protect against passwords being extracted through torture, a personal
distress password may be provided to the assailant by the wearer. The personal distress password
unlocks the system but puts it into a special tracking and distress notification mode.
42 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
Thus the concept of HI has become blurred across geographical boundaries, and
between more than one human and more than one computer.
1. The general spirit of MR, like typical AR, includes adding virtual objects
but also the desire to take away, alter, or more generally to visually
“mediate” real objects. Thus MR affords the apparatus the ability to
augment, diminish, or otherwise alter our perception of reality.
2. Typically an AR apparatus is tethered to a computer workstation that is
connected to an ac outlet, or constrains the user to some other specific
site (a workcell, helicopter cockpit, etc.). What is proposed (and reduced
to practice) in this chapter is a system that facilitates the augmenting,
diminishing, or altering of the visual perception of reality in the context of
ordinary day-to-day living.
MR uses a body-worn apparatus where both the real and virtual objects are
placed on an equal footing, in the sense that both are presented together via a
synthetic medium (e.g., a video display).
Successful implementations have been realized by viewing the real world
using a head-mounted display (HMD) fitted with video cameras, body-worn
processing, and/or bidirectional wireless communications to one or more remote
computers, or supercomputing facilities. This portability enabled various forms
of the apparatus to be tested extensively in everyday circumstances, such as while
riding the bus, shopping, banking, and various other day-to-day interactions.
The proposed approach shows promise in applications where it is desired to
have the ability to reconfigure reality. For example, color may be deliberately
diminished or completely removed from the real world at certain times when it
is desired to highlight parts of a virtual world with graphic objects having unique
colors. The fact that vision may be completely reconfigured also suggests utility
to the visually handicapped.
simple cube) with the real world in a meaningful way. Steve Feiner’s group was
responsible for demonstrating the viability of AR as a field of research, using
sonar (Logitech 3D trackers) to track the real world so that the real and virtual
worlds could be registered [41,42]. Other research groups [43] have contributed
to this development. Some research in AR arises from work in telepresence [44].
AR, although lesser known than VR, is currently used in some specific
applications. Helicopter pilots often use a see-through visor that superimposes
virtual objects over one eye, and the F18 fighter jet, for example, has a
beamsplitter just inside the windshield that serves as a heads-up display (HUD),
projecting a virtual image that provides the pilot with important information.
The general spirit of AR is to add computer graphics or the like to the real
world. A typical AR apparatus does this with beamsplitter(s) so that the user sees
directly through the apparatus while simultaneously viewing a computer screen.
The goal of this chapter is to consider a wireless (untethered) personal imaging
apparatus worn over the eyes that in real time, computationally reconfigures
reality. The apparatus allows the wearer to augment, diminish, or otherwise alter
the perception of reality in addition to simply adding to reality. This “mediation”
of reality may be thought of as a filtering operation applied to reality and then
a combining operation to insert overlays (Fig. 2.15a). Equivalently the addition
of computer-generated material may be regarded as arising from the filtering
operation itself (Fig. 2.15b).
Real Visual
(actual) R
filter
+ Real Visual
R
(actual) filter
Virtual
V
(synthetic)
(a ) (b )
Figure 2.15 Two equivalent interpretations of mediated reality (MR). (a) Besides the ability to
add computer-generated (synthetic) material to the wearer’s visual world, there is potential
to alter reality through a ‘‘visual filter.’’ The coordinate transformation embodied in the visual
filter may either be inserted into the virtual channel, or the graphics may be rendered in the
coordinate system of the filtered reality channel so that the real and virtual channels are in
register. (b) The visual filter need not be a linear system. The visual filter may itself embody the
ability to create computer-generated objects and therefore subsume the ‘‘virtual’’ channel.
WHEN BOTH CAMERA AND DISPLAY ARE HEADWORN 45
It is this “always ready” nature of mediated reality, of living life through the
camera viewfinder, that makes it more successful than the other embodiments
such as VideoClips or the wristwatch system described previously. In particular,
the reality mediator embodies the concepts of incidentalist imaging, while at the
WHEN BOTH CAMERA AND DISPLAY ARE HEADWORN 47
same time providing a constancy of interaction with the real world. Thus, for
example, the same retroactive recording capabilities of the VideoClips system
are present in a more available way.
but in many ground applications (e.g., in a typical building interior) the differing
depth planes destroy the illusion of unity between real and virtual worlds.
With the illusory transparency approach of mediated reality, the real and virtual
worlds exist in the same medium. They therefore are not only registered in
location but also in depth. Any depth limitations of the display device affect
both the virtual and real environments in exactly the same way.
in particular, a linear integral operator [51]. For each ray of light, a linear time-
invariant system collapses all wavelengths into a single quantity giving rise to a
ray of light with a flat spectrum emerging from the other side.
Of course, the visual filter of Figure 2.15b cannot actually be realized
through a linear system but through an equivalent nonlinear filter arising from
incorporating the generation of virtual objects into the filtering operation.
Figure 2.17 An embodiment of author’s reality mediator as of late 1994. The color stereo
head-mounted display (VR4) has two cameras mounted to it. The intercamera distance and field
of view match approximately author’s interocular distance and field of view with the apparatus
removed. The components around author’s waist comprise radio communications equipment
(video transmitter and receiver). The antennas are located at the back of the head-mount to
balance the weight of the cameras, so that the unit is not front-heavy. Steve Mann, 1994.
4 I have been unsuccessful in contacting McGreevy to determine how he routed the signals from two
input, sometimes also deliberately diminished in other ways, could give rise to a
new genre of cinematography and other related experiences.
To be able to experiment with diminished reality in a controlled manner, it is
desirable to first attain a system that can come close to passing reality through
with more bandwidth than desired. In this way the exact desired bandwidth can be
found experimentally. The system of Figure 2.17 overcame some of the problems
associated with McGreevy’s system, in having 34 times more visual bandwidth. It
was just at the point where it was possible to conduct much of daily life through
this illusion of transparency. The degree of reality could be reduced down to
the level of McGreevy’s system, and less, in a computer-controlled environment
to find out how much visual bandwidth is needed for the user to conduct daily
affairs. (The visual bandwidth may be crudely calculated as the number of pixels
times two for stereo, times another three for color, although there is redundancy
because left and right, as well as color channels are quite similar to one another.)
For various tasks it was found that there was a certain point at which it was
possible to function in the RM. In particular, anything below about 18 of the
system’s full bandwidth made most tasks very difficult, or impossible. Once
it becomes possible to live within the shortcomings of the RM’s ability to be
transparent, new and interesting experiments may be performed.
iRx
iTx Process 1
Process 2
Camera
. . . Process N
HMD oTx
Eyewear
Figure 2.18 Simple implementation of a reality mediator (RM). The stereo camera
(implemented as two separate cameras) sends video to one or more computer systems
over a high-quality microwave communications link, called the ‘‘inbound channel.’’ The
computer system(s) send back the processed image over a UHF communications link called
the ‘‘outbound channel.’’ Note the designations ‘‘i’’ for inbound (e.g., iTx denotes inbound
transmitter) and ‘‘o’’ for outbound. The designation ‘‘visual filter’’ refers to the process(es) that
mediate(s) the visual reality and possibly insert ‘‘virtual’’ objects into the reality stream.
of future more recent generations of WearComp, even though they were, at the
time, not technologically feasible in a self-contained body-worn package.
To a very limited extent, looking through a camcorder provides a mediated
reality experience, because we see the real world (usually in black and white, or
in color but with a very limited color fidelity) together with virtual text objects,
such as shutter speed and other information about the camera. If, for example,
the camcorder has a black and white viewfinder, the visual filter (the color-
blindness one experiences while looking through the viewfinder with the other
eye closed) is likely unintentional in the sense that the manufacturer likely would
have rather provided a full-color viewfinder. This is a very trivial example of a
mediated reality environment where the filtering operation is unintentional but
nevertheless present.
Although the color-blinding effect of looking through a camcorder may be
undesirable most of the time, there are times when it is desirable. The diminished
reality it affords may be a desired artifact of the reality mediator (e.g., where the
user chooses to remove color from the scene either to tone-down reality or to
accentuate the perceptual differences between light and shade). The fact is that a
mediated reality system need not function as just a reality enhancer, but rather,
it may enhance, alter, or deliberately degrade reality.
Stuart Anstis [54], using a camcorder that had a “negation” switch on the
viewfinder, experimented with living in a “negated” world. He walked around
holding the camcorder up to one eye, looking through it, and observed that
he was unable to learn to recognize faces in a negated world. His negation
experiment bore a similarity to Stratton’s inversion experiment mentioned in
54 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
Section 2.7.5, but the important difference within the context of this chapter is
that Anstis experienced his mediated visual world through a video signal. In
some sense both the regular eyeglasses that people wear, as well as the special
glasses researchers have used in prism adaptation experiments [49,48], are reality
mediators. However, it appears that Anstis was among the first to explore, in
detail, an electronically mediated world.
Figure 2.19 Living in a ‘‘Rot 90’’ world. It was found necessary to rotate both cameras rather
than just one. Thus it does not seem possible to fully adapt to, say, a prism that rotates the
image of each eye, but the use of cameras allows the up-down placement of the ‘‘eyes.’’ The
parallax, now in the up-down direction, affords a similar sense depth as we normally experience
with eyes spaced from left to right together with left-right parallax.
WHEN BOTH CAMERA AND DISPLAY ARE HEADWORN 55
One of the most exciting developments in the field of low vision is the Low Vision
Enhancement System (LVES). This is an electronic vision enhancement system that
provides contrast enhancement . . .. Future enhancements to the device include text
manipulation, autofocus and image remapping. (quote from their WWW page [55],
emphasis added)
(a ) (b )
Figure 2.20 Living in coordinate-transformed worlds. Color video images are transmitted,
coordinate-transformed, and then received back at 30 frames/s — the full frame-rate of the
VR4 display device. (a) This visual filter would allow a person with very poor vision to read
(due to the central portion of the visual field being hyperfoveated for a very high degree of
magnification in this area), yet still have good peripheral vision (due to a wide visual field of view
arising from demagnified periphery). (b) This visual filter would allow a person with a scotoma
(a blind or dark spot in the visual field) to see more clearly, once having learned the mapping.
The visual filter also provides edge enhancement in addition the coordinate transformation.
Note the distortion in the cobblestones on the ground and the outdoor stone sculptures.
56 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
Their research effort suggests the utility of the real-time visual mappings
(Fig. 2.20) previously already implemented using the apparatus of Figure 2.17.
The idea of living in a coordinate transformed world has been explored
extensively by other authors [49,48], using optical methods (prisms, etc.).
Much could be written about the author’s experiences in various electronically
coordinate transformed worlds, but a detailed account of all the various
experiences is beyond the scope of this chapter. Of note, however, the author has
observed that visual filters differing slightly from the identity (e.g., rotation by a
few degrees) had a more lasting impression when removing the apparatus (e.g.,
being left incapacitated for a greater time period upon removal of the apparatus)
than visual filters far from the identity (e.g., rotation by 180 degrees upside-
down). Furthermore the visual filters close to the identity tended to leave an
opposite aftereffect (e.g., causing the author to consistently reach too high after
taking off the RM where the images had been translated down slightly, or reach
too far clockwise after removing the RM that had been rotating images a few
degrees counterclockwise). Visual filters far from the identity (e.g., reversal or
upside-down mappings) did not leave an opposite aftereffect: the author would
not see the world as being upside-down upon removing upside-down glasses.
This phenomenon might be thought of as being analogous to learning a second
language (either a natural language or computer language). When the second
language is similar to the one we already know, we make more mistakes switching
back and forth than when the two are distinct. When two (or more) adaptation
spaces were distinct, for example, in the case of the identity map and the rotation
operation (“rot 90”), it was possible to mentally sustain a dual adaptation space
and switch back and forth between the identity operator and the “rot 90” operator
without one causing lasting aftereffects in the other.
Regardless of how much care is taken in creating the illusion of transparency,
there will be a variety of flaws. Not the least of these are limited resolution, lack
of dynamic range, limited color (mapping from the full spectrum of visible light
to three responses of limited color gamut), and improper alignment and placement
of the cameras. In Figure 2.17, for example, the cameras are mounted above the
eyes. Even if they are mounted in front of the eyes, they will extend, causing the
wearer to assume the visual capabilities of some hypothetical organism that has
eyes that stick out of its head some three or four inches.
After wearing the apparatus for an extended period of time, the author
eventually adapted, despite its flaws, whether these be unintended (limited
dynamic range, limited color gamut, etc.), or intended (e.g., deliberately
presenting an upside-down image). It appears that in some sense the visual
reconfiguration is subsumed and induced into the brain. This way, the apparatus
may act as an extension of the body and the mind. Conscious effort is no longer
needed in order to use the machine.
Extended Baseline
Having the cameras above the display (as in Fig. 2.17) induced some parallax
error for nearby objects, so the author tried mounting the cameras at the
WHEN BOTH CAMERA AND DISPLAY ARE HEADWORN 57
(a ) (b )
Figure 2.21 Giant’s eyes: Extended baseline. (a) With a 212 mm baseline, author could
function in most everyday tasks but would see crosseyed at close conversational distances.
(b) With a 1 m baseline, author could not function in most situations but had a greatly enhanced
sense of depth for distant objects (e.g., while looking out across the river). Wires from the
cameras go down into author’s waist bag containing the rest of the apparatus. Inbound transmit
antenna is just visible behind author’s head.
sides (Fig. 2.21a). This gave an interocular distance of approximately 212 mm,
resulting in an enhanced sense of depth. Objects appeared smaller and closer
than they really were so that the world looked like a reduced-scale model of
reality. While walking home that day (wearing the apparatus), the author felt
a need to duck down to avoid hitting what appeared to be a low tree branch.
However, recollection from previous walks home had been that there were no
low branches on the tree, and, removing the RM, it was observed that the tree
branch that appeared to be within arm’s reach was several feet in the air. After
this enhanced depth perception was adapted to, the cameras were mounted on a
1 m baseline for further experimentation. Crossing the street provided an illusion
of small toy cars moving back and forth very close, giving the feeling that one
might just push them out of the way, but better judgment served to wait until
there was a clearing in the traffic before crossing the road to get to the river.
Looking out across the river provided an illusion that the skyscrapers on the other
side were within arm’s reach in both distance and height.
Edgertonian Sampling
Instead of a fixed delay of the video signal, the author next experimented by
applying a repeating freeze-frame effect to it (with the cameras’ own shutter set
to 1/10,000 second). With this video sample and hold , it was found that nearly
periodic patterns would appear to freeze at certain speeds. For example, while
looking out the window of a car, periodic railings that were a complete blur
without the RM would snap into sharp focus with the RM. Slight differences in
each strut of the railing would create interesting patterns that would dance about
revealing slight irregularities in the structure. (Regarding the nearly periodic
structure as a true periodic signal plus noise, the noise is what gave rise to the
interesting patterns.) Looking out at another car, traveling at approximately the
same speed, it was easy to read the writing on the tires, and count the number of
bolts on the wheel rims. Looking at airplanes in flight, the number of blades on
the spinning propellers could be counted, depending on the sampling rate in the
RM the blades would appear to rotate slowly backward or forward the same way
objects do under the stroboscopic light experiments of Harold Edgerton [57]. By
manually adjusting the processing parameters of the RM, many things that escape
normal vision could be seen.
Wyckoff’s World
One of the problems with the RM is the limited dynamic range of CCDs.
One possible solution is to operate at a higher frame rate than needed, while
underexposing, say, odd frames and overexposing even frames. The shadow detail
may then be derived from the overexposed stream, the highlight detail from the
underexposed stream, and the midtones from a combination of the two streams.
The resulting extended-response video may be displayed on a conventional HMD
by using Stockham’s homomorphic filter [58] as the visual filter. The principle of
extending dynamic range by simultaneously using sensors of different sensitivity
is known as the Wyckoff principle [59] (to be described in Chapter 4), in honor
PARTIALLY MEDIATED REALITY 59
(a ) (b )
Figure 2.22 Partially mediated reality. (a) Half MR: Author’s right eye is completely immersed
in a mediated reality environment arising from a camera on the right, while the left eye is free
to see unmediated real-world objects. (b) Substantially less than half MR: Author’s left eye is
partially immersed in a mediated reality environment arising from a camera also on the left.
Betty and Steve Mann, July 1995.
eyes, through the transparent visor, and then look over to the mediation zone
where the left eye sees “through” the illusion of transparency in the display.
Again, one can switch attention back and forth between the mediated reality and
ordinary vision. Depending on the application or intent, there may be desire to
register or to deliberately misregister the possibly overlapping direct and mediated
zones.
With two personal imaging systems, configured as reality mediators of the kind
depicted in Figure 2.22b, the author set the output radio communications of one
to the input of the other, and vice versa, so that an exchange of viewpoints
resulted (each person would see out through the other person’s eyes). The virtual
vision glasses allowed the wearer to concentrate mainly on what was in one’s
own visual field of view (because of the transparent visor) but at the same time
have a general awareness of the other person’s visual field. This “seeing eye-to-
eye” allowed for an interesting form of collaboration. Seeing eye-to-eye through
the apparatus of Figure 2.17 requires a picture in picture process (unless one
wishes to endure the nauseating experience of looking only through the other
person’s eyes), usually having the wearer’s own view occupy most of the space,
while using the apparatus of Figure 2.22b does not require any processing at all.
Usually, when we communicate (e.g., by voice or video), we expect the
message to be received and concentrated on, while when “seeing eye-to-eye,”
there is not the expectation that the message will always be seen by the other
EXERCISES, PROBLEM SETS, AND HOMEWORK 61
person. Serendipity is the idea. Each participant sometimes pays attention and
sometimes not.
2.10.1 Viewfinders
Virtual reality (VR) systems block out the real world. For example, if you could
make a VR headset portable, tetherless, and wireless, you could still not safely
walk around while wearing it; you would bump into objects in the real world. A
VR headset functions much like a blindfold as far as the real world is concerned.
Mediated reality (MR), as described in the next chapter, allows one to see
a modified version of the real world. A reality mediator is a system that
allows the wearer to augment, deliberately diminish, or otherwise alter his/her
perception of visual reality, or to allow others (other people or computer
programs) to alter his/her perception of reality. Allowing others to alter one’s
perception of reality can serve as a useful communications medium. For example,
the wearable face-recognizer operates when the wearer allows the WearComp
(wearable computational system) to alter his/her perception of reality by inserting
a virtual name tag on top of the face of someone the system recognizes. See
http://wearcam.org/aaai disabled.ps.gz.
The newest reality mediators allow one to experience “life through the screen”
(e.g., life as experienced through a viewfinder). The concept of living visual life
through the viewfinder, as opposed to using a camera as a device that is carried
and looked through occasionally, is what gives rise to this new form of interaction.
Accordingly it is necessary that viewfinders be understood.
To prepare for the material in the next chapter (Chapter 3), and especially
that of Chapter 4, it will be helpful to acquire an understanding of how camera
viewfinders work. If you are taking this course for credit, your group will be
given a small low-cost instamatic film camera to break apart in order to learn how
viewfinders work. Before disassembling the camera try to answer the following
questions, with particular reference to cheap “rangefinder” type cameras, as they
are the most enlightening and interesting to study in the context of mediated
reality: Why do the edges of viewfinders appear somewhat sharp? That is, why
is it that a viewfinder can provide a sharply defined boundary? Some viewfinders
have a reticle or graticule, or some form of crosshairs. Why is it that these appear
sharply focused to the eye, though there is no part of the camera sufficiently far
enough away to focus on?
Reality mediators have been built into ordinary sunglasses. How do you think
it is possible that the screen appears in sharp focus? If you simply place an object
(e.g., a piece of newspaper) inside your sunglasses, you will not likely be able
to see the text in sharp focus because it is too close to your eyes to focus on. A
magnifying lens between your eye and the newsprint could make it appear sharp,
but the rest of the world will be blurry. How does a camera viewfinder make the
62 WHERE ON THE BODY IS THE BEST PLACE FOR A PERSONAL IMAGING SYSTEM?
graticule, reticle, or crosshairs sharp while keeping the scene appearing sharp as
well.
After thinking about this, take apart the camera and see the answer. Try to
answer the same questions now that you have seen how the camera viewfinder
is made. Now you should understand how things may be overlaid onto the real
world.
3
THE EYETAP PRINCIPLE:
EFFECTIVELY LOCATING THE
CAMERA INSIDE THE EYE
AS AN ALTERNATIVE TO
WEARABLE CAMERA SYSTEMS
This chapter discloses the operational principles of the EyeTap reality mediator,
both in its idealized form and as practical embodiments of the invention. The inner
workings of the reality mediator, in particular, its optical arrangement, are described.
A device that measures and resynthesizes light that would otherwise pass through
the lens of an eye of a user is described. The device diverts at least a portion of
eyeward-bound light into a measurement system that measures how much light
would have entered the eye in the absence of the device. In one embodiment, the
device uses a focus control to reconstruct light in a depth plane that moves to
follow subject matter of interest. In another embodiment, the device reconstructs
light in a wide range of depth planes, in some cases having infinite or near-
infinite depth of field. The device has at least one mode of operation in which
it reconstructs these rays of light, under the control of a portable computational
system. Additionally the device has other modes of operation in which it can,
by program control, cause the user to experience an altered visual perception of
reality. The device is useful as a visual communications system, for electronic
newsgathering, or to assist the visually challenged.
To understand how the reality mediator works, consider the first of these
three components, namely the device called a “lightspace analyzer” (Fig. 3.1).
The lightspace analyzer absorbs and quantifies incoming light. Typically (but not
necessarily) it is completely opaque. It provides a numerical description (e.g., it
turns light into numbers). It is not necessarily flat (e.g., it is drawn as curved to
emphasize this point).
The second component, the lightspace modifier, is typically a processor
(WearComp, etc.) and will be described later, in relation to the first and third
components.
The third component is the “lightspace synthesizer” (Fig. 3.2). The lightspace
synthesizer turns an input (stream of numbers) into the corresponding rays of
light.
Now suppose that we connect the output of the lightspace analyzer to the
input of the lightspace synthesizer (Fig. 3.3). What we now have is an illusory
transparency.
Incoming
rays of
light
10011000...
Numerical
description
Figure 3.1 Lightspace analyzer absorbs and quantifies every ray of incoming light. It converts
every incoming ray of light into a numerical description. Here the lightspace analyzer is depicted
as a piece of glass. Typically (although not necessarily) it is completely opaque.
Outgoing
synthetic
(virtual)
light
10011000...
Numerical
description
Figure 3.2 The lightspace synthesizer produces rays of light in response to a numerical input.
An incoming numerical description provides information pertaining to each ray of outgoing light
that the device produces. Here the lightspace synthesizer is also depicted as a special piece
of glass.
66 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
Lightspace Lightspace
analysis synthesis
Outgoing
Incoming synthetic
real light (virtual)
light
10011000...10011000...
Figure 3.3 Illusory transparency formed by connecting the output of the lightspace analysis
glass to the input of the lightspace synthesis glass.
Outgoing
(synthesis)
Incoming
(analysis)
10011000... 10011000...
Figure 3.4 Collinear illusory transparency formed by bringing together the analysis glass and
the synthesis glass to which it is connected.
Moreover suppose that we could bring the lightspace analyzer glass into direct
contact with the lightspace synthesizer glass. Placing the two back-to-back would
create a collinear illusory transparency in which any emergent ray of virtual
light would be collinear with the incoming ray of real light that gave rise to it
(Fig. 3.4).
Now a natural question to ask is: Why make all this effort in a simple illusion
of transparency, when we can just as easily purchase a small piece of clear
glass?
The answer is the second component, the lightspace modifier, which gives us
the ability to modify our perception of visual reality. This ability is typically
achieved by inserting a WearComp between the lightspace analyzer and the
lightspace synthesizer (Fig. 3.5). The result is a computational means of altering
the visual perception of reality.
PRACTICAL EMBODIMENTS OF EYETAP 67
Outgoing
(synthesis)
Incoming
(analysis)
WearComp
In summary:
In practice, there are other embodiments of this invention than the one described
above. One of these practical embodiments will now be described.
68 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
Real
(actual) User
objects
User
User
Figure 3.6 Eyeglasses made from lightspace analysis and lightspace synthesis systems can
be used for virtual reality, augmented reality, or mediated reality. Such a glass, made into a visor,
could produce a virtual reality (VR) experience by ignoring all rays of light from the real world,
and generating rays of light that simulate a virtual world. Rays of light from real (actual) objects
indicated by solid shaded lines; rays of light from the display device itself indicated by dashed
lines. The device could also produce a typical augmented reality (AR) experience by creating
the ‘‘illusion of transparency’’ and also generating rays of light to make computer-generated
‘‘overlays.’’ Furthermore it could ‘‘mediate’’ the visual experience, allowing the perception of
reality itself to be altered. In this figure a less useful (except in the domain of psychophysical
experiments) but illustrative example is shown: objects are left-right reversed before being
presented to the viewer.
1. orthospatial (collinear)
a. orthoscopic
b. orthofocal
2. orthotonal
a. orthoquantigraphic (quantigraphic overlays)
b. orthospectral (nonmetameric overlays)
3. orthotemporal (nonlagging overlays)
mediated reality might, over a long period of time, cause brain damage, such as
damage to the visual cortex, in the sense that learning (including the learning of
new spatial mappings) permanently alters the brain.
This consideration is particularly important if one wishes to photograph, film,
or make video recordings of the experience of eating or playing volleyball, and
the like, by doing the task while concentrating primarily on the eye that is
looking through the camera viewfinder. Indeed, since known cameras were never
intended to be used this way (to record events from a first-person perspective
while looking through the viewfinder), it is not surprising that performance of
any of the apparatus known in the prior art is poor in this usage.
The embodiments of the wearable camera system sometimes give rise to a
small displacement between the actual location of the camera, and the location
of the virtual image of the viewfinder. Therefore either the parallax must be
corrected by a vision system, followed by 3D coordinate transformation, followed
by rerendering, or if the video is fed through directly, the wearer must learn to
make this compensation mentally. When this mental task is imposed upon the
wearer, when performing tasks at close range, such as looking into a microscope
while wearing the glasses, there is a discrepancy that is difficult to learn, and it
may give rise to unpleasant psychophysical effects such as nausea or “flashbacks.”
If an eyetap is not properly designed, initially one wearing the eyetap will
tend to put the microscope eyepiece up to an eye rather than to the camera, if
the camera is not the eye. As a result the apparatus will fail to record exactly
the wearer’s experience, unless the camera is the wearer’s own eye. Effectively
locating the cameras elsewhere (other than in at least one eye of the wearer)
does not give rise to a proper eyetap, as there will always be some error. It
is preferred that the apparatus record exactly the wearer’s experience. Thus, if
the wearer looks into a microscope, the eyetap should record that experience for
others to observe vicariously through at least one eye of the wearer. Although the
wearer can learn the difference between the camera position and the eye position,
it is preferable that this not be required, for otherwise, as previously described,
long-term usage may lead to undesirable flashback effects.
de
dc
1E 22 22i
10VF 1D
1C 10
23F
Hitachi
23N Whitevideo camera
OA R B
39
2C
2D
23
2E
Figure 3.7 A modern camcorder (denoted by the reference numeral 10 in the figure) could, in
principle, have its zoom setting set for unity magnification. Distant objects 23 appear to the eye
to be identical in size and position while one looks through the camcorder as they would in the
absence of the camcorder. However, nearby subject matter 23 N will be distance de , which is
closer to the effective center of projection of the camcorder than distance de to the effective
center of projection of the eye. The eye is denoted by reference numeral 39, while the camera
iris denoted 22i defines the center of projection of the camera lens 22. For distant subject
matter the difference in location between iris 22i and eye 39 is negligible, but for nearby subject
matter it is not. Therefore nearby subject matter will be magnified as denoted by the dotted
line figure having reference numeral 23 F. Alternatively, setting the camcorder zoom for unity
magnification for nearby subject matter will result in significantly less than unity magnification
for distant subject matter. Thus there is no zoom setting that will make both near and far subject
matter simultaneously appear as it would in the absence of the camcorder.
it really is. It captures all eyeward bound rays of light, for which we can imagine
that it processes these rays in a collinear fashion. However, this reasoning is pure
fiction, and breaks down as soon as we consider the scene that has some depth
of field, such as is shown in Figure 3.9.
Thus we may regard the apparatus consisting of a camera and display as being
modeled by a fictionally large camera opening, but only over subject matter
confined to a plane.
Even if the lens of the camera has sufficient depth of focus to form an image
of subject matter at various depths, this collinearity criterion will only hold at
one such depth, as shown in Figure 3.10. This same argument may be made
for the camera being off-axis. Thus, when the subject matter is confined to a
single plane, the illusory transparency can be sustained even when the camera is
off-axis, as shown in Figure 3.11.
Some real-world examples are shown in Figure 3.12. An important limitation
is that the system obviously only works for a particular viewpoint and for
72 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
1F
10C 24B
1E 32A
24C
1C 32C
39
1D
OA
22 2D
Eye
2C 32B
22F 24A
23 2E
40 = Trivially
inverted
32A
24B
10D
24
24C 32C
32B
24A
Figure 3.8 Suppose that the camera portion of the camcorder, denoted by reference numeral
10C, were fitted with a very large objective lens 22F. This lens would collect eyeward bound
rays of light 1E and 2E. It would also collect rays of light coming toward the center of projection
of lens 22. Rays of light coming toward this camera center of projection are denoted 1C and
2C. Lens 22 converges rays 1E and 1C to point 24A on the camera sensor element. Likewise
rays of light 2C and 2E are focused to point 24B. Ordinarily the image (denoted by reference
numeral 24) is upside down in a camera, but cameras and displays are designed so that when
the signal from a camera is fed to a display (e.g., a TV set) it shows rightside up. Thus the
image appears with point 32A of the display creating rays of light such as denoted 1D. Ray 1D
is collinear with eyeward bound ray 1E. Ray 1D is response to, and collinear with ray 1E that
would have entered the eye in the absence of the apparatus. Likewise, by similar reasoning,
ray 2D is responsive to, and collinear with, eyeward bound ray 2E. It should be noted, however,
that the large lens 22F is just an element of fiction. Thus lens 22F is a fictional lens because
a true lens should be represented by its center of projection; that is, its behavior should not
change other than by depth of focus, diffraction, and amount of light passed when its iris is
opened or closed. Therefore we could replace lens 22F with a pinhole lens and simply imagine
lens 22 to have captured rays 1E and 2E, when it actually only captures rays 1C and 2C.
subject matter in a particular depth plane. This same setup could obviously be
miniaturized and concealed in ordinary looking sunglasses, in which case the
limitation to a particular viewpoint is not a problem (since the sunglasses could
be anchored to a fixed viewpoint with respect to at least one eye of a user).
However, the other important limitation, that the system only works for subject
matter in the same depth plane, remains.
PRACTICAL EMBODIMENTS OF EYETAP 73
22F
22
de
23A dc
23FA 24B
1E
23NA
32A
1C 32AA
1F
39
23F 1D
23C OA
23N 2D
32C 32C
2C
23 32B 32B
2E 24C
23B 24A
2F
Figure 3.9 The small lens 22 shown in solid lines collects rays of light 1C and 2C. Consider,
for example, eyeward bound ray of light 1E, which may be imagined to be collected by a
large fictional lens 22F (when in fact ray 1C is captured by the actual lens 22), and focused to
point 24A. The sensor element collecting light at point 24A is displayed as point 32A on the
camcorder viewfinder, which is then viewed by magnifying lens and emerges as ray 1D into
eye 39. It should be noted that the top of nearby subject matter 23N also images to point 24A
and is displayed at point 32A, emerging as ray 1D as well. Thus nearby subject matter 23N
will appear as shown in the dotted line denoted 23F, with the top point appearing as 23FA
even though the actual point should appear as 23NA (e.g., would appear as point 23NA in the
absence of the apparatus).
1F
23T 23A 10C
1E 24B
24C 32A
1C 32C 39
23C 3E 1D
OA 3C 22 2D
Eye
23M 2C 32B
23 10D
2E
23B 23N 22F 2F 24A
10B
Figure 3.10 Camera 10C may therefore be regarded as having a large fictional lens 22F,
despite the actual much smaller lens 22, so long as we limit our consideration to a single depth
plane and exclude from consideration subject matter 23N not in that same depth plane.
74 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
23A
40R
32A
1E
32C
1C 1F
39
10D 1D
23C 3E
OA
22F 2D
2F
3C 2E
32B
23
40T
2C
24B
22
24C
23B
24A
10C
Figure 3.11 Subject matter confined to a single plane 23 may be collinearly imaged and
displayed by using the same large fictional lens model. Imagine therefore that fictional lens 22F
captures eyeward bound rays such as 1E and 2E when in fact rays 1C and 2C are captured.
These rays are then samplings of fictional rays 1F and 2F that are resynthesized by the display
(shown here as a television receiver) that produces rays 1D and 2D. Consider, for example, ray
1C, which forms an image at point 24A in the camera denoted as 10C. The image, transmitted
by transmitter 40T, is received as 40R and displayed as pixel 32A on the television. Therefore,
although this point is responsive to light along ray 1C, we can pretend that it was responsive
to light along ray 1E. So the collinearity criterion is modeled by a fictionally large lens 22F.
Obviously subject matter moved closer to the apparatus will show as being
not properly lined up. Clearly, a person standing right in front of the camera will
not be behind the television yet will appear on the television. Likewise a person
standing directly behind the television will not be seen by the camera which is
located to the left of the television. Thus subject matter that exists at a variety
of different depths, and not confined to a plane, may be impossible to line up in
all areas, with its image on the screen. See, for example, Figure 3.13.
(a ) (b )
Figure 3.12 Illusory transparency. Examples of a camera supplying a television with an image
of subject matter blocked by the television. (a) A television camera on a tripod at left supplies an
Apple ‘‘Studio’’ television display with an image of the lower portion of Niagara Falls blocked
by the television display (resting on an easel to the right of the camera tripod). The camera
and display were carefully arranged by the author, along with a second camera to capture this
picture of the apparatus. Only when viewed from the special location of the second camera,
does the illusion of transparency exist. (b) Various still cameras set up on a hill capture pictures
of trees on a more distant hillside on Christian Island. One of the still cameras having an NTSC
output displays an image on the television display.
Figure 3.13 Various cameras with television outputs are set up on the walkway, but none of
them can recreate the subject matter behind the television display in a manner that conveys
a perfect illusion of transparency, because the subject matter does not exist in a single depth
plane. There exists no choice of camera orientation, zoom setting, and viewer location that
creates an exact illusion of transparency for the portion of the Brooklyn Bridge blocked by the
television screen. Notice how the railings don’t quite line up correctly as they vary in depth with
respect to the first support tower of the bridge.
Leftmost Rightmost
ray of light ray of light
Aremac
Camera
d
Leftmost
ray of
virtual light
Figure 3.14 The orthoscopic reality mediator. A double-sided mirror diverts incoming rays
of light to a camera while providing the eye with a view of a display screen connected to
the wearable computer system. The display screen appears backward to the eye. But, since
the computer captures a backward stream of images (the camera’s view of the world is also
through a mirror), display of that video stream will create an illusion of transparency. Thus the
leftmost ray of light diverted by the mirror, into the camera, may be quantified, and that quantity
becomes processed and resynthesized by virtue of the computer’s display output. This way it
appears to emerge from the same direction as if the apparatus were absent. Likewise for the
rightmost ray of light, as well as any in between. This principle of ‘‘virtual light’’ generalizes
to three dimensions, though the drawing has simplified it to two dimensions. Typically such
an apparatus may operate with orthoquantigraphic capability through the use of quantigraphic
image processing [63].
embodiments of the personal imaging system have used two cameras and two
viewfinders. In some embodiments the vergence of the viewfinders was linked
to the focus mechanism of the viewfinders and the focus setting of cameras. The
result was a single automatic or manual focus adjustment for viewfinder vergence,
camera vergence, viewfinder focus, and camera focus. However, a number of
these embodiments became too cumbersome for unobtrusive implementation,
rendering them unacceptable for ordinary day-to-day usage. Therefore most of
what follows will describe other variations of single-eyed (partially mediated)
systems.
apparatus less obtrusive and allow others to see the wearer’s eye(s) unobstructed
by the mediation zone.
The apparatus of Figure 3.14 does not permit others to make full eye contact
with the wearer. Therefore a similar apparatus was built using a beamsplitter
instead of the double-sided mirror. In this case a partial reflection of the display
is visible to the eye of the wearer by way of the beamsplitter. The leftmost ray
of light of the partial view of the display is aligned with the direct view of the
leftmost ray of light from the original scene, and likewise for the rightmost ray,
or any ray within the field of view of the viewfinder. Thus the wearer sees a
superposition of whatever real object is located in front of the apparatus and a
displayed picture of the same real object at the same location. The degree of
transparency of the beamsplitter affects the degree of mediation. For example,
a half-silvered beamsplitter gives rise to a 50% mediation within the mediation
zone.
In order to prevent video feedback, in which light from the display screen
would shine into the camera, a polarizer was positioned in front of the camera.
The polarization axis of the polarizer was aligned at right angles to the
polarization axis of the polarizer inside the display screen, in situations where
the display screen already had a built-in polarizer as is typical of small battery-
powered LCD televisions, LCD camcorder viewfinders, and LCD computer
displays. In embodiments of this form of partially mediated reality where the
display screen did not have a built in polarizer, a polarizer was added in front
of the display screen. Thus video feedback was prevented by virtue of the two
crossed polarizers in the path between the display and the camera. If the display
screen displays the exact same rays of light that come from the real world, the
view presented to the eye is essentially the same as it might otherwise be.
In order that the viewfinder provide a distinct view of the world, it was found
to be desirable that the virtual light from the display screen be made different in
color from the real light from the scene. For example, simply using a black-and-
white display, or a black-and-green display, gave rise to a unique appearance
of the region of the visual field of the viewfinder by virtue of a difference in
color between the displayed image and the real world upon which it is exactly
superimposed. Even with such chromatic mediation of the displayed view of the
world, it was still found to be far more difficult to discern whether or not video
was correctly exposed, than when the double-sided mirror was used instead of the
beamsplitter. Therefore, when using these partially see-through implementations
of the apparatus, it was found to be necessary to use a pseudocolor image or
unique patterns to indicate areas of overexposure or underexposure. Correct
exposure and good composition are important, even if the video is only used
for object recognition (e.g., if there is no desire to generate a picture as the final
result). Thus even in tasks such as object recognition, a good viewfinder system
is of great benefit.
In this see-through embodiment, calibration was done by temporarily removing
the polarizer and adjusting for maximum video feedback. The apparatus may be
concealed in eyeglass frames in which the beamsplitter is embedded in one or
PROBLEMS WITH PREVIOUSLY KNOWN CAMERA VIEWFINDERS 79
both lenses of the eyeglasses, or behind one or both lenses. In the case in which
a monocular version of the apparatus is being used, the apparatus is built into
one lens, and a dummy version of the beamsplitter portion of the apparatus may
be positioned in the other lens for visual symmetry. It was found that such an
arrangement tended to call less attention to itself than when only one beamsplitter
was used.
These beamsplitters may be integrated into the lenses in such a manner to have
the appearance of the lenses in ordinary bifocal eyeglasses. Moreover magnifica-
tion may be unobtrusively introduced by virtue of the bifocal characteristics of
such eyeglasses. Typically the entire eyeglass lens is tinted to match the density
of the beamsplitter portion of the lens, so there is no visual discontinuity intro-
duced by the beamsplitter. It is not uncommon for modern eyeglasses to have a
light-sensitive tint so that a slight glazed appearance does not call attention to
itself.
Apart from large-view cameras upon which the image is observed on a ground
glass, most viewfinders present an erect image. See, for example, U.S. Pat.
5095326 entitled “Keppler-type erect image viewfinder and erecting prism.” In
contrast to this fact, it is well known that one can become accustomed, through
long-term psychophysical adaptation (as reported by George M. Stratton, in
Psychology Review, in 1896 and 1897), to eyeglasses that present an upside-
down image. After wearing upside-down glasses constantly, for eight days
(keeping himself blindfolded when removing the glasses for bathing or sleeping),
Stratton found that he could see normally through the glasses. More recent
experiments, as conducted by and reported by Mann in an MIT technical report
Mediated Reality, medialab vismod TR-260, (1994; the report is available in
http://wearcam.org/mediated-reality/index.html), suggest that slight
transformations such as rotation by a few degrees or small image displacements
give rise to a reversed aftereffect that is more rapidly assimilated by the user of
the device. Often more detrimental effects were found in performing other tasks
through the camera as well as in flashbacks upon removal of the camera after
it has been worn for many hours while doing tasks that require good hand-to-
eye coordination, and the like. These findings suggest that merely mounting a
conventional camera such as a small 35 mm rangefinder camera or a small video
camcorder to a helmet, so that one can look through the viewfinder and use it
hands-free while performing other tasks, will result in poor performance at doing
those tasks while looking through the camera viewfinder.
Part of the reason for poor performance associated with simply attaching a
conventional camera to a helmet is the induced parallax and the failure to provide
an orthoscopic view. Even viewfinders that correct for parallax, as described
in U.S. Pat. 5692227 in which a rangefinder is coupled to a parallax error
compensating mechanism, only correct for parallax between the viewfinder and
80 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
the camera lens that is taking the picture. They do not correct for parallax between
the viewfinder and the image that would be observed with the naked eye while
not looking through the camera.
Open-air viewfinders are often used on extremely low-cost cameras (e.g.,
disposable 35 mm cameras), as well as on some professional cameras for use
at night when the light levels are too low to tolerate any optical loss in the
viewfinder. Examples of open-air viewfinders used on professional cameras, in
addition to regular viewfinders, include those used on the Grafflex press cameras
of the 1940s (which had three different kinds of viewfinders: a regular optical
viewfinder, a ground glass, and an open-air viewfinder), as well as those used on
some twin-lens reflex cameras.
While such viewfinders, if used with a wearable camera system, have the
advantage of not inducing the problems such as flashback effects described above,
the edges of the open-air viewfinder are not in focus. They are too close to the eye
for the eye to focus on, and they have no optics to make the viewfinder appear
sharp. Moreover, although such open-air viewfinders induce no parallax error in
subject matter viewed through such viewfinders, they fail to eliminate the offset
between the camera’s center of projection and the actual center of projection of
the eye (to the extent that one cannot readily remove one’s eye and locate the
camera in the eye socket, exactly where the eye’s normal center of projection
resides).
Electronic Viewfinders
Many modern cameras use electronic viewfinders. They therefore provide an
electronically mediated environment in which the visual perception of reality
is altered, both geometrically (due to the same parallax errors found in optical
viewfinders), because of tonal distortion (e.g., color distortion or complete loss
of color) and reduced dynamic range, and because of reduced frame rate since
the electronic viewfinder only updates images 30 or 60 times per second, and
sometimes even less. (Many studies have been done on display update rates.
Most of these studies come from the virtual reality community [62].)
Not all aspects of a viewfinder-altered visual perception are bad, though. One
of the very reasons for having an electronic viewfinder is to alter the user’s
visual perception by introducing indicia (shutter speed, or other text and graphical
overlays) or by actually mediating visual perception more substantively (e.g., by
applying zebra-stripe banding to indicate areas of overexposure). This altered
visual perception serves a very useful and important purpose.
Electronic information displays are well known. They have been used
extensively in the military and industrial sectors [65], as well as in virtual
reality (VR) systems [39]. Using any of these various information displays as a
viewfinder gives rise to similar problems such as the offset between the camera’s
center of projection and the actual center of projection of the eye.
Augmented reality systems [40] use displays that are partially transparent.
Augmented reality displays have many practical uses in industry [66]. When
these displays are used as viewfinders, they function much like the open-air
PROBLEMS WITH PREVIOUSLY KNOWN CAMERA VIEWFINDERS 81
viewfinders. They provide sharp focus but still do not solve the discrepancy
between the eye’s center of projection and camera’s center of projection.
Other kinds of information displays such as Microvision’s scanning displays
[67], or the Private Eye manufactured by Reflection Technologies, can also be
used for virtual reality or augmented reality. Nevertheless, similar problems arise
when an attempt is made to use them as camera viewfinders.
A so-called infinity sight [68], commonly used in telescopes, has also been
used to superimpose crosshairs for a camera. However, the camera’s center of
projection would not line up with that of the eye looking through the device.
Another problem with all of the above-mentioned camera systems is the fixed
focus of the viewfinders. Although the camera lens itself has depth of field control
(automatic focus, automatic aperture, etc.), known viewfinders lack such control.
in which the subject matter viewed through the device is presented at the same
focal distance as subject not viewed through the device. Thus the device operates
in such a manner as to cause zero or near-zero eyestrain, while allowing the user
of the device to capture video in a natural fashion. What is presented therefore is
a more natural kind of camera that can function as a true extension of the mind
and body, and in which the visual perception of reality may be computationally
altered in a controlled way without causing eyestrain.
There are two embodiments of the aremac: (1) one in which a focuser (e.g., an
electronically focusable lens) tracks the focus of the camera to reconstruct rays of
diverted light in the same depth plane as imaged by the camera, and (2) another
in which the aremac has extended or infinite depth of focus so that the eye itself
can focus on different objects in a scene viewed through the apparatus.
NEARBY SUBJECT P0
DISTANT
AREMAC SUBJECT
P0 MATTER
SENSOR SYNTH AREMAC
L1 FOCUSER
P1 L1
L2 P1 P2
P2
DIVERTER
L3 L2 FOCUSER L3
EYE EYE
P3
P3
FOCUS FOCUS
CONTROLLER CONTROLLER
PROC. PROC.
(a ) (b )
Figure 3.15 Focus tracking aremac. (a) With a NEARBY SUBJECT, a point P0 that would otherwise
be imaged at P3 in the EYE of a user of the device is instead imaged to point P1 on the image
SENSOR, because the DIVERTER diverts EYEward bound light to lens L1 . When subject matter is
nearby, the L1 FOCUSER moves objective lens L1 out away from the SENSOR automatically in
much the same way as an automatic focus camera functions. A signal from the L1 FOCUSER
directs the L2 FOCUSER, by way of the FOCUS CONTROLLER, to move lens L2 outward away from
the light SYNTHesizer. At the same time an image from the SENSOR is directed through an image
PROCessor, into the light SYNTHesizer. Point P2 of the display element is responsive to point P1 of
the SENSOR. Likewise other points on the light SYNTHesizer are each responsive to corresponding
points on the SENSOR so that the SYNTHesizer produces a complete image for viewing through
lens L2 by the EYE, after reflection off of the back side of the DIVERTER. The position of L2 is
such that the EYE’s own lens L3 will focus to the same distance as it would have focused in the
absence of the entire device. Therefore, when lens L1 moves outward, the eye’s lens muscles
tense up and lens L3 becomes thicker. (b) With DISTANT SUBJECT MATTER, rays of parallel light are
diverted toward the SENSOR where lens L1 automatically retracts to focus these rays at point
P1 . When lens L1 retracts, so does lens L2 . When lens L2 retracts, the light SYNTHesizer ends
up generating parallel rays of light that bounce off the backside of the DIVERTER. These parallel
rays of light enter the EYE and cause its own lens L3 to relax to infinity (L3 gets thinner), as it
would have done in the absence of the entire device. Rays denoted by solid lines are real light
(diverging or parallel) from the subject matter, whereas rays denoted by dotted lines are virtual
light (converging toward the eye) synthesized by the device.
(a ) (b ) (c )
Figure 3.16 Aremac depth tracking in the EyeTap system is achieved by having the aremac and camera both focused
together by a single focus control input, either manual or automatic. Solid lines denote real light from subject matter,
and dashed lines denote virtual light synthesized by the aremac. (a) Aremac focus controlled by autofocus camera. When
the camera focuses to infinity, the aremac focuses so that it presents subject matter that appears as if it is infinitely far.
When the camera focuses close, the aremac presents subject matter that appears to be at the same close distance.
A zoom input controls both the camera and aremac to negate any image magnification and thus maintain the EyeTap
condition. Rays of light defining the widest field of view are denoted W. Rays of light defining the narrowest field of view
are denoted T (for ‘‘Tele’’ ). Note that the camera and aremac fields of view correspond. (b) Aremac and camera focus
both controlled by eye focus. An eyefocus measurer (by way of a beamsplitter called the ‘‘eyefocus diverter’’ ) obtains
an approximate estimate of the focal distance of the eye. Both the camera and aremac then focus to approximately this
same distance. (c) Focus of right camera and both aremacs (including vergence) controlled by autofocus camera on left
side. In a two-eyed system it is preferable that both cameras and both aremacs focus to the same distance. Therefore
one of the cameras is a focus master, and the other camera is a focus slave. Alternatively, a focus combiner is used to
average the focus distance of both cameras and then make the two cameras focus at equal distance. The two aremacs,
as well as the vergence of both systems, also track this same depth plane as defined by camera autofocus.
85
86 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
Aperture
stop Eye
Image
plane
Aremac
Figure 3.17 Aremac based on aperture stop between reversed petzval lens groups. The use
of two lens groups in a viewfinder permit the insertion of an aperture stop in between. This
results in a viewfinder having extended depth of focus. As a result no matter at what depth the
eye of a user is focusing on, the image plane will appear sharp. The nonshooting eye (the eye
not looking through the camera) can focus on subject matter at any distance, and the camera
eye (the eye looking into the aremac) can assert its own focus without eyestrain. Clearly, in
some sense, the aremac is a nonassertive viewfinder in that it does not impose a strong need
for the user’s eye to focus at any particular distance.
9 mm 12 mm 18 mm
Image 920
plane 910
Eye
f = 8 mm f = 4 mm
Aperture
stop
0 40 mm
Figure 3.18 Aremac with distal aperture stop. As will be seen later in this chapter, an important
first step toward making the apparatus covert is moving the aperture stop out from within the
optics, and bringing it further back toward the image plane. The dimensions of the preferred
embodiment are indicated in millimeters, and the focal lengths of each of the lenses are also so
indicated. Again the eye is indicated in dashed lines, since it is not actually part of the aremac
apparatus.
in Figure 3.17, where the eye is depicted in dashed lines, since it is not part of
the aremac.
The author has found it preferable that the optics be concealed within what
appear to be ordinary eyeglasses. The aperture stop, being dark (usually black),
tends to be visible in the eyeglasses, and so should best be removed from where
it can be seen. Figure 3.18 depicts an embodiment of the aremac invention in
which the aperture stop is more distant from the eye, and is located further back
toward the display element. In this way the aperture stop is conveniently made
as part of the display element housing, which reduces cost and makes the system
more easily manufacturable.
Camera
Lens
Diverter
groups d
Aperture
stop Eye
Image plane d
Figure 3.19 Aremac with distal aperture stop and folded optical path. Folding the optical
path moves elements of the display system out of the way so that most of the elements can
be concealed by the temple side piece of the eyeglasses. Moreover the folded design takes
most of the aremac out of the way of the user’s view of the world, such that less of the user’s
vision is obstructed. A camera that is concealed in the nosebridge of the eyeglasses can look
across and obtain light reflected off the backside of the folding optics. The folding is by way
of 45 degree mirrors or beamsplitters, the one closer to the eye being called a ‘‘diverter.’’ This
diverter diverts light that would otherwise have entered an eye of the user of the device, into the
camera. The distance from the eye to the optical center of the diverter is called the ‘‘EyeTap’’
distance, and is denoted by the letter d. This distance is equal to the distance from the camera
to the optical center of the diverter. Thus the camera provides an image of rays of light that
would otherwise enter an eye of the user of the device. So the effective center of projection of
the camera is located in the left eye of the user of the device. Since the camera and aremac
can both have depth of focus controls, both can be in focus from four inches to infinity. In this
way the apparatus can easily operate over the normal range of focusing for the healthiest of
human eyes.
Fig. 3.19), where the folding has an additional purpose, namely one of the folds
as the additional function of a diverter to divert light sideways into the eye. The
purpose of the diverter is threefold:
• To get the optics, and display element out from in front of the eye and to
allow portions of the apparatus to be moved closer to the user’s head. This
reduces the moment of inertia, for example, and makes the apparatus more
covert and more comfortable to wear.
• To get more of the apparatus out of the user’s field of view so that less of
the user’s vision is obstructed.
• To facilitate the insertion of a camera, in such a way that the camera can
receive light that would otherwise pass through the center of projection of
a lens of an eye of the user of the apparatus.
the eye is reduced in size and visibility at the expense of making the further
lens group larger and otherwise more visible. Since the further lens group can
be concealed at the edge of the eyeglass lens, by the frames of the eyeglasses,
such a design provides a very useful trade-off.
An aperture stop, by its nature, is far more difficult to conceal in an eyeglass
lens than transparent optics. Aperture stops are generally black. A large black
disk, for example, with a small hole in it, would be hard to conceal in the main
central portion of an eyeglass lens. Therefore the use of a distal aperture stop
helps by getting the aperture stop back far enough that it can be concealed by
the frames of the eyeglasses. Preferably the frames of the eyeglasses are black,
so that the aperture stop which is also preferably black, along with a housing
to prevent light leakage, will all blend in to be concealed within the eyeglass
frames.
Diverter Retina’
PS SLM Cond PS SLM Cond
Eye’
Eye lens Eye lens Eye lens’
0th order
Eye Eye
Retina Retina
(a ) (b )
Figure 3.20 Laser EyeTap system. (a) RM based on the pinhole aremac. The pinhole aremac
consists of a point source of light (PS), a spatial light modulator (SLM), and a condensing
lens to convert real light into virtual light (COND). Incoming rays of real light from the scene are
replaced (or augmented) with corresponding rays of virtual light from the pinhole aremac. The
point source of light is derived from a solid state laser, along with a simple spatial filter and
spreading optics. The pinhole aremac is a vitrionic display system and is therefore prototyped
through immersion in a xylene solution. (b) The diverter effectively locates the camera in the
eye, or equivalently, it locates the eye (denoted EYE , with its lens and retina denoted EYE LENS
and RETINA respectively) at the camera. Here the original eye location is shown in dashed lines,
and the effective eye location in solid lines, to show where various diffractive orders land on
the eye. It is essential to design the apparatus so that only one diffractive order enters the lens
of the eye.
for pinhole aremacs that present not perfectly sharp images). In the author’s
opinion, diffraction blurring is acceptable, but periodic diffraction is not. Periodic
diffraction causes the appearance of ghosted replicas of subject matter, and is
most evident when there is text displayed on the aremac.
Through the incorporation of a higher-order diffractive excluder, this problem
can be resolved. The finished design (see Fig. 3.21) has no moving parts, and it
can be economically manufactured.
The apparatus is preferably concealed in eyeglass frames in which the diverter
is either embedded in one onto both lenses of the eyeglasses. In the case where a
monocular version of the apparatus is being used, the apparatus is built into one
lens, and a dummy version of the diverter portion of the apparatus is positioned
in the other lens for visual symmetry. It was found that such an arrangement
tended to call less attention to itself than when only one diverter was used for a
monocular embodiment.
These diverters may be integrated into the lenses to have the appearance
of the lenses in ordinary bifocal eyeglasses. Moreover magnification may
be unobtrusively introduced by virtue of the bifocal characteristics of such
eyeglasses. Typically the entire eyeglass lens is tinted to match the density of the
beamsplitter portion of the lens, so there is no visual discontinuity introduced by
the diverter. It is common for modern eyeglasses to have a light-sensitive tint so
that a slight glazed appearance does not call attention to itself.
90 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
Figure 3.21 Covert prototype of camera and viewfinder having appearance of ordinary bifocal
eyeglasses. This is a monocular left-eyed system, with a dummy lens in the right eye having
the same physical appearance as the left-eye lens. The apparatus has essentially infinite
depth of focus and therefore provides essentially zero eyestrain. This apparatus is useful for
shooting documentary video, electronic newsgathering, as well as for mediated shared visual
communications space.
Lens
group
SLM
PS
DIV
Sensor
DIV’
Camera Aremac
aperture
Eye’
Eye
Eyeglass
frames
Figure 3.22 The diverter constancy phenomenon. Movement of the diverter does not affect
the collinearity criterion but simply shifts the eyetap point. Therefore, the diverter position can
be controlled by a device that tracks the eye’s position to cause the eyetap point to follow the
eye.
A portion of this light passes through the beamsplitter and is absorbed and
quantified by a wide-camera. A portion of this incoming light is also reflected
by the beamsplitter and directed to the tele-camera. The image from the wide-
camera is displayed on a large screen, of size 0.7 inches (approx. 18 mm) on
the diagonal, forming a widefield-of-view image of virtual light from the wide-
camera. The image from the tele-camera is displayed on a small screen, typically
of screen size 14 inch (approx. 6 mm) on the diagonal, forming a virtual image
of the tele-camera as virtual light. A smaller display screen is used to display the
image from the tele-camera in order to negate the increased magnification that
the tele-camera would otherwise provide. This way there is no magnification, and
both images appear as if the rays of light are passing through the apparatus, as
if the virtual light rays align with the real light rays were they not intercepted by
the double-sided mirror. The large display screen is viewed as a reflection in the
mirror, while the small display screen is viewed as a reflection in the beamsplitter.
Note also that the distance between the two display screens, as measured along
the optical axis of the wide-camera is set to be equal to the distance between the
double-sided mirror and the beamsplitter as measured along the optical axis of
the tele-camera. The apparent distance to both display screens is the same, so the
92 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
Large
screen
television
Double−
sided
mirror
Beam
splitter Small
screen
Tele television
camera Eye
Figure 3.23 Foveated personal imaging system. A large display screen (0.7 inch diagonal)
displays virtual light from a wide-camera, while a small display screen (0.25 inch diagonal)
displays virtual light from a tele-camera. The wide-camera typically performs only head-tracking,
as in the foveated system depicted earlier, but having a viewfinder for the head-tracker was
still found to be of benefit in long-term adaptation.
wearer experiences a view of the two displays superimposed upon one another
in the same depth plane.
In most embodiments the display screens were equipped with lenses to form an
apparent image in the same depth plane as a central real object in the scene. Alter-
natively, the display screens may be equipped with lens assemblies of differing
magnification to adjust their magnifications. Then the display screen displaying
the image from the tele-camera would subtend a smaller visual angle than the
display screen displaying the image from wide-camera, and so these visual angles
would match the visual angles of the incoming rays of light. In this way two
displays of equal size may be used to simplify manufacture of the apparatus.
An important aspect of the author’s research has been the teaching of students
through simple examples. In order to teach the fundamentals, the author made
TEACHING THE EYETAP PRINCIPLE 93
(a ) (b )
(c )
Figure 3.24 Deliberately large and crude examples of EyeTap systems for illustrative teaching
purposes. (a) Side view of crude helmet-mounted device for teaching the principles of tapping
the eye. Note the connectix QuickCam below the diverter and the aremac above the diverter.
There is no need for focus controls because the apparatus has a very wide depth of field.
(b) The eye is a camera. Front view showing the camera effectively located in the user’s left
eye. Note the elongated slot for adjustability of the EyeTap distance. (c) Close-up of left eye, as
seen while wearing the EyeTap apparatus. These EyeTap devices cause the eye to behave as
a camera in the sense that the camera replaces the eye, and the aremac replaces the subject
matter that would otherwise be seen by the eye.
Figure 3.25 Eyeglass-based EyeTap system (left eye tapped by camera and aremac).
itself to behave, in effect, as if it were a camera, in the sense that the camera
replaces the eye. This “eye is a camera” concept is evident in Figure 3.24c.
The author also teaches the students how to build the apparatus onto and into
eyeglasses, as shown in Fig 3.25 (in both covert and noncovert systems). The
diverter is made from plexiglass mirror because of the ease with which it can
be cut by students on a table, by scratching with a knive and ruler and breaking
at the edge of the table. This material is also safer (free of fragments of glass
that might get in the user’s eye). Two pieces of mirror are glued back-to-back
to make a two-sided mirror. The resulting ghosting helps teach the students the
importance of first-surface mirrors, which are used later in the course.
This simple apparatus allows a portion of the user’s visual field of view
to be replaced by the exact same subject matter, in perfect spatial register
with the real world. The image is also registered in tonal range, using the
quantigraphic imaging framework for estimating the unknown nonlinear response
of the camera, and also estimating the response of the aremac and compensating
for both [64].
r
rte
l2 3l1
ive
4
D
3l2
4
l1
q
d 2 l1 2 l2
(a ) (b )
Blurred edge
Focused edge
Eyetap point
Circles
of confusion
(c )
Figure 3.26 A simple planar diverter. (a) view showing the plane of the diverter as a line.
(b) View perpendicular to plane of diverter. (c) Compensation for the fact that the diverter is not
in sharp focus by either the eye or the camera.
Figure 3.26 depicts a close-up view of just the diverter where the dimensions
of the diverter are indicated. These dimensions are indicated in Figure 3.26a
with respect to the EyeTap point of the diverter, where a view is shown in which
the plane of the diverter is seen as a line. The EyeTap point is defined as the
point where the optical axis of the camera intersects the diverter. To calculate
the dimensions, a square is drawn between the upper right end of the diverter
and the EyeTap point, and a smaller square is drawn between the EyeTap point
and the lower left end of the diverter.
It should be emphasized that the diverter should be the correct size. If it is
too big, it will obstruct vision beyond that which is reconstructed as virtual light
from the aremac. If it is too small, it will provide a smaller mediation zone than
that which the apparatus is capable of producing.
Let the horizontal half-angle subtended by the camera be θ , as illustrated in
Figure 3.26a.
Let d denote the distance from the center of projection of the eye’s lens (i.e.,
the location of the pencil of rays corresponding to the reflection of the diverted
96 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
light) to the point on the diverter at which the optical axis of this pencil of rays
intersects the diverter, as illustrated in Figure 3.26a. Note that the optical axis
of the camera is considered to intersect the optical axis of the eye, and this point
at which the two optical axes intersect is called the EyeTap point. Also note that
the distance from the center of projection of the eye to the EyeTap point is equal
to the distance from the center of projection of the camera to the EyeTap Point.
This distance is called the EyeTap distance, and is denoted by the letter d.
Let l1 denote the length of the diverter to the left of this point as projected
orthographically onto a plane parallel to the image plane. This distance l1 is the
edge length of the smaller square in Figure 3.26a.
Let l2 denote the length of diverter to the right of this point as projected
orthographically onto a plane parallel to the image plane. This distance l2 is the
edge length of the larger square in Figure 3.26a. Therefore
l1
tan θ = (3.1)
d − l1
and
d tan θ
l1 = (3.2)
1 + tan θ
Similarly
l2
tan θ = (3.3)
d + l2
and
d tan θ
l2 = (3.4)
1 − tan θ
The width of the diverter is thus the square root of two times the sum of these
two lengths:
√ √ d tan θ d tan θ
w = 2(l1 + l2 ) = 2 + (3.5)
1 + tan θ 1 − tan θ
Since one end of the diverter is closer to the camera than the other, the diverter
will have a trapezoidal shape, as illustrated in Figure 3.26b. The side closer to
the eye will be less tall than the side further from the eye. When oriented at a
45 degree angle inside a pair of eyeglasses, the diverter should subtend a solid
rectangular cone 4 units wide and 3 units high, for a VGA or NTSC aspect ratio
of 4 : 3. The left and right sides subtend the same angle, even though the right
side is taller in reality. Dimensions of the diverter as viewed at a 0 degree angle
(i.e., directly head-on) are illustrated in Figure 3.26b. We know that there will be
no foreshortening of height (i.e., foreshortening in the vertical direction) because
both the near and far edge of the diverter are parallel to the image plane in the
eye.
CALIBRATION OF EYETAP SYSTEMS 97
The standard aspect ratio of video images (whether NTSC, VGA, SVGA, or
DVGA) is 4 units wide and 3 units high, so we know that the total √ height at
the EyeTap point (where the optical axis intersects the diverter) is 4 2(l1 + l2 ).
3
Moreover the left half of the diverter is simply a perspective projection of the
left half of an aremac image onto a 45 degree angle, and so 32 l1 high at that point
(i.e., in the plane defined along distance l1 , we would see the left side of the
picture). It is easiest to think of the diverter in four quadrants. For this reason
the diverter has been partitioned into four quadrants of projectively equal sizes
(but of course the Euclidean sizes of the right quadrants are larger owing to the
fact that they are farther from the camera). These quadrants are indicated by the
centerlines that are drawn in Figure 3.26.
From this observation we have that the diverter is
3 3 d tan θ
l1 =
2 2 1 + tan θ
√ d tan θ d tan θ
2 +
1 + tan θ 1 − tan θ
This is a simplification of the actual size of the diverter because the diverter
will be slightly out of focus. There is a small increase in size to account for
the circle of confusion of the camera lens. For this reason the diverter typically
has a slightly larger trapezoidal size and the corners are slightly rounded. This
increased size is depicted in Figure 3.26c.
Figure 3.26c depicts a frontal view of the diverter with corners rounded by a
circle of projectively constant size. This is an approximation to the actual shape,
since the circle of confusion will be projectively smaller further from the camera
(toward the scene and thus toward better focus). The actual size of the blurring
circle will be somewhere between constant projective size and constant Euclidean
size. However, the important thing to note is the rounded corners and the fact that
it is slightly larger than that calculated by way of focused “pinhole geometry”
depicted in Figure 3.26a and Figure 3.26b.
EyeTap systems have unique point spread functions (PSFs) and modulation
transfer functions (MTFs) for example, that can easily be measured and compen-
sated for. Photoquantigraphic imaging is a new signal-processing methodology
that is well suited to calibration and enhancement of EyeTap images [70]. This
98 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
theory states that modification of the image material is best done in lightspace
rather than imagespace, and it provides a methodology for treatment of the
camera as a measurement instrument in which the camera reports the actual
quantity of light arriving from each direction in space. Using this new mathe-
matical theory results in much more natural appearances of the image material
in an EyeTap system. The concept of photoquantigraphic imaging provides
for a new kind of image restoration (see Fig. 3.27) quite suitable for EyeTap
content.
(a ) (b )
(c ) (d )
Figure 3.27 Photoquantigraphic Image restoration. (a) Images captured by EyeTap devices
are often blurry and milky in appearance. (b) Traditional image restoration does a poor job
of recovering the original image even when the point spread function (PSF) and modulation
transfer function (MTF) are known or easily measured. (c) Attempts at inverse filtering and further
tone scale adjustments fail to provide a clear image. Although no longer ‘‘milky’’ in appearance,
the image still suffers from undesirable inverse filtering artifacts. (d) By recognizing that the
degradation happens in lightspace [70], not in imagespace, photoquantigraphic restoration
brings improved results. Since EyeTap systems operate in lightspace anyway, in the sense that
images are converted into lightspace, modified (augmented, diminished, or otherwise altered),
and then converted back to imagespace for display on the aremac, there is little additional
computational burden in restoring them in lightspace. Such lightspace processing will be
described in more detail in Chapters 4 and 5.
USING THE DEVICE AS A REALITY MEDIATOR 99
Because the device absorbs, quantifies, processes, and reconstructs light passing
through it, there are extensive applications in mediated reality. Mediated reality
differs from virtual reality in the sense that mediated reality allows the visual
perception of reality to be augmented, deliberately diminished, or, more generally,
computationally altered. Reality mediators have been demonstrated as useful for
assisting the visually challenged [1], as well as for various other purposes such
as filtering out unwanted advertising (as a filter for packets of light, as a photonic
firewall for establishing forwarding rules on visual information, etc.).
Fully mediated reality, which typically involves a mediation zone (field of
view of the camera and aremac) over which visual reality can be completely
reconfigured, has been explored previously [1]. However, a more moderate form
of mediated reality is possible using the apparatus of Figure 3.21. Mediation
is often only partial in the sense that it affects only part of the field of view
(e.g., one eye or part of one eye), but mediation can also be partial within the
mediation zone. The original reason for introducing this concept was to make
the apparatus less obtrusive so that others can see the user’s eye(s) unobstructed
by the mediation zone.
The author has built many devices for partially mediated reality, often using
a beamsplitter instead of the double-sided mirror for the diverter. This allowed
a partial reflection of the aremac to be visible to the eye of the user by way of
the beamsplitter.
Thus the user sees a superposition of whatever real object is located in front of
the apparatus and an aremac picture of the same real object at the same location.
The degree of transparency of the beamsplitter affects the degree of mediation.
For example, a half-silvered beamsplitter gives rise to a 50% mediation within
the mediation zone.
To prevent video feedback, in which light from the aremac would shine into the
camera, a polarizer was positioned in front of the camera. The polarization axis of
the polarizer was aligned at right angles to the polarization axis of the polarization
of the aremac (since most aremacs are polarized). Thus video feedback was
prevented by the two crossed polarizers in the path between the display screen
and the camera. If the aremac displays the exact same rays of light that come from
the real world, the view presented to the eye is essentially the same. However,
for the RM to provide a distinct view of the world, it was found that the virtual
light from the aremac had to be made different in color from the real light of the
scene. For example, simply using a black-and-white aremac, or a black-and-red
aremac, gave rise to a unique appearance of the region of the mediation zone
of the RM by virtue of a difference in color between the aremac image and the
real world upon which it is exactly superimposed. Even with such chromatic
mediation of the aremac view of the world, it was still found to be far more
difficult to discern whether or not the video was correctly exposed than when
the double-sided mirror was used instead of the beamsplitter. Therefore, when
using these partially see-through implementations of the apparatus, it was found
100 THE EYETAP PRINCIPLE: EFFECTIVELY LOCATING THE CAMERA
Besides providing reduced eyestrain, the author has found that the EyeTap system
allows the user to capture dynamic events, such as a volleyball game, from the
perspective of a participant. To confirm the benefits of the new device, the author
has done extensive performance evaluation testing of the device as compared to
wearable camera systems. An example of one of the performance test results
appears in Figure 3.28.
This chapter described a new device. The device includes a capability to measure
a quantity of light in each ray that would otherwise enter an eye of the user, and
a reconstruction capability to regenerate these rays of light. The measurement
capability causes the eye itself to function as a camera. The reconstruction
capability is by way of a device called an aremac. The aremac is a display
1.0
0.5 EyeTap
Score
WearCam
0
FPS
60 30 15 7.5 3.75 1.875
Figure 3.28 The sharp knee in the curve of frame rate versus ability to do many tasks. Many
tasks require a certain minimum frame rate below which performance drops off rapidly. EyeTap
systems work better than wearable camera systems at a given frame rate. EyeTap systems can
be used at lower frame rates to obtain the same degree of performance as can be obtained
with a wearable camera system operating at a higher frame rate.
EXERCISES, PROBLEM SETS, AND HOMEWORK 101
technology that does not impose a sense of focus on the eye of a user looking
into it. An aremac is essentially like a camera in reverse, providing depth of
focus control for a display medium. Two embodiments were described, one in
which the aremac has focus controlled by the camera portion of the device and
the other in which the aremac has extended depth of focus. An extreme case
of the latter, called the pinhole aremac, was also described. The pinhole aremac
is implemented with solid state lasers, has no moving parts, and is cheap and
simple to manufacture. It has infinite depth of field, and its use results in zero
eyestrain.
The device using the aremac is well suited to augmented reality, mediated
reality, or simply as an improved camera viewfinder. Some of the technical
problems associated with EyeTap systems, such as the milky and blurred
appearance of video shot with the apparatus, were also addressed. The end result
is a lightweight comfortable device suitable for shooting lifelong documentary
video.
Rightmost
ray of real
Leftmost light
ray of real
light
Rightmost
ray of
? ? virtual light
Eye
Leftmost
ray of virtual light
Diverter
(two-sided mirror)
Figure 3.29 The ‘‘virtual light’’ principle of mediated reality implemented in the ‘‘diverter’’
embodiment of the WearCam invention. Two of the components have been removed from this
diagram. What are they, and where should they go?
cone of view. What are the approximate dimensions of the diverter as a function
of angle subtended by the mediation zone for this embodiment of the reality
mediator (WearCam)? Assume that the mediation zone has an aspect ratio given
by four units of width and three units of height.
4
COMPARAMETRIC EQUATIONS,
QUANTIGRAPHIC IMAGE
PROCESSING, AND
COMPARAGRAPHIC RENDERING
The EyeTap glasses of the previous chapter absorb and quantify rays of light,
process these rays of light, and then resynthesize corresponding rays of light.
Each synthesized ray of light is collinear with, and responsive to, a corresponding
absorbed ray of light. The exact manner in which it is responsive is the subject of
this chapter. In other words, this chapter provides meaning to the word “quantify”
in the phrase “absorb and quantify.”
It is argued that hidden within the flow of signals from typical cameras, through
image processing, to display media, is a homomorphic filter. While homomor-
phic filtering is often desirable, there are some occasions when it is not. Thus
cancellation of this implicit homomorphic filter is proposed, through the intro-
duction of an antihomomorphic filter. This concept gives rise to the principle
of photoquantigraphic image processing, wherein it is argued that most cameras
can be modeled as an array of idealized light meters each linearly responsive to
a semimonotonic function of the quantity of light received and integrated over
a fixed spectral response profile. This quantity, called the “photoquantigraphic
quantity,” is neither radiometric nor photometric but rather depends only on the
spectral response of the sensor elements in the camera. A particular class of func-
tional equations, called “comparametric equations,” is introduced as a basis for
photoquantigraphic image processing. Comparametric equations are fundamental
to the analysis and processing of multiple images differing only in exposure. The
well-known gamma correction of an image is presented as a simple example of a
comparametric equation, for which it is shown that the underlying photoquanti-
graphic function does not pass through the origin. For this reason it is argued that
exposure adjustment by gamma correction is inherently flawed, and alternatives
are provided. These alternatives, when applied to a plurality of images that differ
only in exposure, give rise to a new kind of processing in the amplitude domain
103
104 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
(as opposed to the time domain or the frequency domain). While the theoret-
ical framework presented in this chapter originated within the field of wearable
cybernetics (wearable photographic apparatus) in the 1970s and early 1980s, it is
applicable to the processing of images from nearly all types of modern cameras,
wearable or otherwise. This chapter follows roughly a 1992 unpublished report
by the author entitled “Lightspace and the Wyckoff Principle.”
The quantity of light falling on an image sensor array, or the like, is a real-
valued function q(x, y) of two real variables x and y. An image is typically
a degraded measurement of this function, where degredations may be divided
into two categories: those that act on the domain (x, y) and those that act on
the range q. Sampling, aliasing, and blurring act on the domain, while noise
(including quantization noise) and the nonlinear response function of the camera
act on the range q.
Registering and combining multiple pictures of the same subject matter will
often result in an improved image of greater definition. There are four classes of
such improvement:
4.2.1 What’s Good for the Domain Is Good for the Range
The notion of producing a better picture by combining multiple input pictures
has been well studied with regard to the domain (x, y) of these pictures. Horn
and Schunk, for example, provide means of determining optical flow [71], and
many researchers have then used this result to spatially register multiple images
THE WYCKOFF PRINCIPLE AND THE RANGE OF LIGHT 105
and provide a single image of increased spatial resolution and increased spatial
extent. Subpixel registration methods such as those proposed by [72] attempt to
increase domain resolution. These methods depend on slight (subpixel) shift from
one image to the next. Image compositing (mosaicking) methods such as those
proposed by [73,74] attempt to increase domain extent. These methods depend
on large shifts from one image to the next.
Although methods that are aimed at increasing domain resolution and domain
extent tend to also improve tonal fidelity by virtue of a signal-averaging and
noise-reducing effect, we will see in what follows that images of different
exposure can be combined to further improve upon tonal fidelity and dynamic
range. Just as spatial shifts in the domain (x, y) improve the image, we will also
see that exposure shifts (shifts in the range, q) also improve the image.
embodied by the Wyckoff film) has been recently proposed [63]. In fact, most
everyday scenes have a far greater dynamic range than can be recorded on a
photographic film or electronic imaging apparatus. A set of pictures that appear
identical except for their exposure collectively show us much more dynamic range
than any single picture from a set, and this also allows the camera’s response
function to be estimated, to within a single constant unknown scalar [59,63,77].
A set of functions
fi (x) = f (ki q(x)), (4.1)
where qs (λ) is the actual light falling on the image sensor and s is the spectral
sensitivity of an element of the sensor array. It is assumed that the spectral
sensitivity does not vary across the sensor array.
Each camera will typically have its own photoquantigraphic unit. In this way the
camera may be regarded as an array of light meters:
∞
q(x, y) = qss (x, y, λ)s(λ) d λ, (4.3)
0
where qss is the spatially varying spectral distribution of light falling on the
image sensor. This light might, in principle, be captured by an ideal Lippman
photography process that preserves the entire spectral response at every point on
an ideal film plane, but more practically, it can only be captured in grayscale or
tricolor (or a finite number of color) response at each point.
Thus varying numbers of photons of lesser or greater energy (frequency times
Planck’s constant) are absorbed by a given element of the sensor array and, over
the temporal integration time of a single frame in the video sequence (or the
picture taking time of a still image) will result in the photoquantigraphic quantity
given by 4.3.
In a color camera, q(x, y) is simply a vector quantity, such as [qr (x, y),
qg (x, y), qb (x, y)], where each component is derived from a separate spectral
sensitivity function. In this chapter the theory will be developed and explained
for grayscale images, where it is understood that most images are color images
for which the procedures are applied to the separate color channels. Thus in
both grayscale and color cameras the continuous spectral information qs (λ) is
lost through conversion to a single number q or to typically 3 numbers, qr , qg ,
and qb .
Ordinarily cameras give rise to noise. That is, there is noise from the sensor
elements and further noise within the camera (or equivalently noise due to film
grain and subsequent scanning of a film, etc.). A goal of photoquantigraphic
imaging is to estimate the photoquantity q in the presence of noise. Since qs (λ)
is destroyed, the best we can do is to estimate q. Thus q is the fundamental or
“atomic” unit of photoquantigraphic image processing.
+ + ~
f f −1
q f1 q^
Light
rays nq nf
Sensor noise Image noise
Figure 4.1 Typical camera and display. Light from subject matter passes through lens
(approximated with simple algebraic projective geometry, or an idealized ‘‘pinhole’’) and is
quantified in q units by a sensor array where noise nq is also added to produce an output
that is compressed in dynamic range by an unknown function f. Further noise nf is introduced
by the camera electronics, including quantization noise if the camera is a digital camera and
compression noise if the camera produces a compressed output such as a jpeg image, giving
rise to an output image f1 (x, y). The apparatus that converts light rays into f1 (x, y) is labeled
CAMERA. The image f1 is transmitted or recorded and played back into a DISPLAY system, where
the dynamic range is expanded again. Most cathode ray tubes exhibit a nonlinear response
to voltage, and this nonlinear response is the expander. The block labeled ‘‘expander’’ is
therefore not usually a separate device. Typical print media also exhibit a nonlinear response
that embodies an implicit expander.
1 It should be noted that some cameras, such as many modern video surveillance cameras, operate
0.8 8
Normalized response
Photoquantity, q
0.6 6
Power
law
0.4 4
Logarithmic
0.2 2 Antilog
Power law
0 0
0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1
Photoquantity, q Renormalized signal level, f1
Figure 4.2 The power law dynamic range compression implemented inside most cameras
showing approximately the same shape of curve as the logarithmic function, over the range of
signals typically used in video and still photography. The power law response of typical cathode
ray tubes, as well as that of typical print media, is quite similar to the antilog function. The act of
doing conventional linear filtering operations on images obtained from typical video cameras,
or from still cameras taking pictures intended for typical print media, is in effect homomorphic
filtering with an approximately logarithmic nonlinearity.
screens have approximately the same kind of built-in dynamic range compression
suitable for print media as well.
It is interesting to compare this naturally occurring (and somewhat accidental)
development in video and print media with the deliberate introduction of
companders (compressors and expanders) in audio. Both the accidentally
occurring compression and expansion of picture signals and the deliberate use
of logarithmic (or mu-law) compression and expansion of audio signals serve to
allow 8 bits to be used to often encode these signals in a satisfactory manner.
(Without dynamic range compression, 12 to 16 bits would be needed to obtain
satisfactory reproduction.)
Most still cameras also provide dynamic range compression built into the
camera. For example, the Kodak DCS-420 and DCS-460 cameras capture
internally in 12 bits (per pixel per color) and then apply dynamic range
compression, and finally output the range-compressed images in 8 bits (per pixel
per color).
Subject
matter CAMERA DISPLAY
''Lens''
Linear Estimated
Sensor
Estimated Cathode ray tube
Compressor expander processing compressor Expander
^ ~
+ f + f −1 ^
f −1
q f1 q^1 f
Light nq nf
rays
Sensor noise Image noise
Figure 4.3 The antihomomorphic filter. Two new elements fˆ−1 and fˆ have been inserted, as
compared to Figure 4.1. These are estimates of the inverse and forward nonlinear response
function of the camera. Estimates are required because the exact nonlinear response of a
camera is generally not part of the camera specifications. (Many camera vendors do not even
disclose this information if asked.) Because of noise in the signal f1 , and also because of noise
in the estimate of the camera nonlinearity f, what we have at the output of fˆ−1 is not q but rather
an estimate q̃. This signal is processed using linear filtering, and then the processed result is
passed through the estimated camera response function f,ˆ which returns it to a compressed
tone scale suitable for viewing on a typical television, computer, and the like, or for further
processing.
where each image has, associated with it, a separate realization of a photoquanti-
graphic noise process nq and an image noise process nf that includes noise
introduced by the electronics of the dynamic range compressor f , and other
electronics in the camera that affect the signal after its dynamic range has been
compressed. In a digital camera, nf also includes the two effects of finite word
length, namely quantization noise (applied after the image has undergone dynamic
range compression), and the clipping or saturation noise of limited dynamic range.
In a camera that produces a data-compressed output, such as the Kodak DC260
which produces JPEG images, nf also includes data-compression noise (JPEG
artifacts, etc., which are also applied to the signal after it has undergone dynamic
range compression). Refer again to Figure 4.1.
If it were not for noise, we could obtain the photoquantity q from any one
of a plurality of differently exposed pictures of the same subject matter, for
example, as
1
q = f −1 (fi ), (4.7)
ki
where the existence of an inverse for f follows from the semimonotonicity
assumption. Semimonotonicity follows from the fact that we expect pixel values
to either increase or stay the same with increasing quantity of light falling on the
image sensor.2 However, because of noise, we obtain an advantage by capturing
multiple pictures that differ only in exposure. The dark (underexposed) pictures
show us highlight details of the scene that would have been overcome by noise
(i.e., washed out) had the picture been “properly exposed.” Similarly the light
pictures show us some shadow detail that would not have appeared above the
noise threshold had the picture been “properly exposed.”
Each image thus provides us with an estimate of the actual photoquantity q:
1 −1
q= (f (fi − nfi ) − nqi ), (4.8)
ki
where nqi is the photoquantigraphic noise associated with image i, and nfi is the
image noise for image i. This estimate of q, q̂ may be written
1 ˆ−1
q̂ i = f (fi ), (4.9)
k̂ i
2 Except in rare instances where the illumination is so intense as to damage the imaging apparatus,
for example, when the sun burns through photographic negative film and appears black in the final
print or scan.
THE WYCKOFF PRINCIPLE AND THE RANGE OF LIGHT 113
1. ki q(x0 , y0 )
nqi , and
2. ci (q(x0 , y0 ))
ci k1i f −1 (nfi ) .
The first criterion indicates that for every pixel in the output image, at least
one of the input images provides sufficient exposure at that pixel location to
overcome sensor noise, nqi . The second criterion states that of those at least one
input image provides an exposure that falls favorably (i.e., is neither overexposed
nor underexposed) on the response curve of the camera, so as not to be overcome
by camera noise nfi . The manner in which differently exposed images of the same
subject matter are combined is illustrated, by way of an example involving three
input images, in Figure 4.4.
Moreover it has been shown [59] that the constants ki as well as the unknown
nonlinear response function of the camera can be determined, up to a single
unknown scalar constant, given nothing more than two or more pictures of the
same subject matter in which the pictures differ only in exposure. Thus the
reciprocal exposures used to tonally register (tonally align) the multiple input
images are estimates 1/k̂ i in Figure 4.4. These exposure estimates are generally
made by applying an estimation algorithm to the input images, either while
simultaneously estimating f or as a separate estimation process (since f only
has to be estimated once for each camera, but the exposure ki is estimated for
every picture i that is taken).
114
Subject
matter CAMERA set to exposure 1
Assumptions (see text for corresponding equations):
Estimated for every point in the imput image set,
Compressor expander 1. There exists at least one image in the set
for which the exposure is sufficient to
"Lens"
Sensor
q^1
overcome sensor noise, and
+ + ^ 2. At least one image that satisfies
f f −1
f1 ^ the above has an exposure that is not
k1 q 1/k1 c^1 lost in image noise by being to far into
Light the toe or shoulder region of the response
rays curve.
nq1 nf1
Sensor noise Image noise
Subject
matter CAMERA set to exposure 2 DISPLAY
Estimated Estimated Cathode ray tube
Compressor expander compressor Expander
"Lens"
Sensor
q^2 ^
q
+ f + ^ ^ ~
f2 f −1 f f −1
^ ^ ^
k2q 1/k2 c2 k1
Light
rays nq 2 nf 2
Sensor noise Image noise Optional anti−homomorphic
Wyckoff filter to act on estimated
photoquantity may
Subject be inserted
matter CAMERA set to exposure 3 here
Anti−−homomorphic
Estimated Wyckoff filter
Compressor expander
"Lens"
Sensor
q^3
+ + ^
f −1
f3 f
^ ^c
k3 q 1/k3 3
Light
rays nq 3 nf 3
Sensor noise Image noise
THE WYCKOFF PRINCIPLE AND THE RANGE OF LIGHT 115
Owing to the large dynamic range that some Wyckoff sets can cover, small
errors in f tend to have adverse effects on the overall estimate q̂. Thus it
is preferable to estimate f as a separate process (i.e., by taking hundreds of
exposures with the camera under computer program control). Once f is known
(previously measured), then ki can be estimated for a particular set of images.
The final estimate for q, depicted in Figure 4.4, is given by
ĉi q̂ i [ĉi (q̂(x, y))/k̂ i ]fˆ−1 (fi (x, y))
i i
q̂(x, y) = = , (4.10)
ĉi ĉi (q̂(x, y))
i i
From this expression we can see that ci (log(q)) are just shifted versions of
c(log(q)), or dilated versions of c(q).
The intuition behind the certainty function is that it captures the slope of
the response function, which indicates how quickly the output (pixel value or
the like) of the camera varies for given input. In the noisy camera, especially
a digital camera where quantization noise is involved, generally the camera’s
Figure 4.4 The Wyckoff principle. Multiple differently exposed images of the same subject
matter are captured by a single camera. In this example there are three different exposures.
The first exposure (CAMERA set to exposure 1) gives rise to an exposure k1 q, the second to
k2 q, and the third to k3 q. Each exposure has a different realization of the same noise process
associated with it, and the three noisy pictures that the camera provides are denoted f1 , f2 ,
and f3 . These three differently exposed pictures comprise a noisy Wyckoff set. To combine
them into a single estimate, the effect of f is undone with an estimate fˆ that represents our
best guess of what the function f is. While many video cameras use something close to the
standard f = kq0.45 function, it is preferable to attempt to estimate f for the specific camera
in use. Generally, this estimate is made together with an estimate of the exposures ki . After
re-expanding the dynamic ranges with fˆ−1 , the inverse of the estimated exposures 1/k̂ i is
applied. In this way the darker images are made lighter and the lighter images are made darker,
so they all (theoretically) match. At this point the images will all appear as if they were taken
with identical exposure, except for the fact that the pictures that were brighter to start with
will be noisy in lighter areas of the image and those that had been darker to start with will be
noisy in dark areas of the image. Thus rather than simply applying ordinary signal averaging,
a weighted average is taken. The weights are the spatially varying certainty functions ci (x, y).
These certainty functions turn out to be the derivative of the camera response function shifted
up or down by an amount ki . In practice, since f is an estimate, so is ci , and it is denoted ĉi in
the figure. The weighted sum is q̂(x, y), the estimate of the photoquantity q(x, y). To view this
quantity on a video display, it is first adjusted in exposure, and it may be adjusted to a different
exposure level than any of the exposure levels used in taking the input images. In this figure,
for illustrative purposes, it is set to the estimated exposure of the first image, k̂ 1 . The result is
then range-compressed with fˆ for display on an expansive medium (DISPLAY).
116 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
output will be reliable where it is most sensitive to a fixed change in input light
level. This point where the camera is most responsive to changes in input is at
the peak of the certainty function c. The peak in c tends to be near the middle
of the camera’s exposure range. On the other hand, where the camera exposure
input is extremely large or small (i.e., the sensor is very overexposed or very
underexposed), the change in output for a given input is much less. Thus the
output is not very responsive to the input and the change in output can be easily
overcome by noise. Thus c tends to fall off toward zero on either side of its peak
value.
The certainty functions are functions of q. We may also write the uncertainty
functions, which are functions of pixel value in the image (i.e., functions of
grayvalue in fi ), as
dF −1 (fi (x, y))
U (x, y) = . (4.12)
dfi (x, y)
Its reciprocal is the certainty function C in the domain of the image (i.e., the
certainty function in pixel coordinates):
dfi (x, y)
C(x, y) = , (4.13)
d F −1(fi (x, y))
where F = log f and F −1 () = log(f −1 ()). Note that C is the same for all images
(i.e., for all values of image index i), whereas ci was defined separately for
each image. For any i the function ci is a shifted (dilated) version of any other
certainty function cj , where the shift (dilation) depends on the log exposure Ki
(the exposure ki ).
The final estimate of q (4.10) is simply a weighted sum of the estimates from
q obtained from each of the input images, where each input image is weighted
by the certainties in that image.
Camera
log(f ) log(f ) log(f ) response
Image
Q Q Q function
The effect of acquisition
c c c noise is to make
nf Q nf Q nf Q only midtones
Underexposure ‘‘Proper’’ exposure Overexposure be reliable
Wyckoff
set (three
differently
f1 f2 f3 exposed
pictures of
same
subject
matter)
Underexposed ‘‘Properly’’
exposed Overexposed
^ ^ ^
F F F
Inverse
of estimated
camera
response
function
f1 f1 f1
^ ^ ^
1/ k1 =4.01 1/ k2 =0.995 1/ k3 =0.26 Inverses of estimates Image
of the relative exposures
^
q1
^
q2 ^
q3 analysis
^ Derivatives of the
^ ^ c3
c1 c2 estimates of the
Q Q Q response function, shifted
by exposure estimates
^
c3 ^
^ c2 c1
Q
Relative certainty
functions work like
overlapping filters
in a filterbank, but ^
in the" amplitude domain" q Estimate of the photoquantigraphic
rather than the frequency quantity q(x,y)
domain
fa fb fc fd
Examples of four synthetic images from a variety of extrapolated or interpolated exposure levels
the output image without loss of fine details. The result can be printed on paper
or presented to an electronic display in such a way as to have optimal tonal
definition.
We
can verify
that (4.15) is a solution
of (4.14) by noting that g(q) = f (kq) =
γ
exp (kq) = exp(k q ) = exp(q ) = f .
γ
Example Two images, f1 and f2 differ only in exposure. Image f2 was taken with
twice as much exposure as f1 ; that is, if f1 = f (q), then f2 = f (2q). Suppose
that we wish to tonally align the two images by darkening f2 . If we darken f2
by squaring all the pixel values of f2 (normalized on the interval from 0 to 1, of
course), then we have implicity assumed, whether we choose to admit it or not,
that the camera response function must have been f (q) = exp(q logk (2) ) = exp(q).
We see that the underlying solution of gamma correction, namely the camera
response function (4.15), does not pass through the origin. In fact f (0) = 1.
Since most cameras are designed so that they produce a signal level output of
zero when the light input is zero, the function f (q) does not correspond to a
realistic or reasonable camera response function. Even a medium that does not
itself fall to zero at zero exposure (e.g., film) is ordinarily scanned in such a
way that the scanned output is zero for zero exposure, assuming that the dmin
(minimum density for the particular emulsion being scanned) is properly set in
the scanner. Therefore it is inappropriate and incorrect to use gamma correction
to lighten or darken differently exposed images of the same subject matter, when
the goal of this lightening or darkening is tonal registration (making them look
the “same,” apart from the effects of noise which is accentuated in the shadow
detail of images that are lightened and the highlight detail of images that are
darkened).
f (2t ) XY Plotter
Tape
Speed
30 ips
Output
15 ips RewPlay FF S f (2t )
f (t )
High-speed
tape player
Tape
Speed
30 ips
Low-speed Output
tape player 15 ips Rew Play FF S f (t )
Figure 4.6 A system that generates comparametric plots. To gain a better intuitive
understanding of what a comparametric plot is, consider two tape recorders that record
identical copies of the same subject matter and then play it back at different speeds. The
outputs of the two tape recorders are fed into an XY plotter, so that we have a plot of f(t) on the
X axis and a plot of f(2t) on the Y axis. Plotting the function f against a contracted or dilated
(stretched out) version of itself gives rise to a comparametric plot. If the two tapes start playing
at the same time origin, a linear comparametric plot is generated.
comparametric plot by playing two tape recordings of the same subject matter
(i.e., two copies of exactly the same tape recorded arbitrary signal) at two different
speeds into an XY plotter. If the subject matter recorded on the tapes is simply a
sinusoidal waveform, then the resulting comparametric plot is a Lissajous figure.
Lissajous figures are comparametric plots where the function f is a sinusoid.
However, for arbitrary signals recorded on the two tapes, the comparametric plot
is a generalization of the well-known Lissajous figure.
Depending on when the tapes are started, and on the relative speeds of
the two playbacks, the comparametric plot takes on the form x = f (t) and
y = f (at + b), where t is time, f is the subject matter recorded on the tape, x
is the output of the first tape machine, and y is the output of the second tape
machine.
The plot (f (t), f (at + b)) will be called an affine comparametric plot.
The special case when b = 0 will be called a linear comparametric plot, and
corresponds to the situation when both tape machines begin playing back the
subject matter at exactly the same time origin, although at possibly different
COMPARAMETRIC IMAGE PROCESSING 121
speeds. Since the linear comparametric plot is of particular interest in this book,
it will be assumed, when not otherwise specified, that b = 0 (we are referring to
a linear comparametric plot).
More precisely, the linear comparametric plot is defined as follows:
Here the quantity q is used, rather than time t, because it will not necessarily
be time in all applications. In fact it will most often (in the rest of this book)
be a quantity of light rather than an axis of time. The function f () will also
be an attribute of the recording device (camera), rather than an attribute of the
input signal. Thus the response function of the camera will take on the role of
the signal recorded on the tape in this analogy.
A function f (q) has a family of comparametric plots, one for each value of
the constant k, which is called the comparametric ratio.
Thus the plot in Definition 4.3.1 may be rewritten as a plot (f, g(f )), not
involving q. In this form the function g is called the comparametric function,
and it expresses the range of the function f (kq) as a function of the range of
the function f (q), independently of the domain q of the function f .
The plot g defines what is called a comparametric equation:
Definition 4.3.2 Equations of the form g(f (q)) = f (kq) are called compara-
metric equations [64].
f f (4.16)
f (g) f (kq)
g
wherein it is evident that there are two equivalent paths to follow from q to
f (kq):
g ◦f = f ◦ k. (4.17)
g = f ◦ k ◦ f −1 , (4.18)
122 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
exp(q ) − 1
f (q) = (4.22)
κ
(κf + 1)γ − 1
g= . (4.23)
κ
exp(q ) − 1
f (q) = . (4.24)
2ζ − 1
The comparametric equation (4.23), with the denominator deleted, forms the
basis for zeta correction of images:
((2ζ − 1)f + 1)1/ζ − 1, ∀ζ = 0,
g= (4.25)
2f − 1 for ζ = 0,
where γ has been fixed to be equal to 1/ζ so that there is only one degree of
freedom, ζ .
Implicit in zeta correction of images is the assumption of an exponential
camera response function, scaled. Although this is not realistic (given that the
exponential function expands dynamic range, and most cameras have compressive
response functions rather than expansive response functions), it is preferable
to gamma correction because of the implicit notion of a response function for
which f (0) = 0. With standard IEEE arithmetic, values of ζ can range from
approximately −50 to +1000.
f (q) = aq 2 + bq + c (4.26)
This gives
−b ± b2 − 4a(c − f )
q= . (4.29)
2a
Similarly, for g, we have
−b ± b2 − 4a(c − g)
kq = . (4.30)
2a
So, setting k times (4.29) equal to (4.30), we have
−b ± b2 − 4a(c − f ) −b ± b2 − 4a(c − g)
k = (4.31)
2a 2a
g = α ± βd + γ d 2 , (4.34)
A simpler and much more accurate and consistent way to vary the output of a
light source is to move it further from or closer to the sensor, or to cover portions
of it with black cardboard. So we begin with the light source far away, and move
it toward the sensor (camera, cell, or whatever) until some small output f1 is
observable by the sensor. We associate this light output with the quantity of light
q1 produced by the light source. Then we cover half the light source, if it’s a
small lamp, with a round reflector; we cover exactly half the reflector output
of the lamp with black paper, and this causes the quantity of light received at
the sensor to decrease to q0 = q1 /2. The measured quantity at the sensor is now
f0 = f (q0 ). Next we move the half-covered lamp toward the sensor until the
quantity f1 is observed. At this point, although the lamp is half covered up, it
is closer to the sensor, so the same amount of light q1 reaches the sensor as did
when the lamp was further away and not half covered. Now, if we uncover the
other half of the lamp, the quantity of light received at the sensor will increase
to q2 = 2q1 . Thus, whatever quantity we observe, call it f2 , it will be equal to
f (2q1 ) which is equal to f (4q0 ), where f is the unknown response function of
the camera. We continue this process, now covering half the lamp back up again
to reduce its output back down to that of q1 , and then moving it still closer to
the sensor until we observe an output of f2 on the sensor. At this point we know
that the lamp is providing a quantity of light q2 to the sensor even though it is
half covered. We can uncover the lamp in order to observe f3 which we know
will be f3 = f (2q2 ) = f (4q1 ) = f (8q0 ). As we repeat the process, we are able
to measure the response function of the sensor on a logarithmic scale where the
base of the logarithm is 2.3
This process is called “log unrolling,” and we will denote it by the function
logunroll( ). Alternatively, we could use the inverse square law of light to
determine the response function of the camera.
Unfortunately, both the log-unrolling method, and the inverse square law
method suffer from various problems:
• Only one element (i.e., one pixel or one region of pixels) of the sensor array
is used, so these methods are not very robust.
• Most cameras have some kind of automatic gain control or automatic
exposure. Even cameras that claim to provide manual exposure settings often
fail to provide truly nonimage-dependent settings. Thus most cameras, even
when set to “manual,” will exhibit a change in output at the one area of the
sensor that depends on light incident on other areas of the sensor.
• The output scale is too widely spaced. We only get one reading per doubling
of the exposure in the half covering method.
3 This log spacing is quite wide; we only get to find f on a very coarse q axis that doubles each
2
time. However, we could use a smaller interval, by covering the lamp in quarter sections, or smaller
sections, such as varying the lamp in smaller output increments with pie-shaped octants of black
paper.
COMPARAMETRIC IMAGE PROCESSING 127
Figure 4.7 Picture of test pattern taken by author on Wednesday December 20, 2000, late
afternoon, with imaging portion of wearable camera system mounted to tripod, for the purpose
of determining the response function f(q) of the imaging apparatus. Exposure was 1/30 s at
f/16 (60 mm lens on a D1 sensor array). WearComp transmission index v115 (115th light vector
of this transmission from a Xybernaut MA IV).
128 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
200
150
100
50
0
0 2 4 6 8 10 12
Spatial coordinate, x
(a )
v115
200 v116
v118
150
100
50
v
a
0
0 2 4 6 8 10 12
Spatial coordinate, x
(b )
Figure 4.8 Test pattern plots generated by segmenting out bars and averaging down columns
to average out noise and improve the estimate. (a) The greatest exposure for which the highest
level on the chart was sufficiently below 255 that clipping would not bias the estimate was v115
(1/30 s exposure). (b) Plots shown for one-third stop below and above v115, as well as a full
stop below and above v115. Exposures were 1/60 s for v112, 1/40 s for v114, 1/25 s for v116,
and 1/15 s for v118.
COMPARAMETRIC IMAGE PROCESSING 129
−b ± b2 − 4a(c − q)
f = . (4.36)
2a
and
kq = ag 2 + bf + c. (4.38)
−b ± b2 − 4a(c − kaf 2 + bf + c)
g= (4.39)
2a
which will be called the “squadratic model” so-named because of its similarity
to the square root of a quadratic formula.
Solving for the parameters a, b, and c, for the 12 data points of the WearCam
system gives curves plotted in Figure 4.11a.
COMPARAMETRIC IMAGE PROCESSING 131
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Quantigraphic unit, q
(a )
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Quantigraphic unit, q
(b )
Figure 4.9 Plots of f(q) together with best quadratic curve fits. The range f is normalized on
the interval from 0 to 1. (a) Unfortunately, a quadratic fit comes far from passing through the
origin, and (b) even if constrained to do so, the curve becomes concave upward in places. This
is a very poor fit to the observed data.
132 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
aq 3 + bq 2 + cq + d
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Quantigraphic unit, q
(a )
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Quantigraphic unit, q
(b )
Figure 4.10 (a) Unfortunately, a cubic fit still comes far from passing through the origin, and
(b) even if constrained to do so, the curve becomes concave upwards in places. This is a very
poor fit to the observed data.
COMPARAMETRIC IMAGE PROCESSING 133
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Quantigraphic unit, q
(a )
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Quantigraphic unit, q
(b )
Figure 4.11 (a) The best-fit inverse quadratic for f(q) turns out to not be a function.
(b) Constraining it to pass through the origin does not help solve the problem.
134 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
Sqrtic fit to f (q )
1
0.8
Response function, f (q )
f (q )
aq + b √q + c
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Quantigraphic unit, q
(a )
Sqrtic fit to f (q )
1
f (q )
0.8
aq + b √q + c
Response function, f(q)
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Quantigraphic unit, q
(b )
Figure 4.12 (a) The best-fit inverse quadratic for f(q) turns out to not be a function.
(b) Constraining it to pass through the origin does not help solve the problem.
136 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
Comparaplot of quadratic f (q ) for ratio k=2 Comparaplot of sqrtic f (q) for ratio k=2
1 1
0.8
g(f (q )) = f (kq)
0.8
g(f (q )) = f (kq)
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
f (q) f (q)
(a ) (b )
Figure 4.13 Comparison of the comparametric plots for the last two models, with k = 2:
(a) Comparaplot for quadratic model of f(q); (b) comparaplot for sqrtic model of f(q).
matter, we can in fact apply this same philosophy to the way we obtain ground-
truth measurement data.
Table 4.1 was generated from picture v115, and likewise Table 4.2 shows
19 sets of ordered pairs arising from each of 19 differently exposed pictures
(numbered v100 to v118) of the test chart. The actual known quantity of light
entering the wearable imaging system is the same for all of these, and is denoted
q in the table.
Plotting the data in Table 4.2, we obtain the 19 plots shown in Figure 4.14a.
Shifting these results appropriately (i.e., by the Ki values), to line them up, gives
the ground-truth, known response function, f , shown in Figure 4.14b.
For simplicity, the base of the logarithm is chosen such that the shift in Q
is by an integer quantity
√ for each of the exposures. In this example, the base of
the logarithm is 3 2 since the pictures were all taken at third F-stop exposure
intervals, by way of exposure times listed in Table 4.3.
q v100 v101 v102 v103 v104 v105 v106 v107 v108 v109 v110 v111 v112 v113 v114 v115 v116 v117 v118
0.011 1.7 1.7 1.8 2.2 2.6 3.1 3.6 4.4 5.5 7.1 8.7 10.3 12.9 16.1 19.7 23.8 28.7 34.0 41.9
0.021 2.6 2.7 3.1 3.8 4.6 5.8 7.1 8.9 11.5 15.3 18.1 21.6 26.9 32.5 38.5 44.9 52.4 60.4 72.6
0.048 4.3 4.8 5.5 7.1 8.7 11.2 13.8 17.1 21.7 28.2 32.5 37.9 45.3 53.2 61.5 70.4 80.9 92.3 108.8
0.079 6.1 7.0 8.2 10.5 13.0 16.5 20.1 24.8 30.9 38.8 44.1 50.8 59.8 69.7 79.6 90.5 103.1 117.0 137.2
0.126 9.2 10.6 12.6 15.9 19.4 24.4 29.6 35.2 43.0 52.7 59.2 67.7 78.8 90.8 102.9 116.2 131.9 148.7 172.5
0.185 12.9 14.7 17.3 21.7 26.6 32.4 38.4 45.5 54.6 65.9 73.5 83.8 95.9 109.6 125.7 140.7 159.0 175.0 198.3
0.264 18.3 20.6 24.4 30.0 35.6 42.9 50.2 58.6 69.6 83.4 92.6 104.0 120.6 136.9 152.8 171.2 189.2 207.9 227.7
0.358 23.0 26.1 30.3 36.7 43.4 51.7 60.1 69.9 82.2 97.5 107.3 121.5 137.7 156.3 176.7 193.2 211.7 223.3 236.9
0.472 29.4 32.7 37.6 45.2 52.6 62.3 72.1 83.1 97.0 114.6 126.3 141.6 161.5 181.1 199.0 216.5 230.5 239.8 249.8
0.622 36.2 40.2 46.1 54.5 63.3 74.4 85.5 98.0 113.7 133.8 146.9 164.2 184.5 203.6 221.0 233.2 242.0 249.2 253.0
0.733 40.7 45.0 51.2 60.3 70.0 81.9 93.9 107.2 124.2 145.8 159.7 177.3 197.3 216.4 230.5 239.4 247.5 253.0 254.0
0.893 46.7 51.2 58.0 68.2 78.8 91.7 105.0 119.5 138.0 161.4 176.0 193.1 213.3 228.8 238.9 246.7 252.7 254.0 255.0
137
138 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
200
f (response byte)
150
100
50
0
18 16 14 12 10 8 6 4 2 0
Q (Quantity of light)
(a )
200
f (response byte)
150
100
50
0
35 30 25 20 15 10 5 0
Q (Quantity of light)
(b )
Figure 4.14 (a) Each of the 19 exposures produced 11 ordered pairs in a plot of f(Q) as
a function of Q = log(q). (The point at the origin is omitted since the data is plotted on a
logarithmic scale). The data points are represented as circles, and are connected together by
straight lines to make it more clear which of the points belongs to which exposure. Note that
the known Q values (as provided by DSC Labs) are not uniformly spaced. (b) Shifting these
19 plots left or right by the appropriate exposure constant Ki allowed them all to line up to
produce
√ the ground truth known-response function f(Q). The base of the logarithm is chosen
3
as 2 (the ki ratios) so that the amount of left-right shift is equal to an integer for all plots.
COMPARAMETRIC IMAGE PROCESSING 139
Solving (4.42) means determining the general class of functions f (q) for
which (4.42) is true. Let f (q) be expressed as a Laurent series:
∞
f (q) = . . . c−2 q −2 + c−1 q −1 + c0 + c1 q + c2 q 2 + . . . = cn q n . (4.43)
−∞
dg
=a (4.46)
df
so that
dg d g df df
= =a , (4.47)
dq df d q dq
which can hold only if at most one of the coefficients cn is nonzero. Let at most
one nonzero coefficient be the mth coefficient cm so that
df
= mcm q m−1 . (4.49)
dq
where continuous values of α, β, and γ are allowed, and the coefficient at the
origin is not necessarily set to zero.
One of these solutions turns out, perhaps coincidentally, to be the familiar
model
f (q) = α + βq γ (4.51)
g = k γ f + α(1 − k γ ) (4.53)
Note that the constant β does not appear in the comparametric equation. Thus we
cannot determine β from the comparametric equation. The physical (intuitive)
interpretation is that we can only determine the nonlinear response function of a
camera up to a single unknown scalar constant.
Note that (4.14) looks quite similar in form to (4.51). It in fact is identical if we
set α = 0 and β = 1. However, one must recall that (4.14) is a comparametric
equation and that (4.51) is a solution to a (different) comparametric equation.
Thus we must be careful not to confuse the two. The first corresponds to gamma
correction of an image, while the second corresponds to the camera response
function that is implicit in applying (4.53) to lighten or darken the image. To
make this distinction clear, applying (4.53) to lighten or darken an image will be
called affine correcting (i.e., correcting by modeling the comparametric function
with a straight line). The special case of affine correction when the intercept is
equal to zero will be called linear correction.
Preferably affine correction of an image also includes a step of clipping values
greater than one to one, and values less than zero to zero, in the output image:
If the intercept is zero and the slope is greater than one, the effect, neglecting
noise, of (4.54), is to lighten the image in a natural manner that properly simulates
the effect of having taken the picture with greater exposure. In this case the
effect is theoretically identical to that which would have been obtained by using
a greater exposure on the camera, assuming that the response function of the
camera follows the power law f = q γ , as many cameras do in practice. Thus
it has been shown that the correct way to lighten an image is to apply linear
correction, not gamma correction (apart from correction of an image to match an
incorrectly adjusted display device or the like, where gamma correction is still
the correct operation to apply).
As before, we have worked forward, starting with the solution (4.51) and
deriving the comparametric equation (4.53) of which (4.51) is a solution. It is
much easier to generate comparametric equations from their solutions than it is
to solve comparametric equations.
142 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
200
f (response byte)
150
100
50
0
35 30 25 20 15 10 5
Q (Quantity of light)
(a )
200
f (response byte)
150
100
50
0
14 13 12 11 10 9 8 7 6
Q (Quantity of light)
(b )
Figure 4.15 The standard power law photographic response function (4.51) can only fit the
response of the imaging apparatus over a narrow region of exposure latitude. (a) Best fit over
the full 37/3 F-stops is poor. (b) Best fit over an interval of ten thirds of a stop is satisfactory.
Although this region of exposures is typical of conventional photography, a feature of cybernetic
photography is the use of deliberate massive overexposure and underexposure. Indeed, the
human eye has a much wider exposure latitude than is suggested by the narrow region over
which the power law model is valid. Therefore a new model that captures the essence of the
imaging system’s response function in regions of extreme exposure is required.
144 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
which has only three parameters. Thus no extra unnecessary degrees of freedom
(which might otherwise capture or model noise) have been added over and above
the number of degrees of freedom in the previous model (4.51).
An intuitive understanding of (4.55) can be better had by rewriting it:
1
∀q = 0,
f = (1 + e−(a log(q)+b) )c (4.56)
0 for q = 0.
written in this form, the soft transition into the toe (region of underexposure) and
shoulder (region of overexposure) regions is evident by the shape this curve has
if plotted on a logarithmic exposure scale,
1
f = , (4.57)
(1 + e−(aQ+b) )c
D
D
Q
(a )
Certainty
Q
Q (b )
(c )
Figure 4.16 Example of response functions 1/(1 + e−(a log(q)+b) )c , which have soft transition
into the toe region of underexposure and shoulder region of overexposure. Traditionally these
responses would be on film, such as the Wyckoff film, having density ‘D’ as a function of log
exposure (i.e., Q = log(q)). (a) Response functions corresponding to three different exposures.
In the Wyckoff film these would correspond to a coarse-grained (fast) layer, denoted by a
dashed line, that responds to a small quantity of light, a medium-grained layer, denoted by a
dotted line, that responds moderately, and a fine-grained layer, denoted by a solid line, that
responds to a large quantity of light. Ideally, when the more sensitive layer saturates, the next
most sensitive layer begins responding, so each layer covers one of a set of slightly overlapping
amplitude bins. (b) Tonally aligning (i.e., tonally ‘‘registering’’ by comparadjustment), the images
creates a situation where each image provides a portion of the overall response curve. (c) The
amplitude bin over which each contributes is given by differentiating each of these response
functions, to obtain the relative ‘‘certainty function.’’ Regions of highest certainty are regions
where the sensitivity (change in observable output with respect to a given change in input) is
maximum.
COMPARAMETRIC IMAGE PROCESSING 145
f k ac
g(f ) = √ , (4.58)
( c f (k a − 1) + 1)c
where K = log(k).
Again, note that g(f ) does not depend on b, which is consistent with our
knowledge that the comparametric equation captures the information of f (q) up
to a single unknown scalar proportionality constant.
Therefore we may rewrite (4.55) in a simplified form
c
qa
f (q) = , (4.59)
qa + 1
where b has been normalized to zero, and where it is understood that q > 0,
since it is a quantity of light (therefore f is always real). Thus we have, for q,
√
c
f (q)
q= √
a
. (4.60)
1 − c f (q)
From this simple form, we see that there are two degrees of freedom, given by
the free parameters a and c. It is useful and intuitive to consider the slope of the
corresponding comparametric equation (4.58),
dg k ac
= √ (4.61)
df ( c f (k a − 1) + 1)c+1
once the slope at the origin is fixed, we can replace a with a* sharpness and
replace c with c* sharpness where sharpness typically varies from 1 to 200,
depending on the nature of the camera or imaging system (e.g., 10 might be a
typical value for sharpness for a typical camera system).
Once the values of a and c are determined for a particular camera, the
response function of that camera is known by way of (4.60). Equation (4.60)
provides a recipe for converting from imagespace to lightspace. It is thus
implemented, for example, in the comparametric toolkit as function pnm2plm
(from http://wearcam.org/cement), which converts images to portable
lightspace maps.
It should also be emphasized that (4.60) never saturates. Only when q increases
without bound, does f approach one (the maximum value). In an actual camera,
such as one having 8 bits per pixel per channel, the value 255 would never quite
be reached.
In practice, however, we know that cameras do saturate (i.e., there is usually
a finite value of q for which the camera will give a maximum output). Thus the
actual behavior of a camera is somewhere between the classic model (4.51) and
that of (4.60). In particular, a saturated model turns out to be the best.
and √
g
c
ka q a = √ √ . (4.64)
c
s− c g
This obtains √ √
√ ka c s c f
c
g= √ √ , (4.65)
c
f (k a − 1) + c s
COMPARAMETRIC IMAGE PROCESSING 147
g (f (q)) = f (kq)
Equicomparametric region
1.2
S3
Clipping region
1.0
Equicomparametric region
S2
Clipping region
S1
0.8
0.6
0.4
0.2
S0
0 f (q)
0 0.2 0.4 0.6 0.8 1.0 1.2
Figure 4.17 A scaling factor with saturation arithmetic to the power of root over root
plus constant correction. The scaling factor allows for an equicomparametric function with
equicomparametricity not necessarily equal to unity. A typical linear scaling saturation constant
of equicomparametricity is 1.2, as shown in the figure. This model accounts for soft saturation
toward a limit, followed by hard saturation right at the limit. Four points are shown in the
illustration: the curve starts at s0 (the origin) and moves toward s1 , where it is clipped, and then
moves toward s2 . A fourth point, s3 shows, where it would have gone (from s2 ) if it were not for
the saturation operator.
giving
sf k ac
g(f ) = √ √ . (4.66)
( c f (k a − 1) + c s)c
(f/s)k ac
g/s = √ , (4.67)
( f/s(k a − 1) + 1)c
c
where we can see the effect of the saturation parameter is simply to set forth
a new scaling. This new scaling is shown in Figure 4.17. The result is a three-
parameter model with parameters a, c, and s. This model accurately describes the
relationships among sets of images coming from a very large number of different
kinds of imaging systems.
Comparametric Equations g(f (q)) = f (kq) Solution (Camera Response Function) f (q)
g = fγ f = exp(q ), γ = k
g = kγ f f = qγ
g = af + b ∀a = 1 or b = 0 f = α + βq γ , a = k γ , b = α(1 − k γ )
g = f + a log k f = a log(q) + b
√
g = ( γ f + log k)γ f = logγ q
g = (f + 1)γ − 1 f = exp(βq ) − 1, γ = k
γ γ)
g = eb f a = eα(1−k ) f (k log f = α + βq γ
b b
g = exp((log f )(k ) ) f = exp(a (q ) )
g = exp(logk f ) f = exp(a bq )
g = α ± βd + γ d 2 , where
d = b2 − 4a(c − f ),
k 2 b2 − 2b2 k + 4ac + b2 − 2b
α= , f = aq 2 + bq + c
4a
b(k − k 2 ) k2
β= , and γ =
2a 4a
d = b2 − 4a(c − f )
2 π 2
g = arctan k tan f f = arctan(q)
π 2 π
1 1
1 1 1 arctan(bπ log q) + ∀q = 0
g = arctan bπ log k + tan f − π + f = π 2
π 2 2
0 for q = 0
c
√
c
c b a c 1
f ka e q ∀q = 0
g= √
c a
f = = 1 + e−(a log q+b)
f (k − 1) + 1 eb q a + 1
0 for q = 0
c
√ c 1
c
c
logf k a eb q a exp ∀q = 0
g = exp √
c
f = exp = 1 + e−(a log q+b)
logf (k a − 1) + 1 eb q a + 1
0 for q = 0
Note: The third equation from the top and second from the bottom were found to describe a large variety of cameras and have been used in a wide variety of
photoquantigraphic image-processing applications. The second equation from the bottom is the one that is most commonly used by the author. The bottom entry
in the table is for use when camera output is logarithmic, that is, when scanning film in units of density.
149
150 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
This solution also appears in Table 4.4. We may also use this solution to seed
the solution of the comparametric equation second from the bottom of Table 4.4,
by using h(x) = x/(x + 1). The equation second from the bottom of Table 4.4
may then be further coordinate transformed into the equation at the bottom of
Table 4.4 by using h(x) = exp(x). Thus properties of comparametric equations,
such as those summarized in Table 4.5, can be used to help solve comparametric
equations, such as those listed in Table 4.4.
THE COMPARAGRAM: PRACTICAL IMPLEMENTATIONS OF COMPARANALYSIS 151
Solutions
Comparametric Equations (Camera Response Function)
g(f (q)) = f (kq) f (q)
ğ(f ) = g(f˘), where ğ(f (q)) = g(f (h(q))) f˘(q) = f (h(q)) ∀ bijective h
g(f ) = ğ(f ), where ğ(f (q)) = f˘(βq) f (q) = f˘(βq)
g(f ) = g(h(f˘)) = h(ğ), where ğ(q) = f˘(kq) f = h(f˘)
h−1 (g) = ğ(f˘) f = h(f˘)
The (4.69) process of “registering” the second image with the first differs
from the image registration procedure commonly used in much of machine
vision [81–84] and image resolution enhancement [72–73] because it operates
on the range f (q(x)) (tonal range) of the image fi (x) as opposed to its domain
(spatial coordinates) x = (x, y).
255
255 m 2
n
ε= − fˆ k fˆ−1 J [m, n], (4.70)
m=0 n=0
255 255
255
255
ε= (fˆ−1 (y) − k fˆ−1 (x))2 J [m, n]
m=0 n=0
255
255 n m 2
= fˆ−1 − k fˆ−1 J [m, n], (4.71)
m=0 n=0
255 255
or equivalently, that minimizes the logarithm of the degree to which this equation
(4.69) is untrue. This is obtained by minimizing
255
255 n m 2
ε= F̂ −1 − F̂ −1 −K J [m, n], (4.72)
m=0 n=0
255 255
(a )
(c ) (b )
(d )
Figure 4.18 Understanding the comparagram by using it to recover the curve in Modify
Curves (a) Original image. (b) The Modify Curves function of GIMP menu item Curves is applied
to the original image. (c) Modified image resulting from this Modify Curves operation being
applied to the original image. (d) The comparagram of the original image and the modified
image. Notice how the modification is recovered. The comparagram allows us to recover, from
just the two images (original and modified), the curve that was applied by the Modify Curves
function. Completely black areas of the comparagram indicate bin counts of zero, whereas
lighter areas indicate high bin counts. From here on, however, comparagrams will be shown in
a reversed tone scale to conserve ink or toner.
THE COMPARAGRAM: PRACTICAL IMPLEMENTATIONS OF COMPARANALYSIS 155
Table (LUT) for converting from an image such as a Portable PixMap (PPM)
format to a Portable LightspaceMap (PLM), as is implemented by the pnm2plm
program.
Additionally the estimate fˆ−1 (or equivalently, F −1 ) is generally constrained
to be semimonotonic (and preferably smooth as well). Semimonotonicity is forced
by preventing its derivative from going negative, and this is usually done using
a quadratic programming system such as “qp.m,” An example of such programs
is Octave or Matlab, which follows:
H=A.’*A;
b=A.’*y;
We now see how well this method works on a typical dataset comprised of
pictures differing only in exposure. The sequence of pictures is from a dark
interior looking out into bright sunlight, with bright sky in the background. The
dynamic range of the original subject matter is far in excess of what can be
captured in any one of the constituent pictures. Such an image sequence is shown
in Figure 4.19. From the comparagrams, the response function is determined
156 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
k=2
k=4
k=8
Figure 4.19 A sequence of differently exposed pictures of the same subject matter. Such
variable gain sequences give rise to a family of comparagrams. In this sequence the gain
happens to have increased from left to right. The square matrix J (called a comparagram)
is shown for each pair of images under the image pairs themselves, for k = 21 = 2. The
next row shows pairwise comparagrams for skip = 2 (e.g., k = 22 = 4), and then for skip = 3
(e.g., k = 23 = 8). Various skip values give rise to families of comparagrams that capture all the
necessary exposure difference information. Each skip value provides a family of comparagrams.
Comparagrams of the same skip value are added together and displayed at the bottom, for
k = 2, k = 4, and k = 8.
using the least squares method, with monotonicity and smoothness constraints,
to obtain the recovered response function shown in Figure 4.20a.
Although it is constrained by smoothness and monotonicity, the model as
fitted in Figure 4.20a has 256 degrees of freedom. In fact there are simpler
models that have fewer degrees of freedom (and therefore better noise immunity).
So, rather than trying to estimate the 256 sample values of the LUT directly
from the comparagrams, we can use any of the various models previously
presented, in order to break the problem down into two separate simpler (and
better constrained) steps:
200
f (response byte)
150
100
50
0
35 30 25 20 15 10 5
Q (Quantity of light)
(a )
40
35
30
Certainty function
25
20
15
10
0
0 50 100 150 200 250
Pixel integer
(b )
Figure 4.20 (a) A least squares solution to the data shown in Figure 4.19, using a novel
multiscale smoothing algorithm, is shown as a solid line. The plus signs denote known ground
truth data measured from the camera using the test chart. We can see that the least squares
solution recovered from the data in Figure 4.19 is in close agreement with the data recovered
from the test chart. (b) The derivative of the computed response function is the certainty
function. Note that despite the excellent fit to the known data, the certainty function allows
us to see slight roughness in the curve which is accentuated by the process of taking the
derivative.
158 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
f (kq ) f (q )
f (kq1)
f (kq0)
f (q ) q /q0
Figure 4.21 Log unrolling. Logarithmic logistic unrolling procedure for finding the pointwise
nonlinearity of an image sensor from the comparametric equation that describes the relationship
between two pictures differing only in their exposures. (a) Comparametric plot: Plot of pixel
values in one image against corresponding pixel values in the other. (b) Response curve: Points
on the response curve, found from only the two pictures, without any knowledge about the
characteristics of the image sensor. These discrete points are only for illustrative purposes. If
a logarithmic exposure scale is used (which is what most photographers use), then the points
fall uniformly on the Q = log(q/q0 ) axis.
Separating the estimation process into two stages allows us a more direct route
to “registering” the image domains if, for example, we do not need to know
f but only require g, which is the recipe for expressing the range of f (kq)
in the units of f (q). In particular, we can lighten or darken images to match
one another without ever having to solve for q. The comparadjustment process
of tonally adjusting (i.e., registering) images using comparagraphic information
and the corresponding program function name appear on the WWW site that
accompanies this text as “comparadj().”
The first part of this two-step process allows us to determine the relationship
between two pictures that differ only in exposure. So we can directly perform
operations like image exposure interpolation and extrapolation as in Figure 4.5
and skip the intermediate step of computing q. Not all image processing
applications require determining q, so there is great value in understanding the
simple relationship between differently exposed pictures of the same subject
matter.
At the very least, even when we do not need to find f (q), we may wish to
find g(f ). One simple algorithm for estimating the comparametric equation g(f )
from actual comparagrams is to find the peaks (indexes of the highest bin count)
along each row of the comparagram, or along each column. This way lookup
THE COMPARAGRAM: PRACTICAL IMPLEMENTATIONS OF COMPARANALYSIS 159
tables may be used to convert an image from the tone scale of the first image
from which the comparagram was computed, to the second image from which
the comparagram was computed, and vice versa (depending on whether one is
computing along rows or along columns of the comparagram). However, this
simplistic approach is undesirable for various reasons. Obviously only integer
values will result, so converting one image to the tone scale of the another image
will result in loss of precision (i.e, likely differing pixel values will end up
being converted to identical pixel values). If we regard the comparagram as a
joint probability distribution (i.e., joint histogram), the interpretation of selecting
highest bin counts corresponds to a maximum likelihood estimate (MLE).
We may regard the process of selecting the maximum bin count across each
row or column as just one example of a moment calculation, and then consider
other moments. The first moment (center of gravity along each row or down each
column) typically gives us a noninteger value for each entry of this lookup table.
If the comparagram were regarded as a joint probability distribution function (i.e.,
cross-histogram), this method of selecting first moments across rows or down
columns would amount to a Bayes least error (BLE) least squares formulation
(Bayes least squares).
Calculating moments across rows or down columns is somewhat successful in
“slenderizing” the comparagram into a comparametric plot. However, it still does
not enforce monotonicity or smoothness. Although smoothness is an arbitrary
imposition, we do know for certain that the phenomenon should be monotonic.
Therefore, even if not imposing smoothness (i.e., making no assumptions about
the data), we should at least impose monotonicity.
To impose the monotonicity constraint, we proceed as follows:
4 Each of these images was gathered by signal averaging (capturing 16 times, and then averaging the
images together) to reduce noise. This step is probably not necessary with most full-sized cameras,
but noise from the EyeTap sensor array was very high because a very small sensor array was used
and built into an ordinary pair of sunglasses, in such a way that the opening through which light
entered was very small. Primarily because the device needed to be covert, the image quality was
very poor. However, as we will see in subsequent chapters, this poor image quality can be mitigated
by various new image-processing techniques.
THE COMPARAGRAM: PRACTICAL IMPLEMENTATIONS OF COMPARANALYSIS 161
(a ) (b ) (c )
(d ) (e )
(A ) (B ) (C )
(D ) (E )
Figure 4.22 (a–e) Collection of differently exposed images used to calibrate the author’s
eyeglass-based personal imaging system. These images differ only in exposure. (A–E) Certainty
images corresponding to each image. The certainty images, c(f(x, y)), are calculated by
evaluating f with the derivative of the estimated response function. Areas of higher certainty are
white and correspond to the midtones, while areas of low certainty are black and correspond
to highlights and shadows, which are clipped or saturated at the extrema (toe or shoulder of
the response curve) of possible exposures. Another interpretation of the proposed method of
combining multiple images of different exposure is to think of the result as a weighted sum of
exposure adjusted images (adjusted to the same range), where the weights are the certainty
images.
162 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
Range−Range plots for k = 1/16 to k = 16 Synthetic Range−Range plots for various ag values
250 250
k =16
k =8
200 k =4 200
k =2
k =1
150 150
k =1/2
k =1/8
50 50
k =1/16
0 0
0 50 100 150 200 250 0 50 100 150 200 250
(a ) (b )
Figure 4.23 Comparametric plots, g(f(q(x, y))) = f(kq(x, y)), characterizing the specific
eyeglass-mounted CCD sensor array and wearable digitizer combination designed and built
by author. (a) Plots estimated from comparagrams of differently exposed pictures of the same
subject matter, using the proposed nonparametric self-calibration algorithm. (b) Family of
curves generated for various values of k, by interpolating between the nine curves in (a).
on exhibit at the List Visual Arts Center, MIT, in a completely darkened room,
illuminated with a bare flash lamp from one side only) was selected because
of its great dynamic range that could not be captured in any single scan. A
Wyckoff set was constructed by scanning the same negative at five different
“brightness” settings (Fig. 4.24). The settings were controlled by a slider that was
calibrated in arbitrary units from −99 to +99, while running Kodak’s proprietary
scanning software. Kodak provides no information about what these units mean.
Accordingly the goal of the experiment was to find a closed-form mathematical
equation describing the effects of the “brightness” slider on the scans, and to
recover the unknown nonlinearity of the scanner. In order to make the problem
a little more challenging and, more important, to better illustrate the principles
of comparametric image processing, the dmin procedure of scanning a blank film
at the beginning of the roll was overridden.
Jointly (pairwise) comparagrams J01 , J12 , J23 , and J34 were computed from
the five images (v0 through v4 ) of Figure 4.24. They are displayed as density
plots (i.e., treated as images of dimension 256 by 256 pixels, where the darkness
of the image is proportional to the number of counts — darkness rather than
Figure 4.24 These scans from a photographic negative differ only in the choice of ‘‘brightness’’
setting selected using the slider provided on the X-windows screen by the proprietary Kodak
PhotoCD scanning software. The slider is calibrated in arbitrary units from −99 to +99. Five
scans were done and the setting of the slider is noted above each scan.
164 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
Figure 4.25 Pairwise comparagrams of the images in Figure 4.25. It is evident that the data
are well fitted by a straight line, which suggests that Kodak must have used the standard
nonlinear response function f(q) = α + βqγ in the design of their PhotoCD scanner.
lightness to make it easier to see the pattern) in Figure 4.25. Affine regression
(often incorrectly referred to as “linear regression” ) was applied to the data, and
the best-fit straight line is shown passing through the data points. The best-fit
straight line
g = af + b (4.73)
255
n 2
255
m
ε= (e2 ) = a +b− J [m, n]. (4.74)
m=0 n=0
255 255
dε
=2 (af + b − g) f J (f, g), (4.75)
da
dε
=2 (af + b − g) J (f, g), (4.76)
db
with the solution
f 2 J [m, n] f J [m, n]
gJ [m, n]
m,n a m,n
m,n = . (4.77)
f J [m, n] J [m, n] b J [m, n]
m,n m,n m,n
Because the dmin procedure was overridden, notice that the plots do not pass
through the origin. The two leftmost plots had nearly identical slopes and
THE COMPARAGRAM: PRACTICAL IMPLEMENTATIONS OF COMPARANALYSIS 165
intercepts, and likewise for the two rightmost, which indicates that the arbitrary
Kodak units of “brightness” are self-consistent (i.e., J01 , which describes the
relationship between a scan at a “brightness” of 40 units and one of 20 units as
essentially the same as J12 , which describes the relationship between a scan at
a “brightness” of 20 units and one of 0 units). Since there are three parameters
in (4.51), k, α, and γ , which describe only
√ two degrees of freedom (slope and
intercept), γ may be chosen so that k = γ a works out to be linearly proportional
√
to arbitrary Kodak units. Thus setting ( γ aleft )/( γ aright ) = 20/30 (where aleft
is the average slope of the two leftmost plots and aright the average slope of
the two rightmost plots) results in the value γ = 0.2254 From this we obtain
α = b/(1 − a) = 23.88. Thus we have that
qi = α
fi (x, y) − α,
ki
c0 c1 c2
c3 c4
Figure 4.26 The certainty functions express the rate of change of f(q(x, y)) with Q(x, y). The
certainty functions may be used to compute the certainty images, f(ci ). White areas in one
of the certainty images indicate that pixel values f(q) change fastest with a corresponding
change in the photoquantity, Q. When using the camera as a lightmeter (a photoquantigraphic
instrument to estimate q), it will be most sensitive where the certainty images are white. White
areas of these certainty images correspond to midgray values (midtones) of the corresponding
original images in Figure 4.24, while dark areas correspond to extreme pixel values (either
highlights or shadows) of the original images in Figure 4.24. Black areas of the certainty image
indicate that Q changes drastically with small changes in pixel value, and thus an estimate of
Q in these areas will be overcome by image noise nfi .
differentiated without the further increase in the noise that usually accompanies
differentiation. Otherwise, when determining the certainty functions from poor
estimates of f , the certainty functions would be even more noisy than the poor
estimate of f itself. The resulting certainty images, denoted by c(fi ), are shown
in Figure 4.30. Each of the images, fi (x, y), gives rise to an actual estimate of
the quantity of light arriving at the image sensor (4.9). These estimates were
combined by way of (4.10), resulting in the composite image appears shown in
Figure 4.31. Note that the resulting image Iˆ1 looks very similar to f1 , except that
it is a floating point image array of much greater tonal range and image quality.
Furthermore, given a Wyckoff set, a composite image may be rendered
at any in-between exposure from the set (exposure interpolation), as well as
somewhat beyond the exposures given (exposure extrapolation). This result
suggests the “VirtualCamera” [64], which allows images to be rendered at any
desired exposure once q is computed.
THE COMPARAGRAM: PRACTICAL IMPLEMENTATIONS OF COMPARANALYSIS 167
(a ) (b )
Figure 4.27 Noisy images badly scanned from a publication. These images are identical
except for exposure and a good deal of quantization noise, additive noise, scanning noise,
and the like. (a) Darker image shows clearly the eight people standing outside the doorway
but shows little of the architectural details of the dimly lit interior. (b) Lighter image shows the
architecture of the interior, but it is not possible to determine how many people are standing
outside, let alone recognize any of them.
250 1
0.9
Lighter image g(f ) = f(kq)
0.8
200
0.7
150 0.6
0.5
100 0.4
0.3
0.2
50
0.1
0
50 100 150 200 250 0 0.2 0.4 0.6 0.8 1
Pixel value of image 1 Darker image f (q)
(a ) (b ) (c )
Figure 4.28 Comparametric regression. (a) Joint comparagram. Note that because the images
were extremely noisy, the comparagram is spread out over a fat ridge. Gaps appear in the
comparagram owing to the poor quality of the scanning process. (b) Even the comparagram of
the images prior to the deliberately poor scan of them is spread out, indicating that the images
were quite noisy to begin with. (c) Comparametric regression is used to solve for the parameters
of the comparametric function. The resulting comparametric plot is a noise-removed version
of the joint-comparagram; it provides a smoothly constrained comparametric relationship
between the two differently exposed images.
1A variant similar to BFGS, written by M. Adnan Ali, Corey Manders, and Steve Mann.
168 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
200
100
0
−10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0
Quantity of light, Q
0.5
0
−10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0
Quantity of light, Q
Figure 4.29 Relative response functions F(Q + Ki ) recovered from the images in Figure 4.27,
plotted with their derivatives. The derivatives of these response functions suggests a degree of
confidence in the estimate Q̂i = F −1 (fi ) − Ki derived from each input image.
(a ) (b )
Figure 4.30 Certainty images that are used as weights when the weighted sum of estimates of
the actual quantity of light is computed. Bright areas correspond to large degrees of certainty.
the values a = 4.0462 and c = 0.1448. The response curve for these values is
shown in Figure 4.32, together with known ground-truth data. Since a closed-
form solution has been obtained, it may be easily differentiated without the
further increase in noise that usually accompanies differentiation. (Compare to
Figure 4.20b.)
Once the response function is found a modified version shifted over slightly
for each possible exposure value can be formulated (see Fig. 4.33a). Each of
these will have a corresponding certainty function (Fig. 4.33b). Together, the
certainty functions form a bank of amplitude domain filters decomposing the
image into various tonal bands. Thus, the quantity of light for each frame of
SPATIOTONAL PHOTOQUANTIGRAPHIC FILTERS 169
Figure 4.31 Composite image made by simultaneously estimating the unknown nonlinearity
of the camera as well as the true quantity of light incident on the camera’s sensor array, given
two input images from Figure 4.27. The combined optimal estimate of q̂ is expressed here in
the coordinates of the lighter (rightmost) image. Nothing was done to appreciably enhance this
image (i.e., the procedure of estimating q and then just converting it back into a picture again
may seem pointless). Still we can note that while the image appears much like the rightmost
input image, the clipping of the highlight details has been softened somewhat. Later we will
see methods of actual image enhancement done by processing q̂ prior to converting it back to
an image again. Steve Mann, 1993.
the sequence can optimally contribute to the output rendition of the wearable
imaging system, so that, the quantity of light sent into the eye of the wearer is
appropriately optimal and free of noise. Such a system actually allows the wearer
to see more than could be ordinarily seen (e.g., to see into deep dark shadows
while simultaneously looking straight into the bright sun or the flame of an arc
welder’s rig without eye damage). The apparatus also allows the wearer to see
in nearly total darkness.
Most print and display media have limited dynamic range. Thus one might be
tempted to argue against the utility of the Wyckoff principle based on this fact.
One could ask, for example, why bother building a Wyckoff camera that can
capture such dynamic ranges if televisions and print media cannot display more
than a very limited dynamic range? Why bother capturing the photoquantity q
with more accuracy than is needed for display?
170 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
200
f (response byte)
150
100
50
0
35 30 25 20 15 10 5
Q (Quantity of light)
(a )
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250
f (Response byte)
(b )
Figure 4.32 The author’s simple two parameter model fits the response curve almost as well
as the much more complicated 256 parameter model (e.g., the lookup table) of Figure 4.20a,
and in some areas (e.g., near the top of the response curve) the fit is actually better. This
method of using comparametric equations is far more efficient than the least squares method
that produced the data in Figure 4.20a. Moreover, the result provides a closed-form solution
rather than merely a lookup table. (b) This method also results in a very smooth response
function, as we can see by taking its derivative to obtain the certainty function. Here the
relative certainty function is shown on both a linear scale (solid line) and log scale (dashed line).
Compare this certainty function to that of Figure 4.20b to note the improved smoothing effect
of the simple two parameter model.
SPATIOTONAL PHOTOQUANTIGRAPHIC FILTERS 171
200
100
0
10 9 8 7 6 5 4 3 2 1 0
Quantity of light, Q = log(q )
(a )
0.5
0
10 9 8 7 6 5 4 3 2 1 0
Quantity of light, Q = log(q )
(b )
Figure 4.33 Response curves and their certainty functions: (a) Response functions shifted
for each possible K (e.g., each exposure that the imaging apparatus is capable of making).
(b) Amplitude domain filterbanks arise from the overlapping shifted certainty functions.
1. Estimates of q are still useful for machine vision and other applications
that do not involve direct viewing of a final picture. An example is the
wearable face recognizer [14] that determines the identity of an individual
from a plurality of differently exposed pictures of that person, and then
presents the identity in the form of a text label (virtual name tag) on
the retina of an eye of the wearer of the eyeglass–based apparatus.
Since q̂ need not be displayed, the problem of output dynamic range,
and the like, of the display (i.e., number of distinct intensity levels of
the laser beam shining into a lens of the eye of the wearer) is of no
consequence.
2. Although the ordinary dynamic range and the range resolution (typically
8 bits) is sufficient for print media (given the deliberately introduced
nonlinearities that best use the limited range resolution), when performing
operations such as deblurring, noise artifacts become more evident. In
general, sharpening involves high-pass filtering, and thus sharpening will
often tend to uncover noise artifacts that would normally exist below
the perceptual threshold when viewed through ordinary display media. In
particular, sharpening often uncovers noise in the shadow areas, making
172 COMPARAMETRIC EQUATIONS, QUANTIGRAPHIC IMAGE PROCESSING
dark areas of the image appear noisy in the final print or display. Thus in
addition to the benefits of performing sharpening photoquantigraphically
by applying an antihomomorphic filter as in Figure 4.3 to undo the
blur of (4.5), there is also further benefit from doing the generalized
antihomomorphic filtering operation at the point q̂ in Figure 4.4, rather
than just that depicted in Figure 4.3.
3. A third benefit from capturing a true and accurate measurement of the
photoquantity, even if all that is desired is a nice picture (i.e., even
if what is desired is not necessarily a true or accurate depiction of
reality), is that additional processing may be done to produce a picture
in which the limited dynamic range of the display or print medium
shows a much greater dynamic range of input signal, through the use
of further image processing on the photoquantity prior to display or
printing.
octave:5> t=linspace(0,6);
octave:6> plot(cos(t),sin(t))
What do you observe about the shape of the plot? Is the shape complete? If
not, what must be done to make the shape complete?
Now construct the same parametric plot for 100 points of the parameter t
ranging from 0 to 6000. This can be done using the Octave commands:
octave:7> t=linspace(0,6000);
octave:8> plot(cos(t),sin(t))
What do you observe? Why is the shape not that of a smooth relation?
This might seem like a rather expensive and complicated “window,” and one
might ask, Why go to such trouble just to make a transparent viewing window?
The reason is that this creates a visual “reality stream” that allows one to modify,
or to allow others to modify, one’s visual perception of reality. To do this, step 2
above is changed from just passing through to modifying light, while it is in
numerical form.
In this chapter we saw that these numbers are referred to as “photoquanta.” It
is desired that the “photoquanta” be calculated from ordinary digitized pictures.
To understand photoquanta, consider two images that differ only in exposure.
These two images may be matched to a so-called comparaplot (comparametric
plot).
Recall the following definition: The comparametric plot of a function f (q) is
defined as the parametric plot of f (q) versus f (kq), where k is a scalar constant
and q is the real-valued comparameter.
Consider the function f (q) = q 1/3 . Construct a comparametric plot for a fixed
value of the comparametric ratio k = 2. In other words, construct a plot of f (q)
versus f (2q). What do you notice about this comparametric plot?
various blur radii, and comment on your results. Also try sharpening the
final resulting image. Experiment with a combination of blurring the certainty
functions and sharpening the final resultant combined image.
In addition to submitting the final pictures, explain your results. Feel free to
try other laws of composition to combine the two (or more) differently exposed
pictures to obtain a single image of greater dynamic range.
Intelligent Image Processing. Steve Mann
Copyright 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-40637-6 (Hardback); 0-471-22163-5 (Electronic)
5
LIGHTSPACE AND
ANTIHOMOMORPHIC
VECTOR SPACES
The research described in this chapter arises from the author’s work in designing
and building a wearable graphics production facility used to create a new
kind of visual art over the past 15 or 20 years. This work bridges the gap
between computer graphics, photographic imaging, and painting with powerful
yet portable electronic flashlamps. Beyond being of historical significance (the
invention of the wearable computer, mediated reality, etc.), this background can
lead to broader and more useful applications.
The work described in this chapter follows on the work of Chapter 4, where it
was argued that hidden within the flow of signals from a camera, through image
processing, to display, is a homomorphic filter. While homomorphic filtering
is often desirable, there are occasions when it is not. The cancellation of this
implicit homomorphic filter, as introduced in Chapter 4, through the introduction
of an antihomomorphic filter, will lead us, in this chapter, to the concept of
antihomomorphic superposition and antihomomorphic vector spaces. This chapter
follows roughly a 1992 unpublished report by the author, entitled “Lightspace
and the Wyckoff Principle,” and describes a new genre of visual art that the
author developed in the 1970s and early 1980s.
The theory of antihomomorphic vector spaces arose out of a desire to create a
new kind of visual art combining elements of imaging, photography, and graphics,
within the context of personal imaging.
Personal imaging is an attempt to:
1. resituate the camera in a new way — as a true extension of the mind and
body rather than merely a tool we might carry with us; and
2. allow us to capture a personal account of reality, with a goal toward:
a. personal documentary; and
179
180 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
The last goal is not to alter the scene content, as is the goal of much in the
way of digital photography [87] — through such programs as GIMP or its weaker
work-alikes such as Adobe’s PhotoShop. Instead, a goal of personal imaging is
to manipulate the tonal range and apparent scene illumination, with the goal of
faithfully, but expressively, capturing an image of objects actually present in the
scene.
In much the same way that Leonardo da Vinci’s or Jan Vermeer’s paintings
portray realistic scenes, but with inexplicable light and shade (i.e., the shadows
often appear to correspond to no single possible light source), a goal of personal
imaging is to take a first step toward a new direction in imaging to attain a
mastery over tonal range, light-and-shadow, and so on.
Accordingly, a general framework for understanding some simple but impor-
tant properties of light, in the context of a personal imaging system, is put forth.
5.1 LIGHTSPACE
A mathematical framework that describes a model of the way that light interacts
with a scene or object is put forth in this chapter. This framework is called
“lightspace.” It is first shown how any of a variety of typical light sources
(including those found in the home, office, and photography studio) can be
mathematically represented in terms of a collection of primitive elements
called “spotflashes.” Due to the photoquantigraphic (linearity and superposition)
properties of light, it is then shown that any lighting situation (combination of
sunlight, fluorescent light, etc.) can be expressed as a collection of spotflashes.
Lightspace captures everything that can be known about how a scene will respond
to each of all possible spotflashes and, by this decomposition, to any possible
light source.
What information about the world is contained in the light filling a region of space?
Space is filled with a dense array of light rays of various intensities. The set of rays
passing through any point in space is mathematically termed a pencil. Leonardo da
Vinci refers to this set of rays as a “radiant pyramid.” [88]
Leonardo expressed essentially the same idea, realizing the significance of this
complete visual description:
THE LIGHTSPACE ANALYSIS FUNCTION 181
The body of the air is full of an infinite number of radiant pyramids caused by the
objects located in it.1 These pyramids intersect and interweave without interfering
with each other during their independent passage throughout the air in which they
are infused. [89]
We can also ask how we might benefit from being able to capture, analyze,
and resynthesize these light rays. In particular, black-and-white (grayscale)
photography captures the pencil of light at a particular point in space time
(x, y, z, t) integrated over all wavelengths (or integrated together with the
spectral sensitivity curve of the film). Color photography captures three readings
of this wavelength-integrated pencil of light each with a different spectral
sensitivity (color). An earlier form of color photography, known as Lippman
photography [90,91] decomposes the light into an infinite2 number of spectral
bands, providing a record of the true spectral content of the light at each point on
the film.
A long-exposure photograph captures a time-integrated pencil of light. Thus
a black-and-white photograph captures the pencil of light at a specific spatial
location (x, y, z), integrated over all (or a particular range of) time, and over all
(or a particular range of) wavelengths. Thus the idealized (conceptual) analog
camera is a means of making uncountably many measurements at the same time
(i.e., measuring many of these light rays at once).
1 Perhaps more correctly, by the interaction of light with the objects located in it.
2 While we might argue about infinities, in the context of quantum (i.e., discretization) effects of
light, and the like, the term “infinite” is used in the same conceptual spirit as Leonardo used it, that
is, without regard to practical implementation, or actual information content.
3 Neglecting any uncertainty effects due to the wavelike nature of light, and any precision effects
selects a particular wavelength of light more efficiently than a prism, though the familiar triangular
icon is used to denote this splitting up of the white light into a rainbow of wavelengths.
5 Neglecting the theoretical limitations of both sensor noise and the quantum (photon) nature of light.
182 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
Pencil
of light
Various rays
Spot-flash-
of light
spectrometer
First pinhole aperture
01
23 Rays
that got
Single ray
through
Integrating Prism first hole
meter
Scene
Wavelength adjustment (rotatable mirror)
Light sensing
element
(a ) (b )
Figure 5.1 Every point in an illuminated 3-D scene radiates light. Conceptually, at least,
we can characterize the scene, and the way it is illuminated, by measuring these rays in all
directions of the surrounding space. At each point in space, we measure the amount of light
traveling in every possible direction (direction being characterized by a unit vector that has two
degrees of freedom). Since objects have various colors and, more generally, various spectral
properties, so too will the rays of light reflected by them, so that wavelength is also a quantity
that we wish to measure. (a) Measurement of one of these rays of light. (b) Detail of measuring
apparatus comprising omnidirectional point sensor in collimating apparatus. We will call this
apparatus a ‘‘spot-flash-spectrometer.’’
There are seven degrees of freedom in this measuring apparatus.6 These are
denoted by θ, φ, λ, t, x, y, and z, where the first two degrees of freedom are
derived from a unit vector that indicates the direction we are aiming the apparatus,
and the last three denote the location of the apparatus in space (or the last four
denote the location in 4-space, if one prefers to think that way). At each point
in this seven-dimensional analysis space we obtain a reading that indicates the
quantity of light at that point in the space. This quantity of light might be found,
for example, by observing an integrating voltmeter connected to the light-sensing
element at the end of the collimator tube. The entire apparatus, called a “spot-
flash-spectrometer” or “spot-spectrometer,” is similar to the flash spotmeter that
photographers use to measure light bouncing off a single spot in the image.
Typically this is over a narrow (one degree or so) beam spread and short (about
1/500) time interval.
Suppose that we obtain a complete set of these measurements of the
uncountably7 many rays of light present in the space around the scene.
6 Note that in a transparent medium one can move along a ray of light with no change. So measuring
the lightspace along a plane will suffice, making the measurement of it throughout the entire volume
redundant. In many ways, of course, the lightspace representation is conceptual rather than practical.
7 Again, the term “uncountable” is used in a conceptual spirit. If the reader prefers to visualize the
rationals — dense in the reals but countable — or prefers to visualize a countably infinite discrete
lattice, or a sufficiently dense finite sampling lattice, this will still convey the general spirit of light
theorized by Leonardo.
THE LIGHTSPACE ANALYSIS FUNCTION 183
Multichannel
measurement
apparatus
Figure 5.2 A number of spotmeters arranged to simultaneously measure multiple rays of light.
Here the instruments measure rays at four different wavelengths, traveling in three different
directions, but the rays all pass through the same point in space. If we had uncountably many
measurements over all possible wavelengths and directions at one point, we would have an
apparatus capable of capturing a complete description of the pencil of light at that point in
space.
this is impossible in practice, the human eye comes very close, with its 100
million or so light-sensitive elements. Thus we will denote this collection of spot-
flash-spectrometers by the human-eye icon (“eyecon”) depicted in Figure 5.3.
However, the important difference to keep in mind when making this analogy is
that the human eye only captures three spectral bands (i.e., represents all spectral
readings as three real numbers denoting the spectrum integrated with each of the
three spectral sensitivities), whereas the proposed collection of spot-spectrometers
captures all spectral information of each light ray passing through the particular
point where it is positioned, at every instant in time, so that a multichannel
recording apparatus could be used to capture this information.
So far a great deal has been said about rays of light. Now let us consider an
apparatus for generating one. If we take the light-measuring instrument depicted
in Figure 5.1 and replace the light sensor with a flashtube (a device capable of
creating a brief burst of white light that radiates in all directions), we obtain
a similar unit that functions in reverse. The flashtube emits white light in all
directions (Fig. 5.4), and the prism (or diffraction grating) causes these rays of
white light to break up into their component wavelengths. Only the ray of light
that has a certain specific wavelength will make it out through the holes in the two
apertures. The result is a single ray of light that is localized in space (by virtue
THE ‘‘SPOTFLASH’’ PRIMITIVE 185
of the selection of its location), in time (by virtue of the instantaneous nature of
electronic flash), in wavelength (by virtue of the prism), and in direction (azimuth
and elevation).
Perhaps the closest actual realization of a spotflash would be a pulsed variable
wavelength dye-laser10 which can create short bursts of light of selectable
wavelength, confined to a narrow beam.
As with the spotmeter, there are seven degrees of freedom associated with
this light source: azimuth, θl ; elevation, φl ; wavelength, λl ; time, tl ; and spatial
position (xl , yl , zl ).
10 Though lasers are well known for their coherency, in this chapter we ignore the coherency
properties of light, and use lasers as examples of shining rays of monochromatic light along a
single direction.
186 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
Single
monochromatic
light ray
Second pinhole aperture
Figure 5.4 Monochromatic flash spotlight source of adjustable wavelength. This light source
is referred to as a ‘‘spotflash’’ because it is similar to a colored spotlight that is flashed for a
brief duration. (Note the integrating sphere around the flashlamp; it is reflective inside, and has
a small hole through which light can emerge.)
White Spotflash
The ideal spotflash is infinitesimally11 small, so we can pack arbitrarily many of
them into as small a space as desired. If we pack uncountably many spotflashes
close enough together, and have them all shine in the same direction, we can set
each one at a slightly different wavelength. The spotflashes will act collectively
to produce a single ray of light that contains all wavelengths. Now imagine that
we connect all of the trigger inputs together so that they all flash simultaneously
at each of the uncountably many component wavelengths. We will call this light
source the “white-spotflash.” The white-spotflash produces a brief burst of white
light confined to a narrow beam. Now that we have built a white-spotflash, we
put it into our conceptual toolbox for future use.
Point Source
Say we take a flash point source and fire it repeatedly12 to obtain a flashing light.
If we allow the time period between flashes to approach zero, the light stays on
continuously. We have now constructed a continuous source of white light that
radiates in all directions. We place this point source in the conceptual toolbox
for future use.
In practice, if we could use a microflash point source that lasts a third of a
microsecond, and flash it with a 3 Mhz trigger signal (three million flashes per
second) it would light up continuously.13
The point source is much like a bare light bulb, or a household lamp with the
shade removed, continuously radiating white light in all directions, but from a
single point in (x, y, z) space.
Linelight
We can take uncountably many point sources and arrange them along a line in
3-space (x, y, z), or we can take a lineflash and flash it repeatedly so that it stays
on. Either way we obtain a linear source of light called the “linelight,” which we
place in the conceptual toolbox for future use. This light source is similar to the
long fluorescent tubes that are used in office buildings.
Sheetlight
A sheetflash fired repetitively, so that it stays on, produces a continuous light source
called a “sheetlight.” Videographers often use a light bulb placed behind a white
cloth to create a light source similar to the “sheetlight.” Likewise we “construct”
a sheetlight and place it in our conceptual lighting toolbox for future use.
Volume Light
Uncountably many sheetlights stacked on top of one another form a “volume
light,” which we now place into our conceptual toolbox. Some practical examples
12 Alternatively, we can think of this arrangement as a row of flash point sources arranged along the
to be ready for the next flash. In this thought experiment, recycle time is neglected. Alternatively,
imagine a xenon arc lamp that stays on continuously.
188 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
of volumetric light sources include the light from luminous gas like the sun, or
a flame. Note that we have made the nonrealistic assumption that each of these
constituent sheetlights is transparent.
Figure 5.5 ‘‘For now we see through a glass, lightly.’’ Imagine that there is a plane of light
(i.e., a glass or sheet that produces light itself). Imagine now that this light source is totally
transparent and that it is placed between you and some object. The resulting light is very soft
upon the object, providing a uniform illumination without distinct shadows. Such a light source
does not exist in practice but may be simulated by photoquantigraphically combining multiple
pictures (as was described in Chapter 4), each taken with a linear source of light (‘‘linelight’’).
Here a linelight was moved from left to right. Note that the linelight need not radiate equally in
all directions. If it is constructed so that it will radiate more to the right than to the left, a nice and
subtle shading will result, giving the kind of light we might expect to find in a Vermeer painting
(very soft yet distinctly coming from the left). The lightspace framework provides a means of
synthesizing such otherwise impossible light sources — light sources that could never exist in
reality. Having a ‘‘toolbox’’ containing such light sources affords one with great artistic and
creative potential.
real-valued function of four integer variables. Rather than integrating over the
desired light shape, we would proceed to sum (antihomomorphically) over the
desired light vector subspace. This summation corresponds to taking a weighted
sum of the images themselves. Examples of these summations are depicted in
Figure 5.8.
It should also be noted that the linelight, which is made from uncountably
many point sources (or a finite approximation), may also have fine structure. Each
of these point sources may be such that it radiates unequally in various directions.
A simple example of a picture that was illuminated with an approximation to a
190 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
(a ) (b )
Figure 5.6 Early embodiments of the author’s original ‘‘photographer’s assistant’’ application
of personal imaging. (a) 1970s ‘‘painting with light vectors’ pushbroom’’ system and 1980 CRT
display. The linear array of lamps, controlled by a body-worn processor (WearComp), was
operated much like a dot-matrix printer to sweep out spatial patterns of structured light. (b) Jacket
and clothing based computing. As this project evolved in the 1970s and into the early 1980s,
the components became spread out on clothing rather than located in a backpack. Separate
0.6 inch cathode ray tubes attachable/detachable to/from ordinary safetyglasses, as well as
waist-worn television sets replaced the earlier and more cumbersome helmet-based screens of
the 1970s. Notice the change from the two antennas in (a) to the single antenna in (b), which
provided wireless communication of video, voice, and data to and from a remote base station.
linelight appears in Figure 5.9.14 Here the linelight is used as a light source to
indirectly illuminate subject matter of interest. In certain situations the linelight
may appear directly in the image as shown in Figure 5.10.
14 As we learn how the image is generated, it will become quite obvious why the directionality arises.
Here the author set up the three models in a rail boxcar, open at both sides, but stationary on a set of
railway tracks. On an adjacent railway track, a train with headlamps moved across behind the models,
during a long exposure which was integrated over time. However, the thought exercise — thinking of
this process as a single static long slender light source, composed of uncountably many point sources
that each radiate over some fixed solid angle to the right — helps us to better understand the principle
of lightspace.
THE ‘‘SPOTFLASH’’ PRIMITIVE 191
Figure 5.7 Partial lightspace acquired from a system similar to that depicted in Figure 5.6a.
(Pictured here is a white glass ball, a roll of cloth tape, wooden blocks, and white plastic letters.)
As the row of lamps is swept across (sequenced), it traces out a plane of light (‘‘sheetlight’’). The
resulting measurement space is a four-dimensional array, parameterized by two index (azimuth
and elevation) describing rays of incoming light, and two indexes (azimuth and elevation)
describing rays of outgoing light. Here this information is displayed as a block matrix, where
each block is an image. The indexes of the block indicate the light vector, while the indexes
within the block are pixel coordinates.
Dimensions of Light
Figure 5.11 illustrates some of these light sources, categorized by the number of
dimensions (degrees of freedom) that they have in both 4-space (t, z, y, z), and
7-space (θ, φ, λ, t, x, y, z).
(a ) (b )
(c )
Figure 5.8 Antihomomorphic superposition over various surfaces in lightspace. (a) Here the
author synthesized the effect of a scene illuminated with long horizontal slender light source
(i.e., as would be visible if it were lit with a bare fluorescent light tube), reconstructing shadows
that appear sharp perpendicular to the line of the lamp but soft across it. Notice the slender
line highlight in the specular sphere. (b) Here the effect of a vertical slender light source is
synthesized. (c) Here the effect of two light sources is synthesized so that the scene appears
as if lit by a vertical line source, as well as a star-shaped source to the right of it. Both
sources coming from the left of the camera. The soft yet highly directional light is in some
way reminiscent of a Vermeer painting, yet all of the input images were taken by the harsh but
moving light source of the pushbroom apparatus.
Figure 5.9 Subject matter illuminated, from behind, by linelight. This picture is particularly
illustrative because the light source itself (the two thick bands, and two thinner bands in the
background which are the linelights) is visible in the picture. However, we see that the three
people standing in the open doorway, illuminated by the linelight, are lit on their left side more
than on their right side. Also notice how the doorway is lit more on the right side of the picture
than on the left side. This directionality of the light source is owing from the fact that the picture
is effectively composed of point sources that each radiate mostly to the right. () Steve Mann,
1984.
(a ) (b )
(c )
Figure 5.10 Painting with linelight. Ordinarily the linelight is used to illuminate subject matter.
It is therefore seldom itself directly seen in a picture. However, to illustrate the principle of
the linelight, it may be improperly used to shine light directly into the camera rather than for
illuminating subject matter. (a) In integrating over a lattice of light vectors, any shape or pattern
of light can be created. Here light shaped like text, HELLO, is created. The author appears with
linelight at the end of letter ‘‘H.’’ (b) Different integration over lightspace produces text shaped
like WORLD. The author with linelight appears near the letter ‘‘W.’’ (c) The noise-gated version
is only responsive to light due to the pushbroom itself. The author does not appear anywhere
in the picture. Various interweaved patterns of graphics and text may intermingle. Here we see
text HELLO WORLD.
the same location but pointing in all possible directions in a given plane, and
maintaining separate control of each spotflash, provides us with a source that can
produce any pencil of light, varying in intensity and spectral distribution, as a
function of angle. This apparatus is called the “controllable flashpencil,” and it
takes as input, a real-valued function of two real variables.
THE ‘‘SPOTFLASH’’ PRIMITIVE 195
x x x
t t
z z
y y
x x
Tungsten bulb Volumetric source
behind planar of ''constant-on'' light
diffuser
3-dimensional 4-dimensional
(6-dimensional) (7-dimensional
(d ) (e )
Figure 5.11 A taxonomy of light sources in 4-space. The dimensionality in the 4-space
(x, y, z, t) is indicated below each set of examples, while the dimensionality in the new 7-space
is indicated in parentheses. (a) A flash point source located at the origin gives a brief flash
of white light that radiates in all directions (θ, φ) over all wavelengths, λ, and is therefore
characterized as having 3 degrees of freedom. A fat dot is used to denote a practical real-world
approximation to the point flash source, which has a nonzero flash duration and a nonzero
spatial extent. (b) Both the point source and the lineflash have 4 degrees of freedom. Here the
point source is located at the spatial (x, y, z) origin and extends out along the t axis, while
the lineflash is aligned along the x axis. A fat line of finite length is used to denote a typical
real-world approximation to the ideal source. (c) A flash behind a planar diffuser, and a long
slender fluorescent tube are both approximations to these light sources that have 5 degrees
of freedom. (d) Here a tungsten bulb behind a white sheet gives a dense planar array of point
sources that is confined to the plane z = 0 but spreads out over the 6 remaining degrees
of freedom. (e) A volumetric source, such as might be generated by light striking particles
suspended in the air, radiates white light from all points in space and in all directions. It is
denoted as a hypercube in 4-space, and exhibits all 7 degrees of freedom in 7-space.
The Aremac
Similarly, if we apply time-varying control to the controllable flash point source,
we obtain a controllable point source which is the aremac. The aremac is capable
of producing any bundle of light rays that pass through a given point. It is
driven by a control signal that is a real-valued function of four real variables,
θl , φl , λl , and tl . The aremac subsumes the controllable flash point source, and
the controllable spotflash as special cases. Clearly, it also subsumes the white
spotflash, and the flash point source as special cases.
The aremac is the exact reverse concept of the pinhole camera. The ideal
pinhole camera16 absorbs and quantifies incoming rays of light and produces a
real-valued function of four variables (x, y, t, λ) as output. The aremac takes as
input the same kind of function that the ideal pinhole camera gives as output.
The closest approximation to the aremac that one may typically come across
is the video projector. A video projector takes as input a video signal (three real-
valued functions of three variables, x, y, and t). Unfortunately, its wavelength
is not controllable, but it can still produce rays of light in a variety of different
directions, under program control, to be whatever color is desired within its
limited color gamut. These colors can evolve with time, at least up to the
frame/field rate.
A linear array of separately addressable aremacs produces a controllable line
source. Stacking these one above the other (and maintaining separate control
of each) produces a controllable sheet source. Now, if we take uncountably
many controllable sheet sources and place them one above the other, maintaining
separate control of each, we arrive at a light source that is controlled by a real-
valued function of seven real variables, θl , φl , λl , tl , xl , yl , and zl . We call this
light source the “lightspace aremac.”
The lightspace aremac subsumes all of the light sources that we have mentioned
so far. In this sense it is the most general light source — the only one we really
need in our conceptual lighting toolbox.
An interesting subset of the lightspace aremac is the computer screen.
Computer screens typically comprise over a million small light sources spread out
15 An ordinary tungsten-filament lightbulb can also be driven with a time-varying voltage. But it
responds quite sluggishly to the control voltage because of the time required to heat or cool the
filament. The electronic flash is much more in keeping with the spirit of the ideal time-varying
lightsource. Indeed, visual artist Joe Davis has shown that the output intensity of an electronic flash
can be modulated at video rates so that it can be used to transmit video to a photoreceptor at some
remote location.
16 The ideal pinhole camera of course does not exist in practice. The closest approximation would
0 255
Figure 5.12 Using a computer screen to simulate the light from a window on a cloudy day.
All of the regions on the screen that are shaded correspond to areas that should be set to the
largest numerical value (typically 255), while the solid (black) areas denote regions of the screen
that should be set to the lowest numerical value (0). The light coming from the screen would
then light up the room in the same way as a window of this shape and size. This trivial example
illustrates the way in which the computer screen can be used as a controllable light source.
198 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
17 On bright sunny days, a small flash helps to fill in some of the shadows, which results in a much
Spotflash
Spot-flash-spectrometer
0123
+
Integrating
meter
Scene
Figure 5.13 Measuring one point in the lightspace around a particular scene, using a
spot-flash-spectrometer and a spotflash. The measurement provides a real-valued quantity
that indicates how much light comes back along the direction (θ, φ), at the wavelength λ, and
time t, to location (x, y, z), as a result of flashing a monochromatic ray of light in the direction
(θl , φl ), having a wavelength of λl , at time tl , from location (xl , yl , zl ).
Wed Red
700
Response
Response
Tue Grn
Mon Blu
400
Monday Tues Wed Blue Green Red
400 700
Excitation Excitation
(a ) (b )
Figure 5.14 In practice, not all light rays that are sent out will return to the sensor. (a) If we try
to measure the response before the excitation, we would expect a zero result. (b) Many natural
objects will radiate red light as a result of blue excitation, but the reverse is not generally true.
(These are ‘‘density plot,’’ where black indicates a boolean true for causality (i.e., response
comes after excitation).
would not expect to see any response. In general, then, the lightspace will be
zero whenever t < tl or λ < λl .
Now, if we flash a ray of light at the scene, and then look a few seconds later,
we may still pick up a nonzero reading. Consider, for example, a glow-in-the-
dark toy (or clock), a computer screen, or a TV screen. Even though it might be
200 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
turned off, it will glow for a short time after it is excited by an external source
of light, due to the presence of phosfluorescent materials. Thus the objects can
absorb light at one time, and reradiate it at another.
Similarly some objects will absorb light at short wavelengths (e.g., ultraviolet
or blue light) and reradiate at longer wavelengths. Such materials are said to
be fluorescent. A fluorescent red object might, for example, provide a nonzero
return to a sensor tuned to λ = 700 nm (red), even though it is illuminated only
by a source at λ = 400 nm (blue). Thus, along the time and wavelength axes,
lightspace is upper triangular18 (Fig. 5.14b).
In practice, the lightspace is too unwieldy to work with directly. It is, instead, only
useful as a conceptual framework in which to pose other practical problems. As
was mentioned earlier, rather than using a spotflash and spot-flash-spectrometer
to measure lightspace, we will most likely use a camera. A videocamera, for
example, can be used to capture an information-rich description of the world, so
in a sense it provides many measurements of the lightspace.
In practice, the measurements we make with a camera will be cruder than those
made with the precise instruments (spotflash and spot-flash-spectrometer). The
camera makes a large number of measurements in parallel (at the same time). The
crudeness of each measurement is expressed by integrating the lightspace together
with some kind of 14-D blurring function. For example, a single grayscale picture,
taken with a camera having an image sensor of dimension 480 by 640 pixels,
and taken with a particular kind of illumination, may be expressed as a collection
of 480 × 640 = 307, 200 crude measurements of the light rays passing through a
particular point. Each measurement corresponds to a certain sensing element of
the image array that is sensitive to a range of azimuthal angles, θ and elevational
angles, φ. Each reading is also sensitive to a very broad range of wavelengths,
and the shutter speed dictates the range of time to which the measurements are
sensitive.
Thus the blurring kernel will completely blur the λ axis, sample the time
axis at a single point, and somewhat blur the other axes. A color image of
the same dimensions will provide three times as many readings, each one
blurred quite severely along the wavelength axis, but not so severely as the
grayscale image readings. A color motion picture will represent a blurring and
repetitive sampling of the time axis. A one second (30 frames per second)
movie then provides us with 3 × 30 × 480 × 640 = 27,648,000 measurements
(i.e., 27,000 K of data), each blurred by the nature of the measurement device
(camera).
18 The term is borrowed from linear algebra and denotes matrices with entries of zero below the
main diagonal.
‘‘LIGHTVECTOR’’ SUBSPACE 201
We can trade some resolution in the camera parameters for resolution in the
excitation parameters, for example, by having a flash activated every second
frame so that half of the frames are naturally lit and the other half are flash-lit.
Using multiple sources of illumination in this way, we could attempt to crudely
characterize the lightspace of the scene. Each of the measurements is given by
qk =
L(θl , φl , λl , tl , xl , yl , zl , θ, φ, λ, t, x, y, z)
Bk (θl , φl , λl , tl , xl , yl , zl , θ, φ, λ, t, x, y, z)
dθl dφl dλl dtl dxl dyl dzl dθ dφdλdtdxdydz, (5.1)
where L is the lightspace and Bk is the blurring kernel of the kth measurement
apparatus (incorporating both excitation and response).
We may rewrite this measurement (in the Lebesgue sense rather than the
Reimann sense and avoid writing out all the integral signs):
qk = L(θl , φl , λl , tl , xl , yl , zl , θ, φ, λ, t, x, y, z)dµk , (5.2)
where µk is the measure associated with the blurring kernel of the kth measuring
apparatus.
We will refer to a collection of such blurred measurements as a “lightspace
subspace,” for it fails to capture the entire lightspace. The lightspace subspace,
rather, slices through portions of lightspace (decimates or samples it) and blurs
those slices.
One such subspace is the lightvector subspace arising from multiple differently
exposed images. In Chapter 4 we combined multiple pictures together to arrive
at a single floating point image that captured both the great dynamic range and
the subtle differences in intensities of the light on the image plane. This was due
to a particular fixed lighting of the subject matter that did no vary other than by
overall exposure. In this chapter we extend this concept to multiple dimensions.
A vector, v, of length L, may be regarded as a single point in a multidimen-
sional space, L . Similarly a real-valued grayscale picture defined on a discrete
lattice of dimensions M (height) by N (width) may also be regarded as a single
point in R M×N , a space called “imagespace.” Any image can be represented
as a point in imagespace because the picture may be unraveled into one long
vector, row by row.19 Thus, if we linearize, using the procedure of Chapter 4,
19 This is the way that a picture (or any 2-D array) is typically stored in a file on a computer. The
then, in the ideal noise-free world, all of the linearized elements of a Wyckoff
set, qn (x, y) = f −1 (fn (x, y)), are linearly related to each other through a simple
scale factor:
qn = kn q0 , (5.3)
LAMP SET TO
Subject QUARTER
matter OUTPUT CAMERA
Estimated
"Lens"
Sensor
Compressor expander
^
q
^ 3
f f −1
f3 w3
h3
k3q
Light
rays nq 3 nf 3
Sensor noise Image noise
Figure 5.15 Multiple exposures to varying quantity of illumination. A single light source is
activated multiple times from the same fixed location to obtain multiple images differing only in
exposure. In this example there are three different exposures. The first exposure with LAMP SET
TO QUARTER OUTPUT gives rise to an exposure k1q , the second, with LAMP SET TO HALF OUTPUT, to
k2q , and the third, with LAMP SET TO FULL OUTPUT, to k3q . Each exposure gives rise to a different
realization of the same noise process, and the three noisy pictures that the camera provides
are denoted f1 , f2 , and f3 . These three differently exposed pictures comprise a noisy Wyckoff
set (i.e., a set of approximately collinear lightvectors in the antihomomorphic vector space).
To combine them into a single estimate of the lightvector they collectively define, the effect of
f is undone with an estimate fˆ that represents our best guess of the function f, which varies
from camera to camera. Linear filters hi are next applied in an attempt to filter out sensor
noise nqi . Generally, the f estimate is made together with an estimate of the exposures ki . After
reexpanding the dynamic ranges with fˆ−1 , the inverse of the estimated exposures 1/k̂ i are
applied. In this way the darker images are made lighter and the lighter images are made darker
so that they all (theoretically) match. At this point the images will all appear as if they were taken
with identical exposure to light, except for the fact that the pictures with higher lamp output will
be noisy in lighter areas of the image and those taken with lower lamp output will be noisy in
dark areas of the image. Thus, rather than simply applying ordinary signal averaging, a weighted
average is taken by applying weights wi , which include the estimated global exposures ki and
the spatially varying certainty functions ci (x, y). These certainty functions turn out to be the
derivative of the camera response function shifted up or down by an amount ki . The weighted
sum is q̂(x, y), and the estimate of the photoquantity is q(x, y). To view this quantity on a video
display, it is first adjusted in exposure; it may be adjusted to a different exposure level not
present in any of the input images. In this case it is set to the estimated exposure of the first
image, k̂ 1 . The result is then range-compressed with fˆ for display on an expansive medium
(DISPLAY).
204 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
response of the scene or object to any quantity of the light, as directed from a
particular location in the scene.
As will become evident later, the output image is just one of many lightvectors,
each of which describe the same subject matter but with the light source placed
at a different location. To the extent that the output image is a lightvector,
one can dial up or down the illumination of that particular lightvector. For
example, through this interpolation process, one can increase or decrease a
particular skyscraper in the cityscape in virtual effective light output, as seen
in the computerized eyeglasses, while viewing the lightvector painting being
generated.
It should be noted that the process of interpolation or extrapolation of the
output of a single light source (or more generally, of a lightvector contribution
to an image) is isomorphic to the process of Chapter 4. Recall that in
comparadjusting, the tonal range of an image that corresponds to being able
to adjust the picture is equivalent to one that would have been taken using any
desired exposure. (This process was illustrated in Fig. 4.5.)
Of the various comparametric functions introduced in Chapter 4, the power
of root over root plus constant correction model, or the saturated power of root
over root plus constant correction model, is the vastly preferred model because
it accurately portrays the toe and shoulder regions of the response curve. In
traditional photography these regions are ignored; all that is of interest is the
linear mid portion of the density versus log exposure curve. This interest in only
the midtones is because, in traditional photography, areas outside this region are
considered to be incorrectly exposed. However, many of the images in a Wyckoff
set are deliberately underexposed and overexposed. In fact this deliberate
overexposure of some images and underexposure of other images is often taken to
extremes. Therefore the additional sophistication of the model (4.55) is of great
value in capturing the essence of these extreme exposures, in which exposure
into both the toe and shoulder regions are the norm rather than an aberration
when practicing the art of painting with lightvectors.
LAMP SET TO
QUARTER
OUTPUT
Subject CAMERA
matter
Estimated
Sensor
"Lens"
Compressor expander
∧
∧ q1
+ f + f −1
f1
k1 q h1 w1
Light
rays nq1 nf 1
LAMP SET TO Sensor noise Image noise
HALF OUTPUT
Subject
CAMERA
matter
Sensor
Estimated
"Lens"
Compressor expander
∧
q2 ∧
+ + ∧ + qL
f
f2 f −1
k2 q h2 w2 ∧
kL
Light
rays nq 2 nf 2
LAMP SET TO Sensor noise Image noise
FULL OUTPUT
Subject
CAMERA
matter
wL
Sensor
Estimated
"Lens"
Compressor expander
∧
q3
∧
+ f + f −1
k3 q f3 h3 w3
DISPLAY
Light Estimated
rays nq 3 nf 3 Cathode ray tube
compressor Expander
Sensor noise Image noise
LAMP SET TO ∼
+ ∧ f −1
Subject QUARTER OUTPUT f
matter CAMERA
Estimated
Sensor
"Lens"
Compressor expander
∧
q4
∧
+ f + f −1
f4 h4 w4
k4q
Light
rays nq 4 nf 4 wR
Sensor noise Image noise
LAMP SET TO
Subject HALF OUTPUT
matter CAMERA
Estimated
Sensor
"Lens"
Compressor expander
∧
q5
∧ ∧
+ f + f −1 + qR
f5 h5 w5 ∧
kR
k 5q
Light
rays nq 5 nf 5
Sensor noise Image noise
LAMP SET TO
Subject FULL OUTPUT
matter CAMERA
Estimated
Sensor
"Lens"
Compressor expander
∧
q6
+ ∧
f + f −1
f6 h6 w6
k 6q
Light
rays nq 6 nf 6
Sensor noise Image noise
Figure 5.16 The generation of two lightvectors qL and qR corresponding to the scene as lit
from the left and right respectively. Each lightvector is generated from three different exposures.
The first lightvector is generated from the Wyckoff set f1 , f2 , f3 , and the second lightvector is
generated from the Wyckoff set f4 , f5 , f6 .
206 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
Fig. 5.16). The result will be an image of the scene as it would appear illuminated
by light sources at both locations. This image can be estimated as
v8
Flashlamp set to full output
v10 v9
Figure 5.17 A picture sampled on a discrete 480 by 640 lattice may be regarded as a
single point in 480×640 = 307,200 . These lightvectors are not necessarily orthogonal, and thus
there may be more than 307,200 such vectors present, depicted v1 , v2 , . . . , v999,999 . Typically,
however, the number of lightvectors is much less than the dimensionality of the space, 307,200.
Some of the lightvectors are typically collinear, and the sets of collinear lightvectors are called
‘‘Wyckoff sets.’’ A Wyckoff set formed by three pictures, differing only in exposure, is depicted
as 3 collinear arrows coming out from the origin, v2 , v3 , v4 . Suppose that these three pictures
were taken with no flash (just the natural light present in the scene). Suppose that another three
pictures, v6 , v7 , v8 , are taken with a flash, and with such a high shutter speed that they are
representative of the scene as lit by only the flash (i.e., the shutter speed is high enough to
neglect contribution from the v2 , v3 , v4 axis). These three pictures define a second Wyckoff set
v6 , v7 , v8 , also depicted as three collinear arrows from the origin. The 2-D subspace formed
from these two Wyckoff sets (natural light set and flash set) is depicted as a shaded gray planar
region spanned by the two Wyckoff sets.
Why use more than a single pixture to define each of these axes? In particular,
from just two images, such as f1 = v2 and f2 = v6 , of differently illuminated
subject matter, a single picture of any desired effective combination of the
illuminants may be estimated:
The answer is that in extreme situations we may wish to process the resulting
images in lightspace, performing arithmetic that uses extended dynamic range. In
such situations as noise arises, the use of multiple exposures to each lightvector
is preferred to overcome the noise.
Thus the estimate is improved if we use multiple (i.e., two or three elements)
of each of the two Wyckoff sets, that is, derive f1 from the set v2 , v3 , v4 and
derive f2 from the set v6 , v7 , v8 as described in Chapter 4.
Because light quantities are additive, if we take a picture with a combination
of natural light and flash light (i.e., if we leave the shutter open for an extended
208 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
v7
v9 v8
Weak flash exposure
with weak ambient light
v10
Weak flash exposure
with strong ambient light
v13 v5 v12
v4
v11
v1
... v999,999
...
...
v307,200 v307,201
Figure 5.18 The 2-D subspace formed from various pictures, each taken with different
combinations of the two light sources (natural and flash). Lightvectors denoted as bold lines
exist as points in 307,200 that lie within the planar (2-D) subspace defined by these two light
sources. Note that since the light intensity cannot be negative, that this subspace is confined
to the portion of the plane bounded by the two pure lightvectors (natural light and flash).
period of time, and fire the flash sometime during that exposure), we obtain a
point in 307,200 that lies on the plane defined by the two (natural light and flash)
pure lightvectors v2 , v3 , v4 and v6 , v7 , v8 . Various combinations of natural light
with flash are depicted in this way in Figure 5.18. Enforcement of lightvectors
being nonnegative can be made by introducing an arbitrary complex valued
wavefunction ψ such that the lightvector is equal to ψ|ψ .
2
Flash illumination
0
0 1 2
Ambient (natural) illumination
Figure 5.19 Hypothetical collection of pictures taken with different amounts of ambient
(natural) light and flash. One axis denotes the quantity of natural light present (exposure), and
the other denotes the quantity of flash, activated during the exposure. The flash adds to the
total illumination of the scene, and affects primarily the foreground objects in the scene, while
the natural light exposure affects the whole scene. These pictures form a 2-D image subspace,
which may be regarded as being formed from the dimensions defined by two lightvectors. For
argument’s sake we might define the two lightvectors as being the image at coordinates (1, 0)
and the image at (0, 1). These two images can form a basis set that may be used to generate all
nine depicted here in this figure, as well as any other linear combination of the two lightvectors.
the image). However, to the extent that this might not be possible,20 we may apply
a coordinate transformation that shears the lightvector subspace. In particular, we
effectively, subtract out the natural light, and obtain a “total-illumination” axis
so that the images of Figure 5.19 move to new locations (Fig. 5.20) in the image
space.
In Real Practice
In practice, the coordinate transformation depicted in these hypothetical situations
may not be directly applied to pictures because the pictures have undergone some
20 In practice, this happens with large studio flash systems because the capacitors are so big that the
flash duration starts getting a little long, on the order of 1/60 second or so.
210 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
0 1 2
Ambient (natural) illumination
Figure 5.20 A coordinate-transformed 2-D image subspace formed from the dimensions of
two lightvectors (natural light and on-camera flash). The new axes, natural light and total light,
define respectively the light levels of the background and foreground objects in the scene.
The author often used lamps such as those pictured in Figure 5.28 to walk around
and illuminate different portions of subject matter, “painting in” various portions
of scenes or objects. The resulting photoquantigraphic image composites, such
as depicted in Figure 5.24, are at the boundary between painting, photography,
and computer-generated imagery.
Using antihomomorphic vector spaces, pictures of extremely high tonal
fidelity and very low noise, in all areas of the image from the darkest
212 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
(a ) (b )
(c ) (d )
Figure 5.21 Photoquantigraphic superposition. (a) Long exposure picture of the city of
Cambridge (Massachusetts). This picture of the large cityscape was taken under its natural
(ambient) light. (b) Short exposure image taken with electronic flash (FT-623 flash tube operating
at 4000 volts main, 30 kV trigger, and 16 kJ energy, in 30 inch highly polished chrome reflector,
as shown in Fig. 5.28a) shows more detail on the rooftops of the buildings where very little
of the city’s various light sources are shining. Note the thin sliver of shadow to left of each
building, as the flash was to the right of the camera. Notice how some portions of the picture
in (a) are better represented while other portions in (b) are better. (c) Linear combination of the
two images. Notice undesirable ‘‘muted’’ highlights. (d) Photoquantigraphic estimate provides
an image with much better contrast and tonal fidelity. () Steve Mann, 1994.
2.0
Ambient room lighting
1.5
1.0
0.5
0.0
0.0 0.5 1.0 1.5 2.0
Flash
Figure 5.22 Two-dimensional antihomomorphic vector space formed from two input
lightvectors (denoted by thick outlines). One lightvector represents the response of the
subject matter to natural ambient light. The other lightvector represents the response of
the subject matter to a flash lamp. From these two ‘‘basis’’ lightvectors, the response of the
scene to any combination of natural ambient illumination and flash lamp can be synthesized.
This two-dimensional lightvector space is a generalization of the one-dimensional lightvector
synthesis that was depicted in Figure 4.5.
ace
ightsp
3−D L
Final selection
2.0
2.0
1.5 {f,a,t } = {0,0,2}
TV on
1.0 1.5
0.5 1.0
{f,a,t } = {0,0,1}
0.0 0.5
2.0
ht 0.0
lig 1.5 {f,a,t } = {0,0,0}
m
1.0 2.0
roo 1.5
nt 1.0
0.5
bie 0.5 Flash
0.0 0.0
Am
Figure 5.23 A 3-D antihomomorphic lightvector space. Each of the 75 pictures corresponds
to a point in this sampling of the lightvector space on a 3 by 5 by 5 lattice. The three unit basis
lightvectors, again indicated by a heavy black outline, are located at 1.0 ambient room light,
1.0 flash, and 1.0 TV on. The last axis was generated by taking one picture with no flash, and
also with the lamp in the room turned off, so that the only source of light was the television set
itself. The new axis (TV on) is depicted by itself to the right of this 3-D space, where there are
three points along this axis. The origin at the bottom is all black, while the middle picture is the
unit (input image) lightvector, and the top image is synthesized by calculating f(2q2 ), where
q2 (x, y) is the quantity of light falling on the image sensor due to the television set being turned
on. Near the top of the figure, the final selection is indicated, which is the picture selected
according to personal taste, from among all possible points in this 3-D lightvector space.
are depicted in Figure 5.26. Throughout the 1980s a small number of other artists
also used the author’s apparatus to create various lightvector paintings. However,
due to the cumbersome nature of the early WearComp hardware, and to the fact
that much of the apparatus was custom fit to the author, it was not widely used,
over any extended periods of time, by others. However, the personal imaging
system proved to be a new and useful invention for a variety of photographic
imaging tasks.
To the extent that the artist’s light sources were made far more powerful than
the natural ambient light levels, the artist had a tremendous degree of control
over the illumination in the scene. The resulting image was therefore a depiction
of what was actually present in the scene, together with a potentially visually rich
illumination sculpture surrounding it. Typically the illumination sources that the
artist carried were powered by batteries. (Gasoline-powered light sources were
found to be unsuitable in many environments such as indoor spaces where noise,
exhaust, etc., were undesirable.) Therefore, owing to limitations on the output
PAINTING WITH LIGHTVECTORS 215
Figure 5.24 Output of antihomomorphic lightvector image processing. This picture was
mathematically generated from a number of differently illuminated pictures of the same subject
matter, in which the subject matter was illuminated by walking around with a bank of flash
lamps controlled by a wearable computer system in wireless communication with another
computer controlling a camera at a fixed base station. () Steve Mann, 1985 .
capabilities of these light sources, the art was practiced in spaces that could be
darkened sufficiently or, in the case of outdoor scenes, at times when the natural
light levels were lowest.
In a typical application the user positioned the camera on a hillside, or the roof
of a building, overlooking a city. Usually an assistant oversaw the operation of
the camera. The user would roam about the city, walking down various streets,
and use the light sources to illuminate buildings one at a time. Typically, for
the wearable or portable light sources be of sufficient strength compared to the
natural light in the scene (so that it was not necessary to shut off the electricity to
the entire city to darken it sufficiently that the artist’s light source be of greater
relative brightness), some form of electronic flash is used as the light source. In
some embodiments of the personal imaging invention, an FT-623 lamp (the most
powerful lamp in the world, with output of 40 kJ, and housed in a lightweight 30
inch highly polished reflector, with a handle that allowed it to be easily held in
one hand and aimed as shown in Figure 5.28a), was used to illuminate various
buildings such as tall skyscrapers throughout a city. The viewfinder on the helmet
displayed material from a remotely mounted camera with computer-generated text
and graphics overlaid in the context of a collaborative telepresence environment.
The assistant at the remote site wore a similar apparatus with a similar body-worn
backpack-based processing system.
216 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
v0−15
f 000 q000 w0−15
+ c0−15
f 001 q
001
v16−31
f 015 q
015 w16−31
f 016 q c16−31
016 w0
v0
q +
f 017 017
q v32−47
f 031 031
w32−47
f 032 q c32−47
032
q + + Output
f 033 033
f image
f 047 q
047 wf cf
f 100 q
100 wt ct
v1 w1
f 101 q
101 wo co +
f 102 q
102 wh ch
f 103 q wf cf
103
f 812 q wt ct
812
w2
q v2
f 813 813 wo co +
f 814 q wh ch
814
f 815 q
815
Figure 5.25 Antihomomorphic vector spaces. Arbitrarily many (typically 1000 or more)
differently exposed images of the same subject matter are captured by a single camera.
Typically pictures are signal averaged in sets of 16 for each of three exposures to obtain a
background (ambient) illumination estimate q̂q . These lightvectors are denoted q000 through
q047 . After applying appropriate weights and certainty functions, the computer calculates
the ambient illumination lightvector v0 . Then, using the apparatus depicted in Figure 5.28,
exposures are made in sets of four (four lamps, two lamps, one lamp, half). These give rise
to estimates of q100 through q103 , and so on. After weighting and application of certainty
functions, lightvector v1 is calculated as the response of the subject matter to this source of
illumination. The wearer of the computer system then moves to a new location and exposes
the subject matter to another rapid burst of four flashes of light, producing an estimate of the
subject matter’s response to lightvector v2 . The lightvectors are then added, and the output
image is calculated as f(v0 + v1 + v2 + . . .).
(a ) (b )
Figure 5.26 The camera used as a sketch pad or artist’s canvas. A goal of personal imaging
[1] is to create something more than the camera in its usual context. The images produced
as artifacts of Personal Imaging are somewhere at the intersection of painting, computer
graphics, and photography. (a) Notice how the broom appears to be its own light source (e.g.
self-illuminated), while the open doorway appears to contain a light source emanating from
within. The rich tonal range and details of the door itself, although only visible at a grazing
viewing angle, are indicative of the affordances of the Lightspace Rendering [92,93] method.
(b) hallways offer a unique perspective, which can also be illuminated expressively. (C) Steve
Mann, sometime in the mid-1980s.
I.Tx O.Rx
S
Tx Rx X Flash Opening
Inbound Outbound
Source of Illumination
(a ) O.Tx
I.Rx
Tx
X
Outbound
Rx S
Inbound
Camera shutter
with solenoid
Camera and Base Station
(b )
A comparatively small lamp (small since the lamp and housing must be held
in one hand) can illuminate a tall skyscraper or an office tower and yet appear, in
the final image, to be the dominant light source, compared to interior fluorescent
lights that are left turned on in a multistory building, or compared to moonlight
or the light from streetlamps.
PAINTING WITH LIGHTVECTORS 219
(a )
(b ) (c )
Figure 5.28 Early cybernetic photography systems for painting with lightvectors. A powerful
handheld and wearable photographic lighting studio system with wearable multimedia computer
is used to illuminate various subjects at different locations in space. The portable nature of the
apparatus allows the wearer to move around and collaborate in a computer-mediated space.
(a) A backpack-based wearable computer system that was completed in 1981 was used in
conjunction with a 40 kJ flash lamp in a 30 inch (762 mm) reflector. (b) A jacket-based computer
system that was completed in the summer of 1985 and used in conjunction with a 2.4 kJ flash
lamp in a 14 inch (356 mm) reflector. Three separate long communications antennas are visible,
two from the backpack and one from the jacket-based computer. (c) The backpack-based rig
used to light up various skyscrapers in Times Square produces approximately 12 kJ of light
into six separate lamp housings, providing better energy localization in each lamp, and giving a
100 ISO Guide Number of about 2000. The two antennae on the author’s eyeglasses wirelessly
link the eyeglasses to the base station shown in Figure 5.29.
to other sites when other collaborators like art directors must be involved in
manipulating and combining the exposures and sending their comments to the
artist by email, or in overlaying graphics onto the artist’s head-mounted display,
which then becomes a collaborative space. In the most recent embodiments of
the 1990s this was facilitated through the World Wide Web. The additional
communication facilitates the collection of additional exposures if it turns out
that certain areas of the scene or object could be better served if they were more
accurately described in the dataset.
PAINTING WITH LIGHTVECTORS 221
(a ) (b )
Figure 5.29 Setting up the base station in Times Square. (a) A portable graphics computer
and image capture camera are set up on a tripod, from the base station (facing approximately
south–southwest: Broadway is to the left; Seventh Avenue is to the right) with antennas for
wireless communication to the wearable graphics processor and portable rig. Various other
apparatus are visible on the table to the right of and behind the imager tripod. (b) One of the
items on the table is an additional antenna stand. To the left of the antenna stand (on the easel)
is a television that displays the ‘‘lightvector painting’’ as it is being generated. This television
allows the person operating the base station to see the output of the author’s wearable graphics
processor while the author is walking through Times Square illuminating various skyscrapers
with the handheld flash lamp.
(a )
(b )
(c )
(d ) (e )
Figure 5.30 Lightvector paintings of Times Square. (a) Author illuminated the foreground in a
single exposure. (b) Author illuminated some of the skyscrapers to the left in five exposures.
(c) Author illuminated the central tower with the huge screen Panasonic television in three
exposures. The six lamp rig puts out enough light to overpower the bright lights of Times Square,
and becomes the dominant source of light in the lightvector painting. (d) Author illuminated
the skyscrapers to the right in six exposures. (e) Author generated the final lightvector painting
from approximately one hundred such exposures (lightvectors) while interactively adjusting
their coefficients.
PAINTING WITH LIGHTVECTORS 223
(a ) (b )
(c ) (d )
(e ) (f )
Figure 5.31 Lightvector paintings of the Brooklyn Bridge. (a) Natural ambient light captured in
a single exposure. (b) The first tower of the bridge captured in 4 exposures to electronic flash.
(c) The cabling of the bridge captured in 10 exposures. (d) Foreground to convey a dynamic
sense of activity on the busy footpath walkway and bicycle route along the bridge, captured
in 22 exposures. (e) Linear image processing (signal averaging). Note the muted highlights
(areas of the image that are clipped at a gray value less than the dynamic range of the output
medium), which cannot be fixed simply by histogram stretching. Linear signal processing (signal
averaging) fails to create a natural-looking lightvector summation. (f) Antihomomorphic image
processing using the methodology proposed in this chapter. Note the natural-looking bright
highlights.
224 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
to aim the flash, and allows the artist to see what is included within the cone of
light that the flash will produce. Furthermore, when viewpoints are exchanged,
the assistant at the main camera can see what the flash is pointed at prior to
activation of the flash.
Typically there is a command that may be entered to switch between local
mode (where the artist sees the flash viewfinder) and exchanged mode (where
the artist sees out through the main camera and the assistant at the main camera
sees out through the artist’s eyes/flash viewfinder).
The personal imaging systems were used over a period of many years, for
the production of visual art of this genre. Within the course context, this method
was taught recently to a 1998 class (Fig. 5.32). Here students shared a common
but altered perception of reality. All participants share the same perception of a
visual reality that any one of the participants can alter.
Typically the artist’s wearable computer system comprises a visual display that
is capable of displaying the image from the base station camera (typically sent
wirelessly over a 2 megabit per second data communications link from the
computer that controls the camera). Typically also this display is updated with
each new lightstroke (exposure).
A number of participants may each carry a light source and point it to shoot at
objects in the scene (or at each other), while having the effect of the light source
persist in the mediated reality space even after the light source has finished
producing light.
The display update is typically switchable between a mode that shows only
the new exposure, and a cumulative mode that shows a photoquantigraphic
summation over time that includes the new exposure photoquantigraphically
added to previous exposures. This temporally cumulative display makes the
device useful to the artist because it helps in the envisioning of a completed
lightmodule painting.
5.8.1 Lightpaintball
The temporally cumulative display is useful in certain applications of the
apparatus to gaming. For example, a game can be devised in which two players
compete against each other. One player may try to paint the subject matter before
the camera red, and the other will try to paint the subject matter blue. When the
subject matter is an entire cityscape as seen from a camera located on the roof of
a tall building, the game can be quite competitive and interesting. Additionally
players can work either cooperatively on the same team or competitively, as
when two teams each try to paint the city a different color, and “claim” territory
with their color.
In some embodiments of the game the players can also shoot at each other
with the flashguns. For example, if a player from the red team “paints” a blue-
team photoborg red, he may disable or “kill” the blue-team photoborg, shutting
226 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
(a ) (b )
(c ) (d )
(e ) (f )
Figure 5.32 Mediated reality field trials. Each participant has the opportunity to aim various
light sources at real subject matter in the scene. Within the ‘‘painting with lightvectors’’
paradigm, each lightvector alters the visual perception of reality for the entire group. All
participants share a common perception of reality that any one of the participants can
alter. (a) A series of field trials in 1998 demonstrated the capabilities of mediated reality for
WearComp-supported collaboration. (b) The field trial takes place both indoors as well as
outdoors, in a wide variety of settings. In collecting image data, the subject matter is illuminated
inside and out. (c) Instruction and training is provided in real use scenarios. (d) Students learn
how to program the apparatus in the field, and thus learn the importance of advance preparation
(e.g. learning how to deal with typical problems that require debugging in the field). (e) Students
collaborate and help each other with various unexpected problems that might arise. (f) In
subsequent years field trials may be repeated, applying what was learned from previous years.
Note the increased proportion of eyeglass-based rigs, and covert rigs.
EXERCISES, PROBLEM SETS, AND HOMEWORK 227
down his flashgun. In other embodiments, the “kill” and “shoot” aspects can
be removed, in which case the game is similar to a game like squash where the
opponents work in a collegial fashion, getting out of each other’s way while each
side takes turns shooting. The red-team flashgun(s) and blue-team flashgun(s)
can be fired alternately by a free-running base-station camera, or they can all fire
together. When they fire alternately there is no problem disambiguating them.
When they fire together, there is preferably a blue filter over each of the flashguns
of the blue team, and a red filter over each of the flashguns of the red team, so that
flashes of light from each team can be disambiguated in the same machine cycle.
5.9 CONCLUSIONS
1. day.jpg (or take instead daysmall.jpg initially for debugging, since it will
use less processing power), which was taken Nov 7 15:36, at f/2.8 for
1/125 s; and
2. night.jpg (or nightsmall.jpg) which was taken Nov 7 17:53, f/2.8 for
1/4 s.
The object is to combine these pictures and arrive at a single picture that renders
the scene as it might have appeared if the sensor had responded to both sources
of illumination (daytime natural light plus night-time building lights).
228 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
Begin by adding the two pictures together. You might do this with the GNU
Image Manipulation Program (GIMP) by loading one image, say “gimp day.jpg,”
and then clicking on “open” to obtain a second image, say “night.jpg,” in the
same session of gimp. Then you can press control A in one window (to select
all), then control C (to cut), and then control V to paste into the other window
on top of the “day.jpg” image. Now the two images will appear as two layers in
the same window. You can adjust the “transparency” of the top layer to about
50% and flatten the result to a single image to save it.
Alternatively, you could write a simple program to operate on two image
arrays. Use a simple for-next loop through all pixels, and scale each total back
down. [Hint: Since pixels go from 0 to 255, the total will go from 0 to 510, so
you need to scale back down by a factor of 255/510.]
Whether you use the gimp or write your own program, describe the picture
that results from adding the day.jpg and night.jpg images together.
5.10.6 CEMENT
Experiment with different weightings for rendering different light sources with
different intensity. Arrive at a selected set of weights that provides what you feel
is the most expressive image. This is an artistic choice that could be exhibited
as a class project (i.e., an image or images from each participant in the class).
Your image should be 2048 pixels down by 3072 pixels across, and ppm or jpeg
compressed with at least jpeg quality 90.
A convenient way of experimenting with the lightvectors in this directory is to
use the “cementinit,” “cementi,” etc., programs called by the “trowel.pl” script.
These programs should be installed as /usr/local/bin/trowel on systems
used in teaching this course.
A program such as “cementi” (cement in) typically takes one double (type
P8) image and cements in one uchar (P6) image. First, you need to decide what
lightvector will be the starting image into which others will be cemented. Suppose
that we select “day.jpg” as the basis.
230 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
pnmuchar2doubleday.ppm -o day.pdm
You may want to work with smaller images, for example, using pnmscale -xsize
640 -ysize 480 . . . . Alternatively, you could convert it to plm (portable lightspace
map), as follows:
Next verify that this is a valid plm file by looking at the beginning:
You could also use the program “cementinit” to convert directly from pnm to
plm and introduce an RGB weight as part of the conversion process.
EXERCISES, PROBLEM SETS, AND HOMEWORK 231
(a ) (b )
(c )
Now try
Observe and describe the result. To view the image, you need to convert it back
to uchar: pnmdouble2uchar total35.pdm -o total35.ppm and then use a program
like gimp or xv to view it: pnmcement nocolor is missing the color weightings.
232 LIGHTSPACE AND ANTIHOMOMORPHIC VECTOR SPACES
The “trowel.pl” script is a convenient wrapper that lets you cement together
pnm images and jpeg images with a cement.txt file. It should be installed as
/usr/local/bin/trowel.
Three lightvectors, as shown in Figure 5.33, were combined using CEMENT,
and the result is shown in Figure 5.34.
There is a tendency for students to come up with interesting images but
to forget to save the cement.txt file that produced them. For every image you
generate always save the CEMENT file that made it (e.g., cement.txt). That way
you can reproduce or modify it slightly. The CEMENT file is like the source
code, and the image is like the executable. Don’t save the image unless you also
save the source cement.txt.
Intelligent Image Processing. Steve Mann
Copyright 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-40637-6 (Hardback); 0-471-22163-5 (Electronic)
6
VIDEOORBITS: THE
PROJECTIVE GEOMETRY
RENAISSANCE
In the early days of personal imaging, a specific location was selected from
which a measurement space or the like was constructed. From this single vantage
point, a collection of differently illuminated/exposed images was constructed
using the wearable computer and associated illumination apparatus. However,
this approach was often facilitated by transmitting images from a specific location
(base station) back to the wearable computer, and vice versa. Thus, when the
author developed the eyeglass-based computer display/camera system, it was
natural to exchange viewpoints with another person (i.e., the person operating
the base station). This mode of operation (“seeing eye-to-eye”) made the notion
of perspective a critical factor, with projective geometry at the heart of personal
imaging.
Personal imaging situates the camera such that it provides a unique first-person
perspective. In the case of the eyeglass-mounted camera, the machine captures
the world from the same perspective as its host (human).
In this chapter we will consider results of a new algorithm of projective geom-
etry invented for such applications as “painting” environmental maps by looking
around, wearable tetherless computer-mediated reality, the new genre of personal
documentary that arises from this mediated reality, and the creation of a collective
adiabatic intelligence arising from shared mediated-reality environments.
6.1 VIDEOORBITS
of greater resolution or spatial extent. The approach is “exact” for two cases of
static scenes: (1) images taken from the same location of an arbitrary 3-D scene,
with a camera that is free to pan, tilt, rotate about its optical axis, and zoom
and (2) images of a flat scene taken from arbitrary locations. The featureless
projective approach generalizes interframe camera motion estimation methods
that have previously used an affine model (which lacks the degrees of freedom to
“exactly” characterize such phenomena as camera pan and tilt) and/or that have
relied upon finding points of correspondence between the image frames. The
featureless projective approach, which operates directly on the image pixels, is
shown to be superior in accuracy and ability to enhance resolution. The proposed
methods work well on image data collected from both good-quality and poor-
quality video under a wide variety of conditions (sunny, cloudy, day, night).
These new fully automatic methods are also shown to be robust to deviations
from the assumptions of static scene and to exhibit no parallax.
Many problems require finding the coordinate transformation between two
images of the same scene or object. In order to recover camera motion between
video frames, to stabilize video images, to relate or recognize photographs taken
from two different cameras, to compute depth within a 3-D scene, or for image
registration and resolution enhancement, it is important to have a precise descrip-
tion of the coordinate transformation between a pair of images or video frames
and some indication as to its accuracy.
Traditional block matching (as used in motion estimation) is really a special
case of a more general coordinate transformation. In this chapter a new solution to
the motion estimation problem is demonstrated, using a more general estimation
of a coordinate transformation, and techniques for automatically finding the 8-
parameter projective coordinate transformation that relates two frames taken of
the same static scene are proposed. It is shown, both by theory and example,
how the new approach is more accurate and robust than previous approaches
that relied upon affine coordinate transformations, approximations to projective
coordinate transformations, and/or the finding of point correspondences between
the images. The new techniques take as input two frames, and automatically
output the 8 parameters of the “exact” model, to properly register the frames.
They do not require the tracking or correspondence of explicit features, yet they
are computationally easy to implement.
Although the theory presented makes the typical assumptions of static scene
and no parallax, It is shown that the new estimation techniques are robust to
deviations from these assumptions. In particular, a direct featureless projective
parameter estimation approach to image resolution enhancement and compositing
is applied, and its success on a variety of practical and difficult cases, including
some that violate the nonparallax and static scene assumptions, is illustrated.
An example image composite, made with featureless projective parameter esti-
mation, is reproduced in Figure 6.1 where the spatial extent of the image is
increased by panning the camera while compositing (e.g., by making a panorama),
and the spatial resolution is increased by zooming the camera and by combining
overlapping frames from different viewpoints.
BACKGROUND 235
Figure 6.1 Image composite made from three image regions (author moving between two
different locations) in a large room: one image taken looking straight ahead (outlined in a
solid line); one image taken panning to the left (outlined in a dashed line); one image taken
panning to the right with substantial zoom-in (outlined in a dot-dash line). The second two
have undergone a coordinate transformation to put them into the same coordinates as the
first outlined in a solid line (the reference frame). This composite, made from NTSC-resolution
images, occupies about 2000 pixels across and shows good detail down to the pixel level.
Note the increased sharpness in regions visited by the zooming-in, compared to other areas.
(See magnified portions of composite at the sides.) This composite only shows the result of
combining three images, but in the final production, many more images can be used, resulting
in a high-resolution full-color composite showing most of the large room. (Figure reproduced
from [63], courtesy of IS&T.)
6.2 BACKGROUND
X2
Projective"operative function"
8
6
Range coordinate value
2
1
c′
0
(a )
−2
−4
−6
(b )
−8
−8 −6 −4 −2 0 2 4 6 8
Domain coordinate value
X1
(c )
(d )
(a,b,c)
Figure 6.2 The projective chirping phenomenon. (a) A real-world object that exhibits peri-
odicity generates a projection (image) with ‘‘chirping’’ — periodicity in perspective. (b) Center
raster of image. (c) Best-fit projective chirp of form sin[2π((ax + b)/(cx + 1))]. (d) Graphical
depiction of exemplar 1-D projective coordinate transformation of sin(2π x1 ) into a projective
chirp function, sin(2π x2 ) = sin[2π((2x1 − 2)/(x1 + 1))]. The range coordinate as a function of
the domain coordinate forms a rectangular hyperbola with asymptotes shifted to center at
the vanishing point, x1 = −1/c = −1, and exploding point, x2 = a/c = 2; the chirpiness is
c = c2 /(bc − a) = − 41 .
BACKGROUND 237
Nonchirping models
Chirping models
Figure 6.3 Pictorial effects of the six coordinate transformations of Table 6.1, arranged left to
right by number of parameters. Note that translation leaves the ORIGINAL house figure unchanged,
except in its location. Most important, all but the AFFINE coordinate transformation affect the
periodicity of the window spacing (inducing the desired ‘‘chirping,’’ which corresponds to what
we see in the real world). Of these five, only the PROJECTIVE coordinate transformation preserves
straight lines. The 8-parameter PROJECTIVE coordinate transformation ‘‘exactly’’ describes the
possible image motions (‘‘exact’’ meaning under the idealized zero-parallax conditions).
[72,97] have assumed affine motion (six parameters) between frames. For the
assumptions of static scene and no parallax, the affine model exactly describes
rotation about the optical axis of the camera, zoom of the camera, and pure
shear, which the camera does not do, except in the limit as the lens focal length
approaches infinity. The affine model cannot capture camera pan and tilt, and
therefore cannot properly express the “keystoning” (projections of a rectangular
shape to a wedge shape) and “chirping” we see in the real world. (By “chirping”
what is meant is the effect of increasing or decreasing spatial frequency with
respect to spatial location, as illustrated in Fig. 6.2) Consequently the affine
model attempts to fit the wrong parameters to these effects. Although it has
fewer parameters, the affine model is more susceptible to noise because it lacks
the correct degrees of freedom needed to properly track the actual image motion.
The 8-parameter projective model gives the desired 8 parameters that exactly
account for all possible zero-parallax camera motions; hence there is an important
need for a featureless estimator of these parameters. The only algorithms proposed
to date for such an estimator are [63] and, shortly after, [98]. In both algorithms
a computationally expensive nonlinear optimization method was presented. In the
earlier publication [63] a direct method was also proposed. This direct method
uses simple linear algebra, and it is noniterative insofar as methods such as
238 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
Levenberg–Marquardt, and the like, are in no way required. The proposed method
instead uses repetition with the correct law of composition on the projective
group, going from one pyramid level to the next by application of the group’s
law of composition. The term “repetitive” rather than “iterative” is used, in partic-
ular, when it is desired to distinguish the proposed method from less preferable
iterative methods, in the sense that the proposed method is direct at each stage of
computation. In other words, the proposed method does not require a nonlinear
optimization package at each stage.
Because the parameters of the projective coordinate transformation had tradi-
tionally been thought to be mathematically and computationally too difficult to
solve, most researchers have used the simpler affine model or other approxima-
tions to the projective model. Before the featureless estimation of the parameters
of the “exact” projective model is proposed and demonstrated, it is helpful to
discuss some approximate models.
Going from first order (affine), to second order, gives the 12-parameter
biquadratic model. This model properly captures both the chirping (change
in spatial frequency with position) and converging lines (keystoning) effects
associated with projective coordinate transformations. It does not constrain
chirping and converging to work together (the example in Fig. 6.3 being chosen
with zero convergence yet substantial chirping, illustrates this point). Despite
its larger number of parameters, there is still considerable discrepancy between
a projective coordinate transformation and the best-fit biquadratic coordinate
transformation. Why stop at second order? Why not use a 20-parameter bicubic
model? While an increase in the number of model parameters will result in a
better fit, there is a trade-off where the model begins to fit noise. The physical
camera model fits exactly in the 8-parameter projective group; therefore we know
that eight are sufficient. Hence it seems reasonable to have a preference for
approximate models with exactly eight parameters.
The 8-parameter bilinear model seems to be the most widely used model [99]
in image processing, medical imaging, remote sensing, and computer graphics.
This model is easily obtained from the biquadratic model by removing the four
x 2 and y 2 terms. Although the resulting bilinear model captures the effect of
converging lines, it completely fails to capture the effect of chirping.
The 8-parameter pseudoperspective model [100] and an 8-parameter relative-
projective model both capture the converging lines and the chirping of a
projective coordinate transformation. The pseudoperspective model, for example,
may be thought of as first a means of removal of two of the quadratic terms
(qx y 2 = qy x 2 = 0), which results in a 10- parameter model (the q-chirp of [101])
and then of constraining the four remaining quadratic parameters to have two
degrees of freedom. These constraints force the chirping effect (captured by
qx x 2 and qy y 2 ) and the converging effect (captured by qx xy and qy xy ) to work
together to match as closely as possible the effect of a projective coordinate
transformation. In setting qα = qx x 2 = qy xy , the chirping in the x-direction is
forced to correspond with the converging of parallel lines in the x-direction (and
likewise for the y-direction).
BACKGROUND 239
Of course, the desired “exact” 8 parameters come from the projective model,
but they have been perceived as being notoriously difficult to estimate. The
parameters for this model have been solved by Tsai and Huang [102], but their
solution assumed that features had been identified in the two frames, along
with their correspondences. The main contribution of this chapter is a simple
featureless means of automatically solving for these 8 parameters.
Other researchers have looked at projective estimation in the context of
obtaining 3-D models. Faugeras and Lustman [83], Shashua and Navab [103],
and Sawhney [104] have considered the problem of estimating the projective
parameters while computing the motion of a rigid planar patch, as part of a larger
problem of finding 3-D motion and structure using parallax relative to an arbitrary
plane in the scene. Kumar et al. [105] have also suggested registering frames of
video by computing the flow along the epipolar lines, for which there is also
an initial step of calculating the gross camera movement assuming no parallax.
However, these methods have relied on feature correspondences and were aimed
at 3-D scene modeling. My focus is not on recovering the 3-D scene model,
but on aligning 2-D images of 3-D scenes. Feature correspondences greatly
simplify the problem; however, they also have many problems. The focus of this
chapter is simple featureless approaches to estimating the projective coordinate
transformation between image pairs.
1 When using low-cost wide-angle lenses, there is usually some barrel distortion, which we correct
to the quantity of light received.4 With these assumptions, the exact camera
motion that can be recovered is summarized in Table 6.2.
6.2.3 Orbits
Tsai and Huang [102] pointed out that the elements of the projective group give
the true camera motions with respect to a planar surface. They explored the
group structure associated with images of a 3-D rigid planar patch, as well as the
associated Lie algebra, although they assume that the correspondence problem
has been solved. The solution presented in this chapter (which does not require
prior solution of correspondence) also depends on projective group theory. The
basics of this theory are reviewed, before presenting the new solution in the next
section.
4 This condition can be enforced over a wide range of light intensity levels, by using the Wyckoff
principle [75,59].
5 Also known as a group action or G-set [107].
BACKGROUND 241
6.2.4 VideoOrbits
Here the orbit of particular interest is the collection of pictures arising from one
picture through applying all possible projective coordinate transformations to that
picture. This set is referred to as the VideoOrbit of the picture in question. Image
sequences generated by zero-parallax camera motion on a static scene contain
images that all lie in the same VideoOrbit.
The VideoOrbit of a given frame of a video sequence is defined to be the
set of all images that can be produced by applying operators from the projective
group to the given image. Hence the coordinate transformation problem may be
restated: Given a set of images that lie in the same orbit of the group, it is desired
to find for each image pair, that operator in the group which takes one image to
the other image.
If two frames, f1 and f2 , are in the same orbit, then there is an group operation,
p, such that the mean-squared error (MSE) between f1 and f2 = p ◦ f2 is zero.
In practice, however, the goal is to find which element of the group takes one
image “nearest” the other, for there will be a certain amount of parallax, noise,
interpolation error, edge effects, changes in lighting, depth of focus, and so on.
Figure 6.4 illustrates the operator p acting on frame f2 to move it nearest to frame
f1 . (This figure does not, however, reveal the precise shape of the orbit, which
occupies a 3-D parameter space for 1-D images or an 8-D parameter space for 2-
D images.) For simplicity the theory is reviewed first for the projective coordinate
transformation in one dimension.6
Suppose that we take two pictures, using the same exposure, of the same
scene from fixed common location (e.g., where the camera is free to pan, tilt,
and zoom between taking the two pictures). Both of the two pictures capture the
1
2 1
(a ) (b )
Figure 6.4 Video orbits. (a) The orbit of frame 1 is the set of all images that can be produced
by acting on frame 1 with any element of the operator group. Assuming that frames 1 and 2
are from the same scene, frame 2 will be close to one of the possible projective coordinate
transformations of frame 1. In other words, frame 2 ‘‘lies near the orbit of’’ frame 1. (b) By
bringing frame 2 along its orbit, we can determine how closely the two orbits come together at
frame 1.
6 In this 2-D world, the “camera” consists of a center of projection (pinhole “lens”) and a line (1-D
same pencil of light,7 but each projects this information differently onto the film
or image sensor. Neglecting that which falls beyond the borders of the pictures,
each picture captures the same information about the scene but records it in a
different way. The same object might, for example, appear larger in one image
than in the other, or might appear more squashed at the left and stretched at
the right than in the other. Thus we would expect to be able to construct one
image from the other, so that only one picture should need to be taken (assuming
that its field of view covers all the objects of interest) in order to synthesize
all the others. We first explore this idea in a make-believe “Flatland” where
objects exist on the 2-D page, rather than the 3-D world in which we live, and
where pictures are real-valued functions of one real variable, rather than the more
familiar real-valued functions of two real-variables.
For the two pictures of the same pencil of light in Flatland, a common COP is
defined at the origin of our coordinate system in the plane. In Figure 6.5 a single
camera that takes two pictures in succession is depicted as two cameras shown
together in the same figure. Let Zk , k ∈ {1, 2} represent the distances, along
each optical axis, to an arbitrary point in the scene, P , and let Xk represent the
distances from P to each of the optical axes. The principal distances are denoted
zk . In the example of Figure 6.5, we are zooming in (increased magnification) as
we go from frame 1 to frame 2.
Considering an arbitrary point P in the scene, subtending in a first picture
an angle α = arctan(x1 /z1 ) = arctan(x1 /z1 ), the geometry of Figure 6.5 defines
a mapping from x1 to x2 , based on a camera rotating through an angle of θ
between the taking of two pictures [108,17]:
x1
x2 = z2 tan(arctan − θ) ∀x1 = o1 , (6.1)
z1
7 We neglect the boundaries (edges or ends of the sensor) and assume that both pictures have sufficient
Scene
Z2
P
X2
X1
z2 p2
x2
p1
x1
a
q
z1
COP
Z1
Figure 6.5 Camera at a fixed location. An arbitrary scene is photographed twice, each time
with a different camera orientation and a different principal distance (zoom setting). In both
cases the camera is located at the same place (COP) and thus captures the same pencil of
light. The dotted line denotes a ray of light traveling from an arbitrary point P in the scene to the
COP. Heavy lines denote both camera optical axes in each of the two orientations as well as
the image sensor in each of its two pan and zoom positions. The two image sensors (or films)
are in front of the camera to simplify mathematical derivations.
First, note the well-known trigonometric identity for the difference of two
angles:
tan(α) − tan(θ )
tan(α − θ ) = . (6.2)
1 + tan(α) tan(θ )
x1 /z1 − tan(θ )
x2 = z2 (6.3)
1 + (x1 /z1 ) tan(θ )
Letting constants a = z2 /z1 , b = −z2 tan(θ ), and c = tan(θ )/z1 , the trigono-
metric computations are removed from the independent variable, so that
ax1 + b
x2 = ∀x1 = o1 , (6.4)
cx1 + 1
244 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
X1
02 = 0
−1
X2 1 1
01 = 0
COP
−1
−2
−3
Figure 6.6 Cameras at 90 degree angles. In this situation o1 = 0 and o2 = 0. If we had in the
domain x1 a function such as sin(x1 ), we would have the chirp function sin(1/x1 ) in the range,
as defined by the mapping x2 = 1/x1 .
group− that takes any function g on image line 1, to a function, h on image line 2:
g − x2 + b
h(x2 ) = g(x1 ) = ∀x2 = o2
cx2 − a (6.5)
−1
= g ◦ x1 = g ◦ p ◦ x2 ,
b − x2
x1 = ∀x2 = o2 (6.6)
cx2 − a
Proof Consider a geometric argument. The mapping from the first (1-D) frame
of an image sequence, g(x1 ) to the next frame, h(x2 ) is parameterized by the
following: camera translation perpendicular to the object, tz ; camera translation
parallel to the object, tx ; pan of frame 1, θ1 ; pan of frame 2, θ2 ; zoom of frame 1,
z1 ; and zoom of frame 2, z2 (see Fig. 6.7). We want to obtain the mapping from
X2
x2
a2 X1
z2
q2
z1
x1 Z2
a1 tx
q1
tz
Z1
Figure 6.7 Two pictures of a flat (straight) object. The point P is imaged twice, each time
with a different camera orientation, a different principal distance (zoom setting), and different
camera location (resolved into components parallel and perpendicular to the object).
Proposition 6.2.2 says that an element of the (ax + b)/(cx + 1) group− can
be used to align any two images of linear objects in flatland regardless of camera
movement.
X2
Affine ‘‘Operator function’’
8
−2
−4
−6
−8
−8 −6 −4 −2 0 2 4 6 8
Domain coordinate value
X1
(a )
X2
Projective ‘‘Operator function’’
8
6
Range coordinate value
2
1
c′
0
−2
−4
−6
−8
−8 −6 −4 −2 0 2 4 6 8
Domain coordinate value
X1
(b )
Figure 6.8 Comparison of 1-D affine and projective coordinate transformations, in terms of
their operator functions, acting on a sinusoidal image. (a) Orthographic projection is equivalent
to affine coordinate transformation, y = ax + b. Slope a = 2 and intercept b = 3. The operator
function is a straight line in which the intercept is related to phase shift (delay), and the
slope to dilation (which affects the frequency of the sinusoid). For any function g(t) in the
range, this operator maps functions g ∈ G(=o1 ) to functions h ∈ H(=o2 ) that are dilated by
a factor of 2 and translated by 3. Fixing g and allowing slope a = 0 and intercept b to vary
produces a family of wavelets where the original function g is known as the mother wavelet.
(b) Perspective projection for a particular fixed value of p = {1, 2, 45◦ }. Note that the plot is
a rectangular hyperbola like x2 = 1/(c x1 ) but with asymptotes at the shifted origin (−1, 2).
Here h = sin(2π x2 ) is ‘‘dechirped’’ to g. The arrows indicate how a chosen cycle of chirp g
is mapped to the corresponding cycle of the sinusoid h. Fixing g and allowing a = 0, b, and
c to vary produces a class of functions, in the range, known as P-chirps. Note the singularity
in the domain at x1 = −1 and the singularity in the range at x2 = a/c = 2. These singularities
correspond to the exploding point and vanishing point, respectively.
BACKGROUND 249
X1
X
2
3.0
2.
O
2
2.5
2.0
0
0.
1.0
0.5
COP 0.0
−0.5
O1 −1.0
Figure 6.9 Graphical depiction of a situation where two pictures are related by a zoom from
1 to 2, and a 45 degree angle between the two camera positions. The geometry of this
situation corresponds, in particular, to the operator p = [2, −2; 1, 1] which corresponds to
p = {1, 2, 45◦ }, that is, zoom from 1 to 2, and an angle of 45 degrees between the optical
axes of the camera positions This geometry corresponds to the operator functions plotted in
Figure 6.8b and Figure 6.2d.
10 For 2-D images in a 3-D world, the isomorphism no longer holds. However, the projective
To lay the framework for the new results, existing methods of parameter
estimation for coordinate transformations will be reviewed. This framework will
apply to existing methods as well as to new methods. The purpose of this review
is to bring together a variety of methods that appear quite different but actually
can be described in a more unified framework as is presented here.
The framework given breaks existing methods into two categories: feature-
based, and featureless. Of the featureless methods, consider two subcategories:
methods based on minimizing MSE (generalized correlation, direct nonlinear
optimization) and methods based on spatiotemporal derivatives and optical flow.
Variations such as multiscale have been omitted from these categories, since
multiscale analysis can be applied to any of them. The new algorithms proposed
in this chapter (with final form given in Section 6.4) are featureless, and based
on (multiscale if desired) spatiotemporal derivatives.
Some of the descriptions of methods will be presented for hypothetical 1-D
images taken of 2-D “scenes” or “objects.” This simplification yields a clearer
comparison of the estimation methods. The new theory and applications will be
presented subsequently for 2-D images taken of 3-D scenes or objects.
using least squares if there are more than three correspondence points. The
extension from 1-D “images” to 2-D images is conceptually identical. For the
affine and projective models, the minimum number of correspondence points
needed in 2-D is three and four, respectively, because the number of degrees of
freedom in 2-D is six for the affine model and eight for the projective model.
Each point correspondence anchors two degrees of freedom because it is in 2-D.
A major difficulty with feature-based methods is finding the features. Good
features are often hand-selected, or computed, possibly with some degree of
human intervention [112]. A second problem with features is their sensitivity
to noise and occlusion. Even if reliable features exist between frames (e.g.,
line markings on a playing field in a football video; see Section 6.5.2), these
features may be subject to signal noise and occlusion (e.g., running football
FRAMEWORK: MOTION PARAMETER ESTIMATION AND OPTICAL FLOW 251
players blocking a feature). The emphasis in the rest of this chapter will be on
robust featureless methods.
analysis [35], giving rise to the so-called q-chirplet [35], which differs from the projective chirplet
discussed here.
252 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
chirplet approach; thus, for the remainder of this chapter, we consider featureless
methods based on spatiotemporal derivatives.
13 While one may choose to debate whether or not this quantity is actually in units of brightness, this
is the term used by Horn [71]. It is denoted by Horn using the letter E. Variables E, F , G, and H
will be used to denote this quantity throughout this book, where, for example, F (x, t) = f (q(x, t))
is a typically unknown nonlinear function of the actual quantity of light falling on the image sensor.
14 The 1-D affine model is a simple yet sufficiently interesting (non-Abelian) example selected to
and then globally fitting this flow with an affine model (affine fit), and rewriting
the optical flow equation in terms of a single global affine (not translation) motion
model (affine flow ).
Affine Fit
Wang and Adelson [119] proposed fitting an affine model to the optical flow field
between two 2-D images. Their approach with 1-D images is briefly examined.
The reduction in dimensions simplifies analysis and comparison to affine flow.
Denote coordinates in the original image, g, by x, and in the new image, h, by
x . Suppose that h is a dilated and translated version of g so that x = ax + b
for every corresponding pair (x , x). Equivalently the affine model of velocity
(normalizing "t = 1), um = x − x, is given by um = (a − 1)x + b. We can
expect a discrepancy between the flow velocity, uf , and the model velocity,
um , due either to errors in the flow calculation or to errors in the affine model
assumption. Therefore we apply linear regression to get the best least-squares fit
by minimizing
Et 2
Et 2
εf it = (um − uf )2 = um + = (a − 1)x + b + .
x x
Ex x
Ex
(6.12)
The constants a and b that minimize εf it over the entire patch are found by
differentiating (6.12), with respect to a and b, and setting the derivatives to zero.
This results in what are called the affine fit equations:
x2, x xEt /Ex
x a−1 x
x = − . (6.13)
x, 1 b Et /Ex
x x x
Affine Flow
Alternatively, the affine coordinate transformation may be directly incorporated
into the brightness change constraint equation (6.10). Bergen et al. [120]
proposed this method, affine flow, to distinguish it from the affine fit model
of Wang and Adelson (6.13). Let us see how affine flow and affine fit are related.
Substituting um = (ax + b) − x directly into (6.11) in place of uf and summing
the squared error, we have
εf low = (um Ex + Et )2 = (((a − 1)x + b)Ex + Et )2 (6.14)
x x
254 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
over the whole image. Then differentiating, and equating the result to zero gives
us a linear solution for both a − 1 and b:
x 2 Ex2 , xEx2 xEx Et
x a−1 x
x = − (6.15)
xEx ,2 2
Ex b Ex Et
x x x
To see how result (6.15) compares to the affine fit, we rewrite (6.12):
um Ex + Et 2
εf it = (6.16)
x
Ex
and observe, comparing (6.14) and (6.16), that affine flow is equivalent to a
weighted least-squares fit (i.e., a weighted affine fit), where the weighting is
given by Ex2 . Thus the affine flow method tends to put more emphasis on areas
of the image that are spatially varying than does the affine fit method. Of course,
one is free to separately choose the weighting for each method in such a way
that affine fit and affine flow methods give the same result.
Both intuition and practical experience tends to favor the affine flow weighting.
More generally, perhaps we should ask “What is the best weighting?” Lucas and
Kanade [121], among others, considered weighting issues, but the rather obvious
difference in weighting between fit and flow did not enter into their analysis
nor anywhere in the literature. The fact that the two approaches provide similar
results, and yet drastically different weightings, suggests that we can exploit the
choice of weighting. In particular, we will observe in Section 6.3.3 that we can
select a weighting that makes the implementation, of rider orbit easier.
Another approach to the affine fit involves computation of the optical flow
field using the multiscale iterative method of Lucas and Kanade, and then
fitting to the affine model. An analogous variant of the affine flow method
involves multiscale iteration as well, but in this case the iteration and multiscale
hierarchy are incorporated directly into the affine estimator [120]. With the
addition of multiscale analysis, the fit and flow methods differ in additional
respects beyond just the weighting. My intuition and experience indicates that
the direct multiscale affine flow performs better than the affine fit to the multiscale
flow. Multiscale optical flow makes the assumption that blocks of the image are
moving with pure translational motion and then paradoxically, that the affine
fit refutes this pure-translation assumption. However, fit provides some utility
over flow when it is desired to segment the image into regions undergoing
different motions [122], or to gain robustness by rejecting portions of the image
not obeying the assumed model.
and minimize the sum of the squared difference as was done in (6.12):
ax + b Et
2
ε= −x+ . (6.18)
x
cx + 1 Ex
Projective Flow
For projective-flow (p-flow), substitute um = (ax + b)/(cx + 1) − x into (6.14).
Again, weighting by (cx + 1) gives
εw = (axEx + bEx + c(xEt − x 2 Ex ) + Et − xEx )2 (6.20)
(the subscript w denotes weighting has taken place). The result is a linear system
of equations for the parameters:
φw φwT [a, b, c]T = (xEx − Et )φw , (6.21)
and use the first three terms, obtaining enough degrees of freedom to account
for the three parameters being estimated. Letting the squared
error due to higher-
order terms in the Taylor series approximation be ε = (−h.o.t.)2 = ((b +
(a − bc − 1)x + (bc − a)cx 2 )Ex + Et )2 , q2 = (bc − a)c, q1 = a − bc − 1, and
q0 = b, and differentiating with respect to each of the 3 parameters of q, setting
the derivatives equal to zero, and solving, gives the linear system of equations
for unweighted projective flow:
4 2 3 2 2 2 2
x Ex x Ex x Ex q x Ex Et
2
x 3 Ex2 x 2 Ex2 xEx2 q1 = − xEx Et . (6.24)
x2E2 xE 2 E2 q0 E E
x x x x t
uTf Ex + Et ≈ 0. (6.25)
As is well known [71] the optical flow field in 2-D is underconstrained.15 The
model of pure translation at every point has two parameters, but there is only
one equation (6.25) to solve. So it is common practice to compute the optical
flow over some neighborhood, which must be at least two pixels, but is generally
taken over a small block, 3 × 3, 5 × 5, or sometimes larger (including the entire
image as in this chapter).
Our task is not to deal with the 2-D translational flow, but with the 2-D
projective flow, estimating the eight parameters in the coordinate transformation:
x A[x, y]T + b Ax + b
x = = = T . (6.26)
y c [x, y] + 1
T T c x+1
The desired eight scalar parameters are denoted by p = [A, b; c, 1], A ∈ 2×2 ,
b ∈ 2×1 , and c ∈ 2×1 .
Differentiating with respect to the free parameters A, b, and c, and setting the
result to zero gives a linear solution:
φφ T [a11 , a12 , b1 , a21 , a22 , b2 , c1 , c2 ]T = (xT Ex − Et )φ, (6.29)
where φ T = [Ex (x, y, 1), Ey (x, y, 1), xEt − x 2 Ex − xyEy , yEt − xyEx −y 2 Ey ].
um + x = qx xy xy + (qx x + 1)x + qx y y + qx ,
(6.30)
vm + y = qy xy xy + qy x x + (qy y + 1)y + qy .
Incorporating these into the flow criteria yields a simple set of eight linear
equations in eight unknowns:
(φ(x, y)φ (x, y)) q = Ex,y Et φ(x, y),
T
(6.31)
x,y
16 Use of an approximate model that doesn’t capture chirping or preserve straight lines can still lead
to the true projective parameters as long as the model captures at least eight degrees of freedom.
258 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
To see how well the model describes the coordinate transformation between
two images, say, g and h, one might warp 17 h to g, using the estimated motion
model, and then compute some quantity that indicates how different the resampled
version of h is from g. The MSE between the reference image and the warped
image might serve as a good measure of similarity. However, since we are really
interested in how the exact model describes the coordinate transformation, we
assess the goodness of fit by first relating the parameters of the approximate
model to the exact model, and then find the MSE between the reference image
and the comparison image after applying the coordinate transformation of the
exact model. A method of finding the parameters of the exact model, given the
approximate model, is presented in Section 6.4.1.
1. Select four ordered pairs (the four corners of the bounding box
containing the region under analysis, the four corners of the image if
the whole image is under analysis, etc.). Here, for simplicity, suppose
that these points are the corners of the unit square: s = [s1 , s2 , s3 , s4 ] =
[(0, 0)T , (0, 1)T , (1, 0)T , (1, 1)T ].
2. Apply the coordinate transformation using the Taylor series for the
approximate model, such as (6.30), to these points: r = um (s).
17 The term warp is appropriate here, since the approximate model does not preserve straight lines.
MULTISCALE IMPLEMENTATIONS IN 2-D 259
xk xk , yk , 1, 0, 0, 0, −xk xk , −yk xk
=
yk 0, 0, 0, xk , yk , 1, −xk yk , −yk yk
T
ax x , ax y , bx , ay x , ay y , by , cx , cy , (6.34)
We remind the reader that the four corners are not feature correspondences as
used in the feature-based methods of Section 6.3.1 but are used so that the two
featureless models (approximate and exact) can be related to one another.
It is important to realize the full benefit of finding the exact parameters. While
the “approximate model” is sufficient for small deviations from the identity, it is
not adequate to describe large changes in perspective. However, if we use it to
track small changes incrementally, and each time relate these small changes to
the exact model (6.26), then we can accumulate these small changes using the law
of composition afforded by the group structure. This is an especially favorable
contribution of the group framework. For example, with a video sequence, we
can accommodate very large accumulated changes in perspective in this manner.
The problems with cumulative error can be eliminated, for the most part, by
constantly propagating forward the true values, computing the residual using the
approximate model, and each time relating this to the exact model to obtain a
goodness-of-fit estimate.
Repeat until either the error between hk and g falls below a threshold, or until
some maximum number of repetitions is achieved. After the first repetition, the
parameters q2 tend to be near the identity, since they account for the residual
between the “perspective-corrected” image h1 and the “true” image g. We find
g q
group est q q_to_p p
h operation
Figure 6.10 Method of computation of eight parameters p between two images from the
same pyramid level, g and h. The approximate model parameters q are related to the exact
model parameters p in a feedback system.
MULTISCALE IMPLEMENTATIONS IN 2-D 261
that only two or three repetitions are usually needed for frames from nearly the
same orbit.
A rectangular image assumes the shape of an arbitrary quadrilateral when it
undergoes a projective coordinate transformation. In coding the algorithm, the
undefined portions are padded with the quantity NaN, a standard IEEE arithmetic
value, so that any calculations involving these values automatically inherit NaN
without slowing down the computations. The algorithm runs at a few frames per
second on a typical WearComp.
18 A commutative (or Abelian) group is one in which elements of the group commute, for example,
translation along the x-axis commutes with translation along the y-axis, so the 2-D translation group
is commutative.
19 While the Heisenberg group deals with translation and frequency-translation (modulation), some
of the concepts could be carried over to other more relevant group structures.
262 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
Start with
image pair, g and h
Is
h close enough
to g No
?
YES
Is
final desired
level reached No
?
Yes
End
Figure 6.11 The VideoOrbits head-tracking algorithm. The new head-tracking algorithm
requires no special devices installed in the environment. The camera in the personal imaging
system (or the EyeTap) simply tracks itself based on its view of objects in the environment.
The algorithm is based on algebraic projective geometry. It provides an estimate of the true
projective coordinate transformation, which, for successive image pairs is composed using
the projective group− . Successive pairs of images may be estimated in the neighborhood of
the identity coordinate transformation (using an approximate representation), while absolute
head-tracking is done using the exact group− by relating the approximate parameters q to the
exact parameters p in the innermost loop of the process. The algorithm typically runs at 5 to 10
frames per second on a general-purpose computer, but the simple structure of the algorithm
makes it easy to implement in hardware for the higher frame rates needed for full-motion video.
PERFORMANCE AND APPLICATIONS 263
Figure 6.12 shows some frames from a typical image sequence. Figure 6.13
shows the same frames brought into the coordinate system of frame (c), that
is, the middle frame was chosen as the reference frame.
Given that we have established a means of estimating the projective coordinate
transformation between any pair of images, there are two basic methods we use
for finding the coordinate transformations between all pairs of a longer image
sequence. Because of the group structure of the projective coordinate transfor-
mations, it suffices to arbitrarily select one frame and find the coordinate trans-
formation between every other frame and this frame. The two basic methods are:
264 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
(a ) (b ) (c )
(d ) (e )
Figure 6.12 Frames from original image orbit, sent from author’s personal imaging apparatus.
The camera is mounted sideways so that it can ‘‘paint’’ out the image canvas with a wider
‘‘brush,’’ when sweeping across for a panorama. Thus the visual field of view that the
author experienced was rotated through 90 degrees. Much like George Stratton did with his
upside-down glasses, the author adapted, over an extended period of time, to experiencing
the world rotated 90 degrees. (Adaptation experiments were covered in Chapter 3.)
(a ) (b ) (c )
(d ) (e )
Figure 6.13 Frames from original image video orbit after a coordinate transformation to move
them along the orbit to the reference frame (c). The coordinate-transformed images are alike
except for the region over which they are defined. The regions are not parallelograms; thus
methods based on the affine model fail.
The frames of Figure 6.12 were brought into register using the differential
parameter estimation, and “cemented” together seamlessly on a common canvas.
266 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
Projective/projective Affine/projective
Affine/affine
Figure 6.14 Frames of Figure 6.13 ‘‘cemented,’’ together on single image ‘‘canvas,’’ with
comparison of affine and projective models. Good registration of the projective/projective
image is possible despite the noise in the amateur television receiver, wind-blown trees, and
the fact that the rotation of the camera was not actually about its center of projection. The
affine model fails to properly estimate the motion parameters (affine/affine), and even if the
‘‘exact’’ projective model is used to estimate the affine parameters, there is no affine coordinate
transformation that will properly register all of the image frames.
“Cementing” involves piecing the frames together, for example, by median, mean,
or trimmed mean, or combining on a subpixel grid [129]. (The trimmed mean was
used here, but the particular method made little visible difference.) Figure 6.14
shows this result (projective/projective), with a comparison to two nonprojective
cases. The first comparison is to affine/affine where affine parameters were
estimated (also multiscale) and used for the coordinate transformation. The
second comparison, affine/projective, uses the six affine parameters found by
estimating the eight projective parameters and ignoring the two chirp parameters
c (which capture the essence of tilt and pan). These six parameters A, b are
more accurate than those obtained using the affine estimation, as the affine
estimation tries to fit its shear parameters to the camera pan and tilt. In other
words, the affine estimation does worse than the six affine parameters within the
projective estimation. The affine coordinate transform is finally applied, giving
the image shown. Note that the coordinate-transformed frames in the affine case
are parallelograms.
1. The camera movement is small, so any pair of frames chosen from the
VideoOrbit have a substantial amount of overlap when expressed in a
common coordinate system. (Use differential parameter estimation.)
PERFORMANCE AND APPLICATIONS 267
In the example of Figure 6.14, any cumulative errors are not particularly
noticeable because the camera motion is progressive; that is, it does not reverse
direction or loop around on itself. Now let us look at an example where the
camera motion loops back on itself and small errors, because of violations of the
assumptions (fixed camera location and static scene) accumulate.
Consider the image sequence shown in Figure 6.15. The composite arising
from bringing these 16 image frames into the coordinates of the first frame
exhibited somewhat poor registration due to cumulative error; this sequence is
used to illustrate the importance of subcomposites.
The differential support matrix20 appears in Figure 6.16. The entry qm,n tells
us how much frame n overlaps with frame m in the matrix when expressed in
the coordinates of frame m, for the sequence of Figure 6.15.
Examining the support matrix, and the mean-squared error estimates, the local
maxima of the support matrix correspond to the local minima of the mean-squared
error estimates, suggesting the subcomposites:21 {7, 8, 9, 10, 6, 5}, {1, 2, 3, 4},
and {15, 14, 13, 12}. It is important to note that when the error is low, if the
support is also low, the error estimate might not be valid. For example, if the
two images overlap in only one pixel, then even if the error estimate is zero (i.e.,
the pixel has a value of 255 in both images), the alignment is not likely good.
The selected subcomposites appear in Figure 6.17. Estimating the coordinate
transformation between these subcomposites, and putting them together into
a common frame of reference results in a composite (Fig. 6.17) about 1200
pixels across. The image is sharp despite the fact that the person in the picture
was moving slightly and the camera operator was also moving (violating the
assumptions of both static scene and fixed center of projection).
Figure 6.15 The Hewlett Packard ‘‘Claire’’ image sequence, which violates the assumptions
of the model (the camera location was not fixed, and the scene was not completely static).
Images appear in TV raster-scan order.
20 The differential support matrix is not necessarily symmetric, while the cumulative support matrix
for which the entry qm,n tells us how much frame n overlaps with frame m when expressed in the
coordinates of frame 0 (reference frame) is symmetric.
21 Researchers at Sarnoff also consider the use of subcomposites; they refer to them as tiles [130,131]
268 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
subPIC, frames 0 to 4
1.00 0.91 0.80 0.68 0.54 0.41 0.30 0.23 0.19 0.17 0.17 0.18 0.19 0.20 0.21 0.23
0.91 1.00 0.89 0.76 0.62 0.49 0.37 0.30 0.25 0.22 0.21 0.22 0.22 0.23 0.24 0.26
0.80 0.89 1.00 0.87 0.72 0.58 0.47 0.38 0.33 0.29 0.27 0.27 0.27 0.27 0.28 0.25
0.68 0.76 0.87 1.00 0.85 0.70 0.58 0.48 0.42 0.37 0.34 0.33 0.33 0.33 0.29 0.23
0.53 0.61 0.72 0.84 1.00 0.85 0.71 0.61 0.53 0.47 0.44 0.42 0.41 0.35 0.28 0.22
0.40 0.48 0.58 0.70 0.85 1.00 0.85 0.74 0.66 0.59 0.55 0.53 0.44 0.34 0.26 0.20
subPIC, frames 5 to 11
0.29 0.37 0.46 0.57 0.71 0.86 1.00 0.88 0.79 0.71 0.67 0.56 0.44 0.35 0.26 0.19
0.22 0.29 0.38 0.48 0.61 0.75 0.88 1.00 0.91 0.82 0.71 0.58 0.46 0.36 0.27 0.19
0.19 0.25 0.32 0.42 0.54 0.66 0.79 0.90 1.00 0.91 0.78 0.65 0.52 0.41 0.31 0.21
0.16 0.22 0.28 0.37 0.47 0.59 0.70 0.81 0.90 1.00 0.87 0.73 0.60 0.48 0.36 0.25
0.16 0.21 0.27 0.34 0.44 0.54 0.65 0.69 0.76 0.86 1.00 0.85 0.71 0.57 0.45 0.33
0.18 0.21 0.27 0.33 0.42 0.52 0.54 0.56 0.63 0.71 0.84 1.00 0.85 0.71 0.57 0.44
0.19 0.22 0.27 0.33 0.41 0.43 0.43 0.45 0.50 0.58 0.69 0.84 1.00 0.85 0.70 0.57
0.20 0.23 0.27 0.33 0.35 0.34 0.34 0.35 0.40 0.46 0.57 0.70 0.85 1.00 0.85 0.71
0.22 0.25 0.29 0.30 0.28 0.27 0.26 0.27 0.30 0.36 0.45 0.57 0.71 0.85 1.00 0.85
0.24 0.27 0.26 0.24 0.22 0.21 0.19 0.19 0.22 0.26 0.33 0.45 0.57 0.71 0.85 1.00
subPIC, frames 12 to 15
(a )
5.55
5.92
6.80
7.59
7.42
6.27
5.07
3.90
4.01
4.67
5.98
6.54
5.94
5.65
5.53
6.99
(b ) (c ) (d )
Figure 6.16 Support matrix and mean-squared registration error defined by image sequence
in Figure 6.15 and the estimated coordinate transformations between images. (a) Entries in
table. The diagonals are one, since every frame is fully supported in itself. The entries just
above (or below) the diagonal give the amount of pairwise support. For example, frames 0 and
1 share high mutual support (0.91). Frames 7, 8, and 9 also share high mutual support (again
0.91). (b) Corresponding density plot (more dense ink indicates higher values). (c) Mean-square
registration error; (d) corresponding density plot.
Frames 0-4
Frames 5-11
Frames 12-15
Completed PIC
Figure 6.17 Subcomposites are each made from subsets of the images that share high
quantities of mutual support and low estimates of mutual error, and then combined to form the
final composite.
(a )
(b )
Figure 6.18 Image composite made from 16 video frames taken from a television broadcast
sporting event. Note the ‘‘Edgertonian’’ appearance, as each player traces out a stroboscopic
path. The proposed method works robustly, despite the movement of players on the field.
(a) Images are expressed in the coordinates of the first frame. (b) Images are expressed in a new
useful coordinate system corresponding to none of the original frames. The slight distortions
due to the fact that football fields are never perfectly flat but raised slightly in the center.
Despite the moving players in the video, the proposed method successfully
registers all of the images in the orbit, mapping them into a single high-resolution
image composite of the entire playing field. Figure 6.18a shows 16 frames of
video from a football game combined into a single image composite, expressed in
the coordinates of the first image in the sequence. The choice of coordinate system
was arbitrary, and any of the images could have been chosen as the reference
frame. In fact a coordinate system other than one chosen from the input images
270 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
could even be used. In particular, a coordinate system where parallel lines never
meet, and periodic structures are “dechirped” (Fig. 6.18b) lends itself well to
machine vision and player-tracking algorithms [132]. Even if the entire playing
field is not visible in a particular image, collectively the video from an entire
game will reveal every square yard of the playing surface at one time or another,
enabling a composite to be made of the entire playing surface.
6.6.1 Overview
In this section the Wyckoff [63] photoquantigraphic imaging principle presented
in Chapter 5 is combined with VideoOrbits. The result suggests that a camera
rotated about its center of projection, may be used as a measuring instrument.
The look-and-paint metaphor of this chapter will now provide not just a picture
of increased spatial extent but a way to “paint” a set of photoquantigraphic
measurements onto an empty “canvas,” as the camera is swept around. (In the
case where the camera is the eye, the painting is accomplished by gaze-based
measurement spaces.) These measurements describe, up to a single unknown
scalar constant (for the entire canvas), the quantity of light arriving from every
direction in space.
6.6.5 AGC
If what is desired is a picture of increased spatial extent or spatial resolution,
the nonlinearity is not a problem so long as it is not image dependent. However,
most low-cost cameras have built-in automatic gain control (AGC), electronic
level control, auto-iris, and other forms of automatic exposure that cannot be
turned off or disabled. (For simplicity all such methods of automatic exposure
AGC AND THE RANGE OF LIGHT 273
(a ) (b ) (c ) (d ) (e )
(A ) (B ) (C ) (D ) (E )
(f ) (g ) (h ) (i ) (j)
(F ) (G) (H ) (I ) (J )
Figure 6.19 The fire-exit sequence, taken using a camera with AGC. (a)–( j) frames 0 to 9: As
the camera pans across to take in more of the open doorway, the image brightens and shows
more of the interior, while at the same time, clipping highlight detail. Frame 0 (a) shows the
writing on the white paper taped to the door very clearly, but the interior is completely black. In
frame 5 (f) the paper is completely obliterated. It is so washed out that we cannot even discern
that there is a paper present. Although the interior is getting brighter, it is still not discernible in
frame 5 (f), but more and more detail of the interior becomes visible as we proceed through the
sequence, showing that the fire exit is blocked by the clutter inside. (A)–( J) ‘‘certainty’’ images
are corresponding to (a) to ( j).
control are referred to as AGC, whether or not they are actually implemented
using gain adjustment.) This means that the unknown response function, f (q),
is image dependent. It will therefore change over time, as the camera framing
changes to include brighter or darker objects.
AGC was a good invention for its intended application, serving the interests of
most camera users who merely wish to have a properly exposed picture without
having to make adjustments to the camera. However, it has thwarted attempts to
estimate the projective coordinate transformation between frame pairs. Examples
of an image sequence, acquired using a camera with AGC appear in Figure 6.19.
A joint estimation of the projective coordinate transformation and the tone-
scale change may be regarded as “motion estimation” problems if we extend
the concept of motion estimation to include both domain motion (motion in the
traditional sense) and range motion (Fig. 6.20).
274 VIDEOORBITS: THE PROJECTIVE GEOMETRY RENAISSANCE
f (q (x ))
Range motion
Domain motion
f (q (x ))
Figure 6.20 One row across each of two images from the fire-exit sequence. ‘‘Domain
motion’’ is motion in the traditional sense (motion from left to right, zoom, etc.), while ‘‘range
motion’’ refers to a tone-scale adjustment (e.g., lightening or darkening of the image). The
camera is panning to the right, so domain motion is to the left. However, when panning to
the right, the camera points more and more into the darkness of an open doorway, causing
the AGC to adjust the exposure. Thus there is some upward motion of the curve as well as
leftward motion. Just as panning the camera across causes information to leave the frame at
the left, and new information to enter at the right, the AGC causes information to leave from the
top (highlights get clipped) and new information to enter from the bottom (increased shadow
detail).
identity is used. For the projective group, the approximate model has the form
q2 (x) = q1 ((ax + b)/(cx + 1)).
For the Wyckoff group (which is a one-parameter group isomorphic to addition
over the reals, or multiplication over the positive reals), the approximate model
may be taken from (4.51) by noting that
Thus we see that g(f ) is a linear equation (is affine) in f . This affine relationship
suggests that linear regression on the comparagram between two images would
provide an estimate of α and γ , while leaving β unknown. This result is consistent
with the fact that the response curve may only be determined up to a constant
scale factor.
From (6.36) we have that the (generalized) brightness change constraint
equation is
where F (x, t) = f (q(x, t)). Combining this equation with the Taylor series
representation:
Minimizing ε2 yields a linear solution in parameters of the approximate model:
x 4 Fx2 x 3 Fx2
x 2 Fx2 x 2 F Fx x 2 Fx
x 3 Fx2 x 2 Fx2 xFx2 xF Fx xFx
x 2 Fx2 xFx2 Fx2 F Fx Fx
x 2 F Fx F2 F
xF Fx F Fx
x 2 Fx xFx Fx F 1
x 2 Fx Ft
q0
q xF F
x t
1
q0 = − Fx Ft
.
1 − kγ
F F t
αk γ − α
Ft
The parameters of the approximate model are related to those of the exact model,
as was illustrated earlier in this chapter (using the feedback process of Fig. 6.10).
This new mathematical result enables images to be brought not just into
register in the traditional domain motion sense, but also into the same tonal
scale through antihomomorphic gain adjustment. The combination of a spatial
coordinate transformation combined with a tone-scale adjustment is referred to
as a “spatiotonal transformation.” In particular, it is the spatiotonal transformation
of antihomomorphic homography that is of interest (i.e., homographic coordinate
transformation combined with antihomomorphic gain adjustment). This form
(a ) (b ) (c ) (d ) (e )
(f ) (g ) (h ) (i ) (j)
(a ) (b ) (c ) (d ) (e )
(f ) (g ) (h ) (i ) (j)
Figure 6.23 Floating-point photoquantigraphic image constructed from the fire-exit sequence.
The dynamic range of the image is far greater than that of a computer screen or printed page.
The photoquantigraphic information may be interactively viewed on the computer screen, and
not only as an environment map (with pan, tilt, and zoom) but also with control of exposure and
contrast. With a virtual camera we can move around in the photoquantigraph, both spatially
and tonally.
22 The dynamic range of some papers is around 100 : 1, while that of many films is around 500 : 1.
THE BIG PICTURE 279
Figure 6.24 Fixed-point image made by tone-scale adjustments that are only locally
monotonic, followed by quantization to 256 graylevels. We can see clearly both the small
piece of white paper on the door (and can even read what it says — ‘‘COFFEE HOUSE CLOSED’’), as
well as the details of the dark interior. We could not have captured such a nicely exposed image
using an on-camera fill-flash to reduce scene contrast, because the fill-flash would mostly
light up the areas near the camera (which happen to be the areas that are already too bright),
while hardly affecting objects at the end of the dark corridor which are already too dark. One
would then need to set up additional photographic lighting equipment to obtain a picture of this
quality. This image demonstrates the advantage of a small lightweight personal imaging system
built unobtrusively into a pair of eyeglasses. In this setup an image of very high quality was
captured by simply looking around, without entering the corridor. This system is particularly
useful when trying to report a violation of fire-safety laws, while at the same time not appearing
to be trying to capture an image. The present image was shot some distance away from the
premises (using a miniaturized tele lens that the author built into his eyeglass-based system).
The effects of perspective, though present, are not as immediately obvious as in some of the
other extreme wide-angle image composites presented earlier in this chapter.
this chapter. As with the photographic lateral inhibition, these methods also relax
the monotonicity constraint.
Thus, in order to print a photoquantigraph, it may be preferable to relax
the monotonicity constraint, and perform some local tone-scale adjustments
(Fig. 6.24).
(a ) (b )
(c )
between the two exposures. Thus the photoquantigraphic estimate q̂ has far
greater dynamic range than can be directly viewed on a television or on the
printed page. Display of fˆ(k̂ 1 q) would fail to show the shadow details, while
display of fˆ(k̂ 2 q) would fail to show the highlight details.
In this case, even if we had used the virtual camera architecture depicted in
Figure 4.5, there would be no single value of display exposure kd for which
a display image fd = fˆ(kd q̂) would capture both the inside of the abandoned
fortress and the details looking outside through the open doorway.
Therefore a strong highpass (sharpening) filter, S was applied to q̂, to sharpen
the photoquantity q̂, as well as provide lateral inhibition similar to the way in
which the human eye functions. Then the filtered result,
 x + b̂
fˆ kd S q̂
2 2
,
ĉ2 x + dˆ2
was displayed on the printed page (Fig. 6.25c), in the projective coordinates of
the second (rightmost) image, i = 2. Note the introduction of spatial coordinates
A, b, c, and d. These compensate for projection (which occurs if the camera
moves slightly between pictures), as described in [2,77]. In particular, the
parameters of a projective coordinate transformation are typically estimated
together with the nonlinear camera response function and the exposure ratio
between pictures [2,77].
As a result of the filtering operation, notice that there is no longer a monotonic
relationship between input photoquantity q and output level on the printed page.
For example, the sail is as dark as some shadow areas inside the fortress. Because
of this filtering the dynamic range of the image can be reduced to that of
printed media, while still revealing details of the scene. This example answers
the question, Why capture more dynamic range than you can display.
Even if the objective is a picture of limited dynamic range, perhaps where
the artist wishes to deliberately wash out highlights and mute down shadows for
expressive purposes, the proposed philosophy is still quite valid. The procedure
captures a measurement space, recording the quantity of light arriving at each
angle in space, and then from that measurement space, synthesizes the tonally
degraded image. This way as much information about the scene as possible
becomes embedded in a photoquantigraphic estimate, and then “expressed” into
that estimate (by throwing away information in a controlled fashion) to produce
the final picture.
xterm xterm
Buffers Buffers
shopping xterm shopping xterm
list >agc n list >agc n
Emacs: AGC is oN Emacs: AGC is oN
F for oFf F for oFf
Figure 6.26 Reality window manager (RWM). The viewport is defined by an EyeTap or laser
EyeTap device and serves as a view into the real world. This viewport is denoted by a reticle
and graticule with crosshairs. The mediation zone over the visual perception of reality can be
altered. There is a virtual screen. Portions of windows inside the viewport are denoted by solid
lines, and portions of windows outside the viewport are denoted by dotted lines. Actual objects
in the scene are drawn as solid lines, and are of course always visible. (a) Initially the mediation
zone is in the center of the virtual screen. (b) When it is desired to select a window, one looks
over to the desired window. This selects the window. The user then presses ‘‘d’’ to select
‘‘recorD’’ (r selects rewind, etc.).
REALITY WINDOW MANAGER 283
(a ) (b ) (c )
Figure 6.28 Virtual message left on the wall under the entrance to the grocery store. When
the recipient of the message approaches the store wearing a reality mediator, the message
left there will suddenly appear. As the wearer moves his/her head, the message will appear
to be attached to the wall. This is because the homography of the plane is tracked, and a
projective coordinate transformation is performed on the message before it is inserted into the
wearer’s reality stream. (Top) Lens distortion in WearCam results in poor registration. (Bottom)
After correcting for lens distortion using the Campbell method [106] the sub-pixel registration
is possible. (Data captured by author and processed later on base station; thanks to J. Levine
for assisting author in porting VideoOrbits to SGI architecture used at base station.)
Imagine an unauthorized user able to remotely log into your computing facility,
and then running a large computer program on your computer without your
knowledge or consent. Many people might refer to this as “theft” of CPU cycles
(i.e., obtaining computing services without payment).
In many ways advertising does the same thing to our brains. The idea behind
advertisements is to break into consumer brains by getting past their mental filters
that have evolved to block out extraneous information. Like the banner ads on
the WWW that try to simulate a second cursor, real-world ads try to trick the
ALL THE WORLD’S A SKINNER BOX 285
brain into paying attention. This sort of “in-band” signaling continually evolves
in an attempt to try to bypass the mental filters we have developed to filter it out.
Like an intelligence arms race, both our coping mechanisms and the ads
themselves escalate the amount of mental processing required to either filter them
out or process them. In this way thefts of the brain’s CPU cycles are increasing
rapidly.
Ads are also being designed to be responsive. Consider the talking ads above
urinals that trigger on approach; it is as if we have a multimedia spectacle
that watches and responds to our movements. This is a closed-loop process
involving the theft of personal information (called humanistic property, analogous
to intellectual property) followed by spam that is triggered by this stolen
information.
To close the control system loop around a victim (e.g. to make the human
victim both controllable as well as observable), perpetrators of humanistic
property violations will typically commit the following violations:
• Steal personal information from the victim. This theft involves the violation
of acquisitional privacy.
• Traffick in this stolen information. This trafficking involves the violation of
disseminational privacy.
• Spam the victim. This spamming involves the violation of the victim’s
solitude.
It has been argued that intellectual property is already excessively protected [135].
Proponents of intellectual property have used the word “piracy” to describe
making “unauthorized” copies of informational “wares.” Of course, this immedi-
ately evokes villains who attack ocean-going vessels and kill all on board, tying
the bodies to life rafts and setting them adrift to signal to rival pirates not to
tread on their turf. Thus some consider the making of “unauthorized” copies
of information wares equal in culpability to piracy. Mitch Kapor, cofounder of
the Electronic Frontier Foundation (EFF), has criticized this equivalence between
copying floppy disks and “software piracy” as the software industry’s propaganda
(see also http://wearcam.org/copyfire.html).
Thus the use of terms “steal” and “traffick” is no more extreme when applied
to humanistic property than to intellectual property.
Person Theft of
being humanistic
controlled property
Spam Trafficking
in stolen
Controller humanistic
property
+
C P Dissemination
− (violation of
Behavioral "error" disseminational
privacy)
Actual behavior
Desired
behavior
Figure 6.29 Block diagram depicting person, P, controlled by feedback loop comprising the
three steps of external controllability: theft of humanistic property, followed by trafficking in said
stolen property, followed by spamming. P is said to be observable when such theft is possible,
and controllable when such spam is possible. Note also that desired behavior is subtracted
from the actual behavior to obtain a behavioral ‘‘error’’ signal that is used by the controlling
entity C to generate customized spam targeted at behavioral modification.
The problem summarized in Figure 6.29 provides us with the seeds of its own
solution. Notably, if we can find a way to break open this feedback loop, we will
find a way to subvert the hegemony of the controller.
This section deals primarily with solitude, defined as the freedom from
violation by an inbound channel controlled by remote entities. Solitude, in this
context, is distinct from privacy. For the purposes of this discussion, privacy
is defined as the freedom from violation by an outbound channel controlled by
remote entities.
While much has been written and proposed in the way of legislation and
other societal efforts at protecting privacy and solitude, We concentrate here on
a personal approach at the level of the point of contact between individuals and
their environment. A similar personal approach to privacy issues has already
appeared in the literature [138]. The main argument in [138] and [2] is that
personal empowerment is possible through wearable cybernetics and humanistic
intelligence.
This section applies a similar philosophical framework to the issue of
solitude protection. In particular, the use of mediated reality, together with the
VideoOrbits-based RWM, is suggested for the protection of personal solitude.
The owner of a building or other real estate can benefit directly from erecting
distracting and unpleasant (at least unpleasant to some people) advertising signs
into the line of sight of all who pass through the space in and around his or her
property. Such theft of solitude allows an individual to benefit at the expense of
others (i.e., at the expense of the commons).
Legislation is one solution to this problem. However, here, a diffusionist [138]
approach is proposed in the form of a simple engineering solution in which
the individual can filter out unwanted real-world spam. Since WearComp, when
functioning as a reality mediator, has the potential to create a modified perception
of visual reality, it can function as a visual filter.
WearComp, functioning as a reality mediator can, in addition to augmenting
reality, also diminish or otherwise alter the visual perception of reality. Why
would one want a diminished perception of reality? Why would anyone buy a
pair of sunglasses that made one see worse?
An example of why we might want to experience a diminished reality is when
we drive and must concentrate on the road. Sunglasses that not only diminish
the glare of the sun’s rays but also filter out distracting billboards could help us
see the road better, and therefore drive more safely.
An example of the visual filter (operating on the author’s view in Times
Square) is shown in Figure 6.30. Thanks to the visual filter, the spam (unwanted
advertising material) gets filtered out of the wearer’s view. The advertisements,
signs, or billboards are still visible, but they appear as windows, containing
alternate material, such as email, or further messages from friends and relatives.
This personalized world is a world of the wearer’s own making, not the world
that is thrust down our throats by advertisers.
Let us see how MR can prevent theft of visual attention and mental
processing resources for the spam shown in Figure 6.30. In the figure light
enters the front of the apparatus as is depicted in the image sequence, which
is used to form a photoquantigraphic image composite. This photoquanti-
graphic representation follows the gaze of the wearer of the apparatus, and
thus traces what the wearer would normally see if it were not for the appa-
ratus.
Using a mathematical representation, the WearComp apparatus deletes or
replaces the unwanted material with more acceptable (nonspam) material
(Fig. 6.31). The nonspam substitute comprises various windows from the
wearer’s information space. As the mathematical representation is revised by
the wearer’s gaze pattern, a new sequence of frames for the spam-free image
sequence is rendered, as shown in Figure 6.31. This is what the wearer sees.
Note that only the frames of the sequence containing spam are modified. The
visual filter in the wearcomp apparatus, makes it possible to filter out offensive
advertising and turn billboards into useful cyberspace.
(a ) (b )
(c ) (d )
(e ) (f )
(g ) (h )
Figure 6.30 Successive video frames of Times Square view (frames 142–149).
(a ) (b )
(c ) (d )
(e ) (f )
(g ) (h )
Figure 6.31 Filtered video frames of Times Square view (frames 142–149 filtered). Note
the absence of spam within regions of images where spam was originally present. To
prevent the theft of solitude, spam may be deleted, or replaced, with useful material such
as personal email messages, the text of a favorite novel, or a favorite quote. Here the spam
was replaced with window diagrams and a table of equations from the wearer’s WWW site
(http://wearcam.org/orbits).
from off axis. We are used to seeing, for example, a flat movie screen from one
side of a theater and perceiving it as correct. Therefore we can quite comfortably
use the VideoOrbits head-tracker to experience the windowing system that we
call a reality window manager (RWM).
In some situations a single window in the RWM environment may be larger
than the field of view of the apparatus. When this occurs we simply look into
the space and find a larger window (see Figs. 6.32 and 6.33).
EXERCISES, PROBLEM SETS, AND HOMEWORK 291
(a ) (b ) (c )
(d ) (e ) (f )
(g ) (h ) (i )
(j ) (k ) (l )
Figure 6.32 A large billboard subtends a greater field of view than can be seen by the author
who is too close to such a large billboard to see its entire surface.
(a ) (b ) (c )
(d ) (e ) (f )
(g ) (h ) (i )
(j ) (k ) (l )
Figure 6.33 A window can replace the billboard even though it is larger than what can be
seen by a person who is too close to a large billboard. The wearer of the apparatus simply
looks around to see the entire screen. The experience is like being very close to a very high
definition video display screen.
Scene
Z2
X2
X1
z2
x2
x1
∝
q
COP z1
Z1
Figure 6.34 Projective coordinate transformation for two pictures of the same scene or
objects.
A particular point has been selected in the scene. Measured along the optical
axis of the first camera, the distance to the point in question is Z1 ; It is measured
as Z2 along the optical axis of the second camera. These are the same camera
in two different positions, but they are drawn as two optical axes in order to
simplify matters. Camera 1 (the first picture) has zoom setting z1 , while camera 2
(the second picture) has zoom setting z2 . (By “zoom setting,” what is meant is
the principal distance.) Derive a mathematical expression for the relationship
between coordinates x1 and x2 as a function of the angle between the cameras, θ ,
and the zoom settings, z1 and z2 . [Hint: The answer should not contain α, since
that is just the angle between the first optical axis and an arbitrary point in the
scene.]
If your answer contains trigonometric functions that depend on x1 or x2 ,
simplify your expression to remove any trigonometric functions that depend on
x1 or x2 . (Your head-tracker will later need to run on a small battery-powered
computer system, so computational efficiency will be important.)
Sketch a rough graph of x2 as a function of x1 to show its general shape.
X1
3.0
X
2
2.0 2.5
O2
2.0
1.5
0.0
COP 0.0
−0.5
O1 −1.0
Figure 6.35 Graphical depiction of a situation in which two pictures are related by a zoom
from 1 to 2, and a 30 degree angle between the two camera positions.
APPENDIX A
SAFETY FIRST!
295
Intelligent Image Processing. Steve Mann
Copyright 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-40637-6 (Hardback); 0-471-22163-5 (Electronic)
APPENDIX B
MULTIAMBIC KEYER FOR
USE WHILE ENGAGED
IN OTHER ACTIVITIES
B.1 INTRODUCTION
(a ) (b )
Figure B.1 Some of author’s early keyer inventions from the 1970s (keyers and pointing
devices) for WearComp. (a) Input device comprising pushbutton switches mounted to a
wooden lamp pushbroom handgrip; (b) input device comprising five microswitches mounted
to the handle of an electronic flash. A joystick (controlling two potentiometers), designed as a
pointing device for use in conjunction with the WearComp project, is also present.
An important aspect of the WearComp has been the details of the keyer, which
serves to enter commands into the apparatus. The keyer has traditionally been
attached to some other apparatus that needs to be held, such as a flashlamp and
a lightcomb, so that the effect is a hands-free data entry device (hands free in
the sense that one only needs to hold onto the flashlamp, or light source, and no
additional hand is needed to hold the keyer).
A distinction will be made from among similar devices called keyboards, keypads,
and keyers. Often all three are called “keyboards,” conflating very different
devices and concepts. Therefore these devices and the underlying concepts and
uses will be distinguished in this appendix as follows:
Another difference is that keyers require multiple keypresses. This means that
one either presses the same key more than once, or presses more than one key
at the same time, in order to generate a single letter or symbol.
Like keyboards and keypads, keyers are devices for communicating in symbols
such as letters of the alphabet, numbers, or selections. The simplest keyer is a
pushbutton switch or telegraph key pressed a short time (for a “dot”), or a long
time (for a “dash”), to key letters using a code, such as Morse code.
The next most complicated keyer is the so-called iambic keyer having two
keys, one to generate a “dot” and the other to generate a “dash.” The term
“iambic” derives from Latin iambus, and Greek iambos, and denotes the use of
a line of verse, such as the iambic pentameter, comprised of one short syllable
followed by one long syllable or comprised of one unstressed syllable followed
by one stressed syllable.
Iambic keyers allow a person to key faster, since a computer can do the timing
to time the length of a dot or a dash. Most iambic keyers contain some kind of
microprocessor, although similar early units were based on a mechanical resonant
mass on a stiff spring steel wire.
In this appendix the term “bi-ambic” will be used to denote the iambic keyer,
so that it can be generalized as follows:
• uni-ambic: one pushbutton switch or key, could also be taken to mean “un-
iambic” (as in not iambic)
• bi-ambic: two pushbutton switches or keys
• tri-ambic: three pushbutton switches or keys
• ...
• pentambic: five pushbutton switches or keys
• ...
• septambic: seven pushbutton switches or keys
• ...
• multiambic: many pushbutton switches or keys
(a ) (b ) (c )
(d ) (e ) (f )
(g ) (h ) (i )
(j ) (k ) (l )
Figure B.2 Preparation for making of keyer. (a) Pouring water from the kettle. (b) Placing
low-temperature plastic in the hot water; (c) wooden chopsticks are used to handle the hot
plastic; (d) the soft plastic is tested for consistency; (e) the hand is cooled in icewater; (f –k) the
soft plastic is lifted out of the hot water in one quick movement while the hand is taken out of
the ice bath; (l) finishing up with soft plastic on the hand. If the pain threshold is reached, the
process can be repeated with more cooling of the hand.
and applied to the hand to conform to the shape of the hand. These pictures
were captured by the author wearing a modern embodiment of the wearable
photographic apparatus (WearComp). This figure also shows the usefulness of
the wearable photographic apparatus in documenting scientific experiments from
the perspective of the individual scientist (i.e., first-person perspective). Thus the
300 APPENDIX B: MULTIAMBIC KEYER FOR USE WHILE ENGAGED IN OTHER ACTIVITIES
apparatus itself can serve a useful function in the context of automated collection
of experimental data and their documentation.
Next the soft-heated plastic is shaped to perfectly fit the hand as shown in
Figure B.3. Excess material is trimmed away as shown in Figure B.4.
(a ) (b )
(c ) (d )
(e ) (f )
(g ) (h )
Figure B.3 Shaping the soft-heated plastic to perfectly fit the hand. (a) Here a left-handed
keyer is in the making; (b–c) soft-heated plastic is pressed down to form nicely between the
fingers so that the keyer will conform perfectly to the hand’s shape; (d –h) when the right hand’s
taken away, the material should feel comfortable and naturally stay in position on the left hand.
THE SEVEN STAGES OF A KEYPRESS 301
(a ) (b ) (c )
(d ) (e ) (f )
(g ) (h ) (i )
(j ) (k ) (l )
Figure B.4 Trimming the keyer support structure. When the plastic cools and hardens, it
should fit perfectly to the hand in the interior region of the plastic. At the edges will be rough
and will need to be trimmed. (a–h) the initial rough trimming is done while the support structure
is worn, so that it can be trimmed to the hand’s shape; (i–l) then the plastic is removed and the
rough edges are easily rounded off.
Early keyers designed and built by the author in the 1970s were used in
conjunction with wearable photographic apparatus.
302 APPENDIX B: MULTIAMBIC KEYER FOR USE WHILE ENGAGED IN OTHER ACTIVITIES
(a ) (b )
(c ) (d )
(e ) (f )
(g) (h)
Figure B.5 Finishing off a keyer. (a) Selecting appropriate pushbutton switches; (b) testing for
positioning of the switches; (c) getting the thumb switches in position is most important; (d)
material being cut away and switches glued into place all while wearing the keyer; (e) keyer
being interfaced to WearComp by way of a Programmable Interface Controller, using a circuit
board designed by Wozniak (assembled by author) and a PIC program designed by Wozniak,
Mann, Moncrieff, and Fung; (f) circuit board (in author’s right hand) that interfaces key switches
to standard PS/2 connector (in author’s left hand); (g) the RJ45 connector used with eight wires,
one for each of the seven switches and the eighth for common; (h) the keyer being plugged
in by way of the RJ45 connector. The author has standardized the connections so that any
member of the community having difficulty with their keyer can temporarily swap with someone
else to determine whether the problem is in the keyer or in the conversion circuit.
THE SEVEN STAGES OF A KEYPRESS 303
For example, the original pentambic keyer had five keys, one for each of
the four fingers, and a fifth one for the thumb, so that characters were formed
by pressing the keys in different combinations. The computer could read when
each key was pressed, and when each key was released, as well as how fast the
key was depressed. Since the switches were double throw, the velocity sensing
capability arose from using both the naturally closed (NC) and naturally open
(NO) contacts, and measuring the time between when the common contact (C)
leaves the NC contact and meets the NO contact.
There are seven stages associated with pressing a combination of keys on a
multiambic keyer (see Fig. B.6). The Close–Sustain–Release progression exists
only in the intentionality of the user, so any knowledge of the progression from
within these three stages must be inferred, for example, by the time delays.
Arbitrary time constants could be used to make the keyer very expressive. For
example, characters could be formed by pressing keys for different lengths of time.
Indeed, a uniambic keyer, such as one used to tap out Morse code, relies heavily on
time constants. Two keys gave the iambic paddle effect, similar to that described in
a January 12, 1972, publication, by William F. Brown, U.S. Pat. 3,757,045, which
was further developed in U.S. Pat. 5,773,769. Thus there was no need for a heavy
base (it could thus be further adapted to be used while worn).
A D C S R Y O
}
}
ARPA APRA
(ARPEGGIO) (OIGGEPRA)
Figure B.6 The seven stages of the keypress. (A) Attack is the exact instant when the first
switch is measurably pressed (when its common moves away from its first throw if it is a double
throw switch, or when the first switch is closed if it is a single throw switch). (D) Delay is the time
between Attack and when the last switch of a given chord has finished being pressed. Thus
Delay corresponds to an arpeggiation interval (ARPA, from Old High German harpha, meaning
harp, upon which strings were plucked in sequence but continued to sound together). This
Delay may be deliberate and expressive, or accidental. (C) Close is the exact instant at which
the last key of a desired chord is fully pressed. This Closure of the chord exists only in the mind
(in the first brain) of the user, because the second brain (i.e., the computational apparatus,
worn by, attached to, or implanted in the user) has no way of knowing whether there is a plan
to, or plans to, continue the chord with more switch closures, unless all of the switches have
been pressed. (S) Sustain is the continued holding of the chord. Much as a piano has a sustain
pedal, a chord on the keyer can be sustained. (Y) Yaled is the opposite of delay (yaled is delay
spelled backward). Yaled is the time over which the user releases the components of a chord.
Just as a piano is responsive to when keys are released, as well as when they are pressed, the
keyer can also be so responsive. The Yaled process is referred to as an APRA (OIGGEPRA),
namely ARPA (or arpeggio) spelled backward. (O) Open is the time at which the last key is
fully (measurably) released. At this point the chord is completely open, and no switches are
measurably pressed.
304 APPENDIX B: MULTIAMBIC KEYER FOR USE WHILE ENGAGED IN OTHER ACTIVITIES
Switch 1
Switch 0
t1 t0 = t1
Rew
Switch 1 time
(4)
t0
Switch 0 time
= Closed = Open
Figure B.7 Cybernetic keyer timing. Two keys would be pressed or released at exactly the
same time, only on a set, denoted by the line t0 = t1 . This has measure zero in the (t0 , t1 )
plane, where t0 is the time of pressing or releasing of SWITCH 0, and t1 is the time of pressing
or releasing of SWITCH 1. To overcome this uncertainty, the particular meaning of the chord is
assigned based ordinally, rather than on using a timing threshold. Here, for example, SWITCH 0
is pressed first and released after pressing SWITCH 1 but before releasing SWITCH 1. This situation
is for one of the possible symbols that can be produced from this combination of two switches.
This particular symbol will be numbered (4) and will be assigned the meaning of REW (rewind).
unique symbols, excluding the Open chord (nothing pressed) as in the diagram
of Figure B.8.
The operation of the cybernetic keyer is better understood by way of a simple
example, illustrated in Figure B.9. The time space graph of Figure B.7 is really
just a four-dimensional time space collapsed onto two dimensions of the page.
Accordingly, we can view any combination of key presses that involves pressing
both switches within a finite time as a pair of ordered points on the graph. There
are six possibilities. Examples of each are depicted in Figure B.10.
With three switches instead of two, there are many more combinations possible.
Even if the three switches are not velocity sensing (i.e., if they are only single
throw switches), there are still 51 combinations. They can be enumerated as
follows:
0 00
SW1 SW0
Play 1 01
Stop 2 10
}
FF 3 FLFL
Rew 4 LFLF
11
Rec 5 FLLF
Pause 6 LFFL
Figure B.8 The cybernetic keyer. Timing information is depicted as dual traces: SWITCH 0 is
depicted by the bottom trace and SWITCH 1 by the top trace. The zeroth symbol 00 depicts
the open chord (no switches pressed). The first symbol 01 depicts the situation where only
SWITCH 0 is pressed. The second symbol 10 depicts the situation where only SWITCH 1 is
pressed. The third through sixth symbols 11 arise from situations where both switches are
pressed and then released, with overlap. The third symbol FLFL depicts the situation where
SWITCH 1 is pressed First, switch 0 is pressed Last, SWITCH 1 is released First, and switch 0 is
released Last. Similarly LFLF denotes Last First Last First (fourth symbol). FLLF denotes the
situation where SWITCH 1 is held down while SWITCH 0 is pressed and released (fifth symbol).
LFFL denotes the situation in which SWITCH 0 is held down while SWITCH 1 is pressed and
released (sixth symbol). The zeroth through sixth symbols are denoted by reference numerals
0 through 6, respectively. Each of the active ones (other than the Open chord, 0) are given
a meaning in operating a recording machine, with the functions PLAY, STOP, FastForward (FF),
REWind, RECord, and PAUSE.
SW1
SW0
1 2 4 6
Figure B.9 Cybernetic keyer timing example. In this example, the top trace denotes SWITCH 1,
and the bottom trace SWITCH 0. Initially SWITCH 0 is pressed and then SWITCH 1 is pressed.
However, because there is no overlap between these switch pressings, they are interpreted as
separate symbols (e.g. this is not a chord). The separate symbols are 1 (PLAY) and 2 (STOP). This
results in the playing of a short segment of video, which is then stopped. A little while later,
SWITCH 0 is pressed and then SWITCH 1 is pressed. However, because there is now overlap,
this action is considered to be a chord. Specifically it is an LFLF (Last First Last First) chord,
which is interpreted as symbol number 4 (REWIND). A REWind operation on a stopped system
is interpreted as high-speed rewind. A short time later, SWITCH 0 is held down while SWITCH 1
is pressed briefly. This action is interpreted as symbol number 6 (PAUSE). Since PAUSE would
normally be used only during PLAY or RECORD, the meaning during REWIND is overloaded with
a new meaning, namely slow down from high-speed rewind to normal speed rewind. Thus we
have full control of a recording system with only two switches, and without using any time
constants as might arise from other interfaces such as the iambic Morse code keyers used by
ham radio operators.
THE PENTAKEYER 307
t1
t0 = t1
1,2
4
Switch 1
6
3
5
2,1
t0
Switch 0
= Closed = Open
Figure B.10 Cybernetic keyer timings. The symbol ‘‘X’’ denotes pressing of the two keys,
and exists in the first pair of time dimensions, t0 and t1 . The releasing of the two keys exists
in the first second pair of dimensions, which, for simplicity (since it is difficult to draw the
four-dimensional space on the printed page), are also denoted t0 and t1 , but with the symbol
‘‘O’’ for Open. Examples of symbols 3 through 6 are realized. Two other examples, for when
the switch closures do not overlap, are also depicted. These are depicted as 1,2 (symbol 1
followed by symbol 2) and 2,1 (symbol 2 followed by symbol 1).
• Using all three switches, at the ARPA (arpeggio, Fig. B.6) stage:
1. There are three choices for First switch.
2. Once the first switch is chosen, there remains the question as to which
of the remaining two will be pressed Next.
3. Then there is only one switch left, to press Last.
Thus at the ARPA stage, there are 3 ∗ 2 ∗ 1 = 6 different ways of pressing
all three switches. At the APRA (oiggepra, Fig. B.6) stage, there are an
equal number of ways of releasing these three switches that have all been
pressed. Thus there are six ways of pressing, and six ways of releasing,
which gives 6 ∗ 6 = 36 symbols that involve all three switches.
The total number of symbols on the three switch keyer is 3 + 12 + 36 = 51. That
is a sufficient number to generate the 26 letters of the alphabet, the numbers 0
through 9, the space character, and four additional symbols.
Uppercase and control characters are generated by using the four additional
symbols for SHIFT and CONTROL, for example, of the letter or symbol that follows.
Thus the multiplication sign is SHIFT followed by the number 8, and the @ sign
is SHIFT followed by the number 2, and so on.
It is best to have all the characters be single chords so that the user gets one
character for every chord. Having a separate SHIFT chord would require the user
to state (i.e., remember) whether the SHIFT key was active, and that would also
slow down data entry.
Accordingly, if a fourth switch is added, we obtain a larger possible
combination of choices:
308 APPENDIX B: MULTIAMBIC KEYER FOR USE WHILE ENGAGED IN OTHER ACTIVITIES
4! 4! 4! 4!
(1!)2 + (2!)2 + (3!)2 + (4!)2
1!(4 − 1)! 2!(4 − 2)! 3!(4 − 3)! 4!(4 − 4)!
= 4 ∗ 12 + 6 ∗ 22 + 4 ∗ 62 + 1 ∗ 242 = 748. (B.1)
This number is sufficient to generate the 256 ASCII symbols, along with 492
additional symbols that may be each assigned to entire words, or to commonly
used phrases, such as a sig (signing off) message, a callsign, or commonly
needed sequences of symbols. Thus a callsign like “N1NLF” is a single chord.
A commonly used sequence of commands like ALT 192, ALT 255, ALT 192, is
also a single chord. Common words like “the” or “and” are also single chords.
The four switches can be associated with the thumb and three largest fingers,
leaving out the smallest finger. Claude Shannon’s information theory, however,
suggests that if we have a good strong clear channel, and a weaker channel,
we can get more error-free communication by using both the strong and weak
channels than we can by using only the strong channel. We could and should
use the weak (smallest) finger for at least a small portion of the bandwidth, even
though the other four will carry the major load. Referring back to Figure B.1,
especially Figure B.1b, we see that there are four strong double-throw switches
for the thumb and three largest fingers, and a fifth smaller switch having a very
long lever for the smallest finger. The long lever makes it easy to press this
REDUNDANCY 309
switch with the weak finger but at the expense of speed and response time. In
fact each of the five switches has been selected upon learning the strength and
other attributes of what will press it. This design gives rise to the pentakeyer.
The result in (B.1) can be generalized. The number of possible chords for
a keyer with N switches, having only single-throw (ST) switches, and not
using any looping back at either the Delay or Yaled (Fig. B.6) stages of chord
development, is
n=N
N!
(n!)2 . (B.2)
n=1
n!(N − n)!
B.6 REDUNDANCY
The pentakeyer provides enough chords to use one to represent each of the most
commonly used words in the English language. There are, for example, enough
chords to represent more than half the words recognized by the UNIX “spell”
command with a typical /usr/share/lib/dict/words having 25,143 words.
However, if all we want to represent is ASCII characters, the pentakeyer
gives us 17,685/256 > 69. That is more than 69 different ways to represent each
letter. This suggests, for example, that we can have 69 different ways of typing
the letter “a,” and more than 69 different ways of typing the letter “b,” and so
on. In this way we can choose whichever scheme is convenient in a given chord
progression.
In most musical instruments there are multiple ways of generating each chord.
For example, in playing the guitar, there are at least two commonly used G chords,
both of which sound quite similar. The choice of which G to use depends on
which is easiest to reach, based on what chord came before it, and what chord will
come after it, and so on. Thus the freedom in having two different realizations
of essentially the same chord makes playing the instrument easier.
Similarly, because there are so many different ways of typing the letter “a,”
the user is free to select the particular realization of the letter “a” that’s easiest to
type when considering whatever came before it and whatever will come after it.
Having multiple realizations of the same chord is called “chordic redundancy.”
Rather than distributing the chordic redundancy evenly across all letters, more
redundancy is applied where it is needed more, so there are more different ways
of typing the letter “a” than there are of typing the letter “q” or “u.” Part of
this reasoning is based on the fact that there are a wide range of letters that can
come before or after the letter “a,” whereas, for example, there is a smaller range
310 APPENDIX B: MULTIAMBIC KEYER FOR USE WHILE ENGAGED IN OTHER ACTIVITIES
of, and tighter distribution on, the letters that can follow “q,” with the letter “u”
being at the peak of that relatively narrow distribution.
Redundancy need not be imposed on the novice. The first-time user can learn
one way of forming each symbol, and then gradually learn a second way of
forming some of the more commonly used symbols. Eventually an experienced
user will learn several ways of forming some of the more commonly used
symbols.
Additionally some chords are applied (in some cases even redundantly) to
certain entire words, phrases, expressions, and the like. An example with timing
diagrams for a chordic redundancy based keyer is illustrated in Figure B.11.
This approach, of having multiple chords to choose from in order to produce
a given symbol, is the opposite of an approach taken with telephone touchpad-
style keypads in which each number could mean different letters. In U.S. Pat.
6,011,554, issued January 4, 2000, assigned to Tegic Communications, Inc.
(Seattle, WA), Martin T. King; (Vashon, WA); Dale L. Grover; (Lansing, MI);
Clifford A. Kushler (Vashon, WA); and Cheryl A. Grunbock; (Vashon, WA)
describe a disambiguating system in which an inference is made as to what the
person might be trying to type. A drawback of this Tegic system is that the user
must remain aware of what the machine thinks he or she is typing. There is an
extra cognitive load imposed on the user, including the need to be constantly
vigilant that errors are not being made. Using the Tegic system is a bit like using
command line completion in Emacs. While it allegedly purports to speed up the
process, it can, in practice, slow down the process by imposing an additional
burden on the user.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
SW2
SW1
SW0
Up Down Up Down
chases chases withins withins
Singles Doubles Triples
SW2
SW1
SW0
00 01 12
Figure B.11 Keyer with a functional chordic redundancy generator or keyer having functional
chordic redundancy. This keyer is used to type in or enter the numbers from 0 to 9 using three
single throw switches. Each symbol (each number from 0 to 9) may be typed in various ways.
If we wish to type ‘‘001,’’ we can do this as follows: first press and release switch SW0, to
obtain symbol 00 (the zeroth embodiment of symbol 0). Then to speed up the process (rather
than press the same switch again), we press switch SW1 while holding SW2, to obtain symbol
01 which is another realization of the symbol 0. We next choose a realization of the symbol 1,
namely 12 , that does not begin with switch SW2. Thus before the chord for symbol 01 is
completely released (at the Yaled stage), we begin entering the chord for symbol 12 , starting
with the available switch SW0.
ROLLOVER 311
In some sense the Tegic system is a form of antiredundancy, giving the user
less flexibility. For example, forming new words (not in the dictionary) is quite
difficult with the Tegic system, and when it does make mistakes, they are harder
to detect because the mistakes get mapped onto the space of valid words. Indeed,
chordic redundancy (choice) is much more powerful than antiredundancy (anti-
choice) in how characters can be formed.
A modifier key is a key that alters the function of another key. On a standard
keyboard, the SHIFT key modifies other letters by causing them to appear
capitalized. The SHIFT key modifies other keys between two states, namely a
lowercase state and an uppercase state.
Another modifier key of the standard keyboard is the control key. A letter
key pressed while the control key is held down is modified so that it becomes a
control character. Thus the letter “a” gets changed to “” if it is pressed while
the control key is held down.
With the cybernetic keyer, an approach is to have a modifier that is ordinally
conditional, so that its effect is responsive to where it is pressed in the chord.
See Figure B.12 for an example of how a four-key ordinally conditional modifier
is implemented.
B.8 ROLLOVER
One reason that chording keyers can be slow is that they often don’t provide
rollover. A regular QWERTY . . . keyboard allows for rollover. For example, the
^ ^ ^ ^ ^ ~ ~ ~ ~ ~
e t a o n ... E T A O N ... e t a o n ... e t a o n ...
SWt
SWi
SWm
SWr
Figure B.12 Example of keyer with ordinally conditional modifier. Letters are arranged in
order of letter frequency, starting with the letter ‘‘e’’ which is the most commonly used letter
of the alphabet. Each of the 26 letters, the 10 numbers, and some symbols are encoded with
the 51 possible chords that can be formed from 3 switches, a middle finger switch, SWm,
an index finger switch, SWi, and a thumb switch, SWt. (The more common letters, such as
e, t, and a, are also encoded redundantly so that there is more than one way to enter, for
example, the letter ‘‘e.’’) A ring finger switch, SWr, is the ordinally conditional modifier. If SWr
is not pressed, an ordinary lowercase character is assumed. If a chord leads with SWr, the
character is assumed to be an uppercase character. If the chord is entered while holding SWr,
the character is assumed to be a control character. If a chord trails with SWr, it is assumed
to represent a meta character. The ordinally conditional modifier is also applied to numbers to
generate some of the symbols. For example, an exclamation mark is entered by leading with
SWr into the chord for the number 1.
312 APPENDIX B: MULTIAMBIC KEYER FOR USE WHILE ENGAGED IN OTHER ACTIVITIES
classic 1984 IBM model M keyboard, still to this day a favorite of many users,
will be responsive to any key while the letter “q” is held down. When the letters
“q” and “w” are held down, it is responsive to most keys (i.e., all those except
keys in the q and w columns). When the letters “q,” “w,” and “e” are held down,
it is responsive to other keys except from those three columns. When “q,” “w,”
“e,” and “r” are held down, it is still responsive to keys in the right-hand half
of the keyboard (i.e., keys that would ordinarily be pressed with the right hand).
Only when five keys are held down, does it stop responsing to new keypresses.
Thus the model M has quite a bit of rollover. This means that one can type
new letters before finishing the typing of previous letters. This ability to have
overlap between typing different letters allows a person to type faster because a
new letter can be pressed before letting go of the previous letter. Commercially
available chording keyers such as the Handykey Twiddler and the BAT don’t
allow for rollover. Thus typing on the Twiddler or BAT is a slower process.
A goal of the cybernetic keyer is to be able to type much more quickly.
Therefore the important features are the trade-off between loopbacks at the Delay
and Yaled stages (Fig. B.6) and rollover. If we decide, by design, that there will
be no loopback at the Delay or the Yaled stages, we can assume that a chord has
been committed to at the Release stage. Thus, once we reach the Release stage,
we can begin to accept another chord, so long as the other chord does not require
the use of any switches that are still depressed at the Release stage. However,
because of the 69-fold chordic redundancy, it is arranged that for most of the
commonly following letters, there exists at the Release stage at least one new
chord that can be built on keys not currently held down.
B01
S1 S0
Figure B.13 Dual sensor keyer. Sensors S0 and S1 may be switches or transducers or other
forms of sensory apparatus. Sensors S0 and S1 are operable individually, or together, by way
of rocker block B01. Pressing straight down on block B01 will induce a response from both
sensors. A response can also be induced in only one of sensors S0 or S1 by pressing along its
axis.
314 APPENDIX B: MULTIAMBIC KEYER FOR USE WHILE ENGAGED IN OTHER ACTIVITIES
If we relax the ordinality constraint, and permit just one time constant, pertaining
to simultaneity, we can obtain eleven symbols from just two switches. Such a
scheme can be used for entering numbers, including the decimal point.
Using this simple coding scheme, the number zero is entered by pressing the LSK.
The number one is entered by pressing the MSK. The number four, for example,
is entered by pressing the MSK first, then pressing the LSK, and then releasing
both at the same time, Within a certain time tolerance for which time is considered
the same. (The letter “W” denotes Within tolerance, as illustrated in Fig. B.14.)
The decimal point is entered by pressing both switches at approximately the same
t1 t0 = t1
6
W1
t0
W0
Figure B.14 Example showing keyer chords within timing tolerances W0 and W1 . Time
differences within the tolerance band are considered to be zero, so events falling within the
tolerance band defined by W0 and W1 are considered to be effectively simultaneous.
MAKING A CONFORMAL MULTIAMBIC KEYER 315
00 1 0
SW1 SW0
0 01
1 10
2 FLFL
SW1 is
top
3 FLLF trace
SW0 is
4 FLW bottom
trace
5 LFFL
6 LFLF
7 LFW
F means
first
8 WFL L means
last
9 WLF W means
within
tolerance
WW
Figure B.15 Chordic keyer with timing tolerances. In addition to the unused Open chord,
there are 11 other chords that can be used for the numbers 0 through 9, and the decimal point.
time, and releasing both at approximately the same time. The “approximately the
same time” is defined according to a band around the line t0 = t1 in Figure B.7.
Such a timing band is depicted in Figure B.15.
This number system can be implemented either by two pushbutton switches
or by a single vector keyswitch of two components, as illustrated in Figure B.13.
In the latter case the entire set of number symbols can be entered with just one
switch, by just one finger. Note that each number involves just a single keypress,
unlike what would be the case if one entered numbers using a small wearable
Morse code keyer. Thus the cybernetic chordic keyer provides a much more
efficient entry of symbols.
Wearable keyers are known in the art of ham radio. For example, in U.S.
Pat. 4,194,085, March 18, 1980, Scelzi describes a “Finger keyer for code
transmission.” The telegraphic keyer fits over a finger, preferably the index finger,
316 APPENDIX B: MULTIAMBIC KEYER FOR USE WHILE ENGAGED IN OTHER ACTIVITIES
of an operator, for tapping against the operator’s thumb or any convenient object.
It is used for transmission of code with portable equipment. The keyer is wearably
operable when walking, or during other activities.
All keyers, including previously known keyers and the proposed keyers, such
as the pentakeyer and the continuous 10-dimensional keying system, are much
easier to use if they are custom-made for the user. The important aspect is getting
the hand grip to fit well.
A subject of ongoing work, therefore, is designing ways of molding the keyers
to fit the hand of the wearer. Presently this is done by dipping the hand in icewater,
as was shown in Figure B.2, and draping it with heated plastic material that is
formed to the shape of the hand. Once the handpiece is formed, sensors are
selected and installed so the keyer will match the specific geometric shape of the
user’s hand.
The keyer described in this chapter is closely related to other work done on
so-called chording keyboards. Since this field is so rapidly changing, a list
of references is maintained dynamically, online, as a constantly evolving and
updated bibliography and resource for the wide variety of keyers and keyer-
related research [140,141].
Two essential differences should be noted, however:
• The multiambic keyer enables a variety of absolute timing, ordinal timing
(order), and other fine structure, as well as phenomena such as redundancy.
Chording keyboards of the prior art fail to provide these features. However,
these so-called chording keyboards (which often don’t have a “board,” and
so should perhaps be called keyers rather than keyboards) may be regarded
as special cases of the more general keyer described in this article.
• The multiambic keyer described in this chapter is customized to the
individual user, much like prescription eyeglasses, a mouthguard, or shoes
and clothing. Therefore it takes an important departure from traditional
computing in which the interface is mass produced. Although there are
some obvious problems in manufacture, such customization is not without
precedent. For example, we commonly fit shoes and clothing for an
individual person, and do not normally swap shoes and clothing at random,
and expect them to serve any person at random. Likewise it is believed
by the author that many individual persons might someday have their own
keyers, and use these as their own personal input devices.
The customization of the keyer marks a sharp departure from the environmental
intelligence paradigms in which the environment simply adapts or is alleged to
adapt to our needs [142]. Instead of having cameras and microphones everywhere,
watching us, and feeding into a network of pervasive computing, we simply
have the computer attached to our own body. Thus, through our own cybernetic
CONCLUSION 317
B.13 CONCLUSION
Learning to use the pentakeyer is not easy, just as learning how to play a musical
instrument is not easy. The pentakeyer evolved out of a different philosophy,
more than 20 years ago. This alternative philosophy knew nothing of so-called
user-friendly, user-interface design, and therefore evolved along a completely
different path.
Just as playing the violin is much harder to master than playing the TV remote
control, it can also be much more rewarding and expressive. Thus, if we were
only to consider ease of use, we might be tempted to teach children how to
operate a television because it is easier to learn than how to play a violin, or how
to read and write. But if we did this, we would have an illiterate society in which
all we could do would be things that are easy to learn. It is the author’s belief
that a far richer experience can be attained with a lifelong computer interface
318 APPENDIX B: MULTIAMBIC KEYER FOR USE WHILE ENGAGED IN OTHER ACTIVITIES
that is worn on the body, and used constantly for 10 to 20 years. On this kind of
time scale, an apparatus that functions as a true extension of the mind and body
may result. Just as it takes a long time to learn how to see, or to read and write,
or to operate one’s own body (e.g., it takes some years for the brain to figure
out how to operate the body so that it can walk, run, swim, etc. effectively), it
is expected that the most satisfying and powerful user interfaces will be learned
over many years.
B.14 ACKNOWLEDGMENTS
Simon Haykin, Woodrow Barfield, Richard Mann, Ruth Mann, Bill Mann, and
Steve Roberts (N4RVE) helped in the way of useful feedback and constructive
criticism as this work evolved.
Dr. Chuck Carter volunteered freely of his time to help in the design of the
interface to WearComp2 (a 6502-based wearable computer system of the early
1980s), and Kent Nickerson similarly helped with some of the miniature personal
radar units and photographic devices involved with this project throughout the
mid-1980s.
Programming of one of the more recent embodiments of the Keyer was done
in collaboration with Adam Wozniak, and elements of the recent embodiment
borrow code from Wozniak’s PICKEY. Students, including Eric Moncrieff, James
Fung, and Taneem Ahmed, are continuing with this work.
The author would also like to thank Corey Manders, Maneesh Yadav, and
Adnan Ali, who recently joined in this effort.
The author additionally extends his thanks to Xybernaut Corp., Digital
Equipment Corp., and Compaq, for lending or donating additional equipment
that made these experiments possible.
Intelligent Image Processing. Steve Mann
Copyright 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-40637-6 (Hardback); 0-471-22163-5 (Electronic)
APPENDIX C
WEARCAM GNUX HOWTO
WARNING: Do not “make bootable” the WearComp GNUX partition (e.g. do not
install LILO) or you may clobber the hard drive and set yourself back a couple of
weeks. Unlike desktop computers, the wearcomps are ruggedized systems in which
the hard drive is very well sealed inside, and difficult to remove and recover from
a clobbered master boot record.
The LILO installation is often very insidiously named something like “make
bootable” under the deselect menu. Ordinarily this would not be a problem.
However the ruggedized nature of the WearComps and makes it difficult to
remove the hard drives to recover from problems with LILO on these WearComps,
since the WearComps typically do not have a floppy drive with proper booting
support.
The goal of this exercise is to familiarize the student with some of the typical
situations that arise in getting a WearComp to boot a free-source operating system.
GNUX (GNU + LinUX) has been chosen as the basis for the WearComp
operating system (WOS) and picture transfer protocol (PTP). It will therefore be
used for this exercise. WOS will continue to be developed on top of GNUX.
The student should install the GNUX operating system on a computer that
has no floppy disk drive or ethernet card. The GNUX runs in a ramdisk so
that, there is no dependence on hard drive. This exercise can be either done on
one of the standard issue WearComps (class set) or on any desktop computer
that has no GNUX file systems on the hard drive, and either has no floppy
319
320 APPENDIX C: WEARCAM GNUX HOWTO
drive or has the floppy drive disabled in order to simulate the condition of
having no floppy drive present. If you are using a WearComp, simply proceed.
If you do not have a WearComp, simply find a suitable computer with i386
type architecture; make sure that there are no GNUX filesystems on any of
the hard drives. Preferably there should only be one DOS partition on the hard
drive, of size 500 megabytes, in order to simulate the condition present on the
WearComps. Various open source freeDOS systems are preferred to proprietary
DOS. The student can install freeDOS, then, from the BIOS, disable the floppy
drive, and shut down the ethernet to simulate a typical WearComp situation.
Assuming that you are using the WearComp, familiarize yourself with the Wear-
Comp system. Currently it has on it DOS, as well as a virus called “win.” By
virus, what is meant is a program in which there has been deliberate attempt to
obfuscate its principle of operation (i.e., by Closed Source) as in the formal defi-
nition of “virus” proposed in the USENIX98 closing keynote address. http://
wearcam.org/usenix98/index.html (note that the virus spreads by way of
human users who write programs or develop applications that depend on it, so it
propagates by the need to install it to run these other applications.)
The following section discusses how to eradicate the automatic running of this
virus.
“win” command on it. Removing these 6 lines includes removal of the line
with the win command (the line that starts the virus).
Now the WearComp should boot into DOS only. DOS is not a particularly
good operating system but forms a satisfactory boot monitor from which to load
another operating system such as GNUX. On a WearComp it is preferable to use
loadlin, rather than LILO. Also using loadlin means that you will not continually
need to rewrite the MBR every time you want to change the default boot kernel.
If there is already LILO present, it can be removed with the “FDISK /MBR”
command which restores the Master Boot Record (MBR). You may want to add
the line “doskey” to the end of the autoexec.bat file. This is useful because it
will allow you to use the up/down arrows to get commands from a command
history.
We need to make room on the hard drive for GNUX, so delete some of the
unnecessary files. On a standard issue WearComp, proceed as follows:
deltree aol25
deltree apache
deltree cdrom
deltree fmpro
deltree ifloldpi
deltree ifls (takes a few minutes to delete)
deltree mmworks
deltree mtb30
deltree picture
deltree pman15
deltree policed
deltree powerpnt
deltree project1
deltree puzzle
322 APPENDIX C: WEARCAM GNUX HOWTO
deltree sidecar
deltree temp
deltree tigers (takes about 30 seconds delete)
deltree verbex (takes about a minute to delete)
deltree voiceage
Keep the “dos” utilities, and the “win” directory (running the “win” virus can
tell you certain useful things about the system, what kind of mouse it has, etc.).
There are several files that are needed and can be obtained from the wearcam.
org/freewear site. You will need access to a network connected PC to which
you can access the serial or parallel port. This is a good exercise in transferring
files to and from the WearComp.
The best way to get the files onto the WearComp is to get them onto your
desktop computer in the DOS partition. From GNUX on your desktop computer
(or by any other means) obtain the following from wearcam.org which you will
put in the dos partition of your desktop computer:
• debian
• fips
• loadlin
When you obtain these files, make sure that they have the same filesize and
contents as the ones on wearcam.org, since some Web browsers, in the presence
of certain viruses such as Virus95, will corrupt files during transfer. In particular,
it is common under Virus95 for binary files to have something appended to the
end of the files. Thus double-check to make sure that your browser has not
appended a virus to the end of any of the files that you download.
The DOS programs interlnk and intersvr may be used to transfer files. If you
are not already in DOS, restart your computer in DOS, and use the DOS utilities
interlnk and intersvr to transfer these to the WearComp into the appropriate
directories.
After you’ve transferred the additional files over, your directory structure
should look something like this:
FIPS 323
Edit the autoexec.bat and config.sys appropriately for the new material
added, and the old material deleted. For example, add directories such as loadlin
to your path, and delete paths to nonexistent directories, to shorten your path if it
gets too long. (Remember that if you goof autoexec.bat and config.sys, you
might have to hold down F8 while booting just before the prompt says “booting
msdos.”
C.6 DEFRAG
Run the defrag program to degragment the drive. This is very important thing
to do.
Make sure you do a full defrag. If the disk is already less than 1% fragmented,
it will run a defrag of files only. If this occurs, make sure you still force it to do
a complete defrag.
Be careful not to “clobber” DOS, or you will have “killed” the WearComp,
since you will then have no way to boot it (it has no floppy drive, CD ROM, etc).
There is currently no convenient way to remove the hard drive, or to connect a
floppy disk or CD ROM. This is typical of WearComp systems. Consider yourself
lucky that it boots something. Do your best to keep it that way (i.e., keep it so
that it boots something).
C.7 FIPS
Before running fips, make sure that the defrag program has run. Be careful that
nothing happens during fips (e.g., tripping over the cable or pulling out power,
which can kill the WearComp). This is the time to be careful that nothing gets
clobbered.
Use fips.exe to reduce the partition size of the DOS partition to 100 megabytes.
When fips asks you if you want to make a backup, you will have to specify n
(no) because there is no way to make a backup at this time. Use the up/down
arrows in fips to specify the partition size. Select 100.4 MB and cylinder number
204. This leaves 399.2 MB for GNUX.
Once fips exits, you will need to reboot the computer before the size change
takes effect. Do not mess around creating or deleting files, as this can kill the
computer. Reboot as soon as fips exits. You will now see a smaller amount of
free space, and the computer will behave, in DOS, as if it has a 100 megabyte
324 APPENDIX C: WEARCAM GNUX HOWTO
hard drive instead of a 500 megabyte hard drive. The remaining 400 megabytes
(invisible to DOS except for fdisk) will be used for GNUX.
Although DOS fdisk can see the new 400 megabyte partition, don’t use DOS
fdisk to do anything to the new partition; just leave the remaining 399.2 MB as
it is for now (unformatted). This will later be repartitioned using GNUX fdisk,
and formatted using GNUX mke2fs.
Reboot, go into debian directory, and type install to run the install.bat file.
APPENDIX D
HOW TO BUILD A COVERT
COMPUTER IMAGING
SYSTEM INTO ORDINARY
LOOKING SUNGLASSES
For the WearComp reality mediator to be of use in everyday life, it must not
have an unusual appearance, especially given its use in corrupt settings such
as dishonest sales establishments, gambling casinos, corrupt customs borders
stations, and political institutions where human rights violations are commonplace
or where objections are likely to be raised to recording apparatus.
Accordingly it has been proposed that the apparatus must pass the so-called
casino test [37]. Once the apparatus has successfully passed the scrutiny of the
most paranoid individuals, like the croupiers and pit bosses of criminally funded
organizations, it will then have reached what one might call a state of looking
normal.
A brief historical time line of the “computershades” (covert reality mediator)
follows:
(a ) (b )
Figure D.1 Covert embodiments of WearComp suitable for use in ordinary day-to-day situ-
ations. Both incorporate fully functional UNIX-based computers concealed in the small of the
back, with the rest of the peripherals, such as analog to digital converters, also concealed
under ordinary clothing. Both incorporate camera-based imaging systems concealed within
the eyeglasses. While these prototype units are detectable by physical contact with the body,
detection of the apparatus by others was not found to be a problem. This is, of course, because
normal social conventions are such that touching of the body is normally only the domain of
those known well to the wearer. As with any prosthetic device, first impressions are important
to normal integration into society, and discovery by those who already know the wearer well
(i.e., to the extent that close physical contact may occur) typically happens after an acceptance
is already established. Other prototypes have been integrated into the clothing in a manner that
feels natural to the wearer and to others who might come into physical contact with the wearer.
(a) Lightweight black-and-white version completed in 1995. (b) Full-color version completed in
1996 included special-purpose digital signal-processing hardware based on an array of TMS
320 series processors connected to a UNIX-based host processor, which is concealed in the
small of the back. A cross-compiler for the TMS 320 series chips was run remotely on a SUN
workstation, accessed wirelessly through radio and antennas concealed in the apparatus.
had an unusual appearance by modern standards. The computer has now been
made quite small. For example, in a seventh-generation system the components
are distributed and concealed in a structure similar to an athletic tank top for
being worn under casual clothing. This structure also allows the device to pick
up physiological measurements, such as respiration, heart rate, and in fact the
full ECG waveform. Alternatively, when it is not necessary to collect physio-
logical data, we sometimes use small-size commercial off-the-shelf computers,
such as the Expresso pocket computer, which can fit in a shirt pocket and can
be concealed under casual clothing without much difficulty.
Batteries can be distributed and are easy to conceal. In and of themselves,
the processor and batteries are easily to concealed, and even the keyer can be
concealed in a pocket, or under a table during a meeting.
The display device is perhaps the most cumbersome characterizing feature
of the WearComp. Even if a 20-year-old backpack-based wearable computer,
or a 15-year-old jacket-based computer, is worn, it is the display that is most
objectionable and usually first noticed. Of all the various parts of the WearComp,
the information display is that which makes it most evident that a person is
networked, and it is usually also the most unnerving or disturbing to others.
A display system, even if quite small, will be distracting and annoying to
others, simply because it is right at the eye level. Of course, this is the area
where people pay the most attention when engaged in normal conversation. Thus
one might want to consider building a covert display system.
The seventh-generation WearComps (started in 1995) were characterized by
a display concealed inside what appear to be ordinary sunglasses. Some of the
sunglasses had more than just displays. Some were EyeTap devices that func-
tioned as cameras. However, for simplicity, we will look at the example of a
simple display device.
The simplest display medium is the Kopin CyberDisplay. The original Kopin
display system, selling for approximately U.S. $5,000 was donated to the author
by Kopin, and formed the basis for many of these experiments. Presently,
however, the cost has come down considerably, and units are now selling for
less than $100.
Originally the author had some connectors manufactured to connect to the
Kopin CyberDisplay. However, later, in order to keep the overall size down, the
size was reduced by soldering wires directly to the Kopin CyberDisplay, in one
of two ways:
Assuming that most students (especially with the growing emphasis on soft
computing) are novices at soldering, we will focus on the first method which
is a lot easier, especially for the beginning student.
328 APPENDIX D: BUILDING A COVERT SYSTEM IN ORDINARY EYEWEAR
The very first step is to number the wires. The importance of numbering the
wires cannot be overemphasized. After soldering the wires on, you will find it
somewhat difficult to determine which wire is which by following each wire.
Since the whole item is very fragile, it is strongly recommended that the wires
be numbered before soldering any of them to anything.
Take 20 or so wires (depending on if you want to connect up all the lines, or
just the ones that are actually used — check the Kopin specifications for signal
levels, etc., and what you are going to use), and bring them into a bundle and
label them as shown in Figure D.2.
The author prefers to use all black wires because they are easier to conceal
in the eyeglass frames. Usually number 30 wire (AWG 30) is used. Sometimes
black number 30 stranded, and sometimes black number 30 solid are used. If
you’re a novice at soldering, use number 30 solid and splice it to number 30
stranded inside the eyeglass frames. Solid is a lot easier to solder to the Kopin
cyberdisplay.
With the reduction by more than a factor of 50 times in cost (from U.S. $5,000
down to less than U.S. $100), we can afford to take more risks and solder directly
to the device to avoid the bulk of a connector that should be concealed.
Figure D.2 Use of black wires for all the wires instead of color coding the wires. Using all
black wire makes them easier to conceal into the frames with optical epoxy, rather than using
variously colored wires. One begins by labeling (numbering) the wires. Here is what a bundle of
wires looks like when it is labeled. The 6 (‘‘six’’) is underlined so that, when it is upside down, it
does not alias into the number 9.
COMPLETING THE COMPUTERSHADES 329
Figure D.3 Begin soldering at one end, and work across carefully, since it is much easier to
solder when one side is free than to solder into the middle of a row of wires.
To solder the wires to the Kopin CyberDisplay, a fine tip iron is best. The
author usually uses, a Weller 921ZX iron, which is the sleekest and most slender
commercially produced soldering iron; it also takes the finest tip, number U9010.
Begin soldering at one end, and work across carefully. It is a lot easier to
solder when one side is free than to go back and fix a bad connection in the
middle of a row of wires as indicated in Figure D.3.
You’re now ready to install the unit into eyeglasses. The display requires a
backlight, and a circuit to drive it. If you’re versed in the art of field programmable
gate arrays (FPGAs), you should have no trouble designing a driver circuit for
the display. Alternatively, you could purchase a product that uses the Kopin
CyberDisplay. An example of a product that uses the Kopin CyberDisplay is the
M1 product made in Canada. If you purchase an M1, you will be able to test
your eyeglasses while you continue development of a smaller-sized drive circuit.
There are now two versions of the M1: one with remote driver board, and
the other in which the driver board is put inside the main box. The one with
330 APPENDIX D: BUILDING A COVERT SYSTEM IN ORDINARY EYEWEAR
Figure D.4 Completed covert eyeglass rig: Here only the right eye position is shown wired.
The second eye (left eye) position has not yet been installed. A second eyeglass safety strap
and a second display is next installed. Both can now be tested with the M1 drive circuit. The
next step is to build a smaller circuit that drives two Kopin CyberDisplays. Not shown in this
system is the installation of the cameras, which are each reflected off the back of the optical
element used for the display.
Figure D.5 Completed unit as it appeared on the cover of Toronto Computes, September
1999. The computershades enable a fashionable, or at least normal-looking existence in the
world of wearable computing.
COMPLETING THE COMPUTERSHADES 331
remote driver board will tend to give a better picture (less image noise) because
of the reduced distance of signal path. In this case (see Fig. D.4), a short (e.g.,
18 inch) set of wires from the eyeglasses to the driver board can be concealed
in an eyeglass safety strap.
Not shown in Figure D.4 is the installation of the cameras. The cameras would
each be reflected off the back of the optical element shown for the display. If
you are just using the computer for normal computational applications, you don’t
need to worry about the camera or the second eye. Enjoy your covert seventh-
generation WearComp.
A complete version of the partially constructed eyeglasses pictured above,
appeared on the cover of the September 1999 issue of Canada Computes
(Fig. D.5), where we can see that wearable computing can appear fashionable,
or at the very least normal looking.
A NOTE ON SAFETY: Neither the author nor the publisher can assume any liability
for any bad effects experienced. Displays can be distracting and cause death or
injury because of the distraction. Also care is needed to avoid optical elements
getting into the eye, if struck, falling down, or the like, while wearing the
apparatus. Finally, excessive brightness over long-term usage can be bad, (clear
glasses turn up brightness and lead to eye damage). If you use a high-voltage
source for the backlight, beware of possible bad effects from it being close to
the eye, and with exposed wires. The electric shock, which feels like a bad
mosquito bite, even at low current, can also cause distraction. So, even if the
voltage is not harmful in itself, it may cause other injury. Remember SAFETY
FIRST!
Intelligent Image Processing. Steve Mann
Copyright 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-40637-6 (Hardback); 0-471-22163-5 (Electronic)
BIBLIOGRAPHY
[1] Steve Mann. Wearable computing: A first step toward personal imaging. IEEE
Computer, 30(2):25–32, Feb. 1997. http://wearcam.org/ieeecomputer.htm.
[2] Steve Mann. Humanistic intelligence/humanistic computing: “Wearcomp” as a new
framework for intelligent signal processing. Proc. IEEE, 86(11):2123–2151 +
cover, Nov. 1998. http://wearcam.org/procieee.htm.
[3] William A. S. Buxton and Ronald M. Baecker. Readings in Human-Computer
Interaction: A Multidisciplinary approach. Morgan Kaufmann, 1987, chs. 1, 2.
[4] Simon Haykin. Neural Networks: A Comprehensive Foundation. Mcmillan, New
York, 1994.
[5] Simon Haykin. Radar vision. Second International Specialist Seminar on Parallel
Digital Processors, Portugal, Apr. 15–19, 1992.
[6] D. E. Rumelhart and J. L. McMlelland, eds. Parallel Distributed Processing. MIT
Press, Cambridge, 1986.
[7] Tomaso Poggio and Federico Girosi. Networks for approximation and learning.
Proc. IEEE, 78(9):1481–1497, Sep. 1990.
[8] Bart Kosko. Fuzzy Thinking: The New Science of Fuzzy Logic. Hyperion, New York,
1993.
[9] Bart Kosko and Satoru Isaka. Fuzzy logic. Scientific American, 269:76–81, July
1993.
[10] Marvin Minsky. Steps toward artificial intelligence. In Phillip Laplante, ed., Great
Papers on Computer Science, Minneapolis/St. Paul, 1996 (paper in IRE 1960).
[11] D. C. Engelbart. Augmenting human intellect. a conceptual framework. Research
Report AFOSR-3223, Stanford Research Institute, Menlo Park, 1962. http://
www.histech.rwth-aachen.de/www/quellen/engelbart/ahi62index.html.
[12] Douglas C. Engelbart. A conceptual framework for the augmentation of man’s
intellect. In P. D. Howerton and D. C. Weeks, eds., Vistas in Information Handling.
Spartan Books, Washington, DC, 1963.
[13] Method and apparatus for relating and combining multiple images of the same
scene or object(s). Steve Mann and Rosalind W. Picard, U.S. Pat. 5706416, Jan. 6,
1998.
[14] Steve Mann. Wearable, tetherless computer–mediated reality: WearCam as a
wearable face–recognizer, and other applications for the disabled. TR 361, MIT
Media Lab Perceptual Computing Section. Also appears in AAAI Fall Symposium
332
BIBLIOGRAPHY 333
on Developing Assistive Technology for People with Disabilities, Nov. 9–11, 1996,
MIT. http://wearcam.org/vmp.htm. Cambridge, Feb. 2 1996.
[15] Stephen M. Kosslyn. Image and Brain: The Resolution of the Imagery Debate. MIT
Press, Cambridge, 1994.
[16] S. Mann. “WearStation”: With today’s technology, it is now possible to build a fully
equipped ham radio station, complete with internet connection, into your clothing.
CQ-VHF, pp. 1–46, Jan., 1997.
[17] S. Mann. “Mediated reality.” TR 260, MIT Media Lab vismod, Cambridge, 1994.
http://wearcam.org/mr.htm.
[18] James D. Meindl. Low power microelectronics: Retrospect and prospect. Proc.
IEEE 83(4):619–635, Apr. 1995.
[19] Steve Mann. An historical account of the “WearComp” and “WearCam” projects
developed for “personal imaging”. In Int. Symp. on Wearable Computing, IEEE,
Cambridge, MA, Oct. 13–14, 1997.
[20] Steve Mann. Personal Imaging. Ph.D. thesis. Massachusetts Institute of Technology.
1997.
[21] Steve Mann. Further developments on “headcam”: Joint estimation of camera
rotation + gain group of transformations for wearable bi-foveated cameras. In Proc.
Int. Conf. on Acoustics, Speech and Signal Processing, vol. 4, Munich, Germany,
Apr. 1997. IEEE.
[22] Eric J. Lind and Robert Eisler. A sensate liner for personnel monitoring application.
In First Int. Symp. on Wearable Computing. IEEE, Cambridge, MA, Oct. 13–14,
1997.
[23] R. W. Picard and J. Healey. Affective wearables. In Proc. First Int. Symp. on
Wearable Computers, IEEE pp. 90–97, Cambridge, MA, Oct. 13–14, 1997.
[24] Steve Mann. Smart clothing: The wearable computer and wearcam. Personal
Technologies, 1(1):21–27, Mar. 1997.
[25] Simon Haykin, Carl Krasnor, Tim J. Nohara, Brian W. Currie, and Dave
Hamburger. A coherent dual-polarized radar for studying the ocean environment.
IEEE Trans. on Geosciences and Remote Sensing, 29(1):189–191, Jan. 1991.
[26] Brian W. Currie, Simon Haykin, and Carl Krasnor. Time-varying spectra for dual-
polarized radar returns from targest in an ocean environment. In IEEE Conf.
Proc. RADAR90, pp. 365–369, Arlington, VA, May 1990. IEEE Aerospace and
Electronics Systems Society.
[27] D. Slepian and H. O. Pollak. Prolate spheroidal wave functions, Fourier analysis
and uncertainty, I. Bell Syst. Tech. J., 40:43–64, Jan. 1961.
[28] Steve Mann and Simon Haykin. The chirplet transform: A generalization of Gabor’s
logon transform. Vision Interface ’91, pp. 205–212, June 3–7, 1991.
[29] G. Strang. Wavelets and dilation equations: A brief introduction. SIAM Rev.,
31(4):614–627, 1989.
[30] I. Daubechies. The wavelet transform, time-frequency localization and signal
analysis. IEEE Trans. Inf. Theory, 36(5):961–1005, 1990.
[31] D. Mihovilovic and R. N. Bracewell. Whistler analysis in the time-frequency plane
using chirplets. J. Geophys. Res., 97(A11):17199–17204, Nov. 1992.
334 BIBLIOGRAPHY
[32] Richard Baraniuk and Doug Jones. Shear madness: New orthonormal bases and
frames using chirp functions. Trans. Signal Processing, 41, Dec. 1993. Special
issue on wavelets in signal processing.
[33] M. A. Saunders, S. S. Chen, and D. L. Donoho. Atomic decomposition by basis
pursuit. http://www-stat.stanford.edu/∼ -donoho/Reports/1995/30401.
pdf, pp. 1–29.
[34] H. M. Ozaktas, B. Barshan D. Mendlovic, and L. Onural, Convolution, filtering,
and multiplexing in fractional fourier domains and their relation to chirp and
wavelet transforms, J. Opt. Soc. America, A11(2):547–559, (1994).
[35] Steve Mann and Simon Haykin. The chirplet transform: Physical considerations.
IEEE Trans. Signal Processing, 43(11):2745–2761, Nov. 1995.
[36] Don Norman. Turn Signals Are the Facial Expressions of Automobiles. Addison
Wesley, Reading, MA, 1992.
[37] Steve Mann. “Smart clothing”: Wearable multimedia and “personal imaging”
to restore the balance between people and their intelligent environments.
In Proc. ACM Multimedia 96. pp. 163–174, Boston, Nov. 18–22, 1996.
http://wearcam.org/acm-mm96.htm.
[38] Steve Mann. Humanistic intelligence. Invited plenary lecture, Sept. 10, In Proc.
Ars Electronica, pp. 217–231, Sept. 8–13, 1997. http://wearcam.org/ars/
http//www.aec.at/fleshfactor. Republished in Timothy Druckrey, ed., Ars
Electronica: Facing the Future, A Survey of Two Decades. MIT Press, Cambridge,
pp. 420–427.
[39] R. A. Earnshaw, M. A. Gigante, and H Jones. Virtual Reality Systems. Academic
Press, San Diego, CA, 1993.
[40] Sutherland. A head mounted three dimensional display. In Proc. Fall Joint
Computer Conf., Thompson Books, Washington, DC, 1968, pp. 757–764.
[41] S. Feiner, B. MacIntyre, and D. Seligmann. Knowledge-based augmented reality.
Commun. ACM, 36(7):52–62, July 1993.
[42] S. Feiner, B. MacIntyre, and D. Seligmann. Karma (knowledge-based augmented
reality for maintenance assistance). 1993. http://www.cs.columbia.edu/grap-
hics/projects/ karma/karma.html.
[43] Henry Fuchs, Mike Bajura, and Ryutarou Ohbuchi. Teaming ultrasound
data with virtual reality in obstetrics. http://www.ncsa.uiuc.edu/Pubs/
MetaCenter/SciHi93/ 1c.Highlights-BiologyC.html.
[44] David Drascic. Papers and presentations, 1993. http://vered.rose.utoronto.
ca/people/david dir/ Bibliography.html.
[45] S. Mann. Wearable Wireless Webcam, 1994. http://wearcam.org.
[46] Ronald Azuma. Registration errors in augmented reality: NSF/ARPA Science and
Technology Center for Computer Graphics and Scientific Visualization, 1994.
http://www.cs.unc.edu/∼/azuma/azuma AR.html.
[47] George M. Stratton. Some preliminary experiments on vision without inversion of
the retinal image. Psycholog. Rev., 3: 611–617, 1896.
[48] Hubert Dolezal. Living in a World Transformed. Academic Press, Orlando, FL,
1982.
[49] Ivo Kohler. The Formation and Transformation of the Perceptual World, Vol. 3(4)
of Psychological Issues. International University Press, New York, 1964.
BIBLIOGRAPHY 335
[50] Simon Haykin. Communication Systems, 2d ed. Wiley, New York, 1983.
[51] G. Arfken. Mathematical Methods for Physicists, 3rd ed. Academic Press, Orlando,
FL, 1985.
[52] K. Nagao. Ubiquitous talker: Spoken language interaction with real world objects,
1995. http://www.csl.sony.co.jp/person/nagao.html.
[53] Michael W. McGreevy. The presence of field geologists in mars-like terrain.
Presence, 1(4):375–403, Fall 1992.
[54] Stuart Anstis. Visual adaptation to a negative, brightness-reversed world: Some
preliminary observations. In Gail Carpenter and Stephen Grossberg, eds., Neural
Networks for Vision and Image Processing. MIT Press, Cambridge, 1992, pp. 1–15.
[55] Wilmer Eye Institute. Lions Vision Research and Rehabilitation Center, Johns
Hopkins, 1995. http://www.wilmer.jhu.edu/low vis/low vis.htm.
[56] Bob Shaw. Light of Other Days. Analog, August 1966.
[57] Harold E. Edgerton. Electronic Flash/Strobe. MIT Press, Cambridge, 1979.
[58] T. G. Stockham Jr. Image processing in the context of a visual model. Proc. IEEE,
60(7):828–842, July 1972.
[59] S. Mann and R. W. Picard. Being “undigital” with digital cameras: Extending
dynamic range by combining differently exposed pictures. Technical Report 323,
MIT Media Lab Perceptual Computing Section, Cambridge, 1994. Also appears,
IS&T’s 48th Ann. Conf., pp. 422–428, May 7–11, 1995, Washington, DC.
http://wearcam.org/ist95.htm.
[60] P. Pattie Maes, Treror Darrell, Bruce Blumberg, and Alex Pentland. The alive
system: Full-body interaction with animated autonomous agents. TR 257, MIT
Media Lab Perceptual Computing Section, Cambridge, 1994.
[61] Steve Mann. Eyeglass mounted wireless video: Computer-supported collaboration
for photojournalism and everyday use. IEEE ComSoc, 36(6):144–151, June 1998.
Special issue on wireless video.
[62] W. Barfield and C. Hendrix. The effect of update rate on the sense of presence
within virtual environments. Virtual Reality: Research, Development, and Applica-
tion, 1(1):3–15, 1995.
[63] S. Mann. Compositing multiple pictures of the same scene. In Proc. 46th An. IS&T
Conf., pp. 50–52, Cambridge, MA, May 9–14, 1993. Society of Imaging Science
and Technology.
[64] Steve Mann. Personal imaging and lookpainting as tools for personal documentary
and investigative photojournalism. ACM Mobile Networking, 4(1):23–36, 1999.
[65] R. J. Lewandowski, L. A. Haworth, and H. J. Girolamo. Helmet and head-mounted
displays iii. Proc. SPIE, AeroSense 98, 3362, Apr. 12–14, 1998.
[66] T. Caudell and D. Mizell. Augmented reality: An application of heads-up display
technology to manual manufacturing processes. Proc. Hawaii Int. Conf. on Systems
Science, 2:659–669, 1992.
[67] Microvision, http://www.mvis.com/.
[68] Graham Wood. The infinity or reflex sight, 1998. http://www.graham-
wood.freeserve.co.uk/1xsight/finder.htm.
[69] Stephen R. Ellis, Urs J. Bucher, and Brian M. Menges. The relationship of
binocular convergence and errors in judged distance to virtual objects. Proc. Int.
Federation of Automatic Control, June 27–29, 1995.
336 BIBLIOGRAPHY
[88] E. H. Adelson and J. R. Bergen. The plenoptic function and the elements of early
vision. In M. Landy and J. A. Movshon, eds., Computational Models of Visual
Processing, MIT Press Cambridge MA, pp. 3–20, 1991.
[89] Compiled and edited from the original manuscripts by Jean Paul Richter. The
Notebooks of Leonardo Da Vinci, 1452–1519 vol. 1. Dover, New York, 1970.
[90] Graham Saxby. Practical Holography, 2nd ed., Prentice-Hall, Englewood Cliffs,
New Jessey, 1994.
[91] B. R. Alexander, P. M. Burnett, J. -M. R. Fournier, and S. E. Stamper. Accurate
color reproduction by Lippman photography. In SPIE Proc. 3011–34 Practical
holography and holographic materials, Bellingham WA 98227. Feb. 11, 1997.
Photonics West 97 SPIE, Cosponsored by IS&T. Chair T. John Trout, DuPont.
[92] Steve Mann. Lightspace. Unpublished report (paper available from author).
Submitted to SIGGRAPH 92. Also see example images in http://wearcam.org/
lightspace, July 1992.
[93] Cynthia Ryals. Lightspace: A new language of imaging. PHOTO Electronic
Imaging, 38(2):14–16, 1995. http://www.peimag.com/ltspace.htm.
[94] S. S. Beauchemin, J. L. Barron, and D. J. Fleet. Systems and experiment perfor-
mance of optical flow techniques. Int. J. Comput. Vision, 12(1):43–77, 1994.
[95] A. M. Tekalp, M. K. Ozkan, and M. I. Sezan. High-resolution image reconstruc-
tion from lower-resolution image sequences and space-varying image restoration. In
Proc. Int. Conf. on Acoustics, Speech and Signal Proc., pp. III–169, San Francisco,
CA, Mar. 23–26, 1992. IEEE.
[96] Qinfen Zheng and Rama Chellappa. A Computational Vision Approach to Image
Registration. IEEE Trans. Image Processing, 2(3):311–325, 1993.
[97] L. Teodosio and W. Bender. Salient video stills: Content and context preserved.
Proc. ACM Multimedia Conf., pp. 39–46, Aug. 1993.
[98] R. Szeliski and J. Coughlan. Hierarchical spline-based image registration.
Computer Vision Pattern Recognition, pp. 194–201, June 1994.
[99] George Wolberg. Digital Image Warping. IEEE Computer Society Press, Los
Alamitos, CA, 1990. IEEE Computer Society Press Monograph.
[100] G. Adiv. Determining 3D motion and structure from optical flow generated
by several moving objects. IEEE Trans. Pattern Anal. Machine Intell., PAMI-
7(4):384–401, July 1985.
[101] Nassir Navab and Steve Mann. Recovery of relative affine structure using the
motion flow field of a rigid planar patch. Mustererkennung 1994, Tagungsband., 5
186–196, 1994.
[102] R. Y. Tsai and T. S. Huang. Estimating three-dimensional motion parameters
of a rigid planar patch I. IEEE Trans. Accoust., Speech, and Sig. Proc.,
ASSP(29):1147–1152, Dec. 1981.
[103] Amnon Shashua and Nassir Navab. Relative affine: Theory and application to 3D
reconstruction from perspective views. Proc. IEEE Conf. on Computer Vision and
Pattern Recognition, Jun. 1994.
[104] H. S. Sawhney. Simplifying motion and structure analysis using planar parallax
and image warping. International Conference on Pattern Recognition, 1:403–908,
Oct. 1994. 12th IAPR.
338 BIBLIOGRAPHY
[105] R. Kumar, P. Anandan, and K. Hanna. Shape recovery from multiple views: A
parallax based approach. ARPA Image Understanding Workshop, Nov. 10, 1994.
[106] Lee Campbell and Aaron Bobick. Correcting for radial lens distortion: A
simple implementation. TR 322, MIT Media Lab Perceptual Computing Section,
Cambridge, Apr. 1995.
[107] M. Artin. Algebra. Prentice-Hall, Englewood Clifs, NJ, 1991.
[108] S. Mann. Wavelets and chirplets: Time-frequency perspectives, with applications.
In Petriu Archibald, ed., Advances in Machine Vision, Strategies, and Applications.
World Scientific, Singapore, 1992.
[109] L. V. Ahlfors. Complex Analysis, 3rd ed., McGraw-Hill, New York, 1979.
[110] R. Y. Tsai and T. S. Huang. Multiframe Image Restoration and Registration. Vol 1,
in Advances in Computer Vision and Image Processing 1984. pp. 317–339.
[111] T. S. Huang and A. N. Netravali. Motion and structure from feature correspon-
dences: A review. Proc. IEEE, 82(2):252–268, Feb. 1984.
[112] Nassir Navab and Amnon Shashua. Algebraic description of relative affine
structure: Connections to Euclidean, affine and projective structure. MIT Media
Lab Memo 270, Cambridge, MA., 1994.
[113] Harry L. Van Trees. Detection, Estimation, and Modulation Theory. Wiley, New
York, 1968, part I.
[114] A. Berthon. Operator Groups and Ambiguity Functions in Signal Processing. In
J. M. Combes, ed. Wavelets: Time-Frequency Methods and Phase Space. Springer
Verlag, Berlin, 1989.
[115] A. Grossmann and T. Paul. Wave functions on subgroups of the group of
affine canonical transformations. Resonances — Models and Phenomena, Springer-
Verlag, Berlin, 1984, pp. 128–138.
[116] R. K. Young. Wavelet Theory and its Applications. Kluwer Academic, Boston,
1993.
[117] Lora G. Weiss. Wavelets and wideband correlation processing. IEEE Sign. Process.
Mag., pp. 13–32, Jan. 1994.
[118] Steve Mann and Simon Haykin. Adaptive “chirplet” transform: An adaptive
generalization of the wavelet transform. Optical Eng., 31(6):1243–1256, June 1992.
[119] John Y. A. Wang and Edward H. Adelson. Spatio-temporal segmentation of video
data. In SPIE Image and Video Processing II, pp. 120–128, San Jose, CA, Feb.
7–9, 1994.
[120] J. Bergen, P. J. Burt, R. Hingorini, and S. Peleg. Computing two motions from
three frames. In Proc. Third Int. Conf. Comput. Vision, pp. 27–32, Osaka, Japan,
Dec. 1990.
[121] B. D. Lucas and T. Kanade. An iterative image-registration technique with an
application to stereo vision. Proc. 7th Int. Joint conf. on Art. Intell. In Image
Understanding Workshop, pp. 121–130, 1981.
[122] J. Y. A. Wang and Edward H. Adelson. Representing moving images with layers.
Image Process. Spec. Iss: Image Seq. Compression, 12(1):625–638, Sep. 1994.
[123] Roland Wilson and Goesta H. Granlund. The uncertainty principle in image
processing. IEEE Trans. on Patt. Anal. Mach. Intell., 6:758–767, Nov. 1984.
BIBLIOGRAPHY 339
[124] J. Segman, J. Rubinstein, and Y. Y. Zeevi. The canonical coordinates method for
pattern deformation: Theoretical and computational considerations. IEEE Trans. on
Patt. Anal. Mach. Intell., 14(12):1171–1183, Dec. 1992.
[125] J. Segman. Fourier cross-correlation and invariance transformations for an optimal
recognition of functions deformed by affine groups. J. Optical Soc. Am., A, 9(6):
895–902, June 1992.
[126] J. Segman and W. Schempp. Two methods of incorporating scale in the Heisenberg
group. Journal of Mathematical Imaging and Vision special issue on wavelets, 1993.
[127] Bernd Girod and David Kuo. Direct estimation of displacement histograms. OSA
Meeting on Image Understanding and Machine Vision, June 1989.
[128] Yunlong Sheng, Claude Lejeune, and Henri H. Arsenault. Frequency-domain
Fourier-Mellin descriptors for invariant pattern recognition. Optical Eng.,
27(5):354–357, May 1988.
[129] S. Mann and R. W. Picard. Virtual bellows: Constructing high-quality images from
video. In Proc. IEEE First Int. Conf. on Image Processing, pp. 363–367, Austin,
TX, Nov. 13–16, 1994.
[130] Peter J. Burt and P. Anandan. Image stabilization by registration to a reference
mosaic. ARPA Image Understanding Workshop, Nov. 10, 1994.
[131] M. Hansen, P. Anandan, K. Dana, G. van der Wal, and P. J. Burt. Real-time scene
stabilization and mosaic construction. ARPA Image Understanding Workshop, Nov.
10, 1994.
[132] S. Intille. Computers watching football, 1995. http://www-white.media.mit.
edu/vismod/demos/football/football.html.
[133] R. Wilson, A. D. Calway, E. R. S. Pearson, and A. R. Davies. An introduc-
tion to the multiresolution Fourier transform. Technical Report, Depart-
ment of Computer Science, University of Warwick, Coventry, UK, 1992.
ftp://ftp.dcs.warwick.ac.uk/reports/rr-204/.
[134] A. D. Calway, H. Knutsson, and R. Wilson. Multiresolution estimation of 2-D
disparity using a frequency domain approach. In British Machine Vision Conference.
Springer-Verlag, Berlin, 1992, pp. 227–236.
[135] Seth Shulman. Owning the Future. Houghton Mifflin, Boston, 1999.
[136] Michel Foucault. Discipline and Punish. Pantheon, New York, 1977. Trans. from
Surveiller et punir.
[137] Natalie Angier. Woman, An Intimate Geography. Houghton Mifflin, Boston, 1999.
[138] Steve Mann. Reflectionism and diffusionism. Leonardo, 31(2):93–102, 1998.
http://wearcam.org/leonardo/index.htm.
[139] Garrett Hardin. The tragedy of the commons. Science, 162:1243–1248, 1968.
[140] Current list of references for keyers, chording keyboards, etc. http://wearcam.
org/keyer references.htm.
[141] Online list of links and resources for keyers, chording keyboards, etc.
http://about.eyetap.org/tech/keyers.shtml.
[142] J. R. Cooperstock, S. S. Fels, W. Buxton, and K. C. Smith. Reactive environments:
Throwing away your keyboard and mouse, 1997. http://www.csl.sony.co.jp/
person/ jer/pub/cacm/cacm.html.
INDEX
341
342 INDEX