This paper presents a low-cost, wearable headset for 3D Point of Gaze (PoG) estimation 
in assistive applications. The device consists of an eye tracking camera and forward 
facing RGB-D scene camera which, together, provide an estimate of the user gaze vector 
and its intersection with a 3D point in space. The resulting system is able to compute 
the 3D PoG in real-time using inexpensive and readily available hardware components.

Eye gaze interaction has been shown to be beneficial to people with physical disabilities. 
In the case study presented in [4], 16 amyotrophic lateral sclerosis (ALS) 
patients with severe motor disabilities (loss of mobility, inability to speak, etc.) were 
introduced to eye tracking devices during a 1-2 week period. Several patients reported a 
clear positive impact on their quality of life, resulting from enhanced communication 
facilitated by the eye tracking devices.

While the utility of gaze interaction has been demonstrated, existing head-mounted eye 
gaze systems suffer from some limiting constraints. In general, they are designed for 
interaction with fixed computer displays or 2D scene images, and the 2D PoG of these 
systems does not directly translate into the 3D world. An accurate estimate of the 3D user 
PoG within an environment is clearly useful, as it can be used to detect user attention 
and intention to interact [2]. Furthermore, existing systems tend to lack mobility, 
and the mobile 3D PoG tracking systems that have been proposed in literature suffer from 
their own limitations. The head-mounted multi-camera system presented in [5], for 
example, gives the 3D PoG relative to the user's frame of reference, but does not map this 
point to the user's environment. Finally, the high monetary cost and proprietary nature of 
commercial eye tracking equipment limits widespread use, which has led to the development 
of low-cost solutions.

We propose a novel head-mounted system that addresses the limitations of current solutions. 
First, an eye tracking camera is used to estimate the 2D PoG. An inexpensive RGB-D scene 
camera is then used to acquire a 3D representation of the environment structure. Finally, 
we provide a process by which the 2D PoG is transformed to 3D coordinates. This work 
focuses on the 3D PoG estimation we describe in [1] for use in applications 
other than object of interest detection and recognition as discussed therein.

The system eye tracking feature is accomplished using an embedded USB camera module equipped 
with an infrared pass filter. The user's eye is illuminated with a single infrared LED to 
provide consistent image data in various ambient lighting conditions. The LED also produces 
a corneal reflection on the user's eye, which can be seen by the camera and exploited to 
enhance tracking accuracy. The LED was chosen according to the guidelines discussed in 
Mulvey2008 to ensure that the device could be used safely for indefinite periods of time. 
The image resolution of 640x480 pixels and frame rate of 30 Hz facilitate accurate tracking 
of the pupil and corneal reflection using image processing techniques which are further 
discussed in section 4.

The eye camera is positioned such that the image frame is centered in front of one of the 
user's eyes. The module can be moved from one side of the headset frame to the other so that 
either eye can be used (to take advantage of user preference or eye dominance), while fine 
adjustments to the camera position and orientation are possible by manipulating the flexible 
mounting arm.

Information about the environment in front of the user is provided by a forward facing RGB-D 
camera, the Asus XtionPRO Live. This device provides a 640x480 color image of the environment 
along with a 640x480 depth range image at a rate of 30 Hz. The two images are obtained from 
individual imaging sensors and registered by the device such that each color pixel value is 
assigned actual 3D coordinates in space. This provides a complete scanning solution for the 
environment in the form of 3D "point clouds", which can be further processed in software. The 
completed headset is shown in Figure 1.

An estimate of the user PoG is computed using a modified version of the starburst algorithm 
presented in [7]. This algorithm creates a mapping between pupil positions and 2D scene image 
coordinates after a 9 point calibration procedure is performed. During the pupil detection 
phase of the algorithm, an ellipse is fitted to the pupil such that the ellipse center 
provides an accurate estimate of the pupil center. The center of the infrared corneal 
reflection is detected during the next phase of the algorithm, which is then compared to the 
pupil center to acquire a difference vector. The resulting difference vector is then used to 
interpolate the 2D PoG in the scene camera, as shown in Figure 2. The 3D PoG can be obtained 
easily from the 2D point by looking up the 3D coordinates of the pixel in the point cloud 
data structure provided by the RGB-D camera. Exploitation of the RGB-D point cloud structure 
removes the need for stereo eye tracking during 3D PoG estimation as used in [5, 6]. 

The resulting headset provides valuable information on user intent to designers of assistive 
systems. The low-cost approach will enable the inclusion of 3D PoG in a wide variety of 
applications. Future work will explore the use of 3D PoG for control of electric wheelchairs 
and service robots in assistive environments.