Research Programme 4 (RP4)

RP4 - Contextual Content Analysis
Programme Leaders: Prof. Noel O'Connor and Prof. Alan Smeaton

Context and Objectives

Sensed data in its raw unprocessed state is often worse than useless in any real application scenario. In fact, gathering such sensed data constitutes further polluting a virtual world already suffering from overwhelming information overload. Thus, it is necessary to process and extract only relevant content as carried by the raw sensed data. That is, to process the sensed data in order to extract and even discover semantic information, whilst doing this in an application-centric manner (i.e. taking cognisance of the manner in which the semantic information will eventually be used). As the sensing modality becomes more sophisticated, it becomes increasingly difficult to extract useful semantics. Gor instance, while it is relatively straightforward to map the signal from a simple point temperature sensor to useful content-rich characterisations (such as “hot” and “cold” and graduations thereof), such an extraction task becomes much more challenging when we consider any of the spectrum of sensing modalities, from text, to multi-media to ‘virtual’ sensors of higher-level user intent, that we address in CLARITY. Sensor networks composed of combinations of sensing modalities of various levels of sophistication only exacerbates this. The objective of this RP is to extend current capabilities for mining context information from sensed data with a view to leveraging this to assist in identifying semantic information. Context mining will be enabled by considering a capture mechanism augmented with multiple sensing modalities that provide either useful analysis constraints or reinforcement of unimodal analysis. Key ancillary objectives include developing fusion frameworks that handle multiple potentially conflicting data sources and extending the use of pattern recognition techniques to model the desired semantics using non-traditional sensor input in a manner that can be easily configured to different application scenarios.

This RP will provide output in the form of newly discovered semantic information to WP5.1 (Using User and Data Contexts for Retrieval). Primary input will be the aggregated, filtered, harmonised data provided by WP3.2 (Sensor Aggregation). In turn, outputs of WP4.2 will feed back to WP3.1 (Adaptive Capture & Filtering) to help inform subsequent data gathering. Sensors used to augment capture will include the inertial wireless sensors from WP2.1 (Nodes and Networks) and human location, posture, movement sensors of WP2.3 (Body Sensor Networks). We will also use the profiles determined in WP6.1 (Profiling Preferences, Activities and Context) and the recommendations provided by WP6.2 (Recommendation and Collaboration) as a means to focus subsequent analysis in a predictive manner. This work will not depend upon a battery of tightly integrated novel capture devices/platforms, but research in WP1.4 Configurable and Energy Aware Hardware will be informed by the outputs of this RP. An over-arching goal between these tasks is to eventually realise an architecture for a platform for wireless media sensing – a media mote. In turn, WP1.4 will work with WP2.1 to produce at least one such prototype platform late in the centre’s lifetime.

Work Packages

WP 4.1: Content Processing for Extracting Context & Semantics

This WP will leverage multi-modal capture frameworks and develop a suite of content analysis techniques for the specific application scenarios as dictated by the three demonstrators. Sensed content will feature a variety of sensor modalities from simple point and location/movement sensors to more sophisticated modalities such as text, audio and multi-spectral video distributed both spatially and temporally. Work will focus on extending existing content analysis techniques to incorporate multiple modalities, including non-content-based sources, in formal frameworks. An illustrative example is using location, direction and time sensors to assist visual analysis to understand where a user is located and what he/she is viewing in an active gaming context. Another example is understanding the content of images/video in a shared photo slideshow based on the profiles and interactions of the people involved. Yet another example is using multiple cameras of varying capabilities in conjunction with an activity profile to both focus and constrain audio-visual identification, tracking and activity recognition in an ambient assisted living or environmental monitoring application.

WP 4.2: Object, Event and Activity Analysis and Modeling

The reason for extracting user or application context and semantics from sensed data is to use this to help understand the real world objects, events and activities represented. Detecting and characterising such material real world things is typically the motivation for instantiating the sensor network in the first place. The actual objects, events and activities to be detected, characterised and recognised are application dependent and will be driven by the CLARITY demonstrators. The aim of this WP is to develop fully automatic approaches to characterising these, leveraging context to constraint analysis thereby making the problem tractable. An example in a home monitoring or active gaming environment corresponds to determining the type of movement (running, jumping, falling, crawling) being carried out. Another example is using multiple views, captured in tandem with location and inertial sensed data and using image modeling techniques to construct rudimentary 3D scene and object models for later indexing without the need for scene markers. This work builds naturally upon the work of WP4.1 via an iterative research effort between the two work packages. WP4.1 produces a suite of content analysis techniques, the outputs of which can be used to contextualise content-based pattern recognition methodologies in this WP.

Novelty

It is becoming accepted that a key enabler for extracting semantics is an understanding of context, relating either to the user (his/her information need, the data access channel, and virtual or physical constraints) or to the application (the time, date, location, environment and the circumstances that form the setting for why the data was gathered the first place). This RP will develop analysis techniques that leverage complementary benefits of different sensor types to extract user/application context and from this a level of semantic meaning. This idea is not novel in itself. However, we believe that innovation is now possible in terms of the level of semantic meaning that can actually be extracted in challenging application scenarios. This is due in part to the recent advances in text, audio and visual analysis and the ongoing trend of research convergence between these fields, as described in 4.1.4. In this RP, we can avail of these advances, but innovate further by considering a change to the data capture paradigm itself. That is, we can consider augmenting sensing traditionally carried out in a uni-modal manner with other sensing modalities with a view to making traditionally ill-posed analysis problems more tractable. Good examples of work in this direction are [1][2] where the image capture mechanism is changed to facilitate better scene understanding in a variety of application scenarios. Another example is the work of [3] that looks at combining inertial and movement sensing with video capture for augmented reality applications. We ourselves have already shown how the combination of multiple visual sensing modalities can make inherently ill-posed problems such as object segmentation and tracking in unconstrained environments more robust [4]. Similarly, using the Microsoft SenseCam, we have leveraged non-visual sources to assist in scene recognition for determining user context, and in turn using this to perform content structuring based on semantic events [5][6].

CLARITY represents a unique opportunity to build upon this preliminary work by providing access to a range of sensing modalities and sensor platforms not available otherwise, whilst simultaneously providing the application and end-user motivation to inform the content analysis task. Taking as its starting point existing proven content analysis techniques, the novelty in WP4.1 will be manifested in re-designing these to use additional sensor data. We will leverage and extend proven formal frameworks for combining sensor evidence – one candidate that has already proven useful in our work on syntactic image segmentation is Smets’ Transferable Belief Model [7], for example. WP4.2 will develop models for objects, events and activities that incorporate the results of WP4.1 and other physical/ virtual sensor data that extends traditional approaches to pattern analysis that have been successfully used for abstraction, indexing and retrieval of multimedia [8] (as hinted at in [9]). A further novel aspect of this work will be factoring adaptation into the design of the analysis techniques to facilitate feedback from other layers in the sensor network e.g. predictive profiles & recommendation from teh personalization layer or data gathering constraints dictated by management of the sensor communities layer.

References:

[1] R. Raskar, T.-H. Tan, R. Feris, J. Yu, and M. Turk, “Non-photorealistic camera: Depth edge detection and stylized rendering using multi-flash imaging,” ACM Transactions on Graphics (TOG), vol. 23, no. 3, pp. 679–688, 2004.
[2] R. Raskar, P. Beardsley, P. Dietz, and J. V. Baar, “Tags are coming: Exploiting photosensing wireless tags for assisting geometric procedures,” cviie, vol. 0, pp. 145–150, 2005.
[3] J. Hol, T. Schn, F. Gustafsson, and P. Slycke, “Sensor fusion for augmented reality,” in Proceedings of FUSION 2006, July 2006.
[4] C. O’Conaire, N. E. O’Connor, and A. F. Smeaton., “Thermo-visual feature fusion for object tracking using multiple spatiogram trackers,” Machine Vision and Applications (in print), 2007.
[5] C. O’Conaire, N. E. O’Connor, A. F. Smeaton, and G. J. Jones., “Organising a daily visual diary using multi-feature clustering,” in SPIE Electronic Imaging - Multimedia Content Access: Algorithms and Systems (EI121), 2007.
[6] M. Blighe, H. L. Borgne, N. O’Connor, A. F. Smeaton, and G. J. Jones., “Exploiting Context Information to aid Landmark Detection in SenseCam Images,” in ECHISE 2006 - 2nd International Workshop on Exploiting Context Histories in Smart Environments - Infrastructures and Design, 8th International Conference of Ubiquitous Computing (Ubicomp 2006), 2006.
[7] P. Smets, “The combination of evidence in the transferable belief model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 5, p. 447458, 1990.
[8] S. Antania, R. Kasturia, and R. Jain, “A survey on the use of pattern recognition methods for abstraction, indexing and retrieval of images and video,” Pattern Recognition, vol. 35, p.945965, 2002.
[9] W. Hu, T. Tan, , L. Wang, and S. Maybank, “A survey on visual surveillance of object motion and behaviors,” IEEE Transactions on Systems, Man and Cybernetics – Part C: Applications and Reviews, vol. 34, no. 3, august 2004.