3. Human activity recognition and understanding with multimodal sensing

  • Meeting Recording System via Multimodal Sensing
  • Automatically Acquiring Personal Preferences from TV Viewerfs Behaviors



  • Meeting Recording System via Multimodal Sensing

      In order to record vivid activities of human communications, multimedia meeting logs which consist of audio/video data of speakers and annotations of event have been proposed.
      We propose a meeting recording system, called MMLogger, which supports to create multimedia meeting logs. MMLogger consists of a microphone array and an omnidirectional video camera. By locating it at the center of the meeting participants, we can make a faithful record of the meeting environment. The microphone array records the utterance of speakers, and is also used for estimating the directions. The omnidirectional video camera is capable of capturing a 360-degree horizontal view field scene, and by using the video image together with the estimated direction of the speakers, we can obtain the frontal view images of speakers.

    Figure1.MMLogger Figure2.Meeting using MMLogger



    Figure3.MMLog

    list


      In usual case, the results and the progress of a meeting, such as what is determined, and who said what, are summerized in a meeting log and are recorded by paper-based media. The paper-based media is useful from the viewpoint of simplicity. However, it has a performance limitation when we want to make a precise record of vivid activities of human communications and the environment during a meeting, such as the situation of discussions, the behavior of speakers (smile, angry, confused, excited, etc.) and that of other participants (nod, agree, disagree, sleep, etc.).
      Recently, in order to record such an human activity during a meeting, an inteligent log system called multimedia log system has been proposed. The multimedia log consists of two types of data, namely, the data in which the contents of utterances of speakers are structured, and the audio/video data at each event during the meeting. By linking these two types of data, a log user can observe the expressions and appearances of speakers together with the contents of the meeting.
      The meeting environment of our concern is a round-table meeting with a few participants. In implementing a meeting recording system, the following problems due to the sensing environment arise: a large scale set is required, even for a small scale meeting, to sense the apperarance of each participant; it is difficult to sense the frontal view image of each speaker since the video cameras are located behind the participants to avoid disterbance of the meeting. Therefore, it is very important to construct the meeting recording system free from the sensing environment problems.
      In order to overcome the above difficulties, we propose a meeting recording system using omnidirectional video camera and microphone array. They are located at the center of the table, and record all the activities during a meeting. They are also used for estimating the directions of speakers, and the data is exploited for reproducing the frontal view image of the speaker with high fidelity.


          Figure1DSystem


          Figure2DExample images

    list



    Automatically Acquiring Personal Preferences from TV Viewerfs Behaviors

      The importance of information services considering personal preferences is increasing. In this research, we propose a system for automatically acquiring personal preferences from TV viewer's behaviors to sense by cameras and microphones. To use this system, we expect that it can acquire personal preferences by a scene. For example, if an user watched a baseball game, it infers that he or she likes baseball. Furthermore, it turned out that he or she was interested in a player's scene by his or her behaviors and the meta data.


                       FigureDSystem



    TV Viewing Interval Estimation for Personal Preference Acquisition

      Since there can be several people in front of the TV, identifying each user is necessary to individually acquire their personal preferences. Normally, at home, where the proposed system is expected to be used, users are limited to specific people such as family members. Therefore, users can be registered beforehand to create their face models for identification. However, a user might not be interested in TV even when the user is in front of the TV. Therefore, ``when'' a user is watching TV is considered to be highly related to usersf preferences. Here, assuming that users face the TV when they are watching TV, we are trying to estimate their TV viewing intervals by identifying only users who are facing the TV.



    User's Intervals of Internet

      In our research, We estimate which time a user was interested by focusing on temporal patterns in face changes.
      We extract feature points that we set to a face organ from videos and calculate feature quantities to express a change of a face from these feature points (figure 1). We estimate user's intervals of interest through the utilization of Hidden Markov Models to learn temporal patterns in feature quantities changes when the user is interested from sample videos (figure 2).
    @
    Fig1.facial feature points and feature quantities

    @
    Fig2.flow of intervals of interest

    list

    INTRODUCTION