Unified Architecture for Auditory Scene Analysis
and Spoken Language Processing

Tomohiro Nakatani, Takeshi Kawabata, and Hiroshi G. Okuno
(NTT Basic Research Laboratories)

We proposes a unified architecture for auditory scene
analysis (ASA) and spoken language processing (SLP). The
unified system is expected to provide robust and friendly
human-computer interface in the real acoustical environment.
Because the number of auditory streams can be enormous, an
adaptive stream attention mechanism is required.  We
consider adaptive understanding (behavior) emerges from
competing goals in a multi-agent system.  The behavior-based
multi-agent system can explain various kinds of activities
in human communications.  In this paper, as the first stage
of this approach, we design and implement a multi-agent
based stream segregation system. The system dynamically
generates stream segregation agents, which extract auditory
streams incrementally.  As clues to segregation, these
agents use only simple sound attributes, harmonics and
average spectrum intensity.  To resolve stream interference,
each agent communicates with other agents by signal
subtraction and by common threshold modification.  The
resulting system, as a whole, segregates streams adaptively
from the mixed sounds. The experimental results show that
the system can segregate dual voice effectively even under
noisy condition.