Sound-induced flash illusion as an optimal percept

Concepts & Trends

5 pages

Please download to get full document.

View again

of 5
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Sound-induced flash illusion as an optimal percept
  Sound-induced £ash illusion as an optimal percept Ladan Shams a,b , Wei Ji Ma b and Ulrik Beierholm c a DepartmentofPsychology,Universityof California Los Angeles,Los Angeles,  b Division ofBiologyand c Computation and Neural Systems,California Institute ofTechnology,Pasadena,California,USA.Correspondence andrequests forreprints to LadanShams,PhD,DepartmentofPsychology,Universityof California Los Angeles,Franz Hall 7445B,Los Angeles,CA 90095-1563,USAE-mail: ladan@psych.ucla.eduSponsorship:W.J.M. was supportedby the Swartz Foundation and the Netherlands Organizationfor Scienti¢c Research.U.B. was supportedby the Davidand Lucile Packard Foundation.Received3 August 2005; accepted21September 2005 Recently, ithasbeen shown thatvisualperception canberadicallyaltered by signals of other modalities.For example, when a single£ash is accompanied by multiple auditory beeps, it is often per-ceived as multiple £ashes. This e¡ect is known as the sound-induced £ash illusion. In order to investigate the principles under-lying this illusion, we developed an ideal observer (derived usingBayes’ rule), and compared human judgements with thoseof the ideal observer for this task. The human observer’sperformancewas highlyconsistentwith that of the ideal observerinallconditionsranging fromnointeraction,topartialintegration, to complete integration, suggesting that the rule used by thenervous system to decidewhen andhow to combine auditory andvisual signals is statistically optimal. Our ¢ndings show that thesound-induced £ash illusion is an epiphenomenon of this general,statistically optimal strategy.  NeuroReport  16:1923^1927   c 2005LippincottWilliams& Wilkins. Keywords :auditory^visualperception,Bayesianinference,cross-modalillusion,cuecombination,idealobserver,multisensoryintegration,multisensoryperception, sound-induced £ashillusion Introduction Situations in which an individual is exposed to sensorysignals in only one modality are the exception rather thanthe rule. At any given instant, the brain is typically engagedin processing sensory stimuli from two or more modalities,and in order to achieve a coherent and ecologically validperception of the physical world, it must determine whichof these temporally coincident sensory signals are caused by the same physical source/event and thus should beintegrated into a single percept. Spatial coincidence of thestimuli is not a very strong or exclusive determinant of cross-modal binding: information from two modalities mayget seamlessly bound together, despite large spatial incon-sistencies (e.g. ventriloquism effect), while spatially con-cordant stimuli may be perceived as separate entities (e.g.someone speaking behind a screen does not lead to the binding of the voice with the screen). This is not surprising,considering the relatively poor spatial resolution of audi-tory, olfactory, and somatosensory modalities. The degree of consistency between the information conveyed by twosensory signals, on the other hand, is clearly an importantfactor in determining whether the cross-modal signals are to be integrated or segregated.Previous models of cue combination [1–9] have allfocused exclusively on conditions in which the signals of the different modalities get completely integrated (or appearso because of the employed paradigms that force partici-pants to report only one percept, and thus not revealing anypotential conflict in percepts). Therefore, the previousmodels are unable to account for the vast number of situations in which the signals do not get integrated or onlypartially integrate.The sound-induced flash illusion [10,11] is a psychophy- sical paradigm in which both integration and segregation of auditory–visual signals occur depending on the stimuluscondition. When one flash is accompanied by one beep(i.e. when there is no discrepancy between the signals), thesingle flash and single beep appear to srcinate from thesame source, and are completely fused. When one flash isaccompanied by four beeps (i.e. when the discrepancy islarge), however, most often they are perceived as emanatingfrom two separate events, and the two signals aresegregated, that is, a single flash and four beeps areperceived. If the single flash is accompanied by two beeps(i.e. when the discrepancy is small), the single flash is oftenperceived as two flashes and on these illusion trials, theflashes and beeps are perceived as having srcinated fromthe same source, that is, integration occurs in a large fractionof trials. When a single flash is accompanied by three beeps,on a fraction of trials the single flash is perceived as twoflashes while the three beeps are perceived as veridical.These trials would exemplify conditions of partial integra-tion in which the visual and/or auditory percepts areshifted towards each other, but do not converge.Therefore, the sound-induced flash illusion offers aparadigm encompassing the entire spectrum of bisensorysituations. As signals are not always completely integrated,previous models of cross-modal integration cannot account AUDITORYANDVESTIBULARSYSTEMS  N EURO R EPORT0959-4965  c Lippincott Williams& Wilkins Vol 16 No 17 28 November 2005  1923 Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.  for these effects. Therefore, we developed a new model inorder to be able to account for situations of segregation andpartial integration, as well as complete integration. Themodel is an ideal observer and in contrast to previousmodels of cue combination, it does not assume one sourcefor all the sensory signals (which would enforce integra-tion); instead, it assumes one source for the signal in eachmodality. The sources, however, are not taken to bestatistically independent, and therefore, the model allowsinferences about both cases in which separate entities havecaused the sensory signals, and cases in which sensorysignals are caused by one source. The model uses Bayes’rule to make inferences about the causes of the varioussensory signals.We presented observers with varying combinations of  beeps and flashes, and asked them to report the perceivednumber of flashes and beeps in each trial. We thencompared the human judgements with those of the idealobserver. Materials and methods Stimuli The visual stimulus consisted of a uniform white disksubtending 1.5 1  of the visual field at 12 1  eccentricity belowthe fixation point (Fig. 1a), flashed for 10ms on a blackcomputer screen 1–4 times. The auditory stimulus was a 10-ms-long beep with 80dB sound pressure level and 3.5kHzfrequency, also presented 0–4 times. A factorial design wasused in which all combinations of 0–4 flashes and 0–4 beeps(except for the no flash–no beep combination) werepresented, leading to a total of 24 conditions. The stimulusonset asynchronies (SOAs) of flashes and beeps were 70 and58ms, respectively (Fig. 1b). These specific SOAs werechosen because of certain constraints (e.g. frame rate,obtaining a strong illusion in the illusion conditions, andthe smallest sound SOA, which consistently is above flutterfusion threshold). The behavioral data are fairly robust tothe exact visual and auditory SOAs. The relative timing of the flashes and beeps was set such that the centers of theflash and beep sequences were synchronous in order tomaximize the time overlap between the two stimuli. Soundwas presented from two speakers placed adjacent to the twosides of the computer monitor, at the height in which thevisual stimulus was presented, thus, localizing at the samelocation as the visual stimulus. Procedure Ten naive observers participated in the experiment. Ob-servers sat at a viewing distance of 57cm from the computerscreen and speakers. Throughout the trials, there was aconstant fixation point at the center of the screen. Theobserver’s task was to judge both the number of flashes seenand the number of beeps heard after each trial (these reportsprovide  P ( Z  A , Z V  |  A , V  ) as described below). The experimentconsisted of 20 trials of each condition, amounting to a totalof 480 trials, ordered randomly. A brief rest interval wasgiven after every third trial of the experiment. The ideal observer model We assume that the auditory and visual signals arestatistically independent given the auditory and visual causes(see Fig. 2). This is a common assumption, motivated by thehypothesis that the noise processes that corrupt the auditoryand visual signals are independent. This conditional inde-pendence means that if the causes are known, knowledgeabout  V   provides no information about  A , and vice versa, asthe noises corrupting the two signals are independent. In themeantime, if the causes are not known, knowledge of   V  provides information about  A , and vice versa [12].The information about the likelihood of sensory signal  A occurring, given an auditory cause  Z  A , is captured by theprobability distribution  P (  A | Z  A ). Similarly,  P ( V  | Z V  ) repre-sents the likelihood of sensory signal  V   given a source  Z V  in the physical world. The priors  P ( Z  A , Z V  ) denote theperceptual knowledge of the observer about the auditory–visual events in the environment. In addition to theobserver’s experience, the priors may also reflect hard-wired biases imposed by the physiology and anatomy of the brain (e.g. the pattern of interconnectivity between thesensory areas [13,14]), as well as biases imposed by the task, the observer’s state, etc.The graph in Fig. 2 [15] illustrates the two key features of the model. First, that there are two sources,  Z  A  and  Z V  , forthe two sensory signals  A  and  V  . This allows inference in both cases in which the signals  A  and  V   are caused by thesame source and cases in which they are caused by twodistinct sources. That is, in contrast to the previous models,this model does not  a priori  assume that the signals have to be integrated. Second, in this model,  Z V   influences  A  onlythrough its effect on  Z  A , and likewise for  Z  A  and  V  . Thiscorresponds to the assumption of independent likelihoodfunctions,  P (  A , V  | Z  A , Z V  ) ¼ P (  A | Z  A ) P ( V  | Z V  ). This is a plau-sible assumption motivated by the fact that either the twosignals are caused by two different events in which case  A would be independent of   Z V   (and likewise for  V  and  Z  A ), orthey are caused by one event, in which case the dependenceof   A  on  Z V   can be captured by its dependence on  Z  A .Given the visual and auditory signals  A  and  V  , an idealobserver would try to make the best possible estimate of thephysical sources  Z  A  and  Z V  , based on the knowledge P (  A | Z  A ),  P ( V  | Z V  ), and  P ( Z  A , Z V  ). These estimates are basedon the posterior probabilities  P ( Z  A , Z V  |  A , V  ), which can becalculated using Bayes’ rule, and simplified by the assump-tions represented by the model structure (Fig. 2), resultingin the following inference rule: P ð Z  A ; Z V  j A ; V Þ ¼  P ð A j Z A Þ  P ð V j Z V Þ  P ð Z A ; Z V Þ P ð A ; V Þ  :  ð 1 Þ This inference rule simply states that the posterior prob-ability of events  Z  A  and  Z V   is the normalized product of thesingle-modality likelihoods and joint priors. In order tosimplify calculations, we assume that  P (  A , V  ) has a uniformdistribution. This, in turn, implies that  P (  A ) and  P ( V  ) alsohave uniform distributions. Given a uniform  P (  A ), theauditory likelihood term is computed as follows P ð  A j Z A Þ ¼  P ð Z A j A Þ P ð A Þ P A  P ð Z A j A Þ P ð A Þ ¼  P ð Z A j A Þ P A  P ð Z A j A Þ (and likewise for  P ( V  | Z V  )). While the likelihood functions P (  A | Z  A ) and  P ( V  | Z V  ) are nicely approximated from theunisensory (visual-alone and auditory-alone) conditions, theprior probabilities  P ( Z  A , Z V  ) involve both sensory modalitiesand cannot be obtained from unisensory conditions alone. Estimation of the joint priors In most models, the priors are not directly computable.Hence, the prior distribution is parameterized and the 1924  Vol 16 No 17 28 November 2005 N EURO R EPORT SHAMS ETAL. Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.  parameters are tuned to fit the observed data (i.e. data to bepredicted). Our experimental paradigm makes it possiblefor the joint priors to be approximated directly from theobserved data, alleviating the need for any parametertuning. The joint priors can be approximated by margin-alizing the joint probabilities across all conditions, that is, allcombinations of   A  and  V  : P ð Z  A ; Z V  Þ ¼ X  A ; V  P ð Z  A ; Z V  j A ; V Þ  P ð A ; V Þ :  ð 2 Þ Given a uniform  P (  A , V  ), this leads to a normalizedmarginalization of the posteriors. As this estimate requiresmarginalizing over all conditions including auditory–visualconditions, we used the data from a different set of observers (the first half of participants) for estimating the joint priors using the above formula, and excluded thosedata from the testing process (the second half of partici-pants). In other words, these data were used only forcalculating the priors and discarded afterwards. Thus, themodel remained predictive, not using any auditory–visualdata for making predictions about the performance in theauditory–visual conditions.Although it may appear that the joint prior matrixintroduces 24 free parameters in our model, it should beemphasized that this is not the case, as these parameters arenot ‘free’. The parameters of the joint prior matrix are setusing the observed data; however, they were not tuned tominimize the error between the model predictions and thedata. Therefore, the model has no ‘free’ parameters. Results The observers perform better in the auditory-alone condi-tions (first row of Fig. 3) than in the visual-alone conditions(first column of Fig. 3). As can be seen in Fig. 3, the humanobserver’s performance is remarkably consistent with thatof the ideal observer in all of the conditions ( r 2 ¼ 0.92),accounting for 600 data points [(25 ( Z  A , Z V  ) combinations at24 conditions] with no free parameters. Fixation pointComputer screenFlashesBeeps10 ms10 ms70 ms58 ms1.5 ° 12 ° (a)(b) t  Fig.1  The spatio-temporal con¢guration of stimuli. (a) The spatial con¢guration of the stimuli.The visual stimulus was presented at12 1  eccentricitybelow the ¢xation point.The sounds were presented from the speakers adjacent to the monitor and at the same height as the center of the visualstimulus. (b) The temporalpro¢le of the stimuliin one of the conditions (2 £ashes+3 beeps) is shown.The centers of thevisual and auditory sequenceswere alignedin allconditions.Vol 16 No 17 28 November 2005  1925 FLASHILLUSIONISBAYES-OPTIMAL  N EURO R EPORT Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.  Only in conditions in which the visual and auditorystimuli are identical (i.e. the conditions displayed along thediagonal) do observers consistently indicate perceiving thesame number of events in both modalities. In conditions inwhich the inconsistency between the auditory and visualstimuli is not too large, for instance, in the 1 flash+2 beepscondition or 2 flashes+1 beep condition, there is a strongtendency to combine the two modalities, as indicated byhighly overlapping auditory and visual reports. The highvalues along the diagonal in joint posterior matrices of theseconditions (not shown here) confirm that indeed the samenumber of events were experienced jointly in both mod-alities, in these conditions. The integration of the auditory–visual percepts is achieved in these cases by a shift of thevisual percept in the direction of the auditory percept. Thisoccurs because the variance in the auditory-alone conditionsis lower than that of the visual-alone conditions. In otherwords, because the auditory modality is more reliable, itdominates the overall percept in these auditory–visualconditions. This finding is consistent with previous studiesof cue combination within [3,4,16,17] or across modalities [7–9] in all of which the discrepancy between the two cues issmall, and the percept is dominated by the cue with lowervariance (or higher reliability). The large fraction of trials in Aud. dataVis. dataAud. modelVis. model1 beep0 beep2 beeps 3 beeps 4 beeps    4   f   l  a  s   h  e  s   3   f   l  a  s   h  e  s   2   f   l  a  s   h  e  s   1   f   l  a  s   h   0   f   l  a  s   h   R  e  s  p  o  n  s  e  p  r  o   b  a   b   i   l   i  t  y Response category10.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.5010.500 2 4 0 2 4 0 2 4 0 2 40 2 40 2 40 2 40 2 40 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 40 2 40 2 40 2 40 2 40 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 Fig. 3  Comparison of theperformance of human observers with theideal observer.To facilitateinterpretation of the data, instead of presenting jointposterior probabilities for each condition, only the marginalized posteriors are shown. The auditory and visual judgements of human observers areplotted in red circles and blue squares, respectively. Each panel represents one of the conditions.The ¢rst row and ¢rst columns represent the audi- tory-alone and visual-alone conditions, respectively.Theremainingpanels correspond to conditionsinwhich auditory and visual stimuli werepresentedsimultaneously.Thehorizontalaxesrepresent theresponse category (with zeros denoting absence of a stimulus and1^4 representingnumber of £ashesor beeps).Theverticalaxesrepresent theprobabilityof a perceivednumber of £ashes or beeps.The data point, whichis enclosedbya greencircle, is anexample of thesound-induced£ashillusion,showing thatinalargefractionof trials,observersperceivedtwo£asheswhenone£ashwaspairedwithtwobeeps.The data pointenclosedby a browncirclereveals an oppositeillusioninwhich two £ashes areperceived as one £ashin a large fraction of trialsin the 2 £ashes+1beep condition.  Z   A  Z  V  AV  Fig. 2  Graphical model describing the ideal observer. In a graphicalmodel [15], the graph nodes representrandomvariables, and arrows de-note potential conditionality.The absence of an arrow represents directstatisticalindependencebetween the two variables.The bidirectional ar-rowbetween Z  A  and Z  V  doesnotimplyarecurrentrelationship;itimplies that the two causes arenotnecessarilyindependent. 1926  Vol 16 No 17 28 November 2005 N EURO R EPORT SHAMS ETAL. Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.  which the observers report seeing two flashes in the 1flash+2 beeps condition corresponds to the sound-inducedflash illusion [10].In conditions in which the discrepancy between thenumber of flashes and beeps is large (e.g. 1 flash+4 beeps or4 flashes+1 beep), the overlap between the auditory andvisual percepts is significantly smaller, indicating a con-siderably smaller degree of integration and larger degree of segregation.Next, we investigated the possibility that the Bayesianmodel of Eq. (1) is overly powerful and capable of predictingany data set. We shuffled the obtained human observerposterior probabilities  P ( Z  A , Z V  |  A , V  ) in each auditory–visualcondition leading to a new data set that was identical to thehuman data in its overall content, although randomized inorder. We applied our model to this data set. The modelpredictions did not match the shuffled data, even when wedid not divide the data set into two halves, and insteadcomputed the priors from the same set for which wegenerated the predictions ( r 2 ¼ 0.05). We obtain qualitativelysimilar results regardless of the made-up distribution used.This finding strongly suggests that the predictions of theproposed ideal observer are distinctly consistent with thehuman observer’s data and not with any arbitrary data set. Discussion Altogether, these results suggest that humans combineauditory and visual information in an optimal fashion.Our results extend earlier findings (e.g. [7,18]) by showing that the optimality of the human performance is notrestricted to situations in which the discrepancy betweenthe two modalities is minute and the two modalities arecompletely integrated. Indeed, it can be shown that manyearlier models of cue combination are special cases of themodel described here.The ideal observer model presented here differs fromprevious models of cue combination, which have employedmaximum likelihood estimation in two important ways. First,as opposed to previous models (which assume one cause forall signals), our model allows a distinct cause for each signal.This is a structural difference between the present model andall the previous models, and is the reason why the multi-sensory paradigm in the present study is beyond the scope of previous models. The assumption of a single cause makesprevious models unable to account for a vast portion of thepresent data in which the visual and auditory information arenot integrated, that is, all the trials in which participantsreported different visual and auditory percepts. Second, theprevious models did not include any prior probability of events, which is equivalent to assuming a uniform priordistribution. In the present model, prior probabilities are notassumed to be uniform. In order to examine the importance of the priors in accounting for the data, we tested the modelusing a uniform prior. The goodness of fit was considerablyreduced ( r 2 ¼ 0.62), indicating that in this task the priors departsignificantly from uniform distribution, and are thereforenecessary for accounting for the data. Conclusion The findings of this study suggest that the brain uses amechanism similar to Bayesian inference [19] to decidewhether, to what degree, and how (in which direction) tointegrate the signals from auditory and visual modalities,and that the sound-induced flash illusion can be viewed asan epiphenomenon of a statistically optimal computationalstrategy. Acknowledgement We are grateful to Graeme Smith for extensive discussionsand help with programming. We thank Stefan Schaal, BoscoTjan, Alan Yuille, and Zili Liu for their insightful discus-sions and comments. References 1. Bu¨lthoff HH, Mallot HA. Integration of depth modules: stereo andshading.  J Opt Soc Am  1988;  5 :1749–1758.2. Knill DC. Mixture models and the probabilistic structure of depth cues. Vision Res  2003;  43 :831–854.3. Landy MS, Maloney LT, Johnston EB, Young M. Measurement andmodeling of depth cue combination: in defense of weak fusion.  Vision Res 1995;  35 :389–412.4. Yuille AL, Bu¨lthoff HH. Bayesian deicision theory and psychophysics. In:Knill DC, Richards W, editors.  Perception as Bayesian inference . Cambridge:Cambridge University Press; 1996. pp. 123–161.5. Alais D, Burr D. The ventriloquist effect results from near-optimal bimodal integration.  Curr Biol  2004;  14 :257–262.6. Massaro DW.  Perceiving talking faces: from speech perception to a behavioralprinciple . Cambridge, Massachusetts: MIT Press; 1998.7. Ernst MO, Banks MS. Humans integrate visual and haptic information ina statistically optimal fashion.  Nature  2002;  415 :429–433.8. van Beers RJ, Sittig AC, Denier van der Gon JJ. Integration of proprioceptive and visual position information: an experimentallysupported model.  J Neurophysiol  1999;  81 :1355–1364.9. Ghahramani Z, Wolpert DM, Jordan MI. Computational models of sensorimotor integration. In: Morasso PG, Sanguineti V, editors.  Self-organization, computational maps, and motor control.  Amsterdam: North-Holland, Elsevier Press; 1997. pp. 117–147.10. Shams L, Kamitani Y, Shimojo S. What you see is what you hear.  Nature 2000;  408 :788.11. Shams L, Kamitani Y, Shimojo S. Visual illusion induced by sound.  CognBrain Res  2002;  14 :147–152.12. Jordan MI. Graphical models.  Stat Sci (Special Issue on Bayesian Stat)  2004; 19 :140–155.13. Falchier A, Clavagnier S, Barone P, Kennedy H. Anatomical evidenceof multimodal integration in primate striate cortex.  J Neurosci  2002;  22 :5749–5759.14. Rockland KS, Ojima H. Multisensory convergence in calcarine visualareas in macaque monkey.  Int J Psychophysiology  2003;  50 :19–26.15. Pearl J.  Probabilistic reasoning in intelligent systems: networks of plausibleinference . San Mateo, California: Morgan Kaufmann, 1988.16. Clark JJ, Yuille AL.  Data fusion for sensory information processing systems .Boston: Kluwer Academic Pulishers; 1990.17. Jacobs R. Optimal integration of texture and motion cues to depth.  VisionRes  1999;  39 :3621–3639.18. Battaglia PW, Jacobs RA, Aslin RN. Bayesian integration of visualand auditory signals for spatial localization.  J Opt Soc Am  2003;  20 :1391–1397.19. Knill DC, Pouget A. The Bayesian brain: the role of uncertainty in neuralcoding and computation.  Trends Neurosci  2004;  27 :712–719. Vol 16 No 17 28 November 2005  1927 FLASHILLUSIONISBAYES-OPTIMAL  N EURO R EPORT Copyright © Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks