Title:
System and method for generating surround sound
Kind Code:
B1


Inventors:
Paczkowski, Jacek (Patents Factory Ltd. Sp. z o.o.Spawaczy 3b/2, 65-119 Zielona Góra, PL)
Kramek, Krzysztof (Patents Factory Ltd. Sp. z o.o.Spawaczy 3b/8, 65-119 Zielona Góra, PL)
Nalewa, Tomasz (Patents Factory Ltd. Sp. z o.o.Kossaka 83, 65-140 Zielona Góra, PL)
Application Number:
EP20140461580
Publication Date:
03/22/2017
Filing Date:
10/23/2014
Assignee:
Patents, Factory Ltd Sp z. o. o. (ul. Antoniego Wysockiego 8, 66-002 Nowy Kisielin, PL)
International Classes:
H04S7/00
View Patent Images:



Foreign References:
2006083383
2014112484
Attorney, Agent or Firm:
Blonski, Pawel (EP-Patent Konstruktorow 30/2, 65-119 Zielona Gora, PL)
Claims:
1. A signal comprising sound events (101) wherein each of said sound event (101) comprises: • time of event (102) information; • information regarding location in space with respect to a reference location point (103); • a movement trajectory in space (104); • orientation information (105); characterized in that the signal further comprises at least three sound event data comprising at least one acoustic sound event data comprising • spatial characteristic of the source of the event (106), comprising spatial characteristic, of sound emission of an associated sound source, defined as a set of points of the spatial characteristic in horizontal and vertical planes; • information on sampling frequency (107); • information on signal resolution (108); and • a set of acoustic samples (109) of the sampling frequency (107) and at the signal resolution (108); at least one textual sound event data further comprising • a library identifier (402) and a textual data field (403) wherein the textual data is to be used to generate sound by a speech synthesizer; and at least one synthetic non-verbal sound event data further comprising • at least one code data (408) and a library selection field (402) referring to a music synthesizer library wherein the at least one code is to configure a music synthesizer.

2. The signal according to claim 1 characterized in that it further comprises a synthetic library packet comprising an identifier (404), language identifier (405) and audio samples data (406) that are referenced by at least one synthetic non-verbal sound event.

3. The signal according to claim 1 characterized in that the at least one textual sound data event further comprises a field specifying emotions in the textually defined event.

4. The signal according to claim 1 characterized in that the at least one textual sound event data further comprises a field of a person's characteristics.

5. The signal according to claim 1 characterized in that the at least one textual sound event data and/or the at least one synthetic non-verbal event data further comprises a filed defining volume of the sound to be synthesized.

6. The signal according to claim 1 characterized in that the at least one textual sound event data further comprises a field defining tempo comprising information on speech synthesis timing including length of syllables and/or pauses between words.

Description:

The present invention relates to a system and method for generating surround sound. In particular the present invention relates to surround environment independent from number of loudspeakers and configuration/placement of the respective loudspeakers.

Prior art defines surround sound systems such as Dolby Digital and DTS multichannel based transmission and presentation of sound. A disadvantage of this solution is a dependence of the obtained effect on loudspeakers placement and room acoustics. Both technologies suggest optimal loudspeakers placement, which, however, is often infeasible due to the shape and arrangement of the room.

There are known sound correction systems which, however, most often are based on a suitable delay of signals destined to each loudspeaker. The problem of sound reflections off the walls is similarly corrected.

Reflections may be used to generate virtual surround sound. This is the case in so-called sound projectors (an array of loudspeakers in a single casing - a so called sound bar).

Problems with surround sound arise from the fact that the data in the stream of acoustic assume specific locations of each loudspeaker relative to the listener. Even the names of channels define particular arrangements i.e.: central, left front, right front, left rear, right rear. In these prior art surround systems the same sound data stream is sent to the speakers of each listener, regardless of the actual position of the speakers in the presentation room.

It would be advantageous to provide a surround sound solution that would be independent from number of loudspeakers and configuration/placement of the respective loudspeakers.

Prior art discloses Ambisonics system, which is a full-sphere surround sound technique: in addition to the horizontal plane, it covers sound sources above and below the listener.

Unlike other multichannel surround formats, its transmission channels do not carry speaker signals. Instead, they contain a speaker-independent representation of a sound field called B-format, which is then decoded to the listener's speaker setup. This extra step allows the producer to think in terms of source directions rather than loudspeaker positions, and offers the listener a considerable degree of flexibility as to the layout and number of speakers used for playback (source: Wikipedia).

The aim of the development of the present invention is a surround system and method that is independent from number of loudspeakers and configuration/placement of the respective loudspeakers.

SUMMARY AND OBJECTS OF THE PRESENT INVENTION

An object of the present invention is a signal according to claim 1.

These and other objects of the invention presented herein, are accomplished by providing a system and method for generating surround sound. Further details and features of the present invention, its nature and various advantages will become more apparent from the following detailed description of the preferred embodiments shown in a drawing, in which:

Fig. 1
presents a diagram of a sound event;
Fig. 2
presents a diagram of the method according to the present invention;
Fig. 3
presents a diagram of the system according to the present invention;
Figs 4A - 5B
depict audio data packets

NOTATION AND NOMENCLATURE

Some portions of the detailed description which follows are presented in terms of data processing procedures, steps or other symbolic representations of operations on data bits that can be performed on computer memory. Therefore, a computer executes such logical steps thus requiring physical manipulations of physical quantities.

Usually these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. For reasons of common usage, these signals are referred to as bits, packets, messages, values, elements, symbols, characters, terms, numbers, or the like.

Additionally, all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Terms such as "processing" or "creating" or "transferring" or "executing" or "determining" or "detecting" or "obtaining" or "selecting" or "calculating" or "generating" or the like, refer to the action and processes of a computer system that manipulates and transforms data represented as physical (electronic) quantities within the computer's registers and memories into other data similarly represented as physical quantities within the memories or registers or other such information storage.

A computer-readable (storage) medium, such as referred to herein, typically may be non-transitory and/or comprise a non-transitory device. In this context, a non-transitory storage medium may include a device that may be tangible, meaning that the device has a concrete physical form, although the device may change its physical state. Thus, for example, non-transitory refers to a device remaining tangible despite a change in state.

DESCRIPTION OF EMBODIMENTS

The present invention is independent from loudspeakers placement due to the fact that an acoustic stream is not divided into channels but rather sound events present in a three-dimensional space.

Fig. 1 presents a diagram of a sound event according to the present invention. The sound event 101 represents the fact of presence of a sound source in an acoustic space. Each such event has an associated set of parameters such as: time of event 102, location in space with respect to a reference location point 103. The location may be given as x,y,z coordinates (alternatively spherical coordinates r,α,β may be used).

The sound event 101 comprises further a movement trajectory in space (for example in case of a vehicle changing its location) 104. The movement trajectory may be defined as n, Δt1, x1, y1, z1, γ1, δ1, Δt2, x2, y2, z2, γ2, δ2, ..., Δtn, xn, yn, zn, γs, δs which is a definition of a curve on which the sound source moves. n is a number of points of the curve while the xi, yi, zi are points in space and γ,δ is temporary orientation of the sound source (azimuth and elevation) and Δt is an increase in time.

The sound event 101 comprises further orientation (γ,δ - direction in which the highest sound amplitude is generated; azimuth and elevation are defined relative to orientation of a coordination system) 105.

Additionally, the sound event 101 comprises spatial characteristic of the source of the event (a shape of a curve of the sound amplitude with respect to emission angle - zero angle means emission in the direction of the highest amplitude) 106. This parameter may be provided as s, λ1, u1, v1, λ2, u2, v2, λ3, u3, v3, γ3, δ3, ..., λs, us, vs where the characteristic is symmetrical and described with s points whereas ui describe a shape of the sound beam in the horizontal plane while vi respective shape in the vertical plane.

The sound event 101 comprises further information on sampling frequency (in case it is different from the base sampling frequency of the sound stream) 107, signal resolution (the number of bits per sample; this parameter is present if a given source has a different span than a standard span of the sound stream) 108 and a set of acoustic samples 109 of the given frequency, resolution.

A plurality of sound events will typically be encoded into an output audio data stream.

The samples are always monophonic and are present as long as a given sound source emits a sound. In case of speech it means that a sound source appears and disappears in the sound stream. This is the reason for naming such event a sound event. In case of a recording of an orchestra there will occur appear/disappear events of respective instruments. As can be easily seen such an approach to sound data stream results in variable bitrate, wherein the changes may be substantial. When there are not any sound events the bitrate will be close to zero while in case of multiple sound events the bitrate may be higher (even higher than in case of prior art surround systems).

The loudspeakers may be located in an arbitrary way however preferably they should not be all placed in a single place, for example a single wall. According to the present invention the plurality of loudspeakers may be considered a cloud of loudspeakers. The more the loudspeakers the better spatial effect may be achieved. Preferably the loudspeakers are scattered in the presentation location, preferably on different walls of a room.

The loudspeakers may be either wired or wireless and be communicatively coupled to a sound decoder according to the present invention. The decoder may use loudspeakers of other electronic devices as long as communication may be established with controllers of such speakers (eg, bluetooth or wi-fi communication with loudspeakers of a TV set or mobile device).

The sound decoder according to the present invention may obtain information on location and characteristic of a given loudspeaker by sending to its controller a test sound stream and subsequently recording the played back test sound stream and analyzing the relevant acoustic response.

For the purpose of obtaining information on location and characteristic of a given loudspeaker there may be used an array of omnidirectional microphones, for example spaced from each other by 10cm and positioned on vertices of a cube or a tetrahedron. By measuring delays in a signal reaching respective microphones, one may estimate sound location. The characteristics of a given loudspeaker may be obtained by analyzing recorded sound at different frequencies.

Other methods for obtaining information on location and characteristic of a given loudspeaker include solutions presented in

US20140112484 or in "Analysis of Loudspeaker Placement and Loudspeaker-Room Interaction, and Correction of Associated Effects" by Michael Hlatky of University of Applied Sciences Offenburg, Media and Information Technology, Bang & Olufsen a/s, Department of Acoustics, August 2007.

According to the present invention there are used sound reflections in order to generate sounds from directions where there is not any loudspeaker present. To this end the sound decoder executes sound location analysis aimed at using reflective surfaces (such as walls) to generate reflected sounds. All sound reflecting surfaces are divided into triangles and each of the triangles is treated by the decoder as a virtual sound source. Each triangle has an associated function defining dependence of a sound virtually emitted by this triangle on sounds emitted by physical loudspeakers. This function defines the amplitude as well as spatial characteristics of emission, which may be different for each physical loudspeaker. In order for the system to operate properly it is necessary to place, at a sound presentation location, microphones used by the sound decoder for constant measurements of compliance of the emitted sounds with expected sounds and for fine tuning the system.

Such a function is a sum of reflected signals emitted by all loudspeakers in a room, wherein a signal reflected from a given triangle depends on the triangle location, loudspeaker(s) location(s), loudspeaker(s) emission characteristics, acoustic pressure emitted by the loudspeaker(s). The signal virtually emitted by the triangle will be a sum of reflection generated by all loudspeakers. A spatial acoustic emission characteristics of such triangle will depend on physical loudspeakers whereas each physical loudspeaker will influence it partially. Such characteristics may be discrete, comprising narrow beams generated by different loudspeakers. Therefore, in order to eliminate sound reflected at a given location, there has to be selected an appropriate loudspeaker or a linear combination of loudspeakers (appropriate means in line with the acoustic target eg. generating, from a given plane, a reflection in the direction of the listener such that other reflections do not ruin the effect).

The most important module of the system is a local sound renderer. This means that the renderer receives separate sound events and composes from them acoustic output streams that are subsequently sent to loudspeakers.

Due to the fact that the sound events comprise information on location of sound sources with respect to a reference location (for example the listener), the renderer shall select a speaker or speakers, which is/are closest to the location in space where the sound was emitted from. In case a speaker is not present in that location, speakers adjacent to this location shall be used, preferably speakers located at opposite sides of the location so that they may be configured in order to create an impression for the listener that the sound is emitted from its original location in space.

More than two loudspeakers may be used for one sound event in particular when a virtual sound source is to be positioned between them.

In case there are not any physical loudspeakers in the vicinity of the location (direction) of the sound of a sound event, reflections from adjacent planes (such as walls) may be used to position the sound. Knowing a sound reflection function for a given reflective section optimal physical loudspeakers need to be chosen for generating the reflection effect.

The reference point location may be differently selected for a given sound rendering location or room. For example one may listen to the music in an armchair and watch television sitting on a sofa. Therefore, there are two different reference locations depending on circumstances. Consequently, the coordinates system changes. The reference location may be automatically obtained by different sensors such as an infrared camera or manually input by the listener. Such solution is possible only because of local sound rendering.

An exemplary normalized characteristics of a physical loudspeaker is shown in Fig. 1 B. The characteristic is usually symmetrical and described with s points whereas u describes a shape of the sound beam in the horizontal plane while v respective shape in the vertical plane. Such characteristics may be determined using an array of microphones as previously described.

In case of reflection, characteristic can be asymmetrical and discontinuous.

Fig. 2 presents a diagram of the method according to the present invention. The method starts, after receiving a sound data stream according to Fig. 1, at step 201 from accessing a database of loudspeakers present at sound presentation location. Subsequently, at step 202, there is executed calculating, which loudspeakers may be used from the available loudspeakers so as to achieve the effect closest to a perfect arrangement. This may be effected by location thresholding based on the database of loudspeakers records.

Such calculation needs to be executed for each sound event because sound events may run in parallel and the same loudspeaker(s) may be needed to emit them. Data for each loudspeaker has to be added by applying superposition approach (all sound events at a given moment of time that affect a selected loudspeaker).

In case a loudspeaker is close to a location in which a sound source is located, this loudspeaker will be used. In case the sound source is located between physical loudspeakers then the closest loudspeakers will be used in order to simulate a virtual loudspeaker, located where the sound source is located. A superposition principle may be applied for this purpose. It is necessary to take into account, during this process, the emission characteristics of the loudspeakers.

The physical loudspeakers selected for simulating a virtual loudspeaker, will emit sound in direction of the listener at predefined angles of azimuth and elevation. For these angles there is to be read attenuation level from the emission characteristic of the loudspeaker (the characteristics is normalized and therefore it will be a number from a range of 0 ... 1) and multiplied by emission strength of the loudspeaker (acoustic pressure). Only after that, superposition may be executed. The signals are to be added by assigning weights to loudspeakers, the weights arising from location of a virtual loudspeaker with respect to these used to its generation (based on proportionality rule).

The calculations shall include not only the direction from which a sound event is emitted but also a distance from the listener (i.e. a delay of the signal in such a way so as to simulate the correct distance from the listener to the sound event). The properly selected loudspeakers surround the sound event location. There may be more than two selected loudspeakers that will emit a particular sound event data.

At step 203 there is calculated an angular difference between sound source location and positions of the candidate loudspeakers in spherical coordinates. The sound event location is:

  • rssi - a distance of the i-th sound event location from the listener;
  • γi - azimuth on the i-th sound event location
  • δi - elevation angle of the i-th sound event
and the loudspeaker location is:
  • rsj- a distance of the j-th loudspeaker location from the listener;
  • γj - azimuth on the j-th loudspeaker location
  • δj- elevation angle of the j-th loudspeaker

Thus the angular difference is as follows: Δγ=γiγj Δδ=δiδj A set of loudspeakers that have the lowest distance from the sound event location are selected at step 204. The loudspeakers are to be located at opposite sides (when facing the reference location of a user) with respect to the sound event location so that the listener has an impression that the sound arrives from the sound event location.

Subsequently, at step 205, in case of insufficient number of physical loudspeakers there may be created one or more virtual loudspeaker(s). Reflection of sound is utilized for this purpose. The reflections are generated by physical loudspeakers so that they imitate a physical loudspeaker in a given location of the sound presentation location. The generated sound will reflect from a selected surface and be directed towards the listener.

Knowing the location of the virtual loudspeaker, a straight line is to be virtually drawn from the listener to this location and further to a reflective plane (such as a wall). A point indicated as an intersection of this line with the reflective plane will indicate a triangle on the reflective plane, which is to be used in order to generate a reflected sound. From the characteristics of emission of that triangle it needs to be read which physical loudspeakers are to be used. Subsequently, there needs to be used a function defining dependency of emission of the triangle from particular loudspeakers in order to generate data streams 206 that are to be sent to physical loudspeakers in order to achieve a reflected sound from that particular triangle. These data stream are to be added to other data emitted by the respective loudspeakers 207.

Fig. 3 presents a diagram of the system according to the present invention. The system may be realized using dedicated components or custom made FPGA or ASIC circuits. The system comprises a data bus 301 communicatively coupled to a memory 304. Additionally, other components of the system are communicatively coupled to the system bus 301 so that they may be managed by a controller 305.

The memory 304 may store computer program or programs executed by the controller 305 in order to execute steps of the method according to the present invention.

The system comprises a sound input interface 303, such as an audio/video communication connector eg. HDMI or communication connector such as Ethernet. The received sound data is processed by a sound renderer 302 managing the presentation of sounds using the listener's premises loudspeakers setup. The management of the presentation of sounds includes virtual loudspeakers management that is effected by a virtual loudspeakers module 307 operating according to the method described above.

Figs 4A - 5B depict audio data packets that are multiplexed in an output audio data stream by a suitable encoder. The audio data stream may comprise a header and packets of acoustic data (for example sound event 101 data packet). The packets are preferably multiplexed in a chronological order but some shifts of data encoding/decoding time versus presentation time are allowable since each packet of acoustic data comprises information regarding its presentation time and must be received sufficiently ahead of that presentation.

The header may for example define a global sampling frequency and samples resolution.

Audio data stream may comprise acoustic events as shown in Fig. 4A. All properties of a sound event 101 are maintained with an addition of a language field that identifies audio language, for example with a use of an appropriate identifier. In case more than one language version is present, the acoustic event packets of different languages audio will differ by language identifier 401 and audio samples data 107, 108, 109. The remaining packet data fields will be identical between the respective audio language versions. An audio renderer will output only packets related to a language selected by a user.

Fig. 4B presents a special sound event packet which is a textual event packet. Instead of sound samples this packet comprises a library identifier 401 and a textual data field 403. Such textual data may be used to generate sound by a speech synthesizer. The library identifier may select a suitable voice of speech synthesizer to be used by the sound renderer as well as provide processing parameters for the renderer.

Optionally, the textual event packet may comprise a field specifying emotions in the textually defined event such as whisper, scream, cry or the like. Further, a field of a person's characteristics may be defined such as gender, age, accent or the like. Thus, the generation of sound may be more accurate.

As another option, the textual event packet may comprise a field defining tempo. In particular this field may define speech synthesis timing, such as length of different syllables and/or pauses between words.

The aforementioned has the advantage of data reduction since textual data consume far less data than compressed audio samples data.

Similarly, Fig. 5A defines a Synthetic Non-verbal Event Packet. Instead of sound samples and language field, this packet comprises at least one code in the data field 408 and a library selection field 402 referring to a music synthesizer library. The codes configure a music synthesizer. Thereby sounds are generated locally based on codes thus saving transmission bandwidth.

Such synthesizers are usually based on built in sound libraries used for synthesis. By their nature such libraries are limited, therefore it may be necessary to transmit to a receiver such a library so that a local library may be changed. This allows for achieving an optimal acoustic effect. Such a Synthetic Library Packet has been presented in Fig. 5B. The library comprises an identifier 404, language identifier 405 and audio samples data 406. The library may further be extended with additional data depending on applied synthesizers. A synthetic non-verbal event packet may reference such library by identifying a specific sample and its parameters if applicable.

Optionally, the textual event packets and/or synthetic non-verbal event packets may comprise a filed defining volume of the sound to be synthesized.

In one embodiment, in case of textual event packets and synthetic non-verbal event packets, the renderer interprets data (text or command) with built-in synthesizers and creates dynamic acoustic events packets that are subject to final sound rendering just as regular acoustic event packets.

The presence of textual event packets and synthetic non-verbal event packets allows for radical decrease of bandwidth required for transmission of audio data. In turn the synthetic library packet requires some bandwidth but allows to increase synthesis quality and still does not require as much data as regular audio samples recorded in real time.

The present invention related to recording, encoding and decoding of sound in order to provide for surround playback independent of loudspeakers setup at the sound presentation location. Therefore, the invention provides a useful, concrete and tangible result.

The aforementioned recording, encoding and decoding of sound takes place in special systems and processes sound data. Therefore the machine or transformation test is fulfilled and that the idea is not abstract.

It can be easily recognized, by one skilled in the art, that the aforementioned method for generating surround sound may be performed and/or controlled by one or more computer programs. Such computer programs are typically executed by utilizing the computing resources in a computing device. Applications are stored on a non-transitory medium. An example of a non-transitory medium is a non-volatile memory, for example a flash memory while an example of a volatile memory is RAM. The computer instructions are executed by a processor. These memories are exemplary recording media for storing computer programs comprising computer-executable instructions performing all the steps of the computer-implemented method according the technical concept presented herein.

While the invention presented herein has been depicted, described, and has been defined with reference to particular preferred embodiments, such references and examples of implementation in the foregoing specification do not imply any limitation on the invention. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader scope of the technical concept. The presented preferred embodiments are exemplary only, and are not exhaustive of the scope of the technical concept presented herein.

Accordingly, the scope of protection is not limited to the preferred embodiments described in the specification, but is only limited by the claims that follow.