Musical Visuals

In his book, Vom Musikalisch-Schönen, Eduard Hanslick argues that music’s purpose is to communicate “tönend bewegte Formen” (tonally moving forms). He argues that the understanding of music is “not empty but rather filled, not mere borders in a vacuum but rather intellect shaping itself from within.” Geoffrey Payzant interprets this understanding as “mind giving external shape to itself from within.”

If the purpose of music is to interact with dynamic shapes and forms, then how we can visualize these forms? We’ll first explore the natures of real-world and synthetic sound, narrowing in on the forms of data we use to store and analyze synthetic (digital) sound. From these forms of digital data, we’ll begin to articulate a vision for visualizing this data as moving forms within a computer graphics context, exploring what the nature of these forms and shapes might be.

Real-World Sound

Sounds are a constant part of the natural world, produced by vibrations in the air hitting our ear drums, sending a signal to our brain which tries to figure out what the sound is, where it’s coming from, and how close it is.

Whether it’s our own motion relative to the things making sound, or their motion relative to ourselves, there is a constant dynamism to the sound in the world around us. Sounds get louder and softer, filtered and affected by the obstacles and conditions between our ears and their source. Each sound seems to have its own signature, such that we can identify different types of sound.

The end result of hearing a sound seems to have two components: a source, which has a location and signature; and a setting, which transforms the sound before it reaches our ear. Our brain puts all this together, and produces a sonic map describing where sounds are coming from and what is making them.

Synthetic Sound

When we listen to synthetic sounds, it is often through in-ear/over-ear headphones or over a speaker. The sound sources are relatively static, and often transmit music in a channeled nature, where different speakers play different versions of the sound. It is up to the sound coming out of the speaker to be dynamic, changing its signatures, locations and settings, and doing so across a number of different channels.

To do this, we use different effects—like reverberations or echoes, or apply different pannings and levels to simulate different locations. These effects simulate some physical process, using building blocks like delays, filters and oscillators. The end result is a simulation of the sonic map, produced through synthetic means.

How is Synthetic Sound represented?

We see two main representations of synthetic sound data: temporal and spectral.


The temporal representation gives us samples over time, or the traditional audio waveform that we are used to. These samples correspond to the offsets of speaker positions, which is how speakers know how to reproduce the sounds stored in the data. This representation is most commonly used to store, process and stream audio. We often store and stream audio in a compressed form, while we typically use audio in a lossless (or uncompressed) form in our DAWs. We see this form of data in waveforms, and in level meters, which are representations of the uncompressed form of audio (amplitudes over time)


Without diving in too deep, we can think of sounds as a single wave, or as the summation of a large amount of waves. We saw the first form above, and we’ll address this second form here. We know that sounds sound different—they have different characters, tones, pitches, timbres—and this is because they have different combinations of frequencies (waves).

The spectral representation gives us pictures of these frequencies, in the form weighted bins over time (like a dynamic histogram). Each bin represents a band in the frequency spectrum, and its value represents the current strength of sound at the frequency band. This representation is more data-intensive than the temporal counterpart, and so it is often only temporarily stored (to do audio work or displaying audio data within a DAW). We see this above displays of EQs and spectrograms.

How can we reimagine?

DAWs are visual processors as much as sound processors. They make several features of digital audio — amplitude, frequency, time — legible as sonic objects, coloured boxes...
Michael Terren
Computer-based music production involves the eyes as much as the ears. The representations in audio editors like Pro Tools and Ableton Live are purely informational, waveforms and grids and linear graphs.
Ethan Hein

In trying to reimagine our visualization, we can approach it through the metaphor of painting. Namely, we can explore:

In other words, what is the dimensionality of the canvas (space) that sound can occupy, and what is the form and style of the paint (matter) that sound is represented with?

For the purposes of Sonik, we want to visualize the sonic map. Thus, we must be able to display:

We want to be able to represent multiple sounds through a single, encompassing view. Rather than relying on level meters and EQ displays to serve as the points of visualization within the DAW, we want the DAW to be based around a dynamic picture of sound. This involves being able to route multiple sounds into a single display, and visualizing them across the temporal spectrum, the stereo spectrum, the decibel spectrum and the frequency spectrum—essentially, we need a 4D understanding of sound to serve as our notion of its form over time (it’s “tönend bewegte Formen”, per Hanslick).


If we think of the purpose of a canvas, it is mainly to offer a space that contains the paint. It provides a certain dimensionality in which the visuals are contained. We can begin to think about how to define this dimensionality for our purposes. In our case, we want to first extend the traditional 2D canvas to a 3D canvas to capture the stereo, decibel and frequency spectrums of sound. We can map the stereo spectrum to the x-axis, working from its nature as representing left and right channels. We can map the frequency spectrum to the y-axis, capturing the “high-low” notion of frequency, as well as some of its psychoacoustic properties. We can map the decibel spectrum to the z-axis, reflecting the nature of softer sounds to feel further away and louder sounds to feel closer.


While the canvas provides the bounding dimensionality, the paint provides the forms that fill this dimensionality. As our paint is supposed to represent sound, we ought to consider how we can derive its form from sound data. We can consider this through the metaphor of painting, exploring what it means to move a brush around and draw shapes of different sizes.


Digital sound is often stored in one of two forms, either as a mono or stereo signal. The ability to route sounds through different speakers allows us to define different signals, or channels, that can be routed to speakers. Commonly, we think of the situation in which we have two channels, the left and the right, which we output through headphones, or a pair of speakers.

Most digital music is stored in this way, as two channels. This dual-channel nature gives rise to the pan functionality, in which we can move a sound L←→ R across the x-axis. The offset in sound between two speakers defines where in this stereo field we hear the sound.

We can imagine our brush going back and forth horizontally, as one channel increases in volume over the other it moves across the canvas, its position controlled by the angle of the sound in stereo space.


If we consider the level functionality, which can increase or lower the loudness of the sound, we can think about moving the sound further from or closer towards us. Perceptually, this is felt as the closeness of the sound to us, with louder sounds feeling closer and softer sounds feeling more distant.

On a two-dimensional canvas, there are several ways we represent the depth of a three-dimensional space. Our brush goes from drawing larger shapes to drawing smaller shapes as the level changes. We may also use various techniques like overlapping, perspective, or shadowing.


While can think of temporal data as giving the amplitudes of the whole signal over time, and spectral data as giving the amplitudes of different frequencies at a specific point in time. We can take spectral data from temporal data, and thus, we can take different spectral “snapshots” over time, leaving us with a temporal spectral signal.

To this spectral signal, we can apply filters, which boost or cut certain bands. This allows us to shape a signal, accentuating certain aspects and diminishing others. In a typical display in a filter or EQ plugin, this histogram is represented traditionally, with bars laid out left to right—lows on the left, highs on the right.

For our purposes, we can turn this orientation 90 degrees and represent lows at the bottom, and highs at the top. We imagine a brush crafting a form in space, and the ways it might vary the shape from top to bottom, creating areas of different weight or detail.

Other effects, like reverb and delay, can be thought of in similar ways, with blurring or smearing, or repeated forms.

Sound Pictures

In creating “pictures of sound”, we want the ability to represent all the small details that make up a full picture of the sonic map. Being able to take a more detailed snapshot of a sound signal, by breaking it apart into its frequency components and taking measurements of the strength of each component, allows us to have more control over each of these details. We can loosely equate this to a painting being the product of many brushstrokes.

Instead of using whole shapes (as in the above graphics), we can use small elements compositionally (points, lines or shapes), allowing each to take on a range of visual properties in order to encode different pieces of data. Through size, shape, color, texture, brightness, opacity, location, z-ordering, blurriness, shadow, and other visual properties, we can individually paint each data point according to independent, derived properties (like how loud is it, what source it is from, how long ago it occurred, etc).

These visual properties allow the articulation of shape and form. By thinking compositionally, and utilizing the base component of sound—vibrating waves—we can think of drawing not whole forms, but aggregate forms. By decomposing forms into sets of points, it makes it easier to draw multiple different sources into the same display, as each unit of drawing takes up minimal space and can be independently fed data to control its display.

Wrap Up


If we return to our goals, we wanted a display that a) visualized sound in a natural environment, b) visualized multiple sound sources and c) visualized sound across time. By using parameterized points—with the parameters being the x, y and z coordinates calculated from panning, frequency componentization and loudness calculations—we are able to accomplish these goals. We model the natural environment through a 3-dimensional approach to sound visualization, and we allow arbitrary sources and time to be shown by working with a small base unit that is driven by derived properties from the sound signal.

Separation of Content and Style

Additionally, this model does not prescribe any visual style onto the visualization, but it leaves it open to the user to specify ranges and values for properties like size, color, and texture. This allows the engine to be flexible and extendible, and to determine the content while leaving space for the user to determine the style. Thus, in the development of a set of textures, colors and shapes that the user can choose from and assign to different sound sources, we can allow them to design and represent sound closer to how they internally view sound.

Next Steps

The task now is to translate this model into a graphics engine. First, we must establish a data pipeline that can take a sound input and return sets of points over timesteps. Next, we’ll need to build a graphics pipeline to take these points and draw them to a screen. Finally, we’ll need to build an interface that allows for sound properties to be mapped to different graphics properties, and for the ranges and values of these to be defined. To be continued…