The Audio Analysis endpoint provides low-level audio analysis for all of the tracks in the Spotify catalog. The Audio Analysis describes the track’s structure and musical content, including rhythm, pitch, and timbre. All information is precise to the audio sample.

Many elements of analysis include confidence values, a floating-point number ranging from 0.0 to 1.0. Confidence indicates the reliability of its corresponding attribute. Elements carrying a small confidence value should be considered speculative. There may not be sufficient data in the audio to compute the attribute with high certainty.

Endpoint

GET https://api.spotify.com/v1/audio-analysis/{id}

Request Parameters

Path Parameters

Path Parameter Value
id Required. The Spotify ID for the track.

Header Fields

Header Field Value
Authorization Required. A valid access token from the Spotify Accounts service: see the Web API Authorization Guide for details.

Response Format

On success, the HTTP status code in the response header is 200 OK and the response body contains an audio analysis object in JSON format. On error, the header status code is an error code and the response body contains an error object.

Examples

Get audio analysis for a track

curl -X GET "https://api.spotify.com/v1/audio-analysis/3JIxjvbbDrA9ztYlNcp3yL" -H "Authorization: Bearer {your access token}"
{
  "bars": [
    {
      "start": 251.98282,
      "duration": 0.29765,
      "confidence": 0.652
    }
  ],
  "beats": [
    {
      "start": 251.98282,
      "duration": 0.29765,
      "confidence": 0.652
    }
  ],
  "sections": [
    {
      "start": 237.02356,
      "duration": 18.32542,
      "confidence": 1,
      "loudness": -20.074,
      "tempo": 98.253,
      "tempo_confidence": 0.767,
      "key": 5,
      "key_confidence": 0.327,
      "mode": 1,
      "mode_confidence": 0.566,
      "time_signature": 4,
      "time_signature_confidence": 1
    }
  ],
  "segments": [
    {
      "start": 252.15601,
      "duration": 3.19297,
      "confidence": 0.522,
      "loudness_start": -23.356,
      "loudness_max_time": 0.06971,
      "loudness_max": -18.121,
      "loudness_end": -60,
      "pitches": [
        0.709,
        0.092,
        0.196,
        0.084,
        0.352,
        0.134,
        0.161,
        1,
        0.17,
        0.161,
        0.211,
        0.15
      ],
      "timbre": [
        23.312,
        -7.374,
        -45.719,
        294.874,
        51.869,
        -79.384,
        -89.048,
        143.322,
        -4.676,
        -51.303,
        -33.274,
        -19.037
      ]
    }
  ],
  "tatums": [
    {
      "start": 251.98282,
      "duration": 0.29765,
      "confidence": 0.652
    }
  ],
  "track": {
    "duration": 255.34898,
    "sample_md5": "",
    "offset_seconds": 0,
    "window_seconds": 0,
    "analysis_sample_rate": 22050,
    "analysis_channels": 1,
    "end_of_fade_in": 0,
    "start_of_fade_out": 251.73333,
    "loudness": -11.84,
    "tempo": 98.002,
    "tempo_confidence": 0.423,
    "time_signature": 4,
    "time_signature_confidence": 1,
    "key": 5,
    "key_confidence": 0.36,
    "mode": 0,
    "mode_confidence": 0.414,
    "codestring": "eJxVnAmS5DgOBL-ST-B9_P9j4x7M6qoxW9tpsZQSCeI...",
    "code_version": 3.15,
    "echoprintstring": "eJzlvQmSHDmStHslxw4cB-v9j_A-tahhVKV0IH9...",
    "echoprint_version": 4.12,
    "synchstring": "eJx1mIlx7ToORFNRCCK455_YoE9Dtt-vmrKsK3EBsTY...",
    "synch_version": 1,
    "rhythmstring": "eJyNXAmOLT2r28pZQuZh_xv7g21Iqu_3pCd160xV...",
    "rhythm_version": 1
  }
}

Try it

Audio Analysis Object

Key Value Type Value Description
bars an array of time interval objects The time intervals of the bars throughout the track. A bar (or measure) is a segment of time defined as a given number of beats. Bar offsets also indicate downbeats, the first beat of the measure.
beats an array of time interval objects The time intervals of beats throughout the track. A beat is the basic time unit of a piece of music; for example, each tick of a metronome. Beats are typically multiples of tatums.
sections an array of section objects Sections are defined by large variations in rhythm or timbre, e.g. chorus, verse, bridge, guitar solo, etc. Each section contains its own descriptions of tempo, key, mode, time_signature, and loudness.
segments an array of segment objects Audio segments attempts to subdivide a song into many segments, with each segment containing a roughly consitent sound throughout its duration.
tatums an array of time interval objects A tatum represents the lowest regular pulse train that a listener intuitively infers from the timing of perceived musical events (segments). For more information about tatums, see Rhythm (below).

Time Interval Object

This is a generic object used to represent various time intervals within Audio Analysis. For information about Bars, Beats, Tatums, Sections, and Segments are determined, please see the Rhythm section below.

Key Value Type Value Description
start float The starting point (in seconds) of the time interval.
duration float The duration (in seconds) of the time interval.
confidence float The confidence, from 0.0 to 1.0, of the reliability of the interval.

Section Object

Key Value Type Value Description
start float The starting point (in seconds) of the section.
duration float The duration (in seconds) of the section.
confidence float The confidence, from 0.0 to 1.0, of the reliability of the section’s “designation”.
loudness float The overall loudness of the section in decibels (dB). Loudness values are useful for comparing relative loudness of sections within tracks.
tempo float The overall estimated tempo of the section in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
tempo_confidence float The confidence, from 0.0 to 1.0, of the reliability of the tempo. Some tracks contain tempo changes or sounds which don’t contain tempo (like pure speech) which would correspond to a low value in this field.
key integer The estimated overall key of the section. The values in this field ranging from 0 to 11 mapping to pitches using standard Pitch Class notation (E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on). If no key was detected, the value is -1.
key_confidence float The confidence, from 0.0 to 1.0, of the reliability of the key. Songs with many key changes may correspond to low values in this field.
mode integer Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. This field will contain a 0 for “minor”, a 1 for “major”, or a -1 for no result. Note that the major key (e.g. C major) could more likely be confused with the minor key at 3 semitones lower (e.g. A minor) as both keys carry the same pitches.
mode_confidence float The confidence, from 0.0 to 1.0, of the reliability of the mode.
time_signature integer An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”.
time_signature_confidence float The confidence, from 0.0 to 1.0, of the reliability of the time_signature. Sections with time signature changes may correspond to low values in this field.

Segment Object

Key Value Type Value Description
start float The starting point (in seconds) of the segment.
duration float The duration (in seconds) of the segment.
confidence float The confidence, from 0.0 to 1.0, of the reliability of the segmentation. Segments of the song which are difficult to logically segment (e.g: noise) may correspond to low values in this field.
loudness_start float The onset loudness of the segment in decibels (dB). Combined with loudness_max and loudness_max_time, these components can be used to desctibe the “attack” of the segment.
loudness_max float The peak loudness of the segment in decibels (dB). Combined with loudness_start and loudness_max_time, these components can be used to desctibe the “attack” of the segment.
loudness_max_time float The segment-relative offset of the segment peak loudness in seconds. Combined with loudness_start and loudness_max, these components can be used to desctibe the “attack” of the segment.
loudness_end float The offset loudness of the segment in decibels (dB). This value should be equivalent to the loudness_start of the following segment.
pitches array of floats A “chroma” vector representing the pitch content of the segment, corresponding to the 12 pitch classes C, C#, D to B, with values ranging from 0 to 1 that describe the relative dominance of every pitch in the chromatic scale. More details about how to interpret this vector can be found below..
timbre array of floats Timbre is the quality of a musical note or sound that distinguishes different types of musical instruments, or voices. Timbre vectors are best used in comparison with each other. More details about how to interpret this vector can be found on the below..

Audio Analysis Details

Rhythm

Beats are subdivisions of bars. Tatums are subdivisions of beats. That is, bars always align with a beat and ditto tatums. Note that a low confidence does not necessarily mean the value is inaccurate. Exceptionally, a confidence of -1 indicates “no” value: the corresponding element must be discarded. A track may result with no bars, no beats, and/or no tatums if no periodicity was detected. The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4. A value of -1 may indicate no time signature, while a value of 1 indicates a rather complex or changing time signature.

Rhythm subdivisions can be visualized in the following form: Rhythm subdivisions

Pitch

Pitch content is given by a “chroma” vector, corresponding to the 12 pitch classes C, C#, D to B, with values ranging from 0 to 1 that describe the relative dominance of every pitch in the chromatic scale. For example a C Major chord would likely be represented by large values of C, E and G (i.e. classes 0, 4, and 7). Vectors are normalized to 1 by their strongest dimension, therefore noisy sounds are likely represented by values that are all close to 1, while pure tones are described by one value at 1 (the pitch) and others near 0.

As can be seen below, the 12 vector indices are a combination of low-power spectrum values at their respective pitch frequencies Pitch vector

Timbre

Timbre is the quality of a musical note or sound that distinguishes different types of musical instruments, or voices. It is a complex notion also referred to as sound color, texture, or tone quality, and is derived from the shape of a segment’s spectro-temporal surface, independently of pitch and loudness. The timbre feature is a vector that includes 12 unbounded values roughly centered around 0. Those values are high level abstractions of the spectral surface, ordered by degree of importance.

For completeness however, the first dimension represents the average loudness of the segment; second emphasizes brightness; third is more closely correlated to the flatness of a sound; fourth to sounds with a stronger attack; etc. See an image below representing the 12 basis functions (i.e. template segments).

Timbre basis functions

The actual timbre of the segment is best described as a linear combination of these 12 basis functions weighted by the coefficient values: timbre = c1 x b1 + c2 x b2 + … + c12 x b12, where c1 to c12 represent the 12 coefficients and b1 to b12 the 12 basis functions as displayed below. Timbre vectors are best used in comparison with each other.