Data Processing with pandas

pandas

pandas is a popular Python package with excellent functionality for data analysis tasks.

DataFrames

A DataFrame is a pandas data structure, much like a table:

  • values are held in rows and columns
  • each row and column has a label

Typically, rows are numerically labelled, starting at zero, while columns have names in strings.

However, a DataFrame can have strings for row labels and they don't have to be unique.

Example repl

Let's jump in with an example. Fork the repl, remembering as always to add your credentials to a .env file.

Info

The walkthrough will highlight a handful of pandas techniques used for creating the Seeder app, but if you would like a more complete understanding of pandas and these techniques, please consult the excellent documentation.

import requests
from requests.auth import HTTPBasicAuth
import os
import pandas as pd

%load_ext dotenv
%dotenv

client_id = os.getenv("CLIENT_ID")
client_secret = os.getenv("CLIENT_SECRET")

response = requests.post('https://accounts.spotify.com/api/token', {    
    'client_id': client_id,
    'client_secret': client_secret,
    'grant_type': 'client_credentials'
})

access_token = response.json()['access_token']
print(access_token)

from spotter import Spotter
s = Spotter(client_id, client_secret)
recs = s.get_item('recommendations', params={'seed_tracks': '4c6vZqYHFur11FbWATIJ9P', 'limit': 100})
    The dotenv extension is already loaded. To reload it, use:
      %reload_ext dotenv
    BQBWyRZKc4-2vaTtqPHTM92O3RuhyMCG56EsBcJjViruXuhyqxTqNnX66IKa2gdIvlnmTASdfTsfDBOnHJw

Code walkthrough

Not much new to see in main.py; we've just substituted use of the .get_recommmendations() method for the .get_item() method we used previously to get individual track data:

recs = s.get_recommendations(track_id)
for track in recs:  
    ...  
    print(f'{track_name} - {track_artist}')
  • .get_recommendataions() returns a list containing data for several tracks, over which we iterate to print the track_name and track_artist
  • pandas has also been imported at the top of the script, and given the alias of pd

In spotter.py we'll find .get_recommendations() at the bottom, but also notice a small modification to .get_item():

def get_item(self, category, item_id=None, params=None):
    ...
    resp = requests.get(f'{self.api_root}{category}/{item_id}', params=params, headers=self.headers)
  • the get_item() method no longer requires an item_id, and can also accept a params argument
def get_recommendations(self, seed, number=6):
    recs = self.get_item('recommendations', params={'seed_tracks': seed, 'limit': 100})
  • we use .get_item() to fetch some recommendations
    • the track_id defined in main.py is the value for the seed_tracks parameter
    • we set a limit of 100 items to be returned

Creating a DataFrame

recs_df = pd.DataFrame(recs['tracks'])
recs_df.head(3)
album artists available_markets disc_number duration_ms explicit external_ids external_urls href id is_local name popularity preview_url track_number type uri
0 {'album_type': 'ALBUM', 'artists': [{'external... [{'external_urls': {'spotify': 'https://open.s... [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... 1 291369 False {'isrc': 'GBAYE9600152'} {'spotify': 'https://open.spotify.com/track/0m... https://api.spotify.com/v1/tracks/0mIZErjrNQzO... 0mIZErjrNQzOTDtn4UYflo False Sun Hits The Sky 48 https://p.scdn.co/mp3-preview/7591511d739043e4... 6 track spotify:track:0mIZErjrNQzOTDtn4UYflo
1 {'album_type': 'ALBUM', 'artists': [{'external... [{'external_urls': {'spotify': 'https://open.s... [AE, AL, AR, AT, AU, BE, BG, BH, BO, BR, BY, C... 1 351626 False {'isrc': 'GBAAN9500078'} {'spotify': 'https://open.spotify.com/track/2f... https://api.spotify.com/v1/tracks/2fXKyAyPrEa2... 2fXKyAyPrEa24c6PJyqznF False Common People 69 None 3 track spotify:track:2fXKyAyPrEa24c6PJyqznF
2 {'album_type': 'ALBUM', 'artists': [{'external... [{'external_urls': {'spotify': 'https://open.s... [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... 1 182680 False {'isrc': 'GBFFS0400061'} {'spotify': 'https://open.spotify.com/track/7m... https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... 7mAF2MJdbNT75VrVcgwT6F False Hounds of Love 55 https://p.scdn.co/mp3-preview/c98ce02fe11caae9... 14 track spotify:track:7mAF2MJdbNT75VrVcgwT6F
  • the value for the tracks key in recs was a list of 100 dictionaries, each one relating to a given track
  • the pd.DataFrame() function has created a DataFrame from the list, with:
    • the data for each track being within a single row
    • the column headings are the keys from the track dictionaries
  • the .head() method allows us to show just the first few records

Using .apply() and lambda functions

In this part of the code, we're going to extract the number of tracks we tracks we're looking for (which is 6 by default), but only take those which are available in as many markets as possible.

We'll do this by:

  1. Adding a column which contains number of markets listed in available_markets for each track
  2. Filtering the DataFrame to include only those with the max() number found
  3. Using the .head() method to fetch the first 6 tracks from the remaining tracks
recs_df['availability'] = recs_df['available_markets'].apply(lambda x: len(x))
recs_df.head(3)
album artists available_markets disc_number duration_ms explicit external_ids external_urls href id is_local name popularity preview_url track_number type uri availability
0 {'album_type': 'ALBUM', 'artists': [{'external... [{'external_urls': {'spotify': 'https://open.s... [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... 1 291369 False {'isrc': 'GBAYE9600152'} {'spotify': 'https://open.spotify.com/track/0m... https://api.spotify.com/v1/tracks/0mIZErjrNQzO... 0mIZErjrNQzOTDtn4UYflo False Sun Hits The Sky 48 https://p.scdn.co/mp3-preview/7591511d739043e4... 6 track spotify:track:0mIZErjrNQzOTDtn4UYflo 91
1 {'album_type': 'ALBUM', 'artists': [{'external... [{'external_urls': {'spotify': 'https://open.s... [AE, AL, AR, AT, AU, BE, BG, BH, BO, BR, BY, C... 1 351626 False {'isrc': 'GBAAN9500078'} {'spotify': 'https://open.spotify.com/track/2f... https://api.spotify.com/v1/tracks/2fXKyAyPrEa2... 2fXKyAyPrEa24c6PJyqznF False Common People 69 None 3 track spotify:track:2fXKyAyPrEa24c6PJyqznF 80
2 {'album_type': 'ALBUM', 'artists': [{'external... [{'external_urls': {'spotify': 'https://open.s... [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... 1 182680 False {'isrc': 'GBFFS0400061'} {'spotify': 'https://open.spotify.com/track/7m... https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... 7mAF2MJdbNT75VrVcgwT6F False Hounds of Love 55 https://p.scdn.co/mp3-preview/c98ce02fe11caae9... 14 track spotify:track:7mAF2MJdbNT75VrVcgwT6F 92
  • a new column availability has been created (scroll right to see it)
  • the .apply() method was used on the available_markets column, with a lambda function used as its argument
  • this lambda function means "take the original value as x, and return len(x), i.e. the length of the list"

Lambda functions are also known as anonymous functions, and tend to be found where the logic is only required for a short period of time, i.e. it will not be re-used.

Selecting data

tracks = recs_df[recs_df['availability'] == recs_df['availability'].max()].head(6).copy()
tracks.head(2)
album artists available_markets disc_number duration_ms explicit external_ids external_urls href id is_local name popularity preview_url track_number type uri availability
2 {'album_type': 'ALBUM', 'artists': [{'external... [{'external_urls': {'spotify': 'https://open.s... [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... 1 182680 False {'isrc': 'GBFFS0400061'} {'spotify': 'https://open.spotify.com/track/7m... https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... 7mAF2MJdbNT75VrVcgwT6F False Hounds of Love 55 https://p.scdn.co/mp3-preview/c98ce02fe11caae9... 14 track spotify:track:7mAF2MJdbNT75VrVcgwT6F 92
5 {'album_type': 'ALBUM', 'artists': [{'external... [{'external_urls': {'spotify': 'https://open.s... [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... 1 414080 False {'isrc': 'GBDGG0200012'} {'spotify': 'https://open.spotify.com/track/2w... https://api.spotify.com/v1/tracks/2wzTHuw8nhFS... 2wzTHuw8nhFSXUU27x0PFa False There Goes The Fear 51 None 3 track spotify:track:2wzTHuw8nhFSXUU27x0PFa 92
  • this can be read as "return the df [where the ['availablility] column in that df is equal to == the .max() value in that same column ]"
  • ...and the return the .head() of the result, containing the number of values specified
  • we then used .head() again to display the first two rows of the result
tracks['features'] = tracks['id'].apply(lambda x: s.get_item('audio-features', x))
tracks[['availability', 'features']]
availability features
2 92 {'danceability': 0.36, 'energy': 0.895, 'key':...
5 92 {'danceability': 0.499, 'energy': 0.975, 'key'...
6 92 {'danceability': 0.491, 'energy': 0.686, 'key'...
9 92 {'danceability': 0.45, 'energy': 0.656, 'key':...
10 92 {'danceability': 0.508, 'energy': 0.826, 'key'...
14 92 {'danceability': 0.47, 'energy': 0.959, 'key':...
  • here we use .apply() again, this time incorporating our .get_item() method and passing the value in the id column as one of the arguments to get audio-features data for each track
  • we can see a subset of columns of the tracks DataFrame by providing a list of columns in [square brackets ]

reset_index()

tracks = tracks.reset_index(drop=True)
  • notice in our previous outputs that the row indexes have been retained from the original DataFrame (and aren't continuous or starting from zero)
  • it will be helpful for us later on to reset these so that they are; the drop=True parameter means that our original index values won't be stored (drop=False would have resulted in the creation of a new column containing them)

Unpacking JSON

features = pd.json_normalize(tracks['features'])
features.head(2)
danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo type id uri track_href analysis_url duration_ms time_signature
0 0.360 0.895 9 -3.795 1 0.1170 0.032900 0.000001 0.0600 0.424 168.214 audio_features 7mAF2MJdbNT75VrVcgwT6F spotify:track:7mAF2MJdbNT75VrVcgwT6F https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... https://api.spotify.com/v1/audio-analysis/7mAF... 182680 4
1 0.499 0.975 5 -4.504 1 0.0538 0.000086 0.072100 0.0258 0.509 109.997 audio_features 2wzTHuw8nhFSXUU27x0PFa spotify:track:2wzTHuw8nhFSXUU27x0PFa https://api.spotify.com/v1/tracks/2wzTHuw8nhFS... https://api.spotify.com/v1/audio-analysis/2wzT... 414080 4
  • the json_normalize() function is used here to unpack the dictionaries returned by our .get_item() method calls requesting audio-features for each track

Concatenation

tracks_full = pd.concat([tracks, features], axis=1)
print(tracks_full.shape)
tracks_full.head(2)
    (6, 37)
album artists available_markets disc_number duration_ms explicit external_ids external_urls href id ... liveness valence tempo type id uri track_href analysis_url duration_ms time_signature
0 {'album_type': 'ALBUM', 'artists': [{'external... [{'external_urls': {'spotify': 'https://open.s... [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... 1 182680 False {'isrc': 'GBFFS0400061'} {'spotify': 'https://open.spotify.com/track/7m... https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... 7mAF2MJdbNT75VrVcgwT6F ... 0.0600 0.424 168.214 audio_features 7mAF2MJdbNT75VrVcgwT6F spotify:track:7mAF2MJdbNT75VrVcgwT6F https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... https://api.spotify.com/v1/audio-analysis/7mAF... 182680 4
1 {'album_type': 'ALBUM', 'artists': [{'external... [{'external_urls': {'spotify': 'https://open.s... [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... 1 414080 False {'isrc': 'GBDGG0200012'} {'spotify': 'https://open.spotify.com/track/2w... https://api.spotify.com/v1/tracks/2wzTHuw8nhFS... 2wzTHuw8nhFSXUU27x0PFa ... 0.0258 0.509 109.997 audio_features 2wzTHuw8nhFSXUU27x0PFa spotify:track:2wzTHuw8nhFSXUU27x0PFa https://api.spotify.com/v1/tracks/2wzTHuw8nhFS... https://api.spotify.com/v1/audio-analysis/2wzT... 414080 4

2 rows × 37 columns

  • we have concatenated (joined together) the two DataFrames
  • axis=1 means that we do so in a column-wise manner
  • the row indices are used to match the data from each DataFrame
  • the .shape attribute of a DataFrame contains the number of (rows, columns)

The pandas .join() and merge() methods are useful for more complex SQL-like joins using specific columns as keys.

dupes = tracks_full.columns.duplicated()
dupes
    array([False, False, False, False, False, False, False, False, False,
           False, False, False, False, False, False, False, False, False,
           False, False, False, False, False, False, False, False, False,
           False, False, False,  True,  True,  True, False, False,  True,
           False])
  • .duplicated() is a method which identifies any repeat instances of column names

Location with .loc

tracks_all = tracks_full.loc[:,~dupes].copy()
tracks_all.shape
    (6, 33)
  • .loc is used to locate items in our DataFrame
  • the code before the , refers to the rows; here : means 'all rows'
  • the code after the , refers to the columns; ~ means 'not' or 'opposite'
  • .copy() makes a new object so that subsequent operations on tracks_all can be made without tracks_full being affected (to save memory, pandas may not otherwise store such a subset as a distinct object)
  • we now have four fewer columns; the duplicates have been removed

Multiple artist names

Sometimes our tracks will have multplie artists, and it would be nice to be able to use all of their names in our app.

Here we again use .apply() with a lambda function, incorporating the .join() method and a list comprehension to amalgamate the name values for all entries in artists:

tracks_all['artist_string'] = tracks_all['artists'].apply(lambda x: ', '.join([a['name'] for a in x]))
list(tracks_all['artist_string'])
    ['The Futureheads',
     'Doves',
     'The Stone Roses',
     'Dexys Midnight Runners',
     'Stereophonics',
     'Primal Scream']

Data structure conversion

tracks_data = tracks_all.to_dict(orient="records")
tracks_data[0].keys()
    dict_keys(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms', 'explicit', 'external_ids', 'external_urls', 'href', 'id', 'is_local', 'name', 'popularity', 'preview_url', 'track_number', 'type', 'uri', 'availability', 'features', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'track_href', 'analysis_url', 'time_signature', 'artist_string'])
  • here we use the .to_dict() DataFrame method, with the orient="records" parameter, to return a list of dictionaries, each one containing data about a given track
  • each of these dictionaries has the same structure (except with the additional audio-features and our new artist_string key:value pairs) as seen previously when calling .json() on a Response object returned from a get_item() request