Data Processing with pandas
pandas
pandas
is a popular Python package with excellent functionality for data analysis tasks.
DataFrames
A DataFrame is a pandas data structure, much like a table:
- values are held in rows and columns
- each row and column has a label
Typically, rows are numerically labelled, starting at zero, while columns have names in strings.
However, a DataFrame can have strings for row labels and they don't have to be unique.
Example repl
Let's jump in with an example. Fork the repl, remembering as always to add your credentials to a .env
file.
Info
The walkthrough will highlight a handful of pandas
techniques used for creating the Seeder app, but if you would like a more complete understanding of pandas
and these techniques, please consult the excellent documentation.
import requests
from requests.auth import HTTPBasicAuth
import os
import pandas as pd
%load_ext dotenv
%dotenv
client_id = os.getenv("CLIENT_ID")
client_secret = os.getenv("CLIENT_SECRET")
response = requests.post('https://accounts.spotify.com/api/token', {
'client_id': client_id,
'client_secret': client_secret,
'grant_type': 'client_credentials'
})
access_token = response.json()['access_token']
print(access_token)
from spotter import Spotter
s = Spotter(client_id, client_secret)
recs = s.get_item('recommendations', params={'seed_tracks': '4c6vZqYHFur11FbWATIJ9P', 'limit': 100})
The dotenv extension is already loaded. To reload it, use:
%reload_ext dotenv
BQBWyRZKc4-2vaTtqPHTM92O3RuhyMCG56EsBcJjViruXuhyqxTqNnX66IKa2gdIvlnmTASdfTsfDBOnHJw
Code walkthrough
Not much new to see in main.py
; we've just substituted use of the .get_recommmendations()
method for the .get_item()
method we used previously to get individual track data:
recs = s.get_recommendations(track_id)
for track in recs:
...
print(f'{track_name} - {track_artist}')
.get_recommendataions()
returns a list containing data for several tracks, over which we iterate to print thetrack_name
andtrack_artist
pandas
has also been imported at the top of the script, and given the alias ofpd
In spotter.py
we'll find .get_recommendations()
at the bottom, but also notice a small modification to .get_item()
:
def get_item(self, category, item_id=None, params=None):
...
resp = requests.get(f'{self.api_root}{category}/{item_id}', params=params, headers=self.headers)
- the
get_item()
method no longer requires anitem_id
, and can also accept aparams
argument
def get_recommendations(self, seed, number=6):
recs = self.get_item('recommendations', params={'seed_tracks': seed, 'limit': 100})
- we use
.get_item()
to fetch somerecommendations
- the
track_id
defined inmain.py
is the value for theseed_tracks
parameter - we set a
limit
of100
items to be returned
- the
Creating a DataFrame
recs_df = pd.DataFrame(recs['tracks'])
recs_df.head(3)
album | artists | available_markets | disc_number | duration_ms | explicit | external_ids | external_urls | href | id | is_local | name | popularity | preview_url | track_number | type | uri | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {'album_type': 'ALBUM', 'artists': [{'external... | [{'external_urls': {'spotify': 'https://open.s... | [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... | 1 | 291369 | False | {'isrc': 'GBAYE9600152'} | {'spotify': 'https://open.spotify.com/track/0m... | https://api.spotify.com/v1/tracks/0mIZErjrNQzO... | 0mIZErjrNQzOTDtn4UYflo | False | Sun Hits The Sky | 48 | https://p.scdn.co/mp3-preview/7591511d739043e4... | 6 | track | spotify:track:0mIZErjrNQzOTDtn4UYflo |
1 | {'album_type': 'ALBUM', 'artists': [{'external... | [{'external_urls': {'spotify': 'https://open.s... | [AE, AL, AR, AT, AU, BE, BG, BH, BO, BR, BY, C... | 1 | 351626 | False | {'isrc': 'GBAAN9500078'} | {'spotify': 'https://open.spotify.com/track/2f... | https://api.spotify.com/v1/tracks/2fXKyAyPrEa2... | 2fXKyAyPrEa24c6PJyqznF | False | Common People | 69 | None | 3 | track | spotify:track:2fXKyAyPrEa24c6PJyqznF |
2 | {'album_type': 'ALBUM', 'artists': [{'external... | [{'external_urls': {'spotify': 'https://open.s... | [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... | 1 | 182680 | False | {'isrc': 'GBFFS0400061'} | {'spotify': 'https://open.spotify.com/track/7m... | https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... | 7mAF2MJdbNT75VrVcgwT6F | False | Hounds of Love | 55 | https://p.scdn.co/mp3-preview/c98ce02fe11caae9... | 14 | track | spotify:track:7mAF2MJdbNT75VrVcgwT6F |
- the value for the
tracks
key inrecs
was a list of 100 dictionaries, each one relating to a given track - the
pd.DataFrame()
function has created a DataFrame from the list, with:- the data for each track being within a single row
- the column headings are the keys from the track dictionaries
- the
.head()
method allows us to show just the first few records
Using .apply()
and lambda
functions
In this part of the code, we're going to extract the number
of tracks we tracks we're looking for (which is 6
by default), but only take those which are available in as many markets as possible.
We'll do this by:
- Adding a column which contains number of markets listed in
available_markets
for each track - Filtering the DataFrame to include only those with the
max()
number found - Using the
.head()
method to fetch the first6
tracks from the remaining tracks
recs_df['availability'] = recs_df['available_markets'].apply(lambda x: len(x))
recs_df.head(3)
album | artists | available_markets | disc_number | duration_ms | explicit | external_ids | external_urls | href | id | is_local | name | popularity | preview_url | track_number | type | uri | availability | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {'album_type': 'ALBUM', 'artists': [{'external... | [{'external_urls': {'spotify': 'https://open.s... | [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... | 1 | 291369 | False | {'isrc': 'GBAYE9600152'} | {'spotify': 'https://open.spotify.com/track/0m... | https://api.spotify.com/v1/tracks/0mIZErjrNQzO... | 0mIZErjrNQzOTDtn4UYflo | False | Sun Hits The Sky | 48 | https://p.scdn.co/mp3-preview/7591511d739043e4... | 6 | track | spotify:track:0mIZErjrNQzOTDtn4UYflo | 91 |
1 | {'album_type': 'ALBUM', 'artists': [{'external... | [{'external_urls': {'spotify': 'https://open.s... | [AE, AL, AR, AT, AU, BE, BG, BH, BO, BR, BY, C... | 1 | 351626 | False | {'isrc': 'GBAAN9500078'} | {'spotify': 'https://open.spotify.com/track/2f... | https://api.spotify.com/v1/tracks/2fXKyAyPrEa2... | 2fXKyAyPrEa24c6PJyqznF | False | Common People | 69 | None | 3 | track | spotify:track:2fXKyAyPrEa24c6PJyqznF | 80 |
2 | {'album_type': 'ALBUM', 'artists': [{'external... | [{'external_urls': {'spotify': 'https://open.s... | [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... | 1 | 182680 | False | {'isrc': 'GBFFS0400061'} | {'spotify': 'https://open.spotify.com/track/7m... | https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... | 7mAF2MJdbNT75VrVcgwT6F | False | Hounds of Love | 55 | https://p.scdn.co/mp3-preview/c98ce02fe11caae9... | 14 | track | spotify:track:7mAF2MJdbNT75VrVcgwT6F | 92 |
- a new column
availability
has been created (scroll right to see it) - the
.apply()
method was used on theavailable_markets
column, with alambda
function used as its argument - this
lambda
function means "take the original value asx
, and returnlen(x)
, i.e. the length of the list"
Lambda functions are also known as anonymous functions, and tend to be found where the logic is only required for a short period of time, i.e. it will not be re-used.
Selecting data
tracks = recs_df[recs_df['availability'] == recs_df['availability'].max()].head(6).copy()
tracks.head(2)
album | artists | available_markets | disc_number | duration_ms | explicit | external_ids | external_urls | href | id | is_local | name | popularity | preview_url | track_number | type | uri | availability | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | {'album_type': 'ALBUM', 'artists': [{'external... | [{'external_urls': {'spotify': 'https://open.s... | [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... | 1 | 182680 | False | {'isrc': 'GBFFS0400061'} | {'spotify': 'https://open.spotify.com/track/7m... | https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... | 7mAF2MJdbNT75VrVcgwT6F | False | Hounds of Love | 55 | https://p.scdn.co/mp3-preview/c98ce02fe11caae9... | 14 | track | spotify:track:7mAF2MJdbNT75VrVcgwT6F | 92 |
5 | {'album_type': 'ALBUM', 'artists': [{'external... | [{'external_urls': {'spotify': 'https://open.s... | [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... | 1 | 414080 | False | {'isrc': 'GBDGG0200012'} | {'spotify': 'https://open.spotify.com/track/2w... | https://api.spotify.com/v1/tracks/2wzTHuw8nhFS... | 2wzTHuw8nhFSXUU27x0PFa | False | There Goes The Fear | 51 | None | 3 | track | spotify:track:2wzTHuw8nhFSXUU27x0PFa | 92 |
- this can be read as "return the
df
[
where the[
'availablility]
column in thatdf
is equal to==
the.max()
value in that same column]
" - ...and the return the
.head()
of the result, containing thenumber
of values specified - we then used
.head()
again to display the first two rows of the result
tracks['features'] = tracks['id'].apply(lambda x: s.get_item('audio-features', x))
tracks[['availability', 'features']]
availability | features | |
---|---|---|
2 | 92 | {'danceability': 0.36, 'energy': 0.895, 'key':... |
5 | 92 | {'danceability': 0.499, 'energy': 0.975, 'key'... |
6 | 92 | {'danceability': 0.491, 'energy': 0.686, 'key'... |
9 | 92 | {'danceability': 0.45, 'energy': 0.656, 'key':... |
10 | 92 | {'danceability': 0.508, 'energy': 0.826, 'key'... |
14 | 92 | {'danceability': 0.47, 'energy': 0.959, 'key':... |
- here we use
.apply()
again, this time incorporating our.get_item()
method and passing the value in theid
column as one of the arguments to getaudio-features
data for each track - we can see a subset of columns of the
tracks
DataFrame by providing a list of columns in[
square brackets]
reset_index()
tracks = tracks.reset_index(drop=True)
- notice in our previous outputs that the row indexes have been retained from the original DataFrame (and aren't continuous or starting from zero)
- it will be helpful for us later on to reset these so that they are; the
drop=True
parameter means that our original index values won't be stored (drop=False
would have resulted in the creation of a new column containing them)
Unpacking JSON
features = pd.json_normalize(tracks['features'])
features.head(2)
danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | type | id | uri | track_href | analysis_url | duration_ms | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.360 | 0.895 | 9 | -3.795 | 1 | 0.1170 | 0.032900 | 0.000001 | 0.0600 | 0.424 | 168.214 | audio_features | 7mAF2MJdbNT75VrVcgwT6F | spotify:track:7mAF2MJdbNT75VrVcgwT6F | https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... | https://api.spotify.com/v1/audio-analysis/7mAF... | 182680 | 4 |
1 | 0.499 | 0.975 | 5 | -4.504 | 1 | 0.0538 | 0.000086 | 0.072100 | 0.0258 | 0.509 | 109.997 | audio_features | 2wzTHuw8nhFSXUU27x0PFa | spotify:track:2wzTHuw8nhFSXUU27x0PFa | https://api.spotify.com/v1/tracks/2wzTHuw8nhFS... | https://api.spotify.com/v1/audio-analysis/2wzT... | 414080 | 4 |
- the
json_normalize()
function is used here to unpack the dictionaries returned by our.get_item()
method calls requestingaudio-features
for each track
Concatenation
tracks_full = pd.concat([tracks, features], axis=1)
print(tracks_full.shape)
tracks_full.head(2)
(6, 37)
album | artists | available_markets | disc_number | duration_ms | explicit | external_ids | external_urls | href | id | ... | liveness | valence | tempo | type | id | uri | track_href | analysis_url | duration_ms | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {'album_type': 'ALBUM', 'artists': [{'external... | [{'external_urls': {'spotify': 'https://open.s... | [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... | 1 | 182680 | False | {'isrc': 'GBFFS0400061'} | {'spotify': 'https://open.spotify.com/track/7m... | https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... | 7mAF2MJdbNT75VrVcgwT6F | ... | 0.0600 | 0.424 | 168.214 | audio_features | 7mAF2MJdbNT75VrVcgwT6F | spotify:track:7mAF2MJdbNT75VrVcgwT6F | https://api.spotify.com/v1/tracks/7mAF2MJdbNT7... | https://api.spotify.com/v1/audio-analysis/7mAF... | 182680 | 4 |
1 | {'album_type': 'ALBUM', 'artists': [{'external... | [{'external_urls': {'spotify': 'https://open.s... | [AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B... | 1 | 414080 | False | {'isrc': 'GBDGG0200012'} | {'spotify': 'https://open.spotify.com/track/2w... | https://api.spotify.com/v1/tracks/2wzTHuw8nhFS... | 2wzTHuw8nhFSXUU27x0PFa | ... | 0.0258 | 0.509 | 109.997 | audio_features | 2wzTHuw8nhFSXUU27x0PFa | spotify:track:2wzTHuw8nhFSXUU27x0PFa | https://api.spotify.com/v1/tracks/2wzTHuw8nhFS... | https://api.spotify.com/v1/audio-analysis/2wzT... | 414080 | 4 |
2 rows × 37 columns
- we have concatenated (joined together) the two DataFrames
axis=1
means that we do so in a column-wise manner- the row indices are used to match the data from each DataFrame
- the
.shape
attribute of a DataFrame contains the number of (rows
,columns
)
The pandas .join()
and merge()
methods are useful for more complex SQL-like joins using specific columns as keys.
dupes = tracks_full.columns.duplicated()
dupes
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, True, True, True, False, False, True,
False])
.duplicated()
is a method which identifies any repeat instances of column names
Location with .loc
tracks_all = tracks_full.loc[:,~dupes].copy()
tracks_all.shape
(6, 33)
.loc
is used to locate items in our DataFrame- the code before the
,
refers to the rows; here:
means 'all rows' - the code after the
,
refers to the columns;~
means 'not' or 'opposite' .copy()
makes a new object so that subsequent operations ontracks_all
can be made withouttracks_full
being affected (to save memory,pandas
may not otherwise store such a subset as a distinct object)- we now have four fewer columns; the duplicates have been removed
Multiple artist names
Sometimes our tracks will have multplie artists, and it would be nice to be able to use all of their names in our app.
Here we again use .apply()
with a lambda
function, incorporating the .join()
method and a list comprehension to amalgamate the name
values for all entries in artists
:
tracks_all['artist_string'] = tracks_all['artists'].apply(lambda x: ', '.join([a['name'] for a in x]))
list(tracks_all['artist_string'])
['The Futureheads',
'Doves',
'The Stone Roses',
'Dexys Midnight Runners',
'Stereophonics',
'Primal Scream']
Data structure conversion
tracks_data = tracks_all.to_dict(orient="records")
tracks_data[0].keys()
dict_keys(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms', 'explicit', 'external_ids', 'external_urls', 'href', 'id', 'is_local', 'name', 'popularity', 'preview_url', 'track_number', 'type', 'uri', 'availability', 'features', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'track_href', 'analysis_url', 'time_signature', 'artist_string'])
- here we use the
.to_dict()
DataFrame method, with theorient="records"
parameter, to return a list of dictionaries, each one containing data about a given track - each of these dictionaries has the same structure (except with the additional
audio-features
and our newartist_string
key:value pairs) as seen previously when calling.json()
on aResponse
object returned from aget_item()
request