Data Processing with `pandas`¶

`pandas`¶

pandas is a popular Python package with excellent functionality for data analysis tasks.

DataFrames¶

A DataFrame is a pandas data structure, much like a table:

values are held in rows and columns
each row and column has a label

Typically, rows are numerically labelled, starting at zero, while columns have names in strings.

However, a DataFrame can have strings for row labels and they don't have to be unique.

Example repl¶

Let's jump in with an example. Fork the repl, remembering as always to add your credentials to a .env file.

!!! info The walkthrough will highlight a handful of pandas techniques used for creating the Seeder app, but if you would like a more complete understanding of pandas and these techniques, please consult the excellent documentation.

Code walkthrough¶

Not much new to see in main.py; we've just substituted use of the .get_recommmendations() method for the .get_item() method we used previously to get individual track data:

recs = s.get_recommendations(track_id)
for track in recs:  
    ...  
    print(f'{track_name} - {track_artist}')

.get_recommendataions() returns a list containing data for several tracks, over which we iterate to print the track_name and track_artist
pandas has also been imported at the top of the script, and given the alias of pd

In spotter.py we'll find .get_recommendations() at the bottom, but also notice a small modification to .get_item():

def get_item(self, category, item_id=None, params=None):
    ...
    resp = requests.get(f'{self.api_root}{category}/{item_id}', params=params, headers=self.headers)

the get_item() method no longer requires an item_id, and can also accept a params argument

def get_recommendations(self, seed, number=6):
    recs = self.get_item('recommendations', params={'seed_tracks': seed, 'limit': 100})

we use .get_item() to fetch some recommendations
- the track_id defined in main.py is the value for the seed_tracks parameter
- we set a limit of 100 items to be returned

Creating a DataFrame¶

recs_df = pd.DataFrame(recs['tracks'])
recs_df.head(3)

Out[105]:

	album	artists	available_markets	disc_number	duration_ms	explicit	external_ids	external_urls	href	id	is_local	name	popularity	preview_url	track_number	type	uri
0	{'album_type': 'ALBUM', 'artists': [{'external...	[{'external_urls': {'spotify': 'https://open.s...	[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...	1	291369	False	{'isrc': 'GBAYE9600152'}	{'spotify': 'https://open.spotify.com/track/0m...	https://api.spotify.com/v1/tracks/0mIZErjrNQzO...	0mIZErjrNQzOTDtn4UYflo	False	Sun Hits The Sky	48	https://p.scdn.co/mp3-preview/7591511d739043e4...	6	track	spotify:track:0mIZErjrNQzOTDtn4UYflo
1	{'album_type': 'ALBUM', 'artists': [{'external...	[{'external_urls': {'spotify': 'https://open.s...	[AE, AL, AR, AT, AU, BE, BG, BH, BO, BR, BY, C...	1	351626	False	{'isrc': 'GBAAN9500078'}	{'spotify': 'https://open.spotify.com/track/2f...	https://api.spotify.com/v1/tracks/2fXKyAyPrEa2...	2fXKyAyPrEa24c6PJyqznF	False	Common People	69	None	3	track	spotify:track:2fXKyAyPrEa24c6PJyqznF
2	{'album_type': 'ALBUM', 'artists': [{'external...	[{'external_urls': {'spotify': 'https://open.s...	[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...	1	182680	False	{'isrc': 'GBFFS0400061'}	{'spotify': 'https://open.spotify.com/track/7m...	https://api.spotify.com/v1/tracks/7mAF2MJdbNT7...	7mAF2MJdbNT75VrVcgwT6F	False	Hounds of Love	55	https://p.scdn.co/mp3-preview/c98ce02fe11caae9...	14	track	spotify:track:7mAF2MJdbNT75VrVcgwT6F

the value for the tracks key in recs was a list of 100 dictionaries, each one relating to a given track
the pd.DataFrame() function has created a DataFrame from the list, with:
- the data for each track being within a single row
- the column headings are the keys from the track dictionaries
the .head() method allows us to show just the first few records

Using `.apply()` and `lambda` functions¶

In this part of the code, we're going to extract the number of tracks we tracks we're looking for (which is 6 by default), but only take those which are available in as many markets as possible.

We'll do this by:

Adding a column which contains number of markets listed in available_markets for each track
Filtering the DataFrame to include only those with the max() number found
Using the .head() method to fetch the first 6 tracks from the remaining tracks

recs_df['availability'] = recs_df['available_markets'].apply(lambda x: len(x))
recs_df.head(3)

Out[106]:

	album	artists	available_markets	disc_number	duration_ms	explicit	external_ids	external_urls	href	id	is_local	name	popularity	preview_url	track_number	type	uri	availability
0	{'album_type': 'ALBUM', 'artists': [{'external...	[{'external_urls': {'spotify': 'https://open.s...	[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...	1	291369	False	{'isrc': 'GBAYE9600152'}	{'spotify': 'https://open.spotify.com/track/0m...	https://api.spotify.com/v1/tracks/0mIZErjrNQzO...	0mIZErjrNQzOTDtn4UYflo	False	Sun Hits The Sky	48	https://p.scdn.co/mp3-preview/7591511d739043e4...	6	track	spotify:track:0mIZErjrNQzOTDtn4UYflo	91
1	{'album_type': 'ALBUM', 'artists': [{'external...	[{'external_urls': {'spotify': 'https://open.s...	[AE, AL, AR, AT, AU, BE, BG, BH, BO, BR, BY, C...	1	351626	False	{'isrc': 'GBAAN9500078'}	{'spotify': 'https://open.spotify.com/track/2f...	https://api.spotify.com/v1/tracks/2fXKyAyPrEa2...	2fXKyAyPrEa24c6PJyqznF	False	Common People	69	None	3	track	spotify:track:2fXKyAyPrEa24c6PJyqznF	80
2	{'album_type': 'ALBUM', 'artists': [{'external...	[{'external_urls': {'spotify': 'https://open.s...	[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...	1	182680	False	{'isrc': 'GBFFS0400061'}	{'spotify': 'https://open.spotify.com/track/7m...	https://api.spotify.com/v1/tracks/7mAF2MJdbNT7...	7mAF2MJdbNT75VrVcgwT6F	False	Hounds of Love	55	https://p.scdn.co/mp3-preview/c98ce02fe11caae9...	14	track	spotify:track:7mAF2MJdbNT75VrVcgwT6F	92

a new column availability has been created (scroll right to see it)
the .apply() method was used on the available_markets column, with a lambda function used as its argument
this lambda function means "take the original value as x, and return len(x), i.e. the length of the list"

Lambda functions are also known as anonymous functions, and tend to be found where the logic is only required for a short period of time, i.e. it will not be re-used.

Selecting data¶

tracks = recs_df[recs_df['availability'] == recs_df['availability'].max()].head(6).copy()
tracks.head(2)

Out[107]:

	album	artists	available_markets	disc_number	duration_ms	explicit	external_ids	external_urls	href	id	is_local	name	popularity	preview_url	track_number	type	uri	availability
2	{'album_type': 'ALBUM', 'artists': [{'external...	[{'external_urls': {'spotify': 'https://open.s...	[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...	1	182680	False	{'isrc': 'GBFFS0400061'}	{'spotify': 'https://open.spotify.com/track/7m...	https://api.spotify.com/v1/tracks/7mAF2MJdbNT7...	7mAF2MJdbNT75VrVcgwT6F	False	Hounds of Love	55	https://p.scdn.co/mp3-preview/c98ce02fe11caae9...	14	track	spotify:track:7mAF2MJdbNT75VrVcgwT6F	92
5	{'album_type': 'ALBUM', 'artists': [{'external...	[{'external_urls': {'spotify': 'https://open.s...	[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...	1	414080	False	{'isrc': 'GBDGG0200012'}	{'spotify': 'https://open.spotify.com/track/2w...	https://api.spotify.com/v1/tracks/2wzTHuw8nhFS...	2wzTHuw8nhFSXUU27x0PFa	False	There Goes The Fear	51	None	3	track	spotify:track:2wzTHuw8nhFSXUU27x0PFa	92

this can be read as "return the df [where the ['availablility] column in that df is equal to == the .max() value in that same column ]"
...and the return the .head() of the result, containing the number of values specified
we then used .head() again to display the first two rows of the result

tracks['features'] = tracks['id'].apply(lambda x: s.get_item('audio-features', x))
tracks[['availability', 'features']]

Out[108]:

	availability	features
2	92	{'danceability': 0.36, 'energy': 0.895, 'key':...
5	92	{'danceability': 0.499, 'energy': 0.975, 'key'...
6	92	{'danceability': 0.491, 'energy': 0.686, 'key'...
9	92	{'danceability': 0.45, 'energy': 0.656, 'key':...
10	92	{'danceability': 0.508, 'energy': 0.826, 'key'...
14	92	{'danceability': 0.47, 'energy': 0.959, 'key':...

here we use .apply() again, this time incorporating our .get_item() method and passing the value in the id column as one of the arguments to get audio-features data for each track
we can see a subset of columns of the tracks DataFrame by providing a list of columns in [square brackets ]

`reset_index()`¶

tracks = tracks.reset_index(drop=True)

notice in our previous outputs that the row indexes have been retained from the original DataFrame (and aren't continuous or starting from zero)
it will be helpful for us later on to reset these so that they are; the drop=True parameter means that our original index values won't be stored (drop=False would have resulted in the creation of a new column containing them)

Unpacking JSON¶

features = pd.json_normalize(tracks['features'])
features.head(2)

Out[110]:

	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	type	id	uri	track_href	analysis_url	duration_ms	time_signature
0	0.360	0.895	9	-3.795	1	0.1170	0.032900	0.000001	0.0600	0.424	168.214	audio_features	7mAF2MJdbNT75VrVcgwT6F	spotify:track:7mAF2MJdbNT75VrVcgwT6F	https://api.spotify.com/v1/tracks/7mAF2MJdbNT7...	https://api.spotify.com/v1/audio-analysis/7mAF...	182680	4
1	0.499	0.975	5	-4.504	1	0.0538	0.000086	0.072100	0.0258	0.509	109.997	audio_features	2wzTHuw8nhFSXUU27x0PFa	spotify:track:2wzTHuw8nhFSXUU27x0PFa	https://api.spotify.com/v1/tracks/2wzTHuw8nhFS...	https://api.spotify.com/v1/audio-analysis/2wzT...	414080	4

the json_normalize() function is used here to unpack the dictionaries returned by our .get_item() method calls requesting audio-features for each track

Concatenation¶

tracks_full = pd.concat([tracks, features], axis=1)
print(tracks_full.shape)
tracks_full.head(2)

(6, 37)

Out[111]:

	album	artists	available_markets	disc_number	duration_ms	explicit	external_ids	external_urls	href	id	...	liveness	valence	tempo	type	id	uri	track_href	analysis_url	duration_ms	time_signature
0	{'album_type': 'ALBUM', 'artists': [{'external...	[{'external_urls': {'spotify': 'https://open.s...	[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...	1	182680	False	{'isrc': 'GBFFS0400061'}	{'spotify': 'https://open.spotify.com/track/7m...	https://api.spotify.com/v1/tracks/7mAF2MJdbNT7...	7mAF2MJdbNT75VrVcgwT6F	...	0.0600	0.424	168.214	audio_features	7mAF2MJdbNT75VrVcgwT6F	spotify:track:7mAF2MJdbNT75VrVcgwT6F	https://api.spotify.com/v1/tracks/7mAF2MJdbNT7...	https://api.spotify.com/v1/audio-analysis/7mAF...	182680	4
1	{'album_type': 'ALBUM', 'artists': [{'external...	[{'external_urls': {'spotify': 'https://open.s...	[AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, B...	1	414080	False	{'isrc': 'GBDGG0200012'}	{'spotify': 'https://open.spotify.com/track/2w...	https://api.spotify.com/v1/tracks/2wzTHuw8nhFS...	2wzTHuw8nhFSXUU27x0PFa	...	0.0258	0.509	109.997	audio_features	2wzTHuw8nhFSXUU27x0PFa	spotify:track:2wzTHuw8nhFSXUU27x0PFa	https://api.spotify.com/v1/tracks/2wzTHuw8nhFS...	https://api.spotify.com/v1/audio-analysis/2wzT...	414080	4

2 rows × 37 columns

we have concatenated (joined together) the two DataFrames
axis=1 means that we do so in a column-wise manner
the row indices are used to match the data from each DataFrame
the .shape attribute of a DataFrame contains the number of (rows, columns)

The pandas .join() and merge() methods are useful for more complex SQL-like joins using specific columns as keys.

dupes = tracks_full.columns.duplicated()
dupes

Out[112]:

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True,  True,  True, False, False,  True,
       False])

.duplicated() is a method which identifies any repeat instances of column names

Location with `.loc`¶

tracks_all = tracks_full.loc[:,~dupes].copy()
tracks_all.shape

Out[119]:

(6, 33)

.loc is used to locate items in our DataFrame
the code before the , refers to the rows; here : means 'all rows'
the code after the , refers to the columns; ~ means 'not' or 'opposite'
.copy() makes a new object so that subsequent operations on tracks_all can be made without tracks_full being affected (to save memory, pandas may not otherwise store such a subset as a distinct object)
we now have four fewer columns; the duplicates have been removed

Multiple artist names¶

Sometimes our tracks will have multplie artists, and it would be nice to be able to use all of their names in our app.

Here we again use .apply() with a lambda function, incorporating the .join() method and a list comprehension to amalgamate the name values for all entries in artists:

tracks_all['artist_string'] = tracks_all['artists'].apply(lambda x: ', '.join([a['name'] for a in x]))

list(tracks_all['artist_string'])

Out[121]:

['The Futureheads',
 'Doves',
 'The Stone Roses',
 'Dexys Midnight Runners',
 'Stereophonics',
 'Primal Scream']

Data structure conversion¶

tracks_data = tracks_all.to_dict(orient="records")
tracks_data[0].keys()

Out[118]:

dict_keys(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms', 'explicit', 'external_ids', 'external_urls', 'href', 'id', 'is_local', 'name', 'popularity', 'preview_url', 'track_number', 'type', 'uri', 'availability', 'features', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'track_href', 'analysis_url', 'time_signature', 'artist_string'])

here we use the .to_dict() DataFrame method, with the orient="records" parameter, to return a list of dictionaries, each one containing data about a given track
each of these dictionaries has the same structure (except with the additional audio-features and our new artist_string key:value pairs) as seen previously when calling .json() on a Response object returned from a get_item() request

Data Processing with pandas¶

pandas¶