Google Takeout – Music Activity Data

After looking through the Takeout data for the tracks I have listened to I cannot see any sort of unique ref for track, artist or album. There isn’t any id on the .csv file itself. Uh oh. I realised this was super gross. None of the data would be linked to the actual track ids etc.

Photo by Martijn Baudoin on Unsplash

This is fine for skilling up for some Data Science techniques, but otherwise? Unless I am getting something seriously wrong here Google have really dropped the ball on interlinking their data here… I mean, its GOOGLE right?

I scratched around and found that a sub section of Google Activity would help on the when, and applied for the archive. Surely this would be better and all would be revealed?

lets deselect all the other guff right now as that is pointless work to wade through later for a few clicks now

At first I got the wrong format, as I intended to use pandas.DataFrame functions so I’d be better off in JSON for the starting point.

when working with pandas JSON is much much easier than a mess of HTML

Lets have a look at this JSON via a Jupyter Notebook display on head and see what we can determine

import requests
import json
import pandas as pd
from pandas.io.json import json_normalize 
r = requests.get(r'https://raw.githubusercontent.com/FinancialRADDeveloper/GoogleMusicDataAnalysis/master/takeout_data/My%20Activity/Google%20Play%20Music/My%20Activity.json')
activity_data_frame = json_normalize(r.json())

display(activity_data_frame.head())

This isn’t good news either. WTAF Google, how is anyone supposed to use this data?

WTAFactivity dataframe head()

Ok, so it is just as bad for interlinking as it was for track data. I’m kicking myself for not using Spotify premium a few years ago! Maybe I’m missing something but the Spotofy API is littered with URIs and other data to link everything together, whereas with Google Music I’m left with the option scrabbling around with text parsing? However, while annoying, it is perfect for skilling up on some data science basics.

This leaves some questions we need to answer:

  • Can we do data verification between the tracks listened to, and the Google Music activity?
  • What does the general total and daily activity look like?
    • How should I display this?
  • Can I display meaning ful stats on most popular choice per hour of the day?

I think any further analysis will start to need to delve into some more Meta data, which will be covered in another post regarding web scraping or Spofy API via spotipy.

Listening stats per hour

import matplotlib.pyplot as plt

plt.ion()
plt.bar([i[0] for i in daily_listening_stats], [j[1] for j in daily_listening_stats], color = 'blue')
plt.bar([i[0] for i in skipped_stats], [j[1] for j in skipped_stats], color='red')
plt.bar([i[0] for i in search_stats], [j[1] for j in search_stats], color='purple')
plt.xlabel('Hour of the Day')
plt.ylabel('Total tracks')

plt.show()

This seems roughly in line with expectations. Some late at night then a big dip while asleep and then a ramp up. I do’t tend to listen much on the commute, but will listen to something at the start of the working day. Then a dip for lunch, and some listening from then till tailing down on the eveing when eating or watching TV with my wife.

Daily Listening Stats:

import requests
import json
import pandas as pd
from pandas.io.json import json_normalize 
import matplotlib.pyplot as plt


r = requests.get(r'https://raw.githubusercontent.com/FinancialRADDeveloper/GoogleMusicDataAnalysis/master/takeout_data/My%20Activity/Google%20Play%20Music/My%20Activity.json')
activity_data_frame = json_normalize(r.json())

listened_to_dataframe = activity_data_frame[activity_data_frame['title'].str.contains("Listened to")]

display(listened_to_dataframe.sample(10))

listened_to_dataframe['activity_datetime'] = pd.to_datetime(listened_to_dataframe['time']).copy()

daily_stats = listened_to_dataframe['activity_datetime'].dt.date.value_counts()
daily_stats_sorted = daily_stats.sort_values(ascending=True)
daily_stats_sorted.plot(legend=True)

Further analysis and visualisation

It would be really interesting to get into the meta data.. I can work out my favourite tracks and artists per hour of the day but its going to be much harder to work out styles or genres. As mentioned before it should force me to go and learn another area of Data Science and python though.

Cross Validation

If each track entry has a ‘Play Count’ field, can I safely assume that if i pick out a sample of tracks, I’ll be able to find the correct Activity entries? Let us see shall we. I’ll take some of the most popular tracks, as a few missed listens should matter alot less.