Minimal cover problems and the live Phish repertoire

<– Back to main page…

A couple years back someone posted a question on reddit, paraphrased here:

Given the setlists for all shows performed by Phish, what is the minimum number of shows you need to listen to in order to cover all of their songs?

Setting aside that you probably want to listen to far more shows than this to hear the variety of performances of many songs, it is an interesting algorithms problem. Fortunately, the person who originally posted that question also conveniently posted a couple CSV files on GitHub that contain a snapshot of the shows up to that point in time (late January 2017) - so we have data to play with!

TL;DR

For the impatient who just want the answer, working with the static data I found on github which spans October 23, 1984 to January 15, 2017, we can generate two answers. The minimum number of shows to see all Phish original songs over that period is 70 shows. The minimum number of shows to see all Phish originals plus covers over that same time period is 246 shows. If one had the full data set up to the current date, it would be easy to re-run the analysis and determine a minimal cover.

Now – how did I calculate that? Read on.

A challenging algorithmic problem

A naïve approach to solving the problem rapidly shows that it’s not a trivial problem. In fact, it turns out to be an instance of the minimum set cover problem, one of the canonical examples of an NP-complete problem. For those unfamiliar with algorithm analysis and theoretical computer science, in very hand-wavy terms NP-complete problems are very hard problems to solve since “efficient” algorithms are unknown for them. As you increase the amount of data to consider, an algorithm to solve the problem takes significantly longer as the data set grows. This becomes apparent if you try a greedy approach or even a clever heuristic: while you may be able to find a cover for all of the songs, finding one that is minimal turns out to be quite time consuming. So much so that you’ll likely run out of patience during the search process and start googling for a better solution.

The set cover problem

Stepping back to look at the problem I mentioned above, we should look at what a set cover is and what it means for one to be minimal. The set cover problem can be summarized as follows. Say we have a set, S, of items - in this case, songs.

S = {A,B,C,D,E}

Here we have five songs (A,B,C,D,E). We also have another set, H, of shows where each member of H is a subset of the big set of songs: here, each subset corresponds to a show containing a handful of songs.

H = { {A,B}, {D,E}, {C,D,E}, {A,C,E}, {E}, {C} }

Graph of H and S.

The graphic above illustrates how these are related: for each song and show, there is an edge between them showing that they are related. Finding a set cover entails finding a subset H’ of H such that the union of the members of this subset is equal to S itself. For example:

H’ = {{A,B}, {D,E}, {E}, {C}}

This set H’ a cover for the set S containing four members of H. Careful examination of H though shows that a smaller collection of shows is also a cover, and happens to be the smallest such one that can be found:

H’ = {{A,B}, {C,D,E}}

Graph of H and S

Finding some set H’ is actually not too hard with a greedy algorithm: pick any member of H, add it to H’. Determine the remaining members of S that are not a part of any set in H’. Find any member of H that contains at least one of those, and repeat until no more members of S remain to be found. The issue is that such an algorithm, while it yields a set cover for S, doesn’t guarantee that the cover will be minimal. For the Phish setlist data, we can see that the data is much more complex - there are LOTS of edges between the shows on the left and the songs on the right. By the way, don’t read anything into any patterns you think that you see in this visualization - the ordering of vertices on both sides is arbitrary.

Graph of all show/song relations.

How do we tackle solving this set cover problem for the larger setlist data? It turns out NP-complete problems like this come up all over the place in operations research, software verification, and a ton of other places. As such, a number of tools have been created that provide very fast methods to solve certain problems that are in that complexity class. I am using just one type of tool here to solve the problem: many others exist, especially if you start digging around through the tools that are out there in the field of operations research and optimal scheduling.

Satisfiability Modulo Theories

The class of tool that I use here is known as a Satisfiability Modulo Theories (SMT) solver. These are related to programs that solve another problem that is NP-complete: the Boolean satisfiability problem. Those solvers, known as SAT solvers, have been around for years. They are very powerful, but relatively tedious to use since they only know how to solve one problem: given a giant formula in first order logic (variables that can be either true or false, connected by AND and OR operators, with some negation operators sprinkled in), find an assignment of Boolean values to the variables in the formula such that the whole thing comes out as true. These tools can also determine that no such assignment exists and report that the input formula is unsatisfiable. I showed a little example of using these for solving Sudoku puzzles in a past post.

SMT solvers allow us to write down formulas that involve more than Boolean values and truth constraints: we can use integers and other richer mathematical structures (hence the word “theories” in the name). This is where we find a way to encode our Phish concert problem in a form that we can use a powerful SMT solver to tackle.

Encoding the problem

How do we go about encoding it? Fortunately, the encoding is relatively straightforward. Let’s consider for a moment what our goal is. We want to listen to every song at least once, but want to minimize the number of shows to do so. This means we have a few things to tell the solver:

The way I solved this was by setting up a set of integer constraints that the solver was given to solve. In this post, I make use of the Z3 solver from Microsoft Research. Z3 is an impressively powerful solver that was made free in the last few years, so it’s worth playing with. It’s available on every major computer OS and can be programmed from a number of languages. In this post, I use the Python bindings that it provides. My original version used the direct SMT-LIB encoding via s-expressions, but given that the audience is likely more familiar with Python I went with that here.

Data munging

To start, we need some data. Fortunately, the person who originally posed the problem on reddit wrote a scraper for Phish.net and put it up on GitHub. The data that was collected gave us the following:

Furthermore, their data allows us to identify songs as originals versus covers: a useful dimension if one wants to hear every song that they performed by any artist, or restrict the solution to only Phish compositions.

The scraper doesn’t provide any additional information, so date, venue, city, and so on, must be inferred from the URLs that they recorded. I will admit: I haven’t played much with the Phish.net API for this task, so that is likely a cleaner way to go that will give the information that we need more directly. But for now, the scraped data is sufficient (and allowed me to work offline, avoiding wasting the bandwidth of the Phish.net site).

Encoding

Let’s start the process of encoding the problem as constraints for the solver. Before I start, one word of warning: you might say during the process “why didn’t you do X instead?”. The short answer: no real reason. Often there are multiple ways to express problems in the form of constraints. If you have ideas on a different encoding, perhaps this article will give you enough of a starting point to try it out! To be honest, I started my solution by treating the problem as a minimum constrained bipartite graph matching problem, so I think some of the constraints are from my first experiments with the encoding a couple years ago. Regardless, let’s move on to the encoding.

Before the encoding, we have to load up the data. The IO code isn’t included here since it’s primarily responsible for converting the CSV tables to Python data structures. There are three lists that are created: a list of show identifiers (e.g., ‘h123’), a list of song identifiers (e.g., ‘s42’), and a list of edges that relate shows and songs based on setlist inclusion (e.g., (‘h123’, ‘s42’) indicating that song s42 occurred in show h123).

The easiest place to start are the song and show inclusion constraints. This is where we establish the set of variables representing each song or show. We know we are going to be eventually counting shows and songs, so we probably want to use Integer variables instead of Booleans. To start, we declare the Z3 variables for each show and song. The IntVector constructor gives a compact way of creating a bunch of them all at once.

song_v = IntVector('s',len(songs))
show_v = IntVector('h',len(shows))

We are also going to need a way of cross referencing the Z3 variables with our song IDs, and vice versa. This is useful when Z3 finds a solution: we want to be able to map the solution back to the original data from the Z3 variables. There may be a cleaner way of doing this by encoding the song IDs directly in the Z3 variable API - I didn’t bother looking (the Z3 API docs for Python are… non-standard.) The quick and dirty way involved a few simple dictionaries:

song_vs = dict(zip(song_v, songs.keys()))
show_vs = dict(zip(show_v, shows.keys()))
song_sv = dict(zip(songs.keys(), song_v))
show_sv = dict(zip(shows.keys(), show_v))

The suffix “vs” indicates a mapping from Z3 variable to source, and “sv” maps from source to Z3 variable.

Once we have that, we want to include a constraint that indicates a song or show either is or is not included. We aren’t using Booleans, so the best we can do is to say the value for each show/song is either 0 or 1: or in the language of constraint logic, for each variable X, X >= 0 and X <= 1.

song_c1 = [And(song >= 0, song <= 1) for song in song_v]
show_c1 = [And(show >= 0, show <= 1) for show in show_v]

Then, for all song/show combinations, they must add up to at least one:

edges_c1 = [show_sv[show] + song_sv[song] >= 1 for (show, song) in edges]

The next constraint encodes our original goal: every song must be included.

song_c2 = [Sum(song_v) >= len(songs.keys())]

We can see from this constraint that, when combining it with the earlier constraint restricting songs to either being assigned 0 or 1, this implies that all songs must be assigned 1 for this to be satisfied.

Finally, we want to impose the constraint that each song must be covered by at least one show. This requires us to first build up a list of Z3 variables for every show a song occurs in, so we can use it in the constraint later.

songEdgeMap = {}
for song in song_v:
    songEdgeMap[song] = []
for (show,song) in edges:
    songEdgeMap[song_sv[song]].append(show_sv[show])

The constraint is then easy to state:

song_c3 = [2 <= (song + Sum(songEdgeMap[song])) for song in song_v]

At this point, we have defined the cover problem, but have one remaining constraint to add: the one that restricts the size of the cover. That’s important since we are attempting to minimize that. Constraining the problem to a specific cover size (or minimizing it) is what makes the set cover problem challenging. Just finding a set cover is pretty easy - just keep grabbing subsets until you cover the whole set. Say we want to ask if we can find a set of 71 or less shows that cover all songs. We can pose that this way:

show_c2 = [Sum(show_v) <= 71]

Solving the problem

Once we have this, we can hand the constraints off to Z3 to do the heavy lifting and see if an answer can be found. This entails instantiating the solver, adding the constraints, and asking if they can be satisfied:

s = Solver()
s.add(song_c1 + song_c2 + song_c3 + show_c1 + show_c2 + edges_c1)
if s.check() == sat:
    print("sat")
    m = s.model()
    for show in show_v:
        if m.evaluate(show).as_long() > 0:
            print(shows[show_vs[show]])
else:
    print("failed to solve")

In the event that the solver says they can be satisfied, we can then ask for the model it found and run through the set of variables looking for shows that are set to non-zero values. Once we have those, we can use the tables we created to map from variable objects back to the show data to figure out what shows it found for us.

In my initial code, I solved the optimization problem by iteratively reducing the number in the constraint show_c2 until the solver reported unsat. I have not used the Optimize class that Z3 provides which can do this for you (and perhaps do it more efficiently - I don’t know). It turns out the person who originally posed the problem on reddit found via a greedy solution that the answer was somewhere in the low 70s or high 60s, so I was able to start there and iteratively shrink the number until we hit unsat. Solving the same problem for the full song repertoire including covers was harder: in that case, I started with a number I knew was too few, and had the solver iteratively increase the song count until it found a satisfying solution.

Results

So what’s the answer? Working with the static data I found on github which spans October 23, 1984 to January 15, 2017, we can generate two answers. The minimum number of shows to see all Phish original songs over that period is 70 shows. The minimum number of shows to see all Phish originals plus covers over that same time period is 246 shows. In each case the solver took around 20-25 minutes to come up with a solution satisfying the constraints. That isn’t ideal, but to be honest, it didn’t seem that bad given that I put absolutely no effort into trying to speed it up (see notes below for some thoughts on that).

The show lists are provided below for those who are curious. I didn’t run the scraper found on GitHub to get a more up-to-date show list, which is unfortunate, since I think the Baker’s Dozen run from 2017 could have a big impact on the results (or not - without the data, who knows!). Aside from the fact that it appears the scraper violates the terms of use for the Phish.net data, I also couldn’t get it to rebuild (I don’t use eclipse, and the original author appeared to - so I ran out of patience trying to rebuild it myself). As we say in the academic world when we have an idea but have run out of time to do it - “Future work”.

Notes

Now, at this point a setlist geek like myself might say “wait, that’s not the full story” - and you’d be correct. Here are some of the obvious things to consider in the approach above.

Unfortunately, the compute time for running the solver, while far faster than a brute force approach, is not something one wants to do often. Especially behind the scenes of a web site with limited resources. So I don’t expect this kind of thing to ever be practical to add to the Phish.net stats that we all can look up for the shows we’ve attended or rated unless a cleverer solver/encoding can be written up that requires far less time.

The show lists

For the curious, here are the minimal covers that were found for the date range October 23, 1984 to January 15, 2017.

Just originals

  1. march 04 1985
  2. september 27 1985
  3. october 15 1986
  4. march 06 1987
  5. april 24 1987
  6. june 21 1988
  7. september 12 1988
  8. may 09 1989
  9. october 26 1989
  10. december 30 1989
  11. march 17 1990
  12. april 18 1990
  13. may 04 1990
  14. september 15 1990
  15. october 08 1990
  16. april 20 1991
  17. october 31 1991
  18. may 02 1992
  19. july 23 1992
  20. march 16 1993
  21. july 17 1993
  22. april 16 1994
  23. april 22 1994
  24. december 10 1994
  25. may 16 1995
  26. october 11 1995
  27. october 24 1995
  28. december 15 1995
  29. august 16 1996
  30. february 13 1997
  31. june 06 1997
  32. august 14 1997
  33. august 17 1997
  34. november 22 1997
  35. july 02 1998
  36. august 15 1998
  37. october 17 1998
  38. june 30 1999
  39. july 24 1999
  40. december 31 1999
  41. july 12 2003
  42. august 02 2003
  43. december 02 2003
  44. december 28 2003
  45. august 15 2004
  46. august 11 2009
  47. november 27 2009
  48. december 29 2009
  49. june 15 2010
  50. june 22 2010
  51. october 11 2010
  52. october 12 2010
  53. october 15 2010
  54. july 02 2011
  55. september 04 2011
  56. july 27 2013
  57. october 31 2013
  58. july 27 2014
  59. october 31 2014
  60. july 28 2015
  61. august 11 2015
  62. august 22 2015
  63. september 06 2015
  64. december 30 2015
  65. june 22 2016
  66. june 24 2016
  67. june 29 2016
  68. october 21 2016
  69. october 24 2016
  70. october 28 2016

Originals + covers

  1. december 02 1983
  2. november 03 1984
  3. december 01 1984
  4. march 04 1985
  5. may 03 1985
  6. october 17 1985
  7. october 30 1985
  8. november 14 1985
  9. november 23 1985
  10. february 03 1986
  11. april 01 1986
  12. december 06 1986
  13. february 21 1987
  14. march 06 1987
  15. april 24 1987
  16. april 29 1987
  17. august 21 1987
  18. february 03 1988
  19. february 07 1988
  20. march 11 1988
  21. march 12 1988
  22. may 21 1988
  23. july 07 1988
  24. july 12 1988
  25. july 30 1988
  26. september 12 1988
  27. september 13 1988
  28. february 23 1989
  29. march 01 1989
  30. april 13 1989
  31. may 09 1989
  32. may 21 1989
  33. may 28 1989
  34. august 12 1989
  35. october 22 1989
  36. october 26 1989
  37. december 30 1989
  38. march 17 1990
  39. april 18 1990
  40. april 22 1990
  41. may 04 1990
  42. august 04 1990
  43. september 13 1990
  44. september 22 1990
  45. september 28 1990
  46. december 28 1990
  47. february 16 1991
  48. april 11 1991
  49. april 16 1991
  50. april 20 1991
  51. july 12 1991
  52. august 03 1991
  53. november 01 1991
  54. november 08 1991
  55. march 11 1992
  56. march 28 1992
  57. march 30 1992
  58. april 21 1992
  59. may 14 1992
  60. july 16 1992
  61. november 20 1992
  62. december 31 1992
  63. february 20 1993
  64. march 02 1993
  65. march 16 1993
  66. march 28 1993
  67. april 10 1993
  68. april 25 1993
  69. may 06 1993
  70. may 08 1993
  71. july 15 1993
  72. august 06 1993
  73. august 26 1993
  74. august 28 1993
  75. april 22 1994
  76. april 23 1994
  77. april 26 1994
  78. may 07 1994
  79. may 17 1994
  80. may 27 1994
  81. june 18 1994
  82. june 22 1994
  83. june 30 1994
  84. october 15 1994
  85. october 26 1994
  86. october 31 1994
  87. november 16 1994
  88. november 17 1994
  89. november 19 1994
  90. may 16 1995
  91. june 17 1995
  92. september 29 1995
  93. september 30 1995
  94. october 02 1995
  95. october 31 1995
  96. november 14 1995
  97. november 16 1995
  98. november 28 1995
  99. november 29 1995
  100. december 15 1995
  101. december 31 1995
  102. august 04 1996
  103. august 07 1996
  104. august 16 1996
  105. october 22 1996
  106. october 31 1996
  107. november 15 1996
  108. november 23 1996
  109. december 06 1996
  110. december 29 1996
  111. december 31 1996
  112. february 25 1997
  113. march 18 1997
  114. june 06 1997
  115. june 25 1997
  116. july 02 1997
  117. july 10 1997
  118. august 13 1997
  119. august 14 1997
  120. august 17 1997
  121. november 30 1997
  122. december 30 1997
  123. december 31 1997
  124. july 15 1998
  125. august 01 1998
  126. august 02 1998
  127. august 03 1998
  128. august 06 1998
  129. august 09 1998
  130. august 11 1998
  131. august 12 1998
  132. august 15 1998
  133. august 16 1998
  134. october 03 1998
  135. october 17 1998
  136. october 18 1998
  137. october 31 1998
  138. november 02 1998
  139. november 14 1998
  140. november 20 1998
  141. november 21 1998
  142. november 27 1998
  143. november 29 1998
  144. december 31 1998
  145. july 13 1999
  146. july 16 1999
  147. july 17 1999
  148. july 18 1999
  149. july 21 1999
  150. july 24 1999
  151. july 26 1999
  152. september 16 1999
  153. september 17 1999
  154. september 21 1999
  155. october 03 1999
  156. october 08 1999
  157. december 30 1999
  158. december 31 1999
  159. june 22 2000
  160. july 11 2000
  161. september 08 2000
  162. september 11 2000
  163. september 29 2000
  164. october 05 2000
  165. october 06 2000
  166. february 14 2003
  167. february 16 2003
  168. february 18 2003
  169. february 24 2003
  170. february 26 2003
  171. july 07 2003
  172. july 15 2003
  173. july 18 2003
  174. july 29 2003
  175. july 30 2003
  176. august 02 2003
  177. august 03 2003
  178. december 30 2003
  179. december 31 2003
  180. april 15 2004
  181. june 18 2004
  182. august 11 2004
  183. august 15 2004
  184. march 08 2009
  185. june 09 2009
  186. june 14 2009
  187. august 11 2009
  188. august 14 2009
  189. august 16 2009
  190. october 31 2009
  191. november 01 2009
  192. november 27 2009
  193. december 30 2009
  194. december 31 2009
  195. june 12 2010
  196. june 15 2010
  197. june 22 2010
  198. june 24 2010
  199. june 25 2010
  200. june 26 2010
  201. june 27 2010
  202. june 29 2010
  203. july 04 2010
  204. october 11 2010
  205. october 22 2010
  206. october 26 2010
  207. october 30 2010
  208. october 31 2010
  209. december 28 2010
  210. december 31 2010
  211. june 07 2011
  212. june 19 2011
  213. july 02 2011
  214. july 03 2011
  215. august 08 2011
  216. august 09 2011
  217. september 04 2011
  218. june 10 2012
  219. july 07 2012
  220. december 31 2012
  221. july 17 2013
  222. july 27 2013
  223. august 02 2013
  224. august 30 2013
  225. august 31 2013
  226. september 01 2013
  227. october 20 2013
  228. october 31 2013
  229. july 27 2014
  230. october 27 2014
  231. october 28 2014
  232. october 31 2014
  233. july 22 2015
  234. july 28 2015
  235. august 05 2015
  236. august 21 2015
  237. august 22 2015
  238. september 06 2015
  239. december 30 2015
  240. january 17 2016
  241. june 29 2016
  242. october 14 2016
  243. october 18 2016
  244. october 21 2016
  245. october 24 2016
  246. october 31 2016