Based on the problem statement, your team of 4 need to solve 5 different cases of the problem. Each case is represented by an input text file. An output file should be generated for each input file and submitted to the Judge System. The Judge System will give a score for each output file. You can submit as many as you can, but only the highest score will be counted for each case. The summation of all the scores is the final score of the team. The top teams will be invited to the final round which will be held at Google Dublin on April 27, 2019.
The problem is always NP-hard, which means that you can never find the best solution in polynomial time. Fortunately, you can make the result better and better. The greedy algorithm and dynamic programming are very helpful.
- A slideshow contains a list of slides
- A slide contains either one horizontal photo or two vertical photos
- A photo contains a list of tags (string type)
- The tag set of a slide is the set of all tags of the photos in the slide
- If the slide containing 1 horizontal photos:
- If the slide containing 2 vertical photos:
- The slideshow is scored based on how interesting the transitions between each pair of subsequent slides are
- For two subsequent slides and , the score is the minimum (the smallest number of the three) of:
- the number of common tags between and
- the number of tags in but not in
- the number of tags in but not in
- Score formula:
For simplicity, horizontal slide refers to a slide containing 1 horizontal slide and vertical slide refers to a slide containing 2 vertical slides
First of all, let’s look at how the tags are distributed in the input files. This may help us understand the problem better.
|H||# distict tags||4||840000||1558||220||NA|
|H||# tags per photo||2.5||18.0||9.43||10.04||NA|
|H||# photos per tag||1.25||1.71||3.02||1369.09||NA|
|V||# distict tags||3||NA||1548||220||500|
|V||# tags per photo||2||NA||9.52||10.01||19.09|
|V||# photos per tag||1.3||NA||3.07||2732.06||3055.96|
- B, D, E will come up with a large amount of score because there are many photos in these cases
- B contains horizontal photos only, which would be a good start point to try ideas
- B has many distinct tags which are far more than D, E
- In B and E, photos have twice more tags than D
- In term of photos per tag, D and E are thousands of times larger than B
- E contains vertical photos only which would be the last challenge
Actually, we can consider the problem as building a sequence of slides and maximizing its overall transition score. Given the tag set of the previous slide, we can find a slide which maximizes the current transition. If every subsequent transition is optimized, we could eventually optimize the final score.
prevSlide <- create a slide of 1 horizontal unused photo or 2 vertical unused photos result += prevSlide while some photos are unused: (hSlide, hScore) <- Best horizontal slide w.r.t prevSlide (vSlide, vScore) <- Best vertical slide w.r.t prevSlide if both hSlide and vSlide are found, then if hScove >= vScore, then prevSlide <- hScove else prevSlide <- vScove else if hSlide not found, then prevSlide <- vSlide else if vSlide not found, then prevSlide <- hSlide else preSlide <- create a slide of 1 horizontal unused photo or 2 vertical unused photos res += prevSlide done return result
Obviously, we just iterate over the horizontal photos and take the photo with the highest score after the previous slide .
Actually, we don’t need to search all the photos in the pool of horizontal photos
hPool, instead, we only need to look for those having at least one common tag with the previous slide, otherwise, the score must be zero. The search space is restricted in those
Essentially, we can keep track of the photos sharing the same tag. We store this information in a
HashMap[String, HashSet[Photo]] call
hDict, which is a mapping from tag to a list of photos containing the tag.
This is the most tricky part of the problem.
One solution: we can iterate over all the combination of 2 vertical photos and choose the best vertical slide as we did for horizontal slide. However, the time complexity for each step is very high, . If you compute the
candidates size for each previous slides, we will find that it is, in average, about 7200 vertical photos in case D and 60000 in case E. The brute force solution may work for D, but never works for E.
A more reasonable solution is heuristic. Given a previous slide, we can first take the best vertical photo, denoted as , and then find the second vertical photo maximizing the score between the previous slide and the vertical slide of and .
We notice that the sequence can be built in parallel. More precisely, we can hash partition the input photos into blocks. Each block of photos can be used to build a sub-sequence independently, and then all the sub-sequences could be joined into a sequence. This is a classical map-reduce pattern which cam largely improve the performance.
You may find the code here (comments included)
With the heuristic algorithm and parallelization trick, we can obtain the following result. The final score is 1112255 which around top 20 in the qualification round, see here. Moreover, the time for D and E are no more than 5 mins which are also acceptable during the competition.
Case: a_example Time: 119 ms Score: 2 Case: b_lovely_landscapes Time: 17569 ms Score: 201792 Case: c_memorable_moments Time: 165 ms Score: 1736 Case: d_pet_pictures Time: 228491 ms Score: 428041 Case: e_shiny_selfies Time: 294582 ms Score: 480684 Final score = 1112255
The competition has been well organized as usual. However, I am a little disappointed at the problem itself. The provided data set is very easy to be cracked brute forced. One can just randomize the slide sequence several times and return the best one. Some teams did that and had a good result. Fortunately, these randomized brute force solutions are not good enough for the finalist. Anyway, this reminds me again the maxim Done is better than perfect.
The CPU information of the computer on which the programme has been run.
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Model name: Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz Stepping: 3 CPU MHz: 800.015 CPU max MHz: 4000.0000 CPU min MHz: 800.0000 ...