Shopee Code League (1) Data Analytics Challenge - Multi-Channel Contacts

English Data Analytics

written by LiaoWC on 2021-03-13


Github source code: link

Contributor:

Kaggle page: link

Method

We know that all tickets are unique, and that any two people have the same phone, email, or order ID are exactly the same people. Let's view all tickets as different nodes(vertices). We check phone, email, and order ID. If any two ticket are belong to the same person, we construct a undirected edge between them. Finally, we find the graph's connected components. Each unique user has a corresponding connected components.

Time complexity

Assume number of total tickets is n.

  • Sorting takes O(nlogn) time.
  • Getting edges by iterating takes O(n) time.
  • Finding connected component takes o(n^2) time[1].
  • Grouping takes O(nlogn) time.
  • Finding group data and write file takes O(nlogn) time.

The total time complexity is o(n^2).


[1] Common ways of finding connected components like Tarjan's algorithm using adjacent list takes not more than O(number of vertices + number of edges). In this case, we use scipy's connected component function. We've tested that it seems not take up to n^2 time and it's fast. If you're interested in the algorithm, you can find more information about how scipy implements this function.