(Last updated: 03/12/2019 | PERSONAL OPINION)

Know your enemy and know yourself, find not a single defeat for 100 battles.”- Sun Tsu, the Art of War

Every Spring patent recipients at my company are treated with a dinner at a five-star hotel and a nice wooden plaque inscribed with the front page of the patent. When I got mine this year, a publication cited by the examiner caught my eye:

So it turned out that Apache Spark’s zipWithIndex() function suffers exactly the same issue which my patent solved. Per Spark’s documentation: “The index assigned to each element is therefore not guaranteed, and may even change if the RDD is reevaluated.

The open source folks think their job is done after they give away software for free. They’d rather spent energy blogging than actually fixing it. Almost a decade later, the issue still exists.

Charging a premium for analytics products, commercial folks like me must solve the issue! But are we better in every way? Honestly, I don’t know.

Kaggle for Analytics Vendors

Who are the best vendors for gradient boosting, the machine learning technique for regression and classification problems? What is the best analytical model for my problem? These are the questions data scientists want to know but not able to google or buy the answer from Gartner. What if there is this benchmarking system where different vendors duke it out in the open? Different implementations of the same analytical model from different vendors are ran on the same data set in the same compute environment, and various metrics are collected in the process:

Practitioner data scientists should pay special attention to these benchmarks. As a group of academics did at UPenn, evidence collected in these benchmarks inform us a lot on what models to use for different problem domains.

If you’re looking for start up ideas, I hereby authorize you to use the ideas herein for free (yes, ideas are this cheap!). I’m confident that Garter will happily acquire your startup in no time.


  1. Evaluating learning algorithms, a classification perspective, A book dedicated in metrics for performance benchmarking.
  2. Data-driven Advice for Applying Machine Learning to Bioinformatics Problems, Randal S. Olson, etc., Aug 2017
  3. AI progress measurement collects problems and metrics/datasets from the AI research literature, and tracks progress on them.
  4. Who is the best at X tracks state of art in objects classification.
Multi-Vendor Benchmarking System

Leave a Reply