“I was at MarcPoint, a big data technology corporation providing digital marketing consultancy, as an Algorithm/Data Mining engineer in 2016, responsible for developing the similar product classification system. The system aims to help our clients understand the market trend, learn from its competitors and adapt the fast-changing customer preferences.
MarcPoint initiated the project due to the increasing interest from many vendors on Taobao to understand the current market and its competitors. We discovered that the function “finding similar products” on Taobao only recommend identical or even inaccurate ones, which does not provide much insights for these corporations. With natural language processing techniques, we successfully designed accurate algorithms that fill up the gap.
The algorithm is implemented with three stages including learning low-dimensional data representation, calculating distances between items and training model ensembles. First, we trained the word embedding with millions of texts retrieved from Weibo to preserve the semantic and syntactic meaning of items’ Chinese descriptions. Second, we incorporated six distance measures to model the similarity between the word embeddings of two selected items. Last but not the least, we adopt multiple random sampling and model ensemble techniques to enhance the generalizability and flexibility of the system. In the end, the accuracy significantly improved from 75% to over 95%.”