In October 2016 at the World of Watson conference in Las Vegas, I chatted with Michael Zargham, Director of Data Science at Cadent.tv, about possible data science collaborations. Two weeks later, we were ready to start a project using IBM's Data Science Experience.
Mike believes that high quality data science does not necessarily come only from well-known machine learning methods like linear regression or decision trees. I totally agree with him on this one. In many cases, relating a data science problem to other hard sciences can give you the chance to apply well-stablished techniques in the context of your problem. In this collaboration, we applied methods from physics to determine whether a movie title is correctly written or not. Detecting and correcting errors within databases is key for achieving better results on data-driven decisions.
The collaboration was a great success and we're excited to have the chance to share our results here.
Unique noisy record recognition tool
In practical applications cleaning data is a recurring challenge; often data includes manually entered text fields which may contain errors or alternative representations which artificially inflate the number of unique records and cause pain points in the enrichment process. In this blog, we propose a methodology for determining whether a record is a new record or a perturbation of a known record.
A history of past events is maintained as a nested dictionary of associated keys including the frequencies of appearance for each key. Our approach leverages distance metrics and the event frequencies to dynamically update groupings of noisy keys (strings) associated with one "correct" natural key — and which of those keys is the "correct" natural key.
A novel aspect of our approach is that we developed it for a distance metric in general. One could use a distance metric derived from a Levenshtein-based edit distance, regular expression encoded rules, NLP-based features with cosine or Euclidean distance, or some combination of the above.
In our tutorial, we emphasize the analysis of the chosen distance metric to understand how the choice of distance metric fundamentally determines the effectiveness of the tool, which relies on the assumption that perturbations of the same natural key are small under this metric while pairwise distances between natural keys are large, relative to their perturbations. The entire first section of this article is dedicated to this analysis.
Our solution is analytical and based on concepts of filtering and estimation. Using the frequency with which a string has appeared along with our distance measure, we use a simple analogy to physics: when a new string appears that is not already assigned to a group, it is assigned to the group with the greatest influence, which is analogous to gravity with frequency taking the role of mass. Likewise, we can use the same construction within a group of keys to identify the natural key as the one closest to the center of mass of all keys in the group.
Data and humanly-like perturbations
For numerical experiments the movie titles from the IMDB database was used. It is easy to get the movie titles using the imdbpie python package. Here is a small sample of the movie titles that are part of the IMDB database: “12 Angry Men”, “La Haine”, “Toy Story”, “The Dark Knight”, to mention a few. We perturbed the movie titles from IMDB using some human-like errors that can be summarized in 4 categories:
- Injection (randomly inject a repeated character)
- Typo (replacing a “correct” character with an “incorrect” character)
- Swap (swap places of two adjacent characters)
- Spelling (typical spelling errors)
There are 4 primary operations in our method:
- The influence measure which combines the distance metric with frequency observations to determine a force analogous to gravity.
- A projection mapping which leverages the influence measure and additional logic to output the natural key for a given candidate key.
- An entry shuffling which leverages the influence measure to dynamically update the natural key assigned to a group as more candidate keys are observed and grouped.
- A dictionary cleaning which is a higher-level function used to periodically inspect and clean up the whole dictionary with a focus on using influence to infer when 2 groups of keys should be merged into 1 group.
While there are four methods here, the primary use case is that given a sequence of candidate keys, the projection mapping determines whether the candidate is a true natural key or a perturbation of one of the known natural keys.
In order to demonstrate this, we must first define a distance metric to work with. We use a mixture of a Levenshtein distance with some basic regular expression logic to parse special characters. Our distance between two string ranges is between 0 and 100, where 100 is an exact match. In other words, the closer the distance is to 100, the closer the strings are.
For the computation of the Levenshtein distance we use the fuzzywuzzy python package. However, the proposed tool does not rely on an specific distance function and it is up to the user to choose one. At any point in time, we can assume that that there are N (non-negative integer) unique observed records, called r1, r2, ..., rN.
In order to create compelling demonstration of the learning capability of this approach, we ran the following experiment: at time zero, N=0 and there are no unique records observed, hence r1 will automatically be considered unique. When a new record is observed, rnew, the distance from rnew too all r1, r2, ..., rN is computed.
In order to keep a bound on computational complexity as the cardinality of unique natural keys rises, we impose a threshold parameter ε, where 0< ε < 100 and restrict a subset of the known natural keys:
Given ε, the Set(rnew) contains the closest keys to rnew. If this set is empty, rnew is trivially mapped to itself and the set of unique records is updated to r1, r2, ..., rN, rnew. However, if this set is not empty then rnew is not unique and mapped to r*, where
where ri ∈ Set(rnew) .
The next table shows a few examples of tuples of records (ri, rj) and the distance between them:
The following image shows a visualization of the computed Levenshtein distances between each of the movie titles of IMDB. In this plot, close movie titles are shown in white such as the diagonal that shows distance from each title to itself (100). The further each movie title is, the darker (closer to zero) it is shown in the visualization.
This is a great place to start for assessing the distance metric, but it is only the starting point because our algorithm accounts for frequencies of observed events in addition to absolute string distances, which are symmetric. In the example below the natural key "Ip Man" has occurred 10 times and the perturbed key "Ip MaM" has occurred once, while they are equidistant from one-another the influence is asymmetric.
If r̄ is an observation of natural key r and it is perturbed (r̄ ≠ r), then
distance(r, r̄)>max distance(r, r')where r' is a different natural key within the domain.
For each natural key r, the probability of any perturbation is less than that of getting r itself:
Pr(r)>Pr(r̄) ∀ r̄ ≠ r.
For stronger mathematical guarantees impose: if r̄ is the observed key associated with natural key r, then
Pr(r̄ = r)>0.5.
Strictly speaking we have no way to enforce that this is or will continue to be true. The great thing about this model is that it will work for any valid distance metric and sequence of observed strings. The bad news is that it is only as good as the fit of the distance metric provided to that set of observations. Therefore, the upfront work of a data scientist is actually to design the distance metric by some combination of business logic and feature engineering through careful analysis of a known set of observations.
Below are distance statistics between natural keys and their randomly generated perturbed keys:
One can immediately see the difference between distribution for "La Haine" and "Touch of Evil" as compared to degenerate cases for "M" and "Up" for which even a few perturbation samples yield distances of 50 or less. In this next example, we look at the distances between natural keys of sequels.
In order to demonstrate that the method works even if the feature space is not fully able to meet these criteria, we did not further optimize our distance metric. Instead we discuss the strengths and weaknesses of our tool in the context of the blind spots of the distance metric.
- Very Short Strings: small character perturbations such as typos have disproportionately large effects on distances.
- Movie Sequels: even with longer strings, two natural keys may differ by as little as one character.
In practice, one would work around these issues by building a deeper feature space which identifies and captures the notion of sequels. On the other hand, there is little one can do with strings which are too short (less then 4 characters) because they simply lack the information density to be distinguishable under human typographical errors.
We will briefly review some data from our example before proceeding with the learning example. This first table shows natural keys which are least separable.
- max title dist: the distance to the closest of all the other known natural keys
- min err dist: the distance to the farthest of all sampled perturbations of this natural key
- overlap= 'max title dist' - 'min err dist'
Positive values of overlap indicate non-separability. This chart highlights the particular issues with sequels but this is an edge case. The figure below shows the full histogram of overlap, which only has a small tail of positive numbers.
In practice, the non-separability of sequels leads to them being grouped together. For example "Alien" and "Toy Story" and "Toy Story 3" will be placed in the same group and our center-of-mass-based calculation will determine which ends up being the associated natural key for the group accounting for the frequency of their observation along with the frequency and distances to the perturbed versions of each of these keys.
- Start with completely empty dictionary.
- Using the distance metric and the IMDB data set discussed before and the aforementioned weaknesses (short titles and sequels) were not removed.
- Experimental set-up: 100 iterations of 100 sample title batches chosen with errors happening on average 30 out of 100 entries.
- Run clean-up every 10 iterations, provided it is a higher-complexity operation
- Please see the next block of code for more details on the experimental set-up.
In the next Figure, the convergence results:
Fraction of the titles that were mapped to the correct valid title on each iteration (green),
Cumulative fraction of the titles that were mapped to the correct valid title on each iteration (blue),
The baseline error rate at 0.7 (red)
The next piece of code and output inspect a few examples of both the title representing each group of title and data dictionaries that collect movie titles and its frequencies.
The next figure shows the rate of incorrectness of the baseline divided by the rate of incorrectness of our mapping. In other words, the learned lift (blue) shows how many times the method is more accurate than the baseline. (It was set to .7 in the experiments).
- Spark implementation.
- Improve performance by using some concepts from nearest neighbor search to limit the number of direct comparisons required.
- Improve performance by storing the effective “mass” of a group and updating it incrementally could also have significant runtime savings.
I would like to give a special thanks to Steve Moore for his great help on the technical writing of this post.