Written by Tamr
Leave it to a customer to provide the best real-world description of Tamr’s features and benefits that we can offer.
In this case, it’s Wouter Dullaert — IT Architect at Toyota Motor Europe (TME) — colorfully portraying the challenges of uniting disparate customer data sources in a modern enterprise.
Titled “Customer Data Unification at TME,” Wouter’s recent presentation to the monthly meetup from the Brussels Data Science Community makes the case that we need to “Plumb Differently” if we want to tap into the full value of the customer information spread across enterprise data sources. Different, that is, from traditional data integration systems — like ETL — that he regards as:
- slow (lots of scripts and manual work)
- costly (only scales with people and often outsourced)
- opaque (no documentation or audit trail)
- inefficient (low-quality results and hard to keep up-to-date)
With “roughly 270 different databases that have customer data in them — and that’s before we actually get to the … retailer network,” the question for TME is “how do we link all of that data together … and feed all of this back” into the business?
“If you do this the traditional way,” Wouter argues, “with the Excel and start mapping columns, you’re going to hit some bumps. It’s not going to work.”
Instead, he presents an alternative approach that has TME working through four unification stages:
Using Tamr for Mapping and Linking
After describing TME’s extraction approach, Wouter summarizes why (starting at ~20:23 mark in the video) TME is using Tamr for mapping and linking: “because we want to use the machine learning algorithms that this tool offers us to make the mapping and the linking, which are the hard part, easier.”
The nice thing about Tamr is that it has machine learning. It’s a machine learning assistant. So they have this nice little robot that’s watching what you do. So when you’re mapping this first source, actually in the background it’s profiling all of the source columns that you have and it’s looking at how you’re mapping them. And for the next source it will offer you recommendations … I can now ask Tamr ‘give me some recommendations on which fields go here.’
Wouter highlights six ways Tamr is helping TME “plumb differently” and conquer the problem of unifying disparate customer data sources.
1. Learning While Adding
Wouter spotlights how Tamr utilizes the “learning” in machine learning with a simple explanation of how the system gets smarter over time:
For your third source, these recs will be quite ok, but not perfect. But once you get to the 5th, or the 8th or the 20th source, it’s going to be spot on. Because all of these sources contain the same entities. … Once you’ve given it a few sources, it’s actually really good at telling you what it should be in the output. At some point you can just copy over the recommendations because you know they’ll be right.
2. Efficiencies of Scale
The benefit of Tamr’s learning process is classic efficiencies of scale:
In this mapping stage, the effort goes down as you add more sources. The effort goes up when you do it traditionally. So we actually reverse this [with Tamr]; as I’m integrating more, it’s actually becoming easier … to scale across my data sources.
3. Learning the Rules
“When we get to … linking data,” Wouter continues, “this is where the real power of the Tamr tool comes into play.”
[Tamr] use[s] machine learning to figure out all the rules that you would previously do by hand. The human kind of breaks down when you have 10 or 20 interrelated functions to think about to determine whether something is the same or not. But the computer doesn’t have that problem. It can have a hundred or thousand different rules and nuances … So making these models, functions, that’s what ML is all about.
4. A Pair-Wise Approach
How does Tamr do this? “Really simple,” he says, “by evaluating record pairs” — relaying results from a meaningfully large and successful demo using customer data:
Tamr goes out and profiles all of your data. This will give you a representative subset of record pairs. … You just label a few of these [pairs]. From one demo that we did, we had about 4 million source records, which we eventually consolidated down into 1.5 million entities. And we had to label 1,000 pairs for it to work. When you compare those numbers, that’s huge. Labeling 1,000 pairs, it’s boring work, but you do that in 2 hours. … That was all it took.
5. Humans Know [Customers] Best
This efficiency/accuracy balance is testament to Tamr’s machine driven, human guided design pattern. Tamr seamlessly routes data questions that the machine can’t answer to people closest and most familiar with the data: business users, or in the case Wouter relays, retailers, who can resolve the issues with maximum accuracy and minimal effort.
In my business, the guy who really knows the customer, that’s the retailer. I want to send those pairs to my retailers to label. And ask them, ‘are Bob and Alice the same’ entity. And they say, ‘no they’re not the same, but they’re married and they have a dog.’ They know the customer up to that level. That is where your knowledge is.
6. Continuous Quality
Wouter’s last point centered on Tamr’s ability — through human/machine collaboration — to very easily keep the model up-to-date even as new sources are added:
It’s easy to add new information to the system. We can now regularly train new pairs as I add new sources … All it requires for any individual expert to do is label 5 or 10 more of these records, and, BOOM, my model is back up-to-date. So every now and then, once a week, these experts will get a mail and label 5 of the records and your unification model stays up to date you get high quality results all the time.
Wouter’s full presentation is as clear and elegant a description of the challenges and approaches of integrating disparate data sources as we’ve seen. Time very well spent, indeed.