When it comes to Data Science, the most recurring topic is modeling. Quite a few articles out there talk about data preparation and only a bunch about how to communicate your results properly. However, there are hardly any dealing with the topic that we are going to cover today: data enrichment.
In our experience with helping companies to start using their data efficiently, in most (especially bigger) organisations the single lowest hanging fruit to go after is a context. Many times the organisations attempt to solve a particular problem with data and fail.
“There is not enough data.”
“The models are not good enough.”
“We cannot do anything with these results.”
…are outcomes a lot of folks are way too familiar with. Sometimes the simplest thing to do is to step back and ask yourself: “What does this data mean in the context of other data?”
Let’s talk about the two main kinds of data enrichment: Data integration and data augmentation.
We understand data integration as combining together all the data that each part of your business generates. This may seem like a no-brainer, but data silos per department are very common in larger organisations and getting them all together can sometimes be a non-trivial exercise. We can see this happening in smaller companies as well, especially since the rise of subscription based SaaS services, which resulted in different tools being used for different teams and therefore all the data being replicated and scattered all over the place.
Talk to people:
If there is more than one person in your organisation, the odds are someone else knows something you don’t ;-) The differences in knowledge and skill sets are at the core of any modern company, yet a lot of times people act as if this wasn’t the case. Because of legacy organisational structures, acquisitions or independent initiatives, teams often end up in information bubbles, unaware of all the valuable insights and data that their colleagues might be sitting on. Our advice? Before starting a project, try to think what information can be useful and who would have an incentive to collect it in your organisation. The odds are, they are collecting it.
Assume things are not the same:
Just because things are called the same or seem like they should be representing the same piece of data, does not mean this is the case. Matching data between different data sources is the most crucial part of the process and it can make or break your analysis. There are plenty of reasons why client_id might mean something different in the variety of databases you have across different teams. Be it a legacy way of assigning an ID to a client, different products or as simple as using bigint vs. int (different data types) for your data in different databases. Tying back to the step above, find someone who can clarify things and make sure your assumptions hold true.
One of the great tools for data integration available out there is Stitch Data. This service allows you to backup/save your data from different sources into a data warehouse of your choice in a nicely structured format. Whether it is Google Sheets or .csv files, SaaS applications or some custom events that you are collecting. They handle consistency, failures and maintenance so you don’t have to. Great for teams, that are short on development resources.
We define data augmentation as getting new data that is not generated by your business in order to give context to the data you already have. This can mean spending on betting pages for credit scoring in a bank, getting weather data for a car insurance company or social media accounts of a customer for an ecommerce site.
Many applications and web services today provide access to their data through an API. This is a way for a developer to process data from a service in a programmatic way without the use of a graphical user interface. Nowadays, there are APIs for everything - weather data, maps, electric grid information, social networks, fitness apps, communications tools, emailing tools, government organisations… you get where I’m going with this. If you can think of some information to augment your internal data with, odds are there is an API that can help you out.
Fullcontact is the best API to augment data about companies and physical people. Searchable by email, Twitter handle or other personal info, it provides publicly available information about the individual, identified from all over the internet. Their language and location, social networks on which you can reach them or topics they are fans of. You can make sure that you are communicating with people about relevant things, in a proper language and within correct channels.*
We love this product. Zapier is the best way to augment data, especially if you don’t have many development resources to spare. It enables you to connect thousands of apps together through their APIs using a drag & drop interface. No need to worry about errors, maintenance, deployments or updates to new version. Zapier handles all of that for you.
*we realise this information can be (and is being) misused by some. We in no way encourage this and are strong advocates of using additional data to make technology better.