Migrating to Open Source: SAS to Python

blog_migrating.png

Intro

Nowadays, companies rely on data more than ever. Often, they use dozens of machine learning models to predict the behaviour of their clients, the reliability of their machines, or to optimize pricing. There are many software tools that help them leverage their data, which can be split into two categories:

  • Licensed software

  • Open-source software

Each one has its advantages and disadvantages, depending on the preferences of each company. For example, if you don’t mind spending extra money, or you don’t want to hire or train employees to support and maintain your code, the licensed version will be good for you. 

However, although I don't deny some of the benefits of licensed solutions, I can't help being an evangelist of open-source technology. I always try to explain its value (with the primary focus on Python) for analytics and predictive models to our clients. I believe that hundreds of thousands of euros that companies spend on licenses can be invested more efficiently.

Licensed vs. open source

Let's take a look at three standard types of software frequently used for data science and data analytics – SAS, R, and Python – and the evolution of their popularity in time:

Source: Google Trends

Source: Google Trends

We can see that while the popularity of SAS has been pretty much constant since 2012, the popularity of open-source software, and especially Python, has been continually growing. Let's take a look at some of the reasons why it might be so:

Flexibility

With licensed software, you can use its pre-built functionalities (as long as you're paying for them) and that's pretty much it. Adding new features and further customization, if possible, usually comes with a price – to use advanced features, you'll often have to buy new products or pay for upgrades. On the other hand, for some companies, it’s just the opposite. They are paying a robust enterprise solution but, in fact, only a small bunch of employees use very few of its functions. Open-source software has all the advanced features out there for you to take them. But it’s up to you how much of them you’ll integrate. There are unlimited options for personalization and customization to build a solution that perfectly fits your needs.

Transparency

As a closed software, the functionalities of licensed software are not very transparent and are hard to understand and you can neither access nor change the source code. As opposed to open source, which is transparent in its nature and all functionalities and algorithms are documented in detail. Also, the source code of the algorithms is completely under your control.

Ease to learn

Licensed software is usually easier to learn and, thanks to a graphical interface, does not require any coding skills. However, Python's simplicity and intuitive syntax also make it truly beginner-friendly and quite easy to learn. That allows people who are not computer scientists to “code” and somewhat reduces the advantages of clickable tools. Moreover, Python will not only serve you for the one purpose but will open you the door to the whole world of possibilities, so the extra learning time is a great investment. 

Even if a company doesn't want to learn how to build algorithms in Python, it doesn't have to give up on open-source solutions. External companies, such as ours, can help with the migration and quickly train the employees to learn how to use the solution and/or make the results of the algorithms accessible for them in a familiar way.

Technical support

Licensed software, such as SAS, comes with great technical support and a community that will answer your questions and will help you resolve your issues. But thanks to the growing popularity of open-source software, everything about it is well documented and the superb community around it is growing just as dynamically and will equally help you solve your problems.

If a company decides to outsource an open-source based solution from an external provider, the technical support is usually part of the deal and only costs a fraction of the price of the license fees.

The same accuracy without license fees

Some might be thinking: “If it's free, it probably won't be good enough.” But don't forget that also in life, some of the best things are free ;). While we are discussing here some of the trade-offs between using licensed and open-source software, rest assured that accuracy is NOT one of them. You can get exactly the same results using open-source vs. licensed software. The quality of the results won't suffer. Period.

It sounds “cooler” and attracts better talent

If someone knows how to use SAS, they have to look for a job in a company that uses SAS. But honestly, all the “cool” companies use Python (Google, Netflix, Uber, to name just a few). Also, knowing Python is a good starting point for learning other programming languages and tools. So if you want to attract the best talent, this might also be worth considering.

Ease of integration with various tools (Big Data, databases)

It's open-source, it's free, feel free to do whatever you deem fit with your solution and make the most of your data. Also, the algorithms can be built on top of almost any infrastructure that you’re already using and your colleagues are familiar with.

Last but not least… THE COST

In fact, usually one of the first factors considered when deciding which one to choose. I keep mentioning it throughout the article: expensive license vs. free software. Even with the initial cost of the adoption of new technology, open-source is the winner here, with you paying only for the features you use. There’s hardly a need to explain further.

Case Study: Migration from SAS to Python

Let's now look at licensed vs. open-source in practice. Recently, we helped one of our clients from the banking industry to migrate their prediction models from SAS to Python and I am now going to share the results.

The goal of this project was to migrate the existing prediction models from SAS:

  • 7 income prediction models (1 for each group of clients)

  • 4 logistic regression models that estimate the probability of each client's interest in a specific product.

These models are used by the bank to select optimal products for each customer.

During 4 months, we worked on the creation of the same flow in an open-source software Python. We managed to reproduce the data preparation, feature engineering, as well as scoring steps. We migrated everything to open-source technologies, saving the client hundreds of thousands of euros per year for the same predictions and results.

The accuracy was an important factor because of the regulations in the banking industry. The results in data preparation and feature engineering were exactly the same (up to twelve decimal places). In the scoring step, there were small differences after the second decimal place because it’s impossible for regression to converge to the exact same coefficients twice. We could also have achieved the exact same results in regression, but we would have had to hard-code the beta coefficient manually. 

You can often hear the argument against Python that it is slow. This might be true if we are talking about Terabytes of data. However, when you are dealing with tens of Gigabytes, sometimes even with hundreds, Python will work just fine. Our solution is executed on a monthly basis and, in this case, the monthly execution was 5-times faster in Python than in SAS (0.5h vs 2.5h) for ~800k clients. Python (Pandas) has very efficient vectorized operations so, as long as you can fit the data into memory, it will be fast. SAS is looping through the data so it is easier to scale but slower than Python if you have enough RAM. We have used the same server that was previously a SAS server, 64Gb of RAM and 8 cores, which is nothing special these days. Just for a reference, data with 800k clients and 1,500 variables have approximately 3.5-4GB. That is perfect for in-memory processing and, therefore, quite fast in Python as well.

The cost of the migration was 60,000 euros which is less than the client used to pay for a 1-year license. Therefore, the estimated ROI was more than 100% within the first year. We also trained the people responsible for the solution, how to manage and use it in Python and now, they can manage everything on their own.

Conclusion

We're still far from open-source software being the standard, and who knows if we'll ever get there. Sure, it might not be for everyone, but if there's space for it even in such a strongly regulated field such as banking, then you're going to have a hard time convincing me why it isn't the better option for you. If some of the advantages mentioned in this article, such as greater flexibility, lower cost, easy integration, or attracting talent, sound appealing to you, I suggest you give it a try (or at least a thought ;).

If you want to discuss options of migration from licensed software to open source, feel free to contact us:

GET IN TOUCH

Previous
Previous

First Triathlon in Words, Pictures, and Visualizations

Next
Next

Smart banking with data science