Sansan Builders Blog

Sansanのものづくりを支えるメンバーの技術やデザイン、プロダクトマネジメントの情報を発信

Economics Meets Data Science: Coding With Style


Hello, Juan here! It’s been a few months since my last post. We have been quite busy at DSOC releasing some sweet stuff for Sansan users, and I’m still working on the next part of the Structural Estimation Series. For now, I want to talk about coding style.

Coming from Economics, my programming education was limited to punching my keyboard until STATA gave me some results. I was rarely asked to share my code because I mostly did research on my own; when I had to, the fact that everyone’s code was all over the place was just something normal. In fact, there's this implicit belief that data analysis is not "coding" in the sense that web devs and Machine Learning engineers use it.

¯\_(ツ)_/¯


When I was asked to revise and resubmit my papers, going back to my code was a true nightmare. Here are some of the problems I always encountered:

  • I always forgot in which order to run the code files.
  • The path to the data files was hardcoded, in many different places.
  • It was not obvious what macros to define before running the code (yes, “macro” means “variable” in STATA’s jargon in order to not confuse with the columns of a tabular dataset, which are called “variables”).
  • Not using loops wisely, which resulted in a lot of copy-pasting.
  • Being too lazy to refactor common code into functions.
  • My version control consisted in copying all the code files into time-stamped directories, and modifying them accordingly.

If you’re a software engineer you’re probably cringing right now, but most Economics programs don’t pay any attention to coding style, so this type of problems is rampant. And guess what, teamwork is as horrible as you can imagine. So much time is wasted trying to understand what the others did.

Unfortunately, software engineers are not exempt from this. While working in web development, I felt like fresh programmers were very strong in theoretical aspects of software, such as data structures, while attention to coding style tends to be more prevalent among seasoned programmers.

A Reading List to Rule Them All

2020 is a good year to be more at home, read and hone your programming skills, whatever your trade is. So I compiled a reading list with some great advice on coding style. Object Oriented, Functional, imperative or declarative, statically or dynamically typed, whatever rocks your world, you should be able to find something useful here.

The Hitchhiker’s Guide to Python (Kenneth Reitz & Real Python)

Find it here

This is a great Python style guide by the creator of the Requests library. I found it the first time I was required to write a Python library and it saved me. It includes good advice on everything from coding style, managing virtual environments, structuring your code, testing, logging, and even the little gotchas of Python that can take even seasoned programmers by surprise. And it’s SUCH a pleasure to read. I’d say it’s a must for people who work mostly with Python. You just need to check Kenneth’s website to realize that you’re in good hands when it comes to learning good style.

Every year many more economists are joining the ranks of Python programmers, so this is a book I'd really want everyone to read. It's available for free online, check it out.

R for Data Science (Garrett Grolemund)

Find it here

I'm not an R expert. I do most of the data wrangling in Python and move to R for its econometric power. But something that I really like about it is that it has great tools for standardizing your teams' code into a clean and easily reproducible way. The coding style promoted by this book produces clean, readable code thanks to function composition using the pipe syntax, encourages the DRY (Don't Repeat Yourself) principle through functions and different flavors of map, and provides great tips on R's typing system and data structures. I'd say this is a must-read for anyone doing programming on R.

Functional Programming Isn’t The Answer (John De Goes)

Find it here

Functional Programming (FP) became really popular in the last decade, thanks to the recent trend in microservices, which made it much more necessary to coordinate concurrent processes in a thread-safe way. And while FP really helps make everything more manageable in a distributed environment, functional languages can introduce unnecessary complexity in many ways. As an example, Scala is such a powerful language (embracing both OOP and FP, the use of implicit variables much to the confusion of anyone familiar with the Zen of Python, the heavy use of macros of some libraries, etc.), made it way more difficult for me to learn than other languages.

Also all that jargon!!

Homomorphism. Homomorphisms are a special type of function f : A => B that preserve a given algebraic structure. For example, the square function (which returns its input multiplied by itself) is a semigroup homomorphism, because it preserves, for example, the multiplicative semigroup.

¯\_(ツ)_/¯

Honestly it all makes sense once you start getting into it, but what a barrier to entry!

FP can get as confusing as one wants to make it. That’s why it is important to understand when using it causes more good than bad. This post by John De Goes, a key contributor to popular libraries such as Scalaz and ZIO, puts our feet on the ground, by reminding us that FP is just a tool for a greater goal. Actually, the whole blog includes very technical yet down-to-earth articles on FP. This is one website you don't wanna miss if you're into FP.

Out of the Tar Pit (Moseley & Marks)

Find it here

You must know how this feels: you write some code on a Jupyter Notebook, run it to test it, modify it, run it some more and for some strange reason it starts behaving weirdly. That function should be returning something completely different. You search for the error for a while but you just can’t find it. Finally, you try restarting the kernel and running again your code, and guess what!

NameError: name ‘recursive_burrito’ is not defined

The function was referencing a global variable which definition you forgot you deleted.

¯\_(ツ)_/¯

State is tricky. Out of the Tar Pit argues that it is actually the biggest source of code complexity, and that you can simplify considerably your code by adopting a programming style that avoids modifying state and side-effects. They argue that most of the complexity in large-scale software is not essential, and can therefore be avoided by employing Functional Programming (FP).

Before reading this paper I thought that the obsession of functional programmers of avoiding mutable variables and data structures was just neurosis. Once I learned about the so many problems that state causes, I started seeing it everywhere.

Design Patterns: Elements of Reusable Object-Oriented Software (Gamma, Helm, Johnson and Vlissides)

This bad boy

The famous Gang of Four book. Just leaving it here in case you haven’t read even some part of it. I decided to give it a check when I was creating a Java 7 library which was really getting out of hand. After refactoring, extending the library and explaining it to others became a breeze. Even if you're not writing a backend, the "write a good interface first, then go for the concrete implementation" can help you a lot to organize your simulation's code.

BONUS MATERIAL: Tokyo Storefronts (Mateusz Urbanowicz)

Find it here

Well, this is not about coding really, but it has a lot to do with style. I find his videos inspiring when beginning a new coding project. I'll leave this here, just to give you an idea of how your code should aspire to be: as beautiful as a watercolor of a random storefront in Kugenuma.

Until the next time!



buildersbox.corp-sansan.com
buildersbox.corp-sansan.com

© Sansan, Inc.