Python code to make sure two data frames have the same columns in the same order. I used this to make sure that two dataframes had the same dummy columns after using pd.get_dummies:

missing_cols = set( X1.columns ) - set( X2.columns )
for c in missing_cols:
    X2[c] = 0
X2 = X2[X1.columns]

Labels: coding , python , machine_learning

No comments

My Favorite Languages

Jan. 23, 2018, 11:25 a.m.

My favorite languages, in order:

  1. MatLab - MatLab is just a beautiful language. It is simple but very powerful and makes it easy to do very complex things. While the fact that is is designed solely for numeric computation is a drawback as far as using it for other things, that is one of the reasons I love it.
  2. Python - Python is by far my favorite scripting language. It is very powerful with a lot of features, but that also makes it a bit complex. It isn't as elegant as MatLab, but it is way more useful.
  3. R - I consider R to be somewhere in between MatLab and Python. It is optimized for numerical computing, but can also be used with text and character data. For statistics it stands alone - Python can do pretty much anything R can do, but R is simpler and easier. However, it is more a functional language than a programming language.
  4. SQL - I've worked with MySQL for almost 20 years and I know SQL very well. It is great at working with normalized data, however as storage costs have gone down and RAM has gone up, I'm not sure normalizing data really makes all that much sense these days. Having to join a bunch of tables can really impact performance, which when you just need to get a string out of a joined table doesn't really seem worth it. For web sites it makes sense, but for computational purposes I'm not sure it's really necessary, unless you have more data than you can fit in memory. However I will always remember SQL as my first love.
  5. C - I used C in university, but not much since then. I have forgotten most of what I once knew, but I plan on learning C again because of it's speed and efficiency. The fact that you can use C to write extensions for R, Python, MatLab etc makes it very useful.
  6. PHP - PHP is a reallly ugly language. It has a lot of features, but is inconsistent in syntax and not designed for manipulating data. It's main advantage is that it is easy to learn, but this also makes it very easy to do badly. PHP can be done very well, but good PHP programmers are few and far between and they seem to be getting crowded out by mediocre programmers.
  7. Javascript - Javascript has really been maturing recently. I started using Javascript back in 1996 or so, when all it could really do was alerts and confirmations, it can do a whole lot more than that these days. I have not yet worked with Node.js so I am not all that familiar with the full power of it. I don't really know what my problem with Javascript is. Maybe I still see it as the silly little language it was when I first learned it. Anyway the fact that it is on the bottom of my list should in no way be taken as a reflection of its value.

 

Labels: coding , data_science

No comments

Data vs Web

Jan. 18, 2018, 11:16 a.m.

In 1998, when I graduated from university the internet was in its infancy and the dot com bubble was just getting ramped up. At the time I thought the internet would change the world by making information easily accessible and available, and I was excited to be a part of this new thing that would be revolutionary.

That started to change a few years ago. The focus of internet companies had shifted from providing useful services to collecting as much data as possible on the users to be used to better target advertisements. Rather than providing useful and informative content and services, the emphasis was on keeping users online for as much time as possible. While the negative effects of this business model on the users and society are becoming more and more noticeable, notably in the recent US election, the tech companies continue to ignore them. This is no longer the industry that I signed up to work for, and I no longer want to be a part of it.Anyone who is familiar with the work of Kahneman and Tversky knows that the human brain is very poor at processing and analyzing data. Most of our decisions are made using heuristics, or rules of thumb, that allow us to make quick and easy judgements. These result in cognitive biases, which are ways in which our brains distort reality for the purpose of making decisions. One of the most famous cognitive biases is the "confirmation bias" - which is how people interpret new information in a way such as to support their existing beliefs. Kahneman and Tversky conducted experiments on people ranging from college undergraduates to statistics professors, and everyone was subject to these biases - even PhD level statisticians who should know better. This is why data science is so important.

Our brains are not designed to gather and analyze large amounts of data and we are incredibly bad at doing so. We tend to draw conclusions from small, isolated, but memorable bits of information rather than looking at the overall big picture. One example is how Americans are all very worried about terrorism even though on average only six Americans die per year from foreign terrorism. The media likes to report these stories because they are sensational and memorable, but doing so greatly exaggerates the real risks. There are also numerous medications which are commonly prescribed despite having minimal positive effects, or having no benefits at all.

Data science is a way to draw knowledge from actual observation of the world, rather than just whatever thoughts happen to be strung together in our heads, or whatever sound bites relating to a given subject most easily come to mind. I can come up with whatever theories and ideas I want, but unless they actual reflect on the real world it's all meaningless. This is the basis of scientific inquiry, and this is why I am getting out of web development and into data science.

Labels: personal , coding , data_science

No comments

My Record Collection

Jan. 14, 2018, 3:59 p.m.

Unfortunately it seems I have to get rid of my record collection. It is currently in the US and between the cost of shipping it here, the cost of finding space to store it here, and the cost of buying new turntables here it just doesn't make sense to keep it. The collection has been my prized possession for over 20 years, and selling it feels like selling a child. But life goes on...

Labels: music

No comments

Archives