Recent posts

It’s time to upgrade your scheduler to Airflow

4 minute read

Airflow is an open source scheduling tool, incubated by Airbnb. Airflow is now getting popular and more Tech companies start using it. Compared with our company’s existing scheduling tool - crontab, it provides advantageous features, such as user-friendly web UI, multi-process/distributed executions,notification when failure/re-try. In this post, I’m going to record down my journey of airflow setup. Content 1.Install Airflow 2.Configure Airflow 3.Choices of Executors 4.Final Note...

Shiny + shinydashboard + googleVis = Powerful Interactive Visiualization

4 minute read

If you are a data scientist, who spent several weeks on developing a fantanstic model, you’d like to have an equally awesome way to visualize and demo your results. For R users, ggplots are good option, but no longer sufficient. R-shiny + shinydashboard + googleVis could be a wonderful combination for a quick demo application. For the purpose of illustration, I just downloaded a random sample data test.csv from kaggle’s latest competitions: https://www.kaggle.com/c/new-york-city-taxi-fare-pre...

Digit Recognition with Tensor Flow

7 minute read

This time I am going to continue with the kaggle 101 level competition – digit recogniser with deep learning tool Tensor Flow. In the previous post, I used PCA and Pooling methods to reduce the dimensions of the dataset, and train with the linear SVM. Due to the limited efficiency of the R SVM package. I only sampled 500 records and performed a 10-fold cross validation. The resulting accuracy is about 82.7% 1. this time with tensorflow we can address the problem differently: Deep Lea...

Implementation of Model Based Recommendation System in R

1 minute read

The most straight forward recommendation system are either user based CF (collaborative filtering) or item based CF, which are categorized as memory based methods. User-Based CF is to recommend products based on behaviour of similar users, and the Item-Based CF is to recommend similar products from products that user purchased. No matter which method is used, the user-user or item-item similarity matrix, which could be sizable, is required to compute. While on the contrast, a model based app...

Revisit Titanic Data using Apache Spark

5 minute read

This post is mainly to demonstrate the pyspark API (Spark 1.6.1), using Titanic dataset, which can be found here (train.csv, test.csv). Another post analysing the same dataset using R can be found here. Content Data Loading and Parsing Data Manipulation Feature Engineering Apply Spark ml/mllib models 1. data loading & parsing data loading sc is the SparkContext launched together with pyspark. Using sc.textFile, we can read csv file as text in RDD data format and data is sep...