Thursday, April 14, 2016

Logistic regression on R - faillll

^Xiaobai, one of the dogs that my family is interested in adopting from ASD! 

Today I thought I try logistic regression to find the variables that might predict whether a dog gets adopted. (Using the same dataset as my previous post, from kaggle)


But it failed terribly. R returned a message "model did not converge". I should have realised it when I checked the levels for color and breed and both returned me with over 100 unique values.

This is a screenshot of the results returned. super fail ):


The next step forward for me is to create columns that indicates whether that color of the dog is present. For instance, I will have a column for white, black, brown. If a dog is white/black, it'll have "1" under white and black columns and "0" under brown. 

PS: I'm really taking baby steps with this blog! 

Thursday, April 07, 2016

Plot.ly on R

I just discovered this amazing graph package, Plot.ly. Who needs d3.js when plot.ly is so easy to use?!

Best part is, it works both online and offline. I can create a plot in R, and it renders it within my Rstudio with all the interactive functions. I can also upload my data to the plot.ly website and create a graph from there. Either ways allows me to save the graph into my plot.ly account.

To give it a test run, I used the shelter animals data from kaggle. (Reason: been trying to persuade my parents to adopt a shelter dog for the longest time)

So I extracted only shelter dog data and compared their age to the outcome of the shelter dog. I later broke down the outcomes to each subtype, and this is the outcome Honestly, I can't gather much insights from this, besides the fact that, the probability of a dog getting adopted after 20 weeks/5 months gets really low (below 75%). It also looks like most dogs don't survive beyond 40 weeks/8 months. They either pass on, or receive euthanasia before then. That's a really short time ):

Uploading github codes: http://robertgreiner.com/2012/04/using-github-as-a-syntax-highlighter/
Uploading plot.ly graphs from R: https://plot.ly/r/getting-started/#hosting-graphs-in-your-online-plotly-account

Friday, April 01, 2016

Miscarriage rates in US

I applied for the Google Squared Data and Analytics programme, and the test we had to do was an exploratory analysis on the Natality dataset.

It's on the Google Server and we had to access it using their BigQuery application, which is pretty cool. https://bigquery.cloud.google.com/table/publicdata:samples.natality But because we only had 3 days, and I was busy studying for my six sigma quiz at the same time, I didn't have time to fully explore and push the level of analysis to a notch higher.

Since we were only given around 5 slides to present the findings, I chose to focus on a specific topic, miscarriage rates and its possible causes. And attached are my findings.

The data are retrieved using the BigQuery application using SQL queries. Surprisingly, I didn't have too much problems with it. The downloaded data were then managed in excel and charts were generated using excel as well. I would have used R, to feel more "pro", but honestly, I think excel would be the best way to customise the charts, and thus I ended up working on the data mostly in excel too.






Some things I would have done if I had the luxury of time.

  • To represent the data in a map format, using QGIS 
  • To build a R Shiny application that allows interactivity of the data (this would probably take me a month though ahaha) 
But overall, I think I did okay, and I found the process really fun actually. It has been awhile since I really have a set of data which I can work on in my own ways. I always felt slightly restricted working on data from projects in school, because I feel confined to the methods my group agrees upon. But it does have its advantage in providing a wider perspective compared to the narrower one I would have taken, had I delved into it alone.