Thursday, April 14, 2016

Logistic regression on R - faillll

^Xiaobai, one of the dogs that my family is interested in adopting from ASD!

Today I thought I try logistic regression to find the variables that might predict whether a dog gets adopted. (Using the same dataset as my previous post, from kaggle)

	dataset = read.csv ("train.csv", header=TRUE, na.strings=c(""))
	dog = subset(dataset, AnimalType == "Dog")
	dog = subset(dog, OutcomeType!="Return_to_owner")

	#add column to check if adopted or not
	for (i in 1:nrow(dog)) {

	if (dog[i,4] == "Adoption")
	dog[i,"adopted"] = TRUE
	else
	dog[i,"adopted"] = FALSE

	}

	#remove columns that I don't want in the model
	dog_filtered = dog[-c(1:3,5,6)]

	#logistic regression
	model <- glm(adopted ~.,family=binomial(link='logit'),data=dog_filtered)
	summary(model)

	#model did not converge, too many breeds and color

	levels(dog_filtered$Color)
	#returned 365 results
	levels(dog_filtered$Breed)
	#returned 1380 results

view raw Shelter Dogs - Logistic Regression fail hosted with ❤ by GitHub

But it failed terribly. R returned a message "model did not converge". I should have realised it when I checked the levels for color and breed and both returned me with over 100 unique values.

This is a screenshot of the results returned. super fail ):

The next step forward for me is to create columns that indicates whether that color of the dog is present. For instance, I will have a column for white, black, brown. If a dog is white/black, it'll have "1" under white and black columns and "0" under brown.

PS: I'm really taking baby steps with this blog!

Thursday, April 07, 2016

Plot.ly on R

data analytics github kaggle plot.ly R shelter dogs Thursday, April 07, 2016

I just discovered this amazing graph package, Plot.ly. Who needs d3.js when plot.ly is so easy to use?!

Best part is, it works both online and offline. I can create a plot in R, and it renders it within my Rstudio with all the interactive functions. I can also upload my data to the plot.ly website and create a graph from there. Either ways allows me to save the graph into my plot.ly account.

To give it a test run, I used the shelter animals data from kaggle. (Reason: been trying to persuade my parents to adopt a shelter dog for the longest time)

	library(plotly)

	Sys.setenv("plotly_username"="melisasa")
	Sys.setenv("plotly_api_key"="SECRET")

	dataset = read.csv ("train.csv", header=TRUE)
	dog = subset(dataset, AnimalType == "Dog")

	for (i in 1:nrow(dog)) {
	time = sub('.*\\ ', '', dog[i,"AgeuponOutcome"])
	num_str = sub("\\ .*", "", dog[i,"AgeuponOutcome"])
	num = strtoi(num_str, base = 0L)

	if (time == "year" \|\| time == "years")
	dog[i,"Age_weeks"] = num*52
	else (time == "month" \|\| time == "months")
	dog[i,"Age_weeks"] = num*4

	dog[i,"Outcome_combined"] = paste(dog[i,"OutcomeType"], dog[i,"OutcomeSubtype"], sep="_")
	}

	age_outcome = plot_ly(dog, x=Age_weeks, color=OutcomeType, type="box")
	age_outcome

	plotly_POST(age_outcome, filename = "r-docs/Age_and_outcome_for_shelter_dogs")

	age_outcome_deets = plot_ly(dog, x=Age_weeks, color=Outcome_combined, type="box")
	age_outcome_deets

	plotly_POST(age_outcome_deets, filename = "r-docs/Age_and_outcome_for_shelter_dogs")

view raw Shelter Dogs - Age and Outcomes hosted with ❤ by GitHub

So I extracted only shelter dog data and compared their age to the outcome of the shelter dog. I later broke down the outcomes to each subtype, and this is the outcome Honestly, I can't gather much insights from this, besides the fact that, the probability of a dog getting adopted after 20 weeks/5 months gets really low (below 75%). It also looks like most dogs don't survive beyond 40 weeks/8 months. They either pass on, or receive euthanasia before then. That's a really short time ):

Uploading github codes: http://robertgreiner.com/2012/04/using-github-as-a-syntax-highlighter/
Uploading plot.ly graphs from R: https://plot.ly/r/getting-started/#hosting-graphs-in-your-online-plotly-account

Friday, April 01, 2016

Miscarriage rates in US

BigQuery data analytics excel exploratory analysis google squared data and analytics miscarriage rates in US natality SQL query Friday, April 01, 2016

I applied for the Google Squared Data and Analytics programme, and the test we had to do was an exploratory analysis on the Natality dataset.

It's on the Google Server and we had to access it using their BigQuery application, which is pretty cool. https://bigquery.cloud.google.com/table/publicdata:samples.natality But because we only had 3 days, and I was busy studying for my six sigma quiz at the same time, I didn't have time to fully explore and push the level of analysis to a notch higher.

Since we were only given around 5 slides to present the findings, I chose to focus on a specific topic, miscarriage rates and its possible causes. And attached are my findings.

The data are retrieved using the BigQuery application using SQL queries. Surprisingly, I didn't have too much problems with it. The downloaded data were then managed in excel and charts were generated using excel as well. I would have used R, to feel more "pro", but honestly, I think excel would be the best way to customise the charts, and thus I ended up working on the data mostly in excel too.

Some things I would have done if I had the luxury of time.

To represent the data in a map format, using QGIS
To build a R Shiny application that allows interactivity of the data (this would probably take me a month though ahaha)

But overall, I think I did okay, and I found the process really fun actually. It has been awhile since I really have a set of data which I can work on in my own ways. I always felt slightly restricted working on data from projects in school, because I feel confined to the methods my group agrees upon. But it does have its advantage in providing a wider perspective compared to the narrower one I would have taken, had I delved into it alone.

Poking at data, one dot at a time

Thursday, April 14, 2016

Logistic regression on R - faillll

Thursday, April 07, 2016

Plot.ly on R

Friday, April 01, 2016

Miscarriage rates in US

About Me

Popular Posts

Labels

Blog Archive