Question about the secret weapon

By | ai, bigdata, machinelearning

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Micah Wright writes:

I first encountered your explanation of secret weapon plots while I was browsing your blog in grad school, and later in your 2007 book with Jennifer Hill. I found them immediately compelling and intuitive, but I have been met with a lot of confusion and some skepticism when I’ve tried to use them. I’m uncertain as to whether it’s me that’s confused, or whether my audience doesn’t get it. I should note that my formal statistical training is somewhat limited—while I was able to take a couple of stats courses during my masters, I’ve had to learn quite a bit on the side, which makes me skeptical as to whether or not I actually understand what I’m doing.

My main question is this: when using the secret weapon, does it make sense to subset the data across any arbitrary variable of interest, as long as you want to see if the effects of other variables vary across its range? My specific case concerns tree growth (ring widths). I’m interested to see how the effect of competition (crowding and other indices) on growth varies at different temperatures, and if these patterns change in different locations (there are two locations). To do this, I subset the growth data in two steps: first by location, then by each degree of temperature, which I rounded to the nearest integer. I then ran the same linear model on each subset. The model had growth as the response, and competition variables as predictors, which were standardized. I’ve attached the resulting figure [see above], which plots the change in effect for each predictor over the range of temperature.

My reply: I like these graphs! In future you might try a 6 x K grid, where K is the number of different things you’re plotting. That is, right now you’re wasting one of your directions because your 2 x 3 grid doesn’t mean anything. These plots are fine, but if you have more information for each of these predictors, you can consider plotting the existing information as six little graphs stacked vertically and then you’ll have room for additional columns. In addition, you should make the tick marks much smaller, put the labels closer to the axes, and reduce the number of axis labels, especially on the vertical axes. For example, (0.0, 0.3, 0.6, 0.9) can be replaced by labels at 0, 0.5, 1.

Regarding the larger issue of, what is the secret weapon, as always I see it as an approximation to a full model that bridges the different analyses. It’s a sort of nonparametric analysis. You should be able to get better estimates by using some modeling, but a lot of that smoothing can be done visually anyway, so the secret weapon gets you most of the way there, and in my view it’s much much better than the usual alternative of fitting a single model to all the data without letting all the coefficients vary.

The post Question about the secret weapon appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

The post Question about the secret weapon appeared first on All About Statistics.




Source link

Science and Technology links (June 23rd, 2017)

By | machinelearning

Elon Musk, Making Humans a Multi-Planetary Species, New Space. June 2017, 5(2): 46-61.

Reportedly, Ikea is working on augmented reality software that would allow you to see the furniture in your home before buying it.

Current virtual reality headsets provide a good experience, but if you ever tried to read text while where one of the headsets, you may have noticed that it is quite hard. It seems that it is because the image resolution is too low. When using prototypes with very high resolution, you can “read text across the room”.

You would think that a tree that is over 200 years old would have accumulated a lot of (random) genetic mutations. However, it seems that it is not the case: as trees grow old, even very old, they preserve their genes intact. We do not know why or how.

Source link

Machine Learning: Fundamental Algorithms for Supervised and Unsupervised Learning With Real-World Applications (Bayes Theorem, TensorFlow, Python Book 1)

By | iot, machinelearning

Computers can’t LEARN… Right?!

Machine Learning is a branch of computer science that wants to stop programming computers using a list of detailed instructions. Instead, they are trying to implement high-level routines which tell computers how to approach new and unknown problems – these are called algorithms.

In practice, they want to give computers the ability to Learn and to Adapt.

We can use these algorithms to obtain insights, recognize patterns and make predictions from data, images, sounds or videos we have never seen before – or even knew existed. Unfortunately, the true power and applications of today’s Machine Learning Algorithms remain deeply misunderstood by most people.

Through this book I want fix this confusion, I want to shed light on the most relevant Machine Learning Algorithms used in the industry. I will show you exactly how each algorithm works, why it works and when you should use it.

Supervised Learning Algorithms

  • K-Nearest Neighbour

  • Naïve Bayes

  • Regressions

Unsupervised Learning Algorithms:

  • Support Vector Machines

  • Neural Networks

  • Decision Trees



Two years as a Data Scientist at Stack Overflow

By | ai, bigdata, machinelearning

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

Last Friday marked my two year anniversary working as a data scientist at Stack Overflow. At the end of my first year I wrote a blog post about my experience, both to share some of what I’d learned and as a form of self-reflection.

After another year, I’d like to revisit the topic. While my first post focused mostly on the transition from my PhD to an industry position, here I’ll be sharing what has changed for me in my job in the last year, and what I hope the next year will bring.

Hiring a Second Data Scientist

In last year’s blog post, I noted how difficult it could be to be the only data scientist on a team:

Most of my current statistical education has to be self-driven, and I need to be very cautious about my work: if I use an inappropriate statistical assumption in a report, it’s unlikely anyone else will point it out.

This continued to be a challenge, and fortunately in December we hired our second data scientist, Julia Silge.

We started hiring for the position in September, and there were a lot of terrific candidates I got to meet and review during the application and review process. But I was particularly excited to welcome Julia to the team because we’d been working together during the course of the year, ever since we met and created the tidytext package at the 2016 rOpenSci unconference.

Julia, like me, works on analysis and visualization rather than building and productionizing features, and having a second person in that role has made our team much more productive. This is not just because Julia is an exceptional colleague, but because the two of us can now collaborate on statistical analyses or split them up to give each more focus. I did enjoy being the first data scientist at the company, but I’m glad I’m no longer the only one. Julia’s also a skilled writer and communicator, which was essential in achieving the next goal.

Company blog posts

In last year’s post, I shared some of the work that I’d done to explore the landscape of software developers, and set a goal for the following year (emphasis is new):

I’m also just intrinsically pretty interested in learning about and visualizing this kind of information; it’s one of the things that makes this a fun job. One plan for my second year here is to share more of these analyses publicly. In a previous post looked at which technologies were the most polarizing, and I’m looking forward to sharing more posts like that soon.

I’m happy to say that we’ve made this a priority in the last six months. Since December I’ve gotten the opportunity to write a number of posts for the Stack Overflow company blog:

Other members of the team have written data-driven blog posts as well, including:

I’ve really enjoyed sharing these snapshots of the software developer world, and I’m looking forward to sharing a lot more on the blog this next year.

Teaching R at Stack Overflow

Last year I mentioned that part of my work has been developing data science architecture, and trying to spread the use of R at the company.

This also has involved building R tutorials and writing “onboarding” materials… My hope is that as the data team grows and as more engineers learn R, this ecosystem of packages and guides can grow into a true internal data science platform.

At the time, R was used mostly by three of us on the data team (Jason Punyon, Nick Larsen, and me). I’m excited to say it’s grown since then, and not just because of my evangelism.

Every Friday since last September, I’ve met with a group of developers to run internal “R sessions”, in which we analyze some of our data to develop insights and models. Together we’ve made discoveries that have led to real projects and features, for both the Data Team and other parts of the engineering department.

There are about half a dozen developers who regularly take part, and they all do great work. But I especially appreciate Ian Allen and Jisoo Shin for coming up with the idea of these sessions back in September, and for following through in the months since. Ian and Jisoo joined the company last summer, and were interested in learning R to complement their development of product features. Their curiosity, and that of others in the team, has helped prove that data analysis can be a part of every engineer’s workflow.

Writing production code

My relationship to production code (the C# that runs the actual Stack Overflow website) has also changed. In my first year I wrote much more R code than C#, but in the second I’ve stopped writing C# entirely. (My last commit to production was more than a year ago, and I often go weeks without touching my Windows partition). This wasn’t really a conscious decision; it came from a gradual shift in my role on the engineering team. I’d usually rather be analyzing data than shipping features, and focusing entirely on R rather than splitting attention across languages has been helpful for my productivity.

Instead, I work with engineers to implement product changes based on analyses and push models into production. One skill I’ve had to work on is writing technical specifications, both for data sources that I need to query or models that I’m proposing for production. One developer I’d like to acknowledge specifically Nick Larsen, who works with me on the Data Team. Many of the blog posts I mention above answer questions like “What tags are visited in New York vs San Francisco”, or “What tags are visited at what hour of the day”, and these wouldn’t have been possible without Nick. Until recently, this kind of traffic data was very hard to extract and analyze, but he developed processes that extract and transform the data into more readily queryable tables. This has many important analyses possible besides the blog posts, and I can’t appreciate this work enough.

(Nick also recently wrote an awesome post, How to talk about yourself in a developer interview, that’s worth checking out).

Working with other teams

Last year I mentioned that one of my projects was developing targeting algorithms for Job Ads, which match Stack Overflow visitors with jobs they may be interested in (such as, for example, matching people who visit Python and Javascript questions with Python web developer jobs). These are an important part of our business and still make up part of my data science work. But I learned in the last year about a lot of components of the business that data could help more with.

One team that I’ve worked with that I hadn’t in the first year is Display Ads. Display Ads are separate from job ads, and are purchased by companies with developer-focused products and services.

For example, I’ve been excited to work closer with Steve Feldman on the Display Ad Operations team. If you’re wondering why I’m not ashamed to work on ads, please read Steve’s blog post on how we sell display ads at Stack Overflow– he explains it better than I could. We’ve worked on several new methods for display ad targeting and evaluation, and I think there’s a lot of potential for data to have a postive impact for the company.

Changes in the rest of my career

There’ve been other changes in my second year out of academia. In my first year, I attended only one conference (NYR 2016) but I’ve since had more of a chance to travel, including to useR and JSM 2017, PLOTCON, rstudio::conf 2017, and NYR 2017. I spoke at a few of these, about my broom package, about gganimate and about the history of R as seen by Stack Overflow.

Julia and I wrote and published an O’Reilly book, Text Mining with R (now available on Amazon and free online here). I also self-published an e-book, Introduction to Empirical Bayes: Examples from Baseball Statistics, based on a series of blog posts. I really enjoyed the experience of turning blog posts into a larger narrative, and I’d like to continue doing so this next year.

There are some goals I didn’t achieve. I’ve had a longstanding interest in getting R into production (and we’ve idly investigated some approaches like Microsoft R Server), but as of now we’re still productionizing models by rewriting them in C#. And there are many teams at Stack Overflow that I’d like to give better support to- prioritizing the Data Team’s time has been a challenge, though having a second data scientist has helped greatly. But I’m still happy with how my work has gone, and excited about the future.

In any case, this made the whole year worthwhile:

var vglnk = { key: ‘949efb41171ac6ec1bf7f206d57e90b8’ };

(function(d, t) {
var s = d.createElement(t); s.type = ‘text/javascript’; s.async = true;
s.src = “http://cdn.viglink.com/api/vglnk.js”;
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, ‘script’));

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…




Source link

Question about the secret weapon

By | machinelearning

Micah Wright writes:

I first encountered your explanation of secret weapon plots while I was browsing your blog in grad school, and later in your 2007 book with Jennifer Hill. I found them immediately compelling and intuitive, but I have been met with a lot of confusion and some skepticism when I’ve tried to use them. I’m uncertain as to whether it’s me that’s confused, or whether my audience doesn’t get it. I should note that my formal statistical training is somewhat limited—while I was able to take a couple of stats courses during my masters, I’ve had to learn quite a bit on the side, which makes me skeptical as to whether or not I actually understand what I’m doing.

My main question is this: when using the secret weapon, does it make sense to subset the data across any arbitrary variable of interest, as long as you want to see if the effects of other variables vary across its range? My specific case concerns tree growth (ring widths). I’m interested to see how the effect of competition (crowding and other indices) on growth varies at different temperatures, and if these patterns change in different locations (there are two locations). To do this, I subset the growth data in two steps: first by location, then by each degree of temperature, which I rounded to the nearest integer. I then ran the same linear model on each subset. The model had growth as the response, and competition variables as predictors, which were standardized. I’ve attached the resulting figure [see above], which plots the change in effect for each predictor over the range of temperature.

My reply: I like these graphs! In future you might try a 6 x K grid, where K is the number of different things you’re plotting. That is, right now you’re wasting one of your directions because your 2 x 3 grid doesn’t mean anything. These plots are fine, but if you have more information for each of these predictors, you can consider plotting the existing information as six little graphs stacked vertically and then you’ll have room for additional columns. In addition, you should make the tick marks much smaller, put the labels closer to the axes, and reduce the number of axis labels, especially on the vertical axes. For example, (0.0, 0.3, 0.6, 0.9) can be replaced by labels at 0, 0.5, 1.

Regarding the larger issue of, what is the secret weapon, as always I see it as an approximation to a full model that bridges the different analyses. It’s a sort of nonparametric analysis. You should be able to get better estimates by using some modeling, but a lot of that smoothing can be done visually anyway, so the secret weapon gets you most of the way there, and in my view it’s much much better than the usual alternative of fitting a single model to all the data without letting all the coefficients vary.

The post Question about the secret weapon appeared first on Statistical Modeling, Causal Inference, and Social Science.

Source link

Changing the game: Sports Tech with the Toronto Argonauts and the Blue Jays, #BigDataTO #BigData #AI

By | ai, bigdata, machinelearning


Notes from the #BigDataTO conference inToronto  Panel: Mark Silver, @silveratola, Stadium Digital; Michael Copeland, @Mike_G_copeland, Toronto Argonauts; Jonathan Carrigan, @J_carrigan, MLSE; Andrew Miller, @BlueJays, Toronto Blue Jays There is a diverse fan base across all Toronto teams, and their preferences and values are diverse in terms of who are they and what drives them to […]


Source link