When does research have active opposition?

By | ai, bigdata, machinelearning

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

A reporter was asking me the other day about the Brian Wansink “pizzagate” scandal. The whole thing is embarrassing for journalists and bloggers who’ve been reporting on this guy’s claims entirely uncritically for years. See here, for example. Or here and here. Or here, here, here, and here. Or here. Or here, here, here, . . .

The journalist on the phone was asking me some specific questions: What did I think of Wansink’s work (I think it’s incredibly sloppy, at best), Should Wansink release his raw data (I don’t really care), What could Wansink do at this point to restore his reputation (Nothing’s gonna work at this point), etc.

But then I thought of another question: How was Wansink able to get away with it for so long. Remember, he got called on his research malpractice a full 5 years ago; he followed up with some polite words and zero action, and his reputation wasn’t dented at all.

The problem, it seems to me, is that Wansink has had virtually no opposition all these years.

It goes like this. If you do work on economics, you’ll get opposition. Write a paper claiming the minimum wage helps people and you’ll get criticism on the right. Write a paper claiming the minimum wage hurts people and you’ll get criticism on the left. Some—maybe most—of this criticism may be empty, but the critics are motivated to use whatever high-quality arguments are at their disposal, so as to improve their case.

Similarly with any policy-related work. Do research on the dangers of cigarette smoking, or global warming, or anything else that threatens a major industry, and you’ll get attacked. This is not to say that these attacks are always (or never) correct, just that you’re not going to get your work accepted for free.

What about biomedical research? Lots of ambitious biologists are running around, all aiming for that elusive Nobel Prize. And, so I’ve heard, many of the guys who got the prize are pushing everyone in their labs to continue publishing purported breakthrough after breakthrough in Cell, Science, Nature, etc. . . . What this means is that, if you publish a breakthrough of your own, you can be sure that the sharks will be circling, and lots of top labs will be out there trying to shoot you down. It’s a competitive environment. You might be able to get a quick headline or two, but shaky lab results won’t be able to sustain a Wansink-like ten-year reign at the top of the charts.

Even food research will get opposition if it offends powerful interests. Claim to have evidence that sugar is bad for you, or milk is bad for you, and yes you might well get favorable media treatment, but the exposure will come with criticism. If you make this sort of inflammatory claim and your research is complete crap, then there’s a good chance someone will call you on it.

Wansink, though, his story is different. Yes, he’s occasionally poked at the powers that be, but his research papers address major policy debates only obliquely. There’s no particular reason for anyone to oppose a claim that men eat differently when with men than with women, or that buffet pricing affects or does not affect how much people eat, or whatever.

Wansink’s work flies under the radar. Or, to mix metaphors, he’s in the Goldilocks position, with topics that are not important for anyone to care about disputing, but interesting and quirky enough to appeal to the editors at the New York Times, NPR, Freakonomics, Marginal Revolution, etc.

It’s similar with embodied cognition, power pose, himmicanes, ages ending in 9, and other PPNAS-style Gladwell bait. Nobody has much motivation to question these claims, so they can stay afloat indefinitely, generating entire literatures in peer-reviewed journals, only to collapse years or decades later when someone pops the bubble via a preregistered non-replication or a fatal statistical criticism.

We hear a lot about the self-correcting nature of science, but—at least until recently—there seems to have been a lot of published science that’s completely wrong, but which nobody bothered to check. Or, when people did check, no one seemed to care.

A couple weeks ago we had a new example, a paper out of Harvard called, “Caught Red-Minded: Evidence-Induced Denial of Mental Transgressions.” My reaction when reading this paper was somewhere between: (1) Huh? As recently as 2016, the Journal of Experimental Psychology: General was still publishing this sort of slop? and (2) Hmmm, the authors are pretty well known, so the paper must have some hidden virtues. But now I’m realizing that, yes, the paper may well have hidden virtues—that’s what “hidden” means, that maybe these virtues are there but I don’t see them—but, yes, serious scholars really can release low-quality research, when there’s no feedback mechanism to let them know there are problems.

OK, there are some feedback mechanisms. There are journal referees, there are outside critics like me or Uri Simonsohn who dispute forking path p-value evidence on statistical grounds, and there are endeavors such as the replication project that have revealed systemic problems in social psychology. But referee reports are hidden (you can respond to them by just submitting to a new journal), and the problem with peer review is the peers; and the other feedbacks are relatively new, and some established figures in psychology and other fields have had trouble adjusting.

Everything’s changing—look at Pizzagate, power pose, etc., where the news media are starting to wise up, and pretty soon it’ll just be NPR, PPNAS, and Ted standing in a very tiny circle, tweeting these studies over and over again to each other—but as this is happening, I think it’s useful to look back and consider how it is that certain bubbles have been kept afloat for so many years, how it is that the U.S. government gave millions of dollars in research grants to a guy who seems to have trouble counting pizza slices.

The post When does research have active opposition? appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

The post When does research have active opposition? appeared first on All About Statistics.




Source link

On a First Name Basis with Statistics Sweden

By | ai, bigdata, machinelearning

(This article was first published on Theory meets practice…, and kindly contributed to R-bloggers)

Abstract

Jugding from recent R-Bloggers posts, it appears that many data scientists are concerned with scraping data from various media sources (Wikipedia, twitter, etc.). However, one should be aware that well structured and high quality datasets are available through state’s and country’s bureau of statistics. Increasingly these are offered to the public through direct database access, e.g., using a REST API. We illustrate the usefulness of such an approach by accessing data from Statistics Sweden.

Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Introduction

Scandinavian countries are world-class when it comes to public registries. So when in need for reliable population data, this is the place to look. As an example, we access Statistics Sweden data by their API using the pxweb package developed by @MansMeg, @antagomir and @LCHansson. Love was the first speaker at a Stockholm R-Meetup some years ago, where I also gave a talk. Funny how such R-Meetups become useful many years after!

library(pxweb)

By browsing the Statistics Sweden (in Swedish: Statistiska Centralbyrån (SCB)) data using their web interface one sees that they have two relevant first name datasets: one containing the tilltalsnamn of newborns for each year during 1998-2016 and one for the years 2004-2016. Note: A tilltalsnamn in Sweden is the first name (of several possible first names) by which a person is usually addressed. About 2/3 of the persons in the Swedish name registry indicate which of their first names is their tilltalsnamn. For the remaining persons it is automatically implied that their tilltalsnamn is the first of the first names. Also note: For reasons of data protection the 1998-2016 dataset contains only first names used 10 or more times in a given year, the 2004-2016 dataset contains only first names used 2 or more times in a given year.

Downloading such data through the SCB web-interface is cumbersome, because the downloads are limited to 50,000 data cells per query. Hence, one has to do several manual queries to get hold of the relevant data. This is where their API becomes a real time-saver. Instead of trying to fiddle with the API directly using rjson or RJSONIO we use the specially designed pxweb package to fetch the data. One can either use the web-interface to determine the name of the desired data matrix to query or navigate directly through the api using pxweb:

d <- interactive_pxweb(api = "api.scb.se", version = "v1", lang = "en")

and select Population followed by Name statistics and then BE0001T04Ar or BE0001T04BAr, respectively, in order to obtain the relevant data and api download url. This leads to the following R code for download:

names10 <- get_pxweb_data(
  url = "http://api.scb.se/OV0104/v1/doris/en/ssd/BE/BE0001/BE0001T04Ar",
  dims = list(Tilltalsnamn = c('*'),
              ContentsCode = c('BE0001AH'),
              Tid = c('*')),
  clean = TRUE) %>% as.tbl

For better usability we rename the columns a little and replace NA counts to be zero. For visualization we pick 10 random lines of the dataset.

names10 <- names10 %>% select(-observations) %>%
  rename(firstname=`first name normally used`,counts=values) %>%
  mutate(counts = ifelse(is.na(counts),0,counts))
##Look at 10 random lines
names10 %>% slice(sample(seq_len(nrow(names10)),size=5))
## # A tibble: 5 × 3
##   firstname   year counts
##         
## 1   Leandro   2011     15
## 2    Marlon   2004      0
## 3    Andrej   2009      0
## 4     Ester   2002     63
## 5   Muhamed   1998      0

Note: Each spelling variant of a name in the data is treated as a unique name. In similar fashion we download the BE0001AL dataset as names2.

We now join the two datasets into one large data.frame by

names <- rbind(data.frame(names2,type="min02"), data.frame(names10,type="min10"))

and thus got everything in place to compute the name collision probability over time using the birthdayproblem package (as shown in previous posts).

library(birthdayproblem)
collision <- names %>% group_by(year,type) %>% do({
  data.frame(p=pbirthday_up(n=26L, prob= .$counts / sum(.$counts),method="mase1992")$prob, gini= ineq::Gini(.$counts))
}) %>% ungroup %>% mutate(year=as.numeric(as.character(year)))

And the resulting probabilities based on the two datasets min02 (at least two instances of the name in a given year) and min10 (at least ten instances of the name in a given year) can easily be visualized over time.

ggplot( collision, aes(x=year, y=p, color=type)) + geom_line(size=1.5) +
  scale_y_continuous(label=scales::percent,limits=c(0,1)) +
  xlab("Year") + ylab("Probability") +
  ggtitle("Probability of a name collision in a class of 26 kids born in year YYYY") +
  scale_colour_discrete(name = "Dataset")

As seen in similar plots for other countries, there is a decline in the collision probability over time. Note also that the two curves are upper limits to the true collision probabilities. The true probabilities, i.e. taking all tilltalsnamn into account, would be based on the hypothetical min1 data set. These probabilities would be slightly, but not substantially, below the min2 line. The same problem occurs, e.g., in the corresponding UK and Wales data. Here, Table 6 is listing all first names with 3 or more uses, but not stating how many newborns have a name occurring once and twice, respectively. With all due respect for the need to anonymise the name statistics, it’s hard to understand why this summary figure is not automatically reported, so one would be able to at least compute correct totals or collision probabilities.

Summary

Altogether, I was still quite happy to get proper individual name data so the collision probabilities are – opposite to some of my previous blog analyses – exact!

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice….

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…




Source link

Geospatial data and analysis for disaster relief

By | ai, bigdata, machinelearning

Tools from maps to drones respond to crises with increasing speed and accuracy.

In order to mount effective responses, emergency managers need accurate maps that show the extent of damage, predictions for its potential spread, and detailed data on the movement of people and resources.

Ten years ago, geospatial data wasn’t rich enough to map granular, real-time movements of people and resources, even in developed countries. Now that smartphones are ubiquitous around the world, disaster relief agencies are able to use geospatial data that goes down to the level of individuals, as well as maps showing key infrastructure and up-to-date damage assessments created on the fly, in order to manage response efforts.

The Humanitarian OpenStreetMap Team (HOT), discussed in Chapter 4, provides fast geospatial intelligence services during humanitarian crises. On October 3, 2016, as Hurricane Matthew bore down on the Caribbean, HOT activated a team to provide accurate, up-to-date maps of coastal areas in Jamaica, Haiti, and the Bahamas. A few days later, HOT announced that “over 1,000 mappers contributed more than 1.2 million edits, adding 180,000 buildings to the map concentrated in the most affected areas.” By the end of October, the effort had added more than 380,000 buildings and 400,000 road segments to the basemap, mostly in Haiti and the Dominican Republic.

HOT contributors respond to a list of projects posted by organizations like the Red Cross. Tasks might include identifying buildings and tracing roads from aerial imagery (Figure 1), and contributors may also gather data firsthand, as in a current HOT project that aims to map water and sanitation resources in Tanzania.

Geospatial data entered by contributors to the Humanitarian OpenStreetMap Team
Figure 1. Geospatial data entered by contributors to the Humanitarian OpenStreetMap Team shows that a road north of Port-au-Prince, Haiti, is operational (data visualized in QGIS by Jon Bruner).

HOT, which has been working since 2011, principally uses satellite imagery supplemented with on-the-ground GPS-based data gathering. The future will almost certainly involve drones and sophisticated software that can stitch images together in order to produce up-to-the-minute surveys of conditions in disaster-stricken areas.

Skycatch, a San Francisco startup founded in 2013, produces software that transforms drone videos into 3D models. It normally sells its software to construction companies working on megaprojects, but it found a new application in 2015 when it joined the relief effort following Nepal’s massive earthquake. Data from the drones was used to identify damaged buildings, map paths for heavy equipment, and plan for the restoration of heritage sites.

Satellites and drones can capture outstanding surveys of infrastructure, but for tracking people after a disaster, phone data is becoming indispensable. Flowminder, a Swedish not-for-profit organization, demonstrated that anonymized call records from mobile phone operators could be used to map flows of displaced people following the 2010 earthquake in Haiti, and updated its model immediately before the 2015 Nepal earthquake.

Following that disaster, Flowminder used anonymized data from 12 million phones to estimate population displacement, including gender and age distributions, down to 100m resolution. Following the disaster, each day the Nepalese mobile provider Ncell delivered a 12-gigabyte CSV file to Flowminder, which reformed the data, assigned call locations to administrative regions, and interpreted migration patterns.

Stefan Avesand, an engineer at Spotify who previously led efforts at Ericsson related to mobile network data, says Flowminder’s work during the Haiti earthquake response is an example of the transformation “from gut-driven to data-driven decision-making,” by “linking data collection directly with the decision-making process.”

Continue reading Geospatial data and analysis for disaster relief.




Source link

CoreLogic: “Between 2007–2016, Nearly 7.8 Million Homes Lost to Foreclosure”

By | ai, bigdata, machinelearning


Here is a report from CoreLogic: US Residential Foreclosure Crisis: 10 Years Later. There are several interesting graphs in the report, including foreclosures completed by year.

This graph for CoreLogic the Ten States with the highest peak foreclosure rate during the crisis, and the current foreclosure rate.

CoreLogic Foreclosure ReportSome states like Nevada and Florida have improved significantly. Other states, like New Jersey and New York, have only recovered slowly.

Here is a table based on data from the CoreLogic report showing completed foreclosure per year.

Completed foreclosure by Year
Source: CoreLogic
Year Completed Foreclosures
2000 191,295
2001 183,437
2002 232,330
2003 255,010
2004 275,900
2005 293,541
2006 383,037
2007 592,622
2008 983,881
2009 1,035,033
2010 1,178,234
2011 958,957
2012 853,358
2013 679,923
2014 608,321
2015 506,609
2016 385,748




Source link

When does research have active opposition?

By | ai, bigdata, machinelearning

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

A reporter was asking me the other day about the Brian Wansink “pizzagate” scandal. The whole thing is embarrassing for journalists and bloggers who’ve been reporting on this guy’s claims entirely uncritically for years. See here, for example. Or here and here. Or here, here, here, and here. Or here. Or here, here, here, . . .

The journalist on the phone was asking me some specific questions: What did I think of Wansink’s work (I think it’s incredibly sloppy, at best), Should Wansink release his raw data (I don’t really care), What could Wansink do at this point to restore his reputation (Nothing’s gonna work at this point), etc.

But then I thought of another question: How was Wansink able to get away with it for so long. Remember, he got called on his research malpractice a full 5 years ago; he followed up with some polite words and zero action, and his reputation wasn’t dented at all.

The problem, it seems to me, is that Wansink has had virtually no opposition all these years.

It goes like this. If you do work on economics, you’ll get opposition. Write a paper claiming the minimum wage helps people and you’ll get criticism on the right. Write a paper claiming the minimum wage hurts people and you’ll get criticism on the left. Some—maybe most—of this criticism may be empty, but the critics are motivated to use whatever high-quality arguments are at their disposal, so as to improve their case.

Similarly with any policy-related work. Do research on the dangers of cigarette smoking, or global warming, or anything else that threatens a major industry, and you’ll get attacked. This is not to say that these attacks are always (or never) correct, just that you’re not going to get your work accepted for free.

What about biomedical research? Lots of ambitious biologists are running around, all aiming for that elusive Nobel Prize. And, so I’ve heard, many of the guys who got the prize are pushing everyone in their labs to continue publishing purported breakthrough after breakthrough in Cell, Science, Nature, etc. . . . What this means is that, if you publish a breakthrough of your own, you can be sure that the sharks will be circling, and lots of top labs will be out there trying to shoot you down. It’s a competitive environment. You might be able to get a quick headline or two, but shaky lab results won’t be able to sustain a Wansink-like ten-year reign at the top of the charts.

Even food research will get opposition if it offends powerful interests. Claim to have evidence that sugar is bad for you, or milk is bad for you, and yes you might well get favorable media treatment, but the exposure will come with criticism. If you make this sort of inflammatory claim and your research is complete crap, then there’s a good chance someone will call you on it.

Wansink, though, his story is different. Yes, he’s occasionally poked at the powers that be, but his research papers address major policy debates only obliquely. There’s no particular reason for anyone to oppose a claim that men eat differently when with men than with women, or that buffet pricing affects or does not affect how much people eat, or whatever.

Wansink’s work flies under the radar. Or, to mix metaphors, he’s in the Goldilocks position, with topics that are not important for anyone to care about disputing, but interesting and quirky enough to appeal to the editors at the New York Times, NPR, Freakonomics, Marginal Revolution, etc.

It’s similar with embodied cognition, power pose, himmicanes, ages ending in 9, and other PPNAS-style Gladwell bait. Nobody has much motivation to question these claims, so they can stay afloat indefinitely, generating entire literatures in peer-reviewed journals, only to collapse years or decades later when someone pops the bubble via a preregistered non-replication or a fatal statistical criticism.

We hear a lot about the self-correcting nature of science, but—at least until recently—there seems to have been a lot of published science that’s completely wrong, but which nobody bothered to check. Or, when people did check, no one seemed to care.

A couple weeks ago we had a new example, a paper out of Harvard called, “Caught Red-Minded: Evidence-Induced Denial of Mental Transgressions.” My reaction when reading this paper was somewhere between: (1) Huh? As recently as 2016, the Journal of Experimental Psychology: General was still publishing this sort of slop? and (2) Hmmm, the authors are pretty well known, so the paper must have some hidden virtues. But now I’m realizing that, yes, the paper may well have hidden virtues—that’s what “hidden” means, that maybe these virtues are there but I don’t see them—but, yes, serious scholars really can release low-quality research, when there’s no feedback mechanism to let them know there are problems.

OK, there are some feedback mechanisms. There are journal referees, there are outside critics like me or Uri Simonsohn who dispute forking path p-value evidence on statistical grounds, and there are endeavors such as the replication project that have revealed systemic problems in social psychology. But referee reports are hidden (you can respond to them by just submitting to a new journal), and the problem with peer review is the peers; and the other feedbacks are relatively new, and some established figures in psychology and other fields have had trouble adjusting.

Everything’s changing—look at Pizzagate, power pose, etc., where the news media are starting to wise up, and pretty soon it’ll just be NPR, PPNAS, and Ted standing in a very tiny circle, tweeting these studies over and over again to each other—but as this is happening, I think it’s useful to look back and consider how it is that certain bubbles have been kept afloat for so many years, how it is that the U.S. government gave millions of dollars in research grants to a guy who seems to have trouble counting pizza slices.

The post When does research have active opposition? appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

The post When does research have active opposition? appeared first on All About Statistics.




Source link