Introduction to Amazon Machine Learning – Training DVD

By | iot, machinelearning

Number of Videos: 2 hours – 26 lessons
Ships on: DVD-ROM
User Level: Intermediate

This course shows you how to build a model using Amazon Machine Learning (Amazon ML) and use it to make predictions. AWS expert Dan Moore covers the basic types of machine learning, how to prepare your data, and how to make your data available to the Amazon Machine Learning processes. You’ll also learn about evaluating a model for accuracy, using it both for batch and real-time predictions, and using tags to manage environments. Designed for developers and technical marketers new to machine learning and for data scientists interested in using the AWS Amazon ML platform, the course provides hands-on experience building a working predictive model using real data. Learners should obtain an AWS account (free from Amazon) and a basic understanding of AWS concepts before beginning the course.

  • Understand how to prepare data for use with Amazon ML and how to navigate the console
  • Learn how to make real-time and batch predictions using Python and the Amazon Machine Learning console
  • Become familiar with advanced machine learning system management concepts like tagging and the model life cycle
  • Develop an awareness of model accuracy and learn to use tools for evaluating and comparing accuracy

Dan Moore runs Boulder, Colorado based Moore Consulting, where he builds Amazon Machine Learning models for purposes such as predicting real estate valuations and estimating equipment utilization. Dan is an Amazon Web Services (AWS) trainer who has worked with AWS since 2008. He holds AWS certifications as a Solutions Architect, AWS Certified Developer, and SysOps Administrator.Learn Introduction to Amazon Machine Learning from your own desk.
Visual training method, offering users increased retention and accelerated learning
Breaks even the most complex applications down into simplistic steps.
Easy to follow step-by-step lessons, ideal for all
Comes with Extensive Working Files!


Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-9)

By | ai, bigdata, machinelearning

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

In this series of article I have tried to help you create an intuition on how probabilities work. To do so, we have been using simulations to see how concrete random situation can unfold and learn simple statistics and probabilistic concepts. In today’s set, I would like to show you some deceptively difficult situations that will challenge the way you understand probability and statistics. By doing so, you will practice the simulation technique we have seen in past set, refined your intuition and, hopefully help you avoid some pitfall when you do your own statistical analysis.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag Hacking stats

Exercise 1
Suppose that there are exactly 365 days in a year and that the distribution of birthday in the population is uniform, meaning that the proportion of birth on any given day is the same throughout the year. In a group of 25 people, what is the probability that at least two individuals share the same birthday? Use a simulation to answer that question, then repeat this process for group of 0,10,20,…,90 and 100 people and plot the results.

Of course, when the group size is of 366 we know that the probability that two people share the same birthday is equal to 1, since there are more people than day in the year and for a group of zero person this probability is equal to 0. What is counterintuitive here is the rate at which the probability of observing this grow. From the graph we can see that with just about 23 people we have a probability of about 50% of observing two people having the same birthday and that a group of about 70 people will have almost 100% chance to see this happening.

Exercise 2
Here’s a problem that can someday save your life. Imagine you are a war prisoner in an Asian Communist country and your jailer is getting bored. So to past the time, they set up a Russian roulette game where you and another inmate play against one another. A jailer takes a six-shooter revolver, put two bullets in two consecutive chamber, spin the chamber and give the gun to your opponent, who place the gun to his temple and pull the trigger. Luckily for him, the chamber was empty and the gun is passed to you. Now you have a choice to make: you can let the chamber as it is and play or you can spin the chamber before playing. Use 10000 simulations of both choices to find which choice give you the highest probability to survive.

The key details in this problem is that the bullet are in consecutive chamber. This mean that if your opponent pulls the trigger on an empty chamber, and that you don’t spin the chamber, it’s impossible that you pull the trigger on the second bullet. You can only have an empty chamber of pull the trigger on the first bullet, which means that you have 25% chance of dying vs 2/6=33% chance of dying if you spin the chamber.
Exercise 3
What is the probability that a mother, whose is pregnant with nonidentical twin, give birth to two boys, if we know that one of the unborn child is a boy, but we cannot identifie which one is the boy?

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

  • Work with about different binomial and logistic regression techniques
  • Know how to compare regression models and choose the right fit
  • And much more

Exercise 4
Two friends play head or tail to pass the time. To make this game more fun they decide to gamble pennies, so for each coin flip one friend call head or tail and if he calls right, he gets a penny and lose one otherwise. Let’s say that they have 40 and 30 pennies respectively and that they will play until someone has all the pennies.

  1. Create a function that simulate a complete game and return how many coin flip has been done and who win.
  2. In average, how many coin flip is needed before someone has all the pennies.
  3. Plot the histogram of the number of coin flipped during a simulation.
  4. What is the probability that someone wins a coin flip?
  5. What is the probability that each friend wins all the pennies? Why is it different than the probability of winning a single coin flip?

When the number of coin flip get high enough, the probability of someone winning often enough to get all the pennies rise to 100%. Maybe they will have to play 24h a day for weeks, but someday, someone will lose often enough to be penniless. In this context, the player who started with the most money have a huge advantage since they can survive a much longer losing streak than their opponent.

In fact, in this context where the probability of winning a single game is equal for each opponent the probability of winning all the money is equal to the proportion of the money they start with. That’s in part why the casino always win since they got more money than each gambler that plays against them, as long they get them to play long enough they will win. The fact that they propose game where they have greater chance to win help them quite a bit too.
Exercise 5
A classic counter intuitive is the Monty Hall problem. Here’s the scenario, if you never heard of it: you are on a game show where you can choose one of three doors and if a prize is hidden behind this door, you win this prize. Here’s the twist: after you choose a door, the game show host open one of the two other doors to show that there’s no prize behind it. At this point, you can choose to look behind the door you choose in the first place to see if there’s a prize or you can choose to switch door and look behind the door you left out.

  1. Simulate 10 000 games where you choose to look behind the first door you have chosen to estimate the probability of winning if you choose to look behind this door.
  2. Repeat this process, but this time choose to switch door.
  3. Why the probabilities are different?

When you pick the first door, you have 1/3 chance to have the right door. When the show host open one of the door you didn’t pick he gives you a huge amount of information on where the price is because he opened a door with no prize behind it. So the second door has more chance to hide the prize than the door you took in the first place. Our simulation tell us that this probability is about 1/2. So, you should always switch door since this gives you a higher probability of winning the prize.

To better understand this, imagine that the Grand Canyon is filled with small capsule with a volume of a cube centimeter. Of all those capsules only one has a piece of paper and if you pick this capsule, you win a 50% discount on a tie. You choose a capsule at random and then all the other trillion capsules are discarded except one, such than the winning capsule is still in play. Assuming you really want this discount, which capsule would you choose?

Exercise 6
This problem is a real life example of a statistical pitfall that can easily be encountered in real life and has been published by Steven A. Julious and Mark A. Mullee. In this dataset, we can see if a a medical treatment for kidney stone has been effective. There are two treatments that can be used: treatment A which include all open surgical procedure and treatment B which include small puncture surgery and the kidney stone are classified in two categories depending on his size, small or large stones.

  1. Compute the success rate (number of success/total number of cases) of both treatments.
  2. Which treatment seems the more successful?
  3. Create a contingency table of the success.
  4. Compute the success rate of both treatments when treating small kidney stones.
  5. Compute the success rate of both treatments when treating large kidney stones.
  6. Which treatment is more successful for small kidney stone? For large kidney stone?

This is an example of the Simpson paradox, which is a situation where an effect appears to be present for the set of all observations, but disappears when the observations are categorized and the analysis is done on each group. It is important to test for this phenomenon since in practice most observations can be classified in sub classes and, as the last example showed, this can change drastically the result of your analysis.

Exercise 7

  1. Download this dataset and do a linear regression with the variable X and Y. Then, compute the slope of the trend line of the regression.
  2. Do a scatter plot of the variable X and Y and add the trend line to the graph.
  3. Repeat this process of each of the three categories.

We can see that the general trend of the data is different from the trends of each of the categories. In other words, the Simpson paradox can also be observed in a regression context. The moral of the story is: make sure that all the variables are included in your analysis or you gonna have a bad time!

Exercise 8
For this problem you must know what’s a true positive, false positive, true negative and false negative in a classification problem. You can look at this page for a quick review of those concepts.

A big data algorithm has been developed to detect potential terrorist by looking at their behavior on the internet, their consummation habit and their traveling. To develop this classification algorithm, the computer scientist used data from a population where there’s a lot of known terrorist since they needed data about the habits of real terrorist to validate their work. In this dataset, you will find observations from this high risk population and observations taken from a low risk population.

  1. Compute the true positive rate, the false positive rate, the true negative rate and the false negative rate of this algorithm for the population that has a high risk of terrorism.
  2. Repeat this process for the remaining observations. Is there a difference between those rate?

It is a known fact that false positive rate are a lot higher in low-incidence population and this is known as . Basically, when the incidence of a certain condition in the population is lower than the average false positive rate of a test, using that test on this population will result in a much higher false positive cases than usual. This is in part due to the fact that the diminution of true positive case make the proportion of false positive so much higher. As a consequence: don’t trust to much your classification algorithm!

Exercise 9

  1. Generate a population of 10000 values from a normal distribution of mean 42 and standard deviation of 10.
  2. Create a sample of 10 observations and estimate the mean of the population. Repeat this 200 times.
  3. Compute the variation of the estimation.
  4. Create a sample of 50 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
  5. Create a sample of 100 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
  6. Create a sample of 500 observations and estimate the mean of the population. Repeat this 200 times and compute the variation of these estimations.
  7. Plot the variance of the estimation of the means done with different sample size.

As you can see, the variance of the estimation of the mean is inversely proportional to the sample size, but this is not a linear relationship. A small sample can create an estimation that is a lot farther to the real value than a sample with more observations. Let’s see why this information is relevant to this set.
Exercise 10
A private school advertise that their small size help their student achieve better grade. In their advertisement, they claim that last year’s students have had an average 5 points higher than the average at the standardize state’s test and since no large school has such a high average, that’s proof that small school help student achieve better results.

Suppose that there is 200000 students in the state, their results at the state test was distributed normally with a mean of 76% and a standard deviation of 15, the school had 100 students and that an average school count 750 student. Does the school claim can be explained statistically?

A school can be seen as a sample of the population of student. A large school, like a large sample, has a lot more chance to be representative of the student’s population and their average score will often be near the population average, while small school can show average a lot more extreme just because they have a smaller body of student. I’m not saying that no school are better than other, but we must look at a lot of results to be sure we are not only in presence of a statistical abnormality.

var vglnk = { key: ‘949efb41171ac6ec1bf7f206d57e90b8’ };

(function(d, t) {
var s = d.createElement(t); s.type = ‘text/javascript’; s.async = true;
s.src = “”;
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, ‘script’));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source link

How Big Data Analytics Boosts Cyber Security

By | iot

As more and more parts of our lives become connected to the internet, and more of our daily transactions take place online, cybersecurity is becoming an increasingly important topic. Just as modern technology changes more quickly than ever before, so do cyber criminals create newer and faster ways to target and rip off organizations. New malware is difficult to detect using previous strategies, which means we need new cybersecurity strategies to ensure commercial security.

One such new strategy is to use big data analytics. Big data analytics is an automated process by which a computer system examines large and varied sets of data to find patterns and trends. It is currently used to help companies track customer preferences and therefore better target their products and advertisements to specific users. However, with some reprogramming, those same big data analytics could be used to detect, respond to, and ultimately prevent cybercrime.

Here are some ways that big data analytics could help in the fight against cyber criminals.

1. Identifying Anomalies in Device and Employee Behavior

It is nigh-on impossible for a human user to manually analyze the millions of alerts that internet customers generate each month and pick out the valid ones from the threats. A …

Read More on Datafloq

Source link

Professor Roberta Millstein, Distinguished Marjorie Grene speaker September 15

By | ai, bigdata, machinelearning



Virginia Tech Philosophy Department

2017 Distinguished Marjorie Grene Speaker

Professor Roberta L. Millstein

University of California, Davis

“Types of Experiments and Causal Process Tracing: What Happened on the Kaibab Plateau in the 1920s?”

September 15, 2017

320 Lavery Hall: 5:10-6:45pm


ABSTRACT. In a well-cited article, ecologist Jared Diamond characterizes three main types of experiment that are performed in community ecology: the Laboratory Experiment (LE), the Field Experiment (FE), and the Natural Experiment (NE). Diamond argues that each form of experiment has strengths and weaknesses, with respect to, for example, realism or the ability to follow a causal trajectory. But does Diamond’s typology exhaust the available kinds of cause-finding practices? Some social scientists have characterized something they call causal process tracing. Is this a fourth type of experiment or something else? In this talk, I examine Diamond’s typology and causal process tracing in the context of a case study concerning the dynamics of wolf and deer populations on the Kaibab Plateau in the 1920s, a case that has been used as a canonical example of a trophic cascade by ecologists but which has also been subject to a fair bit of controversy. I argue that ecologists have profitably deployed causal process tracing together with other types of experiment to help settle questions of causality in this case. It remains to be seen how widespread the use of causal process tracing outside of the social sciences is (or could be), but there are some potentially promising applications, particularly with respect to questions about specific causal sequences.

There will be an additional, informal discussion of Millstein’s* (2013) Chapter 8: “Natural Selection and Causal Productivity” on Saturday, September 16 10:15a.m. at Thebes (Mayo’s house).  For queries:

Sponsored by Mayo-Chatfield Fund for Experimental Reasoning, Reliability, Objectivity and Rationality ( E.R.R.O.R.) and the Philosophy Department


Possibly Related:

*Roberta L. Millstein is Professor of Philosophy at the University of California, Davis, with affiliations to the Science and Technology Studies Program and the John Muir Institute for the Environment. She specializes in the history and philosophy of biology as well as environmental ethics, with a particular focus on fundamental concepts in evolution and ecology. Her work has appeared in journals such as Philosophy of Science; Journal of the History of Biology; Studies in History and Philosophy of Biological and Biomedical Sciences; Journal of the History of Biology; and Ethics, Policy & Environment. She is Co-Editor of the online, open-access journal, Philosophy, Theory, and Practice in Biology and has served on governing boards for several academic societies. Her current project develops and defends a reinterpretation of Aldo Leopold’s land ethic in light of contemporary ecology.

Filed under: Announcement

Source link

RcppClassic 0.9.8

By | ai, bigdata, machinelearning

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A bug-fix release RcppClassic 0.9.8 for the very recent 0.9.7 release which fixes a build issue on macOS introduced in 0.9.7. No other changes.

Courtesy of CRANberries, there are changes relative to the previous release.

Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: ‘949efb41171ac6ec1bf7f206d57e90b8’ };

(function(d, t) {
var s = d.createElement(t); s.type = ‘text/javascript’; s.async = true;
s.src = “”;
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, ‘script’));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source link