Random Number Generation Part II – Some Data Analysis

Following on from my last post, I’ve been trying to read the voltage values that are generated from the random number generator I put together. I was initially using an arduino unit so I could control the voltage output down to below 1 volt. Ultimately I want to read the values with an ESP8266 unit running micropython so I can upload the values to an internet based data logger and the ESP8266 is only 1 volt tolerant on its analog-to-digital pin. So, the voltage has to be controlled and I didn’t mind blowing up one of my arduinos (which are 5 volt tolerant). The voltage divider is now composed of a 10k resistor from positive rail to analog out and a 1k resistor from analog out to ground.

The ardunio was giving me readings up to about 180 unit out of 1024 which gives about 180/1024 * 5V = 0.8789 volts… nice and safe.

Reading these voltages with the ESP8266 gave some interesting results. The code to do this was very simple:

from machine import ADC

import time
adc = ADC(0)            # create ADC object on ADC pin

for i in range(1000000):
    print(adc.read(), end=',')

This reads a voltage every 10 milliseconds from the analog pin and reports one million of them. It took about 2.8 hours and the output was redirected to a file of type csv with this command on my mac:

ampy --port /dev/tty.SLAB_USBtoUART run readanalog.py > avalanche_noise.csv

‘ampy’ is a nice program from Adafruit, which you can learn more about here.

I then analyzed this data in python using the jupyter web interface. A histogram plot of the values looks like this:

The values range from 135 to 674 with a mean of ~292.6. This distribution is obviously skewed too so it’s not gaussian distributed – maybe Poisson?  But the most striking thing is the lines of empty space (no blue) – is this a plotting problem or are certain values skipped during the reads? Well lets zoom in around the top of the distribution.

So its true, some values are just never recorded. Its hard to believe the voltages from the circuit would do this. Also, its clear from the numbers that every 12th value is skipped. I think this must be an error in the way micropython reads the register so I’ll probably post this somewhere in the micropython forums and see if this is true or if I’m just doing something wrong.

Next I tried to fit this distribution to a gaussian curve. It failed. Here is the python code:

import math
import numpy as np
import matplotlib.pyplot as plt
import scipy as scipy
import scipy.stats as stats
from scipy.optimize import curve_fit
from scipy.misc import factorial
from scipy.stats import norm
%matplotlib inline          # needed for plotting in jupyter

data_p = data # not really needed

fig = plt.figure(figsize=(20, 8))
entries, bin_edges, patches = plt.hist(data_p, bins=(int(np.amax(data_p)-np.amin(data_p))), range=[np.amin(data_p),np.amax(data_p)], normed=True)
bin_middles = 0.5*(bin_edges[1:] + bin_edges[:-1])

# poisson function, parameter lamb is the fit parameter
def gauss(x, *p):
 A, mu, sigma = p
 return A*np.exp(-(x-mu)**2/(2.*sigma**2))

# fit with curve_fit
p0 = [1., 260., 30.] # initial guesses for height, mean and SD

parameters, cov_matrix = curve_fit(gauss, bin_middles, entries, p0 = p0) 
print parameters # print results
print np.sqrt(np.diag(cov_matrix)) # print errors of fit

# plot poisson-deviation with fitted parameter
x_plot = np.linspace(0, np.amax(data_p), 1000)
plt.plot(x_plot, gauss(x_plot, *parameters), 'r-', lw=4)


This doesn’t fit very well. But it does reveal the skew present in the data. It does in some ways look poisson distributed (you can read more about this distribution here) and electrical noise is typically poisson distributed because of the discrete nature of electrical charge (read about this and shot noise here).

Mathematical aside:

One way I like to think of the Poisson distribution and Poisson processes is as follows. They arise from discrete events that must always have a positive value – this is not true for Gaussian distributions. So, in our circuit, current (electrons) flow or do not flow. When they flow, a discrete (integer) number of them flow. There is a lower limit on the number that can jump the gap. That number is zero. The gaussian distribution is the limit of a binomial distribution as the number of events goes to infinity. The binomial distribution is based on the idea that an event either happens or does not. (N.B. this is different from the Poisson case because in that case, events either happen 1 or more times or do not happen). If we accumulate binomial events, there is no limit to how many ‘yes’ or ‘no’ events can happen so some of the distribution must extend as far as the number of events recorded – for gaussian this limit goes to infinity and negative infinity. The Poisson distribution doesn’t act this way. If we try and model the Poisson distribution as an infinite binomial distribution we quickly realize that while we can get an infinite number of ‘zero value’ events as well (with low probability) there is more than one other alternative. So the distribution must take into account these many possible values which stretches the distribution in the positive direction while there is a still a hard limit at zero. We can shift the Poisson distribution so ‘N’ zero-value trials will be plotted at ‘-N’ (like we would for a binomial distribution) but on the positive size, the curve would extend past ‘+N’ because some of those trials can have a value of more than 1.

Enough qualitative description of distributions…

Using a poisson function instead like this:

def poisson(k, lamb):
 return (lamb**k/factorial(k)) * np.exp(-lamb)

and we get this:

It doesn’t converge at all.

Yikes, whats going on?

Well my thought was that the numbers I’m reading don’t match the number of electrons actually flowing. This signal is amplified by the transistors and then quantized, not in nature, but by the ADC converter in the ESP8266. My hope is that this signal is proportional to the number of electrons that flowed during signal acquisition. But this signal is not the same as the number of electrons which will be behaving as a Poisson process. But the recorded numbers should be proportional to the number of electrons. So if we divide these values by some constant, can we get the fit to work, and at what optimal division factor.

Long story short, if I divide the 1000000 million points by 23.9 I get an optimal fit in terms of the error reported for the fitted parameter. That parameter, which is the mean value, is ~6.558336. Does this mean, that on average 6.5 electrons pass through the transistor while the ESP8266 is taking an ADC measurement? I think it might be! Here is the fit:

If I take these same numbers and try and fit a Gaussian curve, it doesn’t do as well.

Conclusions? The electrons that flow across the junction in reverse bias are behaving as a Poisson process as expected. The distribution is not flat. I’ve seen some discussion on the net where people seem to assume this would be the case. It does seem to be random! One of the next steps is to convert these numbers to a flat distribution or at least make it generate a binary sequence. It seems to be that XORing  or Von Neumann filtering will not do a good job of removing the biasing that the Poisson distribution will introduce.

Lotteries and “Getting what you are looking for”

I came across a review/excerpt in Scientific American for a new book called “The Improbability Principle: Why Coincidences, Miracles, and Rare Events Happen Every Day”. This article has an excellent description of how seemingly impossible events can and do take place all the time.

The major point of the excerpt was that some events like finding people with matching birthdays can be a rare event or a common event depending on what you mean by a matching event. In a room of 23 people there is about a 6% chance one of them will share a birthday with you, but a greater than 50% chance that two of the 23 people in the room will share a birthday (see a description of the birthday problem here). An intuitive explanation for this is when you’re looking for someone who shares your birthday you have limited the possibilities – they must match your birthday – but when we allow anyone to match any other person, the possible number of combinations of people matching birthdays goes up a lot. So its a lot more likely. The maths for this is at the above link. I don’t need to repeat it here. The point is, a subtle change in the question can have big consequences for the answer.

The excerpt then goes on to talk about the Bulgarian Lottery coincident that occurred in 2009. In one draw, the winning numbers were 4, 15, 23, 24, 35, 42. “Amazingly” the next draw, the same numbers again. There was outrage with people calling the draw rigged and there was even a full government investigation. As an aside, if you could rig the draw why would you repeat the numbers? Seems silly to do that. Why not just rig the draw with some other numbers? Anyway…. the question is, how surprised should we be that this happens at all?

The Bulgarian lottery is a 6 number draw from 49 numbers. So the probability of getting the right numbers is 1/(49!/(6!*43!)) or one in 13,983,816. Once you have a particular set of numbers the chance that any other specific draw will have the same numbers is one in 13,983,816 – its the same as picking the numbers in the first place. This seems like a pretty unlikely event. But this is like finding someone who has the same birthday as you. You are trying to match one specific draw to another. What happens when you have many draws and you are trying to find a match between any of those two draws? The Bulgarian lottery has been going on for a while, so there have been many draws and so the chances that a match would happen among all these draws is much less amazing. For example, among 50 draws there are 1,225 ways we could match 2 of the draws. Quickly, the number of ways to find a match goes up and as a result the probability of a match happening goes up as well.

Now some skeptics might say, “But this was two draws in a row!!! Not a draw this week matching a draw from 5 years ago. This is very special because of the serial way in which it happened!”. I felt this objection was glossed over in the SciAm excerpt. Thinking that the consecutive nature of the draws is special in this case is thinking about it all wrong. Actually its thinking yourself into the special circumstance. You are saying that the two draws in a row is a special thing. But this is the same as finding that one person who matches your birthday.

Another way to think about this sort of thing is to consider the following. Instead of drawing 4, 15, 23, 24, 35 and 42, imagine the lottery drew 1, 2, 3, 4, 5 and 6. Would you think this was amazing? Rigged? Would it be all over the internet and in all the papers? Yes it would. But is it any more unlikely than 4, 15, 23, 24, 35 and 42? No, its not. They have the same chance of happening… one in 13,983,816. The only reason it would be considered special is because the numbers are consecutive. By why is that any more special than starting at 4, adding 11, adding 8, adding 1, adding 11 again and then adding 7? Its only because of the value placed on consecutive numbers by our minds. We humans would say consecutive draws has more meaning or is more special than one draw matching a draw from 5, 10 or 15 years ago. In terms of random events neither is more special than the other.

Your perspective influences your expectation, but not reality. Uncommon things will still happen all the time, just not the sort of uncommon things you might expect to happen.