Birthday Paradox Monte Carlo Simulation

Many of you are familiar with the Birthday Paradox.  If you want to read more about it you can find a good article here.  Basically, it says that in a room of 23 people, there is a 50% chance that at least two people share a birthday.  And if you increase that number to 75 people, the chances go up to 99.9%.  I wanted to explore this a little more and rather than doing the math (boring!), I decided to do a Monte Carlo simulation, run it a bunch of times, and plot the results.

Like most of my Monte Carlo simulations, this will be in the C programming language.  I like C because it is usually much faster than interpreted languages like Python and Java.  With these simulations, I like to start with just one iteration or version and after that is working, expand it to other versions.  In this case, we will start with a version where there are 23 people in the room.  In that version, we will do 1000 iterations and in each iteration we assign random birthdays to the 23 people and then check to see if there were any matches.  After we have this working, we expand it by wrapping the entire thing in another for loop that iterates through the number of people in the room from 2 to 100.  Why stop at 100? Because it’s a nice round number and at that point, the chance of a birthday match has been at 100% for awhile so it doesn’t really achieve anything to continue.

#include <stdio.h>		// print
#include <stdlib.h>	<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>	// rand
#include <time.h>		// time

// This program will determine the percentage of times that two or more people share a birthday
// It will iterate from 2 people to 'maxNumPeople' and in each version it will do 'numSimulations' simulations
int main()
{
	int numPeople;					// Number of people in this version
	int maxNumPeople = 100;			// Maximum number of people to test with. Must be less than 100
	int simNum;						// Simulation number
	long numSimulations = 1000000;	// Number of simulations to try in this version
	long numMatches;				// Keeps track of number of matches in this version
	int birthdays[100];
	int i, j;						// Incrementers fin for loop
	int isMatch;					// Flag for later

	// Seed RNG with time to get unique results each time
	srand(time(NULL));

	// Iterate over the number of people in this version
	for(numPeople=2; numPeople<maxNumPeople; numPeople++)
	{
		// Reset the number of matchs for this version
		numMatches = 0;

		// Iterate the number of simulations in this version
		for(simNum = 0; simNum<numSimulations; simNum++)
		{
			for(i=0; i<numPeople; i++)
			{
				// Give each bday a value, in this case a day from 1 to 365 inclusive
				// rand()%365 returns from 0 to 364 so we add 1 to it
				// We could just as easily leave the 1 off and it would still work, but doesn't hurt
				birthdays[i] = 1 + rand()%365;
			}

			// Reset the isMatch flag
			isMatch=0; 

			// Determine if there is a match
			// Note that this is a very inefficent way to do compare
			for(i=0; i<numPeople-1; i++)
			{
				for(j=i+1; j<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span><numPeople; j++)
				{
					// Compare i and j birthdays
					if(birthdays[i] == birthdays[j])
					{
						// If they match, then set flag and break
						isMatch=1;
						break;
					}
				}

				// This flag is to break out of the outer for loop if a mathc is already found
				if(isMatch==1)
				{
					break;
				}
			}

			// If the match flag was set, increment the number of matches
			if(isMatch==1)
				numMatches++;

		}

		// Print out the numPeople in this version and the percentage of times there wa a match
		printf("%d, %f \n", numPeople, 100.0*numMatches/numSimulations);
	}

	return 0;
}

After you run the code, you get a text output that can be used to create this graph.  This data was done with 1 million iterations with each number of people.  At 23 people, the chances are just over 50%, at 57 people it goes up to 99%, and at 70 people it hits 99.9%.  This doesn’t match the mathematical model perfectly, but it’s pretty darn close.

birthday_paradox_mc

One thing to note is the method of checking to see if there are any matches.  The method above is not necessarily super efficient but is quick to write.  And since the whole thing runs in just over two minutes on a slow computer, there’s not a ton of value to super optimizing it.  But one optimization I did do is with the break statements.  Whenever a match is found, a flag is set and then that inner for loop is broken out of and then the flag is used to break out of the outer loop.  This is necessary because C does not have the ability to break out of two loops without using a goto statement.  You may think that this doesn’t save very much time, but it actually saves a ton of time.  Below is the data up to one million iterations and you can see that having those breaks reduces the time required by 75%.  These times are from a 2013 (4 years old) laptop running a 2.4GHz Intel Core i5.

Iterations No Break With Break
100 0.3s 0.2s
1000 0.8s 0.3s
10000 6.2s 1.5s
100000 61s 14s
1000000 600s 143s

In the future, I’d like to do the same test in the future but use a bubble sort before finding matches so that I can sort out how often there are two separate matches, etc.

Any questions?  Feel free to comment below!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s