Many of you are familiar with the Birthday Paradox. If you want to read more about it you can find a good article here. Basically, it says that in a room of 23 people, there is a 50% chance that at least two people share a birthday. And if you increase that number to 75 people, the chances go up to 99.9%. I wanted to explore this a little more and rather than doing the math (boring!), I decided to do a Monte Carlo simulation, run it a bunch of times, and plot the results.
Like most of my Monte Carlo simulations, this will be in the C programming language. I like C because it is usually much faster than interpreted languages like Python and Java. With these simulations, I like to start with just one iteration or version and after that is working, expand it to other versions. In this case, we will start with a version where there are 23 people in the room. In that version, we will do 1000 iterations and in each iteration we assign random birthdays to the 23 people and then check to see if there were any matches. After we have this working, we expand it by wrapping the entire thing in another for loop that iterates through the number of people in the room from 2 to 100. Why stop at 100? Because it’s a nice round number and at that point, the chance of a birthday match has been at 100% for awhile so it doesn’t really achieve anything to continue.
#include <stdio.h> // print #include <stdlib.h> <span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" ></span> // rand #include <time.h> // time // This program will determine the percentage of times that two or more people share a birthday // It will iterate from 2 people to 'maxNumPeople' and in each version it will do 'numSimulations' simulations int main() { int numPeople; // Number of people in this version int maxNumPeople = 100; // Maximum number of people to test with. Must be less than 100 int simNum; // Simulation number long numSimulations = 1000000; // Number of simulations to try in this version long numMatches; // Keeps track of number of matches in this version int birthdays[100]; int i, j; // Incrementers fin for loop int isMatch; // Flag for later // Seed RNG with time to get unique results each time srand(time(NULL)); // Iterate over the number of people in this version for(numPeople=2; numPeople<maxNumPeople; numPeople++) { // Reset the number of matchs for this version numMatches = 0; // Iterate the number of simulations in this version for(simNum = 0; simNum<numSimulations; simNum++) { for(i=0; i<numPeople; i++) { // Give each bday a value, in this case a day from 1 to 365 inclusive // rand()%365 returns from 0 to 364 so we add 1 to it // We could just as easily leave the 1 off and it would still work, but doesn't hurt birthdays[i] = 1 + rand()%365; } // Reset the isMatch flag isMatch=0; // Determine if there is a match // Note that this is a very inefficent way to do compare for(i=0; i<numPeople-1; i++) { for(j=i+1; j<span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" ></span><numPeople; j++) { // Compare i and j birthdays if(birthdays[i] == birthdays[j]) { // If they match, then set flag and break isMatch=1; break; } } // This flag is to break out of the outer for loop if a mathc is already found if(isMatch==1) { break; } } // If the match flag was set, increment the number of matches if(isMatch==1) numMatches++; } // Print out the numPeople in this version and the percentage of times there wa a match printf("%d, %f \n", numPeople, 100.0*numMatches/numSimulations); } return 0; }
After you run the code, you get a text output that can be used to create this graph. This data was done with 1 million iterations with each number of people. At 23 people, the chances are just over 50%, at 57 people it goes up to 99%, and at 70 people it hits 99.9%. This doesn’t match the mathematical model perfectly, but it’s pretty darn close.
One thing to note is the method of checking to see if there are any matches. The method above is not necessarily super efficient but is quick to write. And since the whole thing runs in just over two minutes on a slow computer, there’s not a ton of value to super optimizing it. But one optimization I did do is with the break statements. Whenever a match is found, a flag is set and then that inner for loop is broken out of and then the flag is used to break out of the outer loop. This is necessary because C does not have the ability to break out of two loops without using a goto statement. You may think that this doesn’t save very much time, but it actually saves a ton of time. Below is the data up to one million iterations and you can see that having those breaks reduces the time required by 75%. These times are from a 2013 (4 years old) laptop running a 2.4GHz Intel Core i5.
Iterations | No Break | With Break |
100 | 0.3s | 0.2s |
1000 | 0.8s | 0.3s |
10000 | 6.2s | 1.5s |
100000 | 61s | 14s |
1000000 | 600s | 143s |
In the future, I’d like to do the same test in the future but use a bubble sort before finding matches so that I can sort out how often there are two separate matches, etc.
Any questions? Feel free to comment below!