Spring 2001 CS100M Project 2 Solutions

Note: 1 bonus point for being the first to post a report on the course newsgroup of an error we don't catch.

Project 2: Part 1 "Structure is the Essence of Order"

This problem is a simple exercise in Matlab structure manipulation. Once the structures are created, we can treat them as units, and use the usual swapping algorithm to exchange their values. The name of one property of substances, namely the atomic weight, consists of two words. It turns out that we can't define Matlab structures with field names containing spaces. Try statement

A = struct('atomic weight', 256.74)

to see what happens! What should we do? We list a few solutions that you should employ in your future work:

Use an established shorthand notation: if scientists often abbreviate "atomic weight" by "aw", then "aw" would be a good variable name. Otherwise, we could use "aw" but then must document what it means.
Use the "underscore" character (_) to replace spaces. This is a widely used solution, and you will often see it in other people's code. Be aware that some programmer's strongly dislike it, though.
You can concatenate all the words, and capitalize all letters which start a word, except maybe the first one. The name of the field would become atomicWeight, or AtomicWeight, in this case.


% assign values to A, B
    A = struct('atomic_weight', 256.74, 'state', 'liquid', ...
               'color', 'blue', 'transparency', 100);
    B = struct('atomic_weight', 107.98, 'state', 'solid', ...
               'color', 'silver', 'transparency', 0);
% print A, B
    A
    B
% swap A, B
    C = A;
    A = B;
    B = C;
% print A,b
    A
    B

NOTE: It is legal for the fields to be in any order, in which case the output will show the fields in a different order. The order for A and B may be different.

Here is an alternative to the "assign values to A, B" code shown above, where we create the structure fields one by one:


% assign values to A
    A.atomic_weight = 256.74;
    A.state = 'liquid';
    A.color = 'blue';
    A.transparency = 100;
% assign values to B
    B.atomic_weight = 107.98;
    B.state = 'solid';
    B.color = 'silver';
    B.transparency = 0;

The second solution is less efficient, since Matlab has to repeatedly account for the changing number of fields in the two structures above. Similarly to vectorization, it is in general more efficient to create structures in one step. This is not possible, however, if not all data is available at the time a structure must be created. In this latter case the second approach (maybe without such a tight grouping of field-creating statements) is a natural solution. Some people would also argue that the second solution is more readable than the first one.

Running either solution above we get the following output:


A = 
    atomic_weight: 256.7400
            state: 'liquid'
            color: 'blue'
     transparency: 100


B = 
    atomic_weight: 107.9800
            state: 'solid'
            color: 'silver'
     transparency: 0


A = 
    atomic_weight: 107.9800
            state: 'solid'
            color: 'silver'
     transparency: 0


B = 
    atomic_weight: 256.7400
            state: 'liquid'
            color: 'blue'
     transparency: 100

Project 2: Part 2 "Eeeee! "

The goal of this problem is to illustrate that numerical computations are only approximate, so care must be taken when interpreting the results of such computations. Just because "the computer says" it does not always mean that it is true! (However, it does not mean that all numerical computations are untrustworthy.)

Note: observe that i ranges from 0 to N-1, not to N or N+1.


format long
N = 150; % number of terms
x = 2;   % compute (approximation of) e^x
% compute 0!, ..., (N-1)!
    for i = 0:N-1
      fact(i+1) = factorial(i);
    end
% compute first N terms x^i/i! of infinite series for e^2 in 
% ascending ($series2$) and descending ($series1$) order of terms
    series2 = sort((x .^ (0:N-1)) ./ fact);
    series1 = series2(end:-1:1); % OR: series1 = fliplr(series2);
% compute value of e^x in 3 ways:
    exact = exp(x)         % use Matlab's built-in function $exp$
    approx1 = sum(series1) % add terms in series in descending order
    approx2 = sum(series2) % add terms in series in ascending order
% display pairwise differences exact-approx1, exact-approx2, approx1-approx2
    exact - approx1
    exact - approx2
    approx1 - approx2
format short

To compute the two approximations, we first prepare an array containing the factorials of all numbers from 0 to N-1. We then generate an array containing the first N powers of 2, divide it element by element by the factorial array, and then we sort the result.

Sorting is necessary to order the series' terms in increasing order. We could determine algebraically what this order should be, and then we could compute the elements directly in this order. Any reasoning we would develop would be specific to the series at hand. Sorting an array with 150 elements is quick, and it has the advantage of being a general solution.

After we generate the series in increasing order, we reverse it by either using function fliplr or an array of indices of positions from right (end) to left (1). Summing the two series produces the two approximations. We compare these with the "exact" value by using subtraction. This is the output that we get:


exact =
   7.38905609893065

approx1 =
   7.38905609893065

approx2 =
   7.38905609893065

ans =
     1.776356839400250e-15

ans =
     0

ans =
    -1.776356839400250e-15

Note that the "exact" value is equal to the second approximation, but different from the first one. The difference is small, but the computation we perfomed was very simple. Complicated sequences of computations (e.g. simulating the aerodynamic behavior of an airplane) can lead to computed values that are very far from the "true" ones. The area of Computer Science which studies these problems is called Numerical Analysis.

We could have computed the result in many other ways, e.g. we could replace the bolded lines in the original solution with any of the following pieces of code:

series1 = sort((2 .^ (0:N-1)) ./ cumprod([1 1:N-1]));

fact = [1];
for i = 1:N
  fact(end + 1) = fact(end) * i;
end

series2 = sort((2 .^ (0:N-1)) ./ fact);

series2 = sort(cumprod([1 2*ones(1, N-1)] ./ [1 1:N-1]));

Bonus: For values of x very close to 0, the value of e^x is approximately equal to 1+x. We would thus expect the plot of exp(x)-1 to be almost a straight line. When we generate the plot using Matlab's intrinsic exponential function we can see that this is not the case (take a look at the blue line in the graph below). The jaggedness of the graph is due to the fact that for the interval considered the errors in estimating e^x-1 are quite large compared to the result.

It is in general the case that the relative error of the result a-b from subtraction increases dramatically when the two operands a and b are approximately equal in magnitude and have the same sign. In effect, we subtract away most of the significant digits, and are left with mostly errors.

We can avoid the problem described above by avoiding the subtraction between e^x and 1. One good way to do this is to use the series given in this problem. If we sum the series without its first term, we expect to get a much better approximation of e^x-1. The program below and the graph following it illustrates the dramatic difference in precision between the two approaches.


N = 150;
x = [-10^(-15) : 10^(-17) : 10^(-15)];
plot(x, exp(x) - 1);
hold on;
values = [];
for i = x
  values(end+1) = sum(sort((i .^ (1:N-1)) ./ cumprod(1:N-1)));
end
plot(x, values, 'r');
axis tight;
legend('Matlab''s values', 'Our approximation');

Do you know what statement axis tight does? If not, check it out! You will probably find it useful.

NOTE: For x close to 0, look again at our infinite series: 1 is much larger than x, which is much larger than x²/2, which is much larger than x³/6, etc. It turns out that using values=x; would produce a graph that looked exactly the same.

Project 2: Part 3 "The Double Helix"

The code below includes the strongly recommended bonus in bold.


top = lower('taCaGCgATAcgtg'); % input: top strand
bot = lower('agTActTgATgtgC'); % input: bottom strand

matchCount = 0; % number of matches in top, bot
matchWhere = []; % positions of matches in top, bot

topOut = []; % top with matches in upper-, mismatches in lower-case
botOut = []; % bot with matches in upper-, mismatches in lower-case

% compare each position for matches/mismatches
    for i = 1:length(top)
      if top(i) == 'a'
        if bot(i) == 't'
           matchCount = matchCount + 1;
           matchWhere = [matchWhere i];
           topOut = [topOut 'A'];
           botOut = [botOut 'T'];
        else
           topOut = [topOut top(i)];
           botOut = [botOut bot(i)];
        end
      elseif top(i) == 't'
        if bot(i) == 'a'
           matchCount = matchCount + 1;
           matchWhere = [matchWhere i];
           topOut = [topOut 'T'];
           botOut = [botOut 'A'];
        else
           topOut = [topOut top(i)];
           botOut = [botOut bot(i)];
        end
      elseif top(i) == 'g'
        if bot(i) == 'c'
           matchCount = matchCount + 1;
           matchWhere = [matchWhere i];
           topOut = [topOut 'G'];
           botOut = [botOut 'C'];
        else
           topOut = [topOut top(i)];
           botOut = [botOut bot(i)];
        end
      elseif top(i) == 'c'
        if bot(i) == 'g'
           matchWhere = [matchWhere i];
           matchCount = matchCount + 1;
           topOut = [topOut 'C'];
           botOut = [botOut 'G'];
        else
           topOut = [topOut top(i)];
           botOut = [botOut bot(i)];
        end
      end
    end
% display count and positions of matches and highlighted matches
    matchCount
    matchWhere
    topOut
    botOut

It is often the case that a more specific problem can be solved easier than its more general counterpart. (It is also often the case that a more general problem can be solved easier than its more specific counterpart.) To reduce the number of cases we have to deal with we first convert the strings to lowercase. We then proceed to "step over" the strings letter by letter, and we compare the letters in corresponding positions in the two strings.

Since we have four bases, we consider four main cases. Depending on the current base found in the top sequence, we look for a single complementary base in the bottom sequence. If we find it, the number of matches is incremented by 1, and the correspoding position is stored in the matchWhere array.

The character arrays (i.e. strings) topOut and botOut contain the part of the top and bottom sequences, respectively, that have already been analyzed. Whenever a match is found, the capitalized bases are inserted in the string; when not, the actual characters from the top and bottom strings are transferred (these are lowercase, since the entire input string was converted to lowercase). When we get to the end of the input strings, topOut and topBot will contain upper case letters in the positions where we found complementary bases.

If we examine the code above, we see that there are many similarities between the actions which correspond to various alternatives. We can aggregate the favorable (i.e. "have match") cases, and the unfavorable ones (i.e. "no match"), by using compound logical conditions built from simpler conditions using the operators and (&) and or (|). This is what we get:


top = lower('taCaGCgATAcgtg'); % input: top strand
bot = lower('agTActTgATgtgC'); % input: bottom strand

matchCount = 0; % number of matches in top, bot
matchWhere = []; % positions of matches in top, bot

topOut = []; % top with matches in upper-, mismatches in lower-case
botOut = []; % bot with matches in upper-, mismatches in lower-case

% compare each position for matches/mismatches
    for i = 1:length(top)
        if ((top(i) == 'a') & (bot(i) == 't')) | ...
           ((top(i) == 't') & (bot(i) == 'a')) | ...
           ((top(i) == 'g') & (bot(i) == 'c')) | ...
           ((top(i) == 'c') & (bot(i) == 'g'))

            matchCount = matchCount + 1;
            matchWhere = [matchWhere i];
            topOut = [topOut upper(top(i))];
            botOut = [botOut upper(bot(i))];
        else
            topOut = [topOut top(i)];
            botOut = [botOut bot(i)];
        end
    end
% display count and positions of matches and highlighted matches
    matchCount
    matchWhere
    topOut
    botOut

Note that the code is much shorter now, and that there is no redundancy in the actions. (There is, however, redundancy in the tests for complementarity.) Also, note how we change the matching pairs to upper case, even though we don't know what specific pair of complementary bases was found. The output of both these programs is identical:


matchCount =
     6

matchWhere =
     1     5     9    10    11    14

topOut =
TacaGcgaTACgtG

botOut =
AgtaCttgATGtgC

Project 2: Part 4 "Small Life"

At every step we have to age the already existing bacteria, and to determine the number of newborn bacteria that will appear at the end of the hour. Of course, we have to actually "create" these bacteria as well. We can perform all these steps if we manage an array containing the ages for all bacteria alive at a certain moment of time. The solution below solves the problem by looking at the age of bacteria at the beginning of each hour. In every cycle each bacterium already alive is aged by one hour. Newborn bacteria are generated at the beginning of each hour, and they are first introduced as having age 0. The number of newborn bacteria is determined by the number of living bacteria having reached an age of at least two hours.


N = 9; % length in hours of the experiment
agesNow = [0]; % initial state of the experiment: 1 newborn bacterium
% advance experiment N hours, 1 hour at a time
    for i = 1:N
        agesNow = 1 + agesNow; % age each bacterium
        newBacteria = 0; % number newborn bacteria = number mature bacteria
        % count number of mature bacteria
            for age = agesNow
                if age >= 2
                    newBacteria = newBacteria + 1;
                end
            end
        % replicate mature bacteria if not at end of experiment
            if i < N
                agesNow = [agesNow  zeros(1, newBacteria)];
            end
    end
% display number and ages of bacteria
    length(agesNow)
    agesNow

The output of this program is the following:


ans =
    34

agesNow =

  Columns 1 through 12 
     9     7     6     5     5     4     4     4     3     3     3     3

  Columns 13 through 24 
     3     2     2     2     2     2     2     2     2     1     1     1

  Columns 25 through 34 
     1     1     1     1     1     1     1     1     1     1

When employing this method, we have to suppress the creation of new bacteria the last time the cycle is executed. This is because we must report data about bacteria at the end of the hour, and not at the beginning of the next hour, so the newborn bacteria would have to be removed immediately, if added at all. Note that we introduce the new bacteria by creating an array of zeros (the age of newborn bacteria at the beginning of the hour), and concatening it with the array of already living bacteria. The solution below represents the bacteria population at the end of each hour. Because we don't have to suppress the creation of new bacteria in the last cycle, this solution is simpler. Below, we also streamline the computation of newBacteria:


N = 9;
agesNow = [0];
for i = 1:N
   newBacteria = sum(agesNow >= 2);
   agesNow = 1 + [agesNow  zeros(1, newBacteria)];
   % OR: agesNow = [1+agesNow ones(1,newBacteria)];
end
length(agesNow)
agesNow

Here is a more abstract way of thinking about this problem: We first observe that the number of bacteria at the end of hour N is completely determined if we know how many bacteria were of age 1, and over 1, respectively at the end of hour N-2. (Why is this?) Also, note that there are no bacteria of age 0 at the end of hour N-2. Let x be the number of bacteria of age 1 at the end of hour N-2, and y the number of bacteria of age 2 and above at the same moment of time. The number of bacteria at times N-1 and N will be x+2*y, and 2*x+3*y, respectively. If n_k is the number of bacteria alive at the end of hour k, we can write the following relation: n_N = n_N-1+n_N-2. More, we know that n₁ = n₂ = 1. The recursive relation allows us to compute the number of bacteria at the end of any hour N. This is a famous series in mathematics, named after Leonardo of Pisa (Italy), also known as Fibonacci (1170 - 1250).

Here is the program that computes the number of bacteria alive at the end of hour N:


N = 9;
f1 = 1;
f2 = 1;
for i = 3:N
  f3 = f1 + f2;
  f1 = f2;
  f2 = f3;
end
f3

Spend some time analyzing the age distribution of bacteria at various moments of time. Can you use Fibonacci's series to compute these age distributions?

Project 2: Part 5 "Coincidental Questions: The Birthday Problem revisited"

Here is a solution derived from the the solution seen in class, as the problem text suggests:


W = [1/5 1/2 1 2];
A = [14 5 7 2] + [32 28 19 26]; 
B = [12 4 3 4] + [33 23 13 18];
possibleQ = 50;

% perform both simulations in an outer loop
    for T = [sum(round(A.*W)) sum(round(B.*W))]
        T
        % generate one bucket for each possible question
              buckets = zeros(1,possibleQ);
        % generate random questions ("throw darts")
              for dart = ceil(rand(1,T)*possibleQ)
                buckets(dart) = 1+buckets(dart);
              end
        % count non-shared (i.e. non-repeated) questions
              N = sum(buckets == 1)
        % count questions ignoring duplicates
              Q = sum(buckets > 0)
        % compute the frequency that students have the same question
              100* (Q-N) / Q
    end

Note that we compute the percentage of shared questions simultaneously for class A and B. In effect, we perform two simulations.

We compute the number of expected questions by multiplying the number of students in each category who asked, or wanted to, ask questions, and the corresponding average number of questions they asked, or wanted to ask, during one lecture.

It might be unusual to think of somebody as asking 1/2 of a question in a lecture, for example, but if we remember that we are talking about averages, all concerns should be alleviated. The average of a data series does not have to, and in general it doesn't, match any particular value in the data series.

In each simulation we create a set of "buckets" (i.e. categories), in a number equal to the total number of possible questions that could be asked in a lecture. We simulate the posing of random questions by the process of throwing darts. While the details of these processes might seem very different, they share the characteristic that the outcome of each individual event is random, and the outcome can be one out of an equal number of equally possible cases.

Each random question that is generated is counted in its corresponding bucket. This correspondence is quite straightforward, so we can just use the ordinal number of the question to index into the bucket array.

Once all questions have been generated, we check the bucket array for non-shared questions. These questions are those which have been only asked once, which also means that their corresponding bucket must contain the value 1. The number of different questions that have been asked is determined by counting the buckets that contain non-zero values. The precise value is not important, since we want to asked all occurences of the same question only once.

Here is the output when the program above is run immediately after Matlab starts up:

Here is an alternative solution:


N = 0;
Q = 0;

howMany  = [14 5 7 2] + [32 28 19 26];
howOften = [5 2 1 .5]; 
T = sum(round(howMany ./ howOften));
possibleQ = 50;

% generate one possible series of questions asked
    questions = ceil(rand(1, T) * possibleQ);

    for i = 1 : length(questions)
      % if question has not been already eliminated...
          if questions(i) > 0
            % assume for now that it is not shared
                isShared = 0;
            % this is a question that we did not count yet
                Q = Q + 1;
            % check for duplicates to the right (toward higher indices)
                for j = i + 1 : length(questions)
                  % if a duplicate is found...
                      if questions(i) == questions(j)
                        % eliminate the duplicate by overwriting its number
                            questions(j) = 0;
                        % keep in mind that the question is shared (repeated)
                            isShared = 1;
                      end
                end
            % if we did not detect that the question is shared...
                if isShared == 0
                  % count it a a new non-shared question
                       N = N + 1;
                end
          end
    end

    T
    N
    Q
    100*(1-N/Q)

We determine the total number T of questions asked by dividing the number of students who ask question with a given frequency with the respective frequency, rounding up to the nearest integer, and then summing all these numbers. We then generate an array of T random elements between 1 and 50. These are the numbers of the questions that have been asked.

Note: rounding should happen last (after summing), but the write-up was worded to imply it should happen before. We will accept either approach.

We use two nested for loops to go over the array and identify the number Q of distinct values in the questions array, as well as the number N of non-shared questions. Assume that variable i has value i0.

If the question questions(i0) has not already been eliminated (see below), we assume that it is not shared. We indicate this by setting the "spy variable" isShared to 0. We count this value as a new distict value (we don't know yet whether it is duplicated). We now investigate the entire part of the array corresponding to indices higher that i0, and we check for the existence of questions having the same number as questions(i0). If we find such a value, that particular questions is shared, and we "remember" this in the isShared variable. Simultaneously we "erase" the newfound question by overwriting it with 0. This is done because there no new information that we would gain by looking at that value again (we know it is a duplicated value and we found one of its pairs in position i0).

If variable isShared is still set to 0 after the inner loop ends, we can be sure that there are no values identical to questions(i0), thus this question must be counted as a non-shared one. We know this because only two other cases are possible:

Question questions(i0) is shared, and there is at least one duplicate of it in position i1, i1<i0, in the array. This can't be the case, because in that case questions(i0) would have been set to 0 in a previous loop, and we would have skipped it in this one.
Question questions(i0) is shared, and there is no duplicate of it in any position i1, i1<i0, in the array. If this is the case, we should have found an unerased duplicate of questions(i0) in the inner loop, and this question would have been marked as a duplicate.

When the outer loop runs to completion we have the number of total, distinct, and non-shared questions that have been asked during class. The code presented below corresponds to the first case given in the problem. Note that asking "multiple" questions in a lecture was translated to "asking one question per half lecture," or "asking exactly two questions per lecture."

Here is the output from the second program when it is run immediately after Matlab starts up:


T =
   108

N =
    11

Q =
    46

ans =
   76.0870

To get the answer for class B, replace the second line of the code above with howMany = [12 4 3 4] + [33 23 13 18];. Here is the output from the revised second program when run immediately after the original second program is run immediately after Matlab starts up:


T =
    83

N =
    19

Q =
    43

ans =
   55.8140

The output is the same as that from the first program! How can that be, since we used "random" numbers? The answer is that these are pseudorandom numbers that the computer generates using a deterministic algorithm. Therefore, if started from the same initial condition, e.g. immediately after Matlab starts up, then the generated sequence of (pseudo)random numbers is always the same.

Is this a feature or a bug? It is a feature because it allows us to debug our code by making the results reproducible. Furthermore, if we want a "more random" sequence, we can try to start the random number generator in a "random" starting state. A common way to do this is to use the clock since time marches forward and it is unlikely that two programs start at exactly the same time. In Matlab, this is done using rand('state',sum(100*clock));.