Summer 2016, Final Exam

1. *Solving equations and systems of equations:*

1a. Use Newton's method to design an algorithm for computing acrsine, i.e., for solving the equation sin(x) = a for an unknown x.

1b. Use the algorithm for computing 1/b that we had in class (and that is implemented in the computers) to perform the first two steps of computing the ratio 1 / 1.01.

1c. Use Newton's method to solve the following system of non-linear equations:

2. *Data processing under uncertainty:* Based on two
observations x_{1} and x_{2}, we can compute the
sample variance as V = (x_{1} − μ)^{2} +
(x_{2} − μ)^{2}, which for μ =
(x_{1} + x_{2}) / 2 is equivalent to a simpler
expression V = (x_{1} − x_{2})^{2} /
2. Suppose that we know that x_{1} is in the interval [9,
11], and x_{2} is in the interval [18,22].

2a. Use numerical differentiation to compute the derivatives of
the function V(x_{1}, x_{2}) at the midpoint
x_{1} = 10, x_{2} = 20; compare with the actual
derivative.

2b. Use linearization technique and your estimate for the derivative to estimate the range of the variance.

2c. Use naive interval computations to estimate this range.

2d. Use mean value form to estimate this range.

3. *Constraint optimization:*

3a. Use the Lagrange multiplier technique for find the general
formula for the values Δ_{1} and Δ_{2}
for which the inaccuracy a_{1} * Δ_{1} +
a_{2} * Δ_{2} of estimating a statistical
characteristic C is the smallest under the condition that the cell
contains k records, i.e., that ρ(x)
* 2^{n} * Δ_{1} * Δ_{2}
= k.

3b. Apply your formula to the example when k = 10 and
a_{1} = a_{2} = 1. What will then the resulting
inaccuracy be?

4. *Different concepts of measuring privacy:*

4a. Explain what is k-anonymity, and why it is important. If k increases, will we get more or less privacy protection? Explain your answer.

4b. Explain what is l-diversity, and why it is important. If l increases, will we get more or less privacy protection? Explain your answer.

4c. Explain what is differential privacy.

5. *Detecting outliers:* In class, we learned the following
algorithm for eliminating outliers:

- we compute the sample
mean μ = (x
_{1}+ ... + x_{n}) / n, - we
compute sample variance V = ((x
_{1}− μ)^{2}+ ... + (x_{n}− μ)^{2}) / (n − 1), and the sample standard deviation σ as the square root of V; - then, we dismiss all the records which are not in the 2-sigma interval [μ − 2σ, μ − 2σ] as outliers;
- after that, we repeat the same procedure for the new database, with some outliers deleted: we re-compute μ and σ and, if needed, eliminate values which are not in the new 2-sigma interval, etc.;
- this process continues until, at some stage, no more outliers are eliminated.

5a. Use this algorithm to eliminate outliers from the following database:

- we have 100 records each equal to 0.9;
- we have 100 records each equal to 1.1;
- we have one record with value 20; and
- we have one record with value 1,000.

5b. Explain how outlier elimination is used in computer security.

6. *Optimal selection of privacy-protecting cells:*_{1} is
equal to (x_{2} − μ_{2}) / N and the
derivative with respect to x_{2} is equal to
(x_{1} − μ_{1}) / N, where
μi are the sample means of the corresponding
values.

6a. In class, we derived a general formula for the optimal sizes
Δ_{i} of a privacy-enhancing box for which the
inaccuracy in the resulting estimation of a statistical
characteristic C is the smallest possible under the condition of
k-anonymity: Δ_{i} = c / a_{i}, where
a_{i} is the absolute value of the partial derivative of C
with respect to the i-th variable x_{i}, and c = (1/2)
*
√(k * a_{1}
* a_{2} * ...) / ρ(x) , where
ρ(x) is the data density, i.e., number of record per unit
volume.

Use the general formula to find the sizes of the cell that provides the smallest possible inaccuracy in computing the covariance between age and blood pressure for adults. Assume that:

- we are looking for k-anonymity with k = 300,
- we deal with a population of Texas, N = 30,000,000,
- age is uniformly distributed on the interval [20, 90];
- blood pressure is uniformly distributed on the interval [100, 200]; and
- we are interested in the cell that covers a record with age
x
_{1}= 65 and blood pressure x_{2}= 140.

6b. What if, in addition to k-anonymity, we also require
l-diversity, with l = 10, and the thresholds ε_{1}
= 0.001 and ε_{2} = 0.0001?

Reminder:

- If for the k-anonymous solution, we have
2Δ
_{i}≥ l * ε_{i}, then l-anonymity is also satisfied. - If for some of the variables x
_{i}, this inequality is not satisfied, then for this variable, we select:- for
this variable, we select Δ
_{i}= (1/2) * l * ε_{i}, and - for other variables, we select Δ
_{j}from the condition that the cell contain k records, i.e., that ρ(x) * 2^{n}* Δ_{1}* Δ_{2}* ... = k.

- for
this variable, we select Δ

7. *Different parts of data have different importance*:

7a. What percentage of privacy do we lose if someone detects the second digit x in the blood pressure 1xy? the third digit y? Assume that the blood pressure is between 100 and 200.

7b. Based on the computations from Part a, it may look like the last decimal digit -- the remainder of dividing a number by 10 -- is irrelevant, and should not be protected much. Explain that it is not so: what if someone knows the remainder by 10 and the remainder in the hex system (when divided by 16), and the blood pressure is between 110 and 180 -- can we then uniquely determine the blood pressure? Explain your answer.

8. *RSA algorithm*:

8a. Use the efficient raising-to-the-power algorithm to compute
5^{21} mod 13. Where is this algorithm used in RSA
coding?

8b. Use Euclid's algorithm to compute the greatest common divisor gcd(13, 21). Then find a number d for which d * 13 mod 21 = 1. Where is this used in RSA coding?

9. *Projects:*

9a. Briefly describe your project for this class.

9b. Briefly describe someone else's project for this class.