4365/CS 5354 Data Processing Under Security and Privacy
Summer 2016, Test 2
1-2. In class, we learned the following algorithm for eliminating
- we compute the sample mean μ =
(x1 + ... + xn) / n,
- we compute sample
variance V = ((x1 − μ)2 + ... +
(xn − μ)2) / (n − 1), and the
sample standard deviation σ as the square root of V;
then, we dismiss all the records which are not in the 2-sigma
interval [μ − 2σ, μ − 2σ] as
- after that, we repeat the same procedure for the new
database, with some outliers deleted: we re-compute μ and
σ and, if needed, eliminate values which are not in the new
2-sigma interval, etc.;
- this process continues until, at some
stage, no more outliers are eliminated.
1. Use this algorithm to eliminate outliers from the following
algorithm is straightforward:
- we have 100 records each equal to 1.9;
have 100 records each equal to 2.1;
- we have one record with
value 10; and
- we have one record with value 1,000.
2. Explain how outlier elimination is used in computer
3. In class, we derived a general formula for the optimal sizes
of a privacy-enhancing box for which the
inaccuracy in the resulting estimation of a statistical
characteristic C is the smallest possible under the condition of
= c / ai
is the absolute value of the partial derivative of C
with respect to the i-th variable xi
, and c = (1/2)
√(k * a1
* a2 * ...) / ρ(x)
ρ(x) is the data density, i.e., number of records per unit
Use the general formula to find the sizes of the cell that
provides the smallest possible inaccuracy in computing the
covariance between weight and age for adults. Assume that:
- we are looking for k-anonymity with k = 10,
- we deal with
a population of El Paso, N = 700,000 students,
- age is
uniformly distributed on the interval [20, 90];
- weight is
uniformly distributed on the interval [50, 90]; and
- we are
interested in the cell that covers a record with age x1
= 65 and weight x2 = 80 kg.
For covariance, the
derivative with respect to x1
is equal to
) / N and the derivative
with respect to x2
is equal to (x1
) / N, where μi are the sample
means of the corresponding values.
4. What if in Problem 3, in addition to k-anonymity, we also
require l-diversity, with l = 2, and the thresholds
= 0.2 and ε2
- If for the k-anonymous solution, we have
2Δi ≥ l
* εi, then l-anonymity is also satisfied.
- If for some of the variables xi, this inequality
is not satisfied, then for this variable, we select:
this variable, we select Δi = (1/2) * l
* εi, and
- for other variables, we select Δj from the
condition that the cell contain k records, i.e., that ρ(x)
* 2n * Δ1 * Δ2 * ...
5. What percentage of privacy do we lose if someone detects the
first digit x in the age xy? the second digit y? Assume that the
age is between 20 and 90.
6. Based on the Problem 5, it may look like the last decimal digit
-- the remainder of dividing a number by 10 -- is irrelevant, and
should not be protected much. Explain that it is not so: what if
someone knows the remainder by 10 and the remainder in the hex
system (when divided by 16) -- can we then uniquely determine the
adult age (from 20 to 90)? Explain your answer.
7. Use the efficient raising-to-the-power algorithm to compute
mod 11. Where is this algorithm used in RSA
8-9. Use Euclid's algorithm to compute the greatest common divisor
gcd(11, 21). Then find a number d for which d * 11 mod 21 = 1.
Where is this used in RSA coding?
10. Briefly describe your project for this class, and what you
have done so far for this project.