**MAIN OBJECTIVE:** to teach data processing under security and
privacy.

**NEED FOR PROCESSING LARGE AMOUNTS OF DATA.** Modern computers
allow us to store and process large amounts of the data. There is
hope that, e.g., by analyzing these huge amounts of medical data,
we will be able to better understand causes of different diseases,
to better trace which cure is better for each patient -- and as a
result, to arrive at more personalized medical practices. Current
data analytics has indeed led to several interesting medical
breakthroughs. Similar interesting results have been obtained in
agriculture, in biology, in social sciences, etc.

**POTENTIAL PROBLEM WITH PRIVACY.** Since we do not know a
priori what kind of dependence we are looking for, it make sense
to allow researchers to ask all kinds of statistical questions. At
first glance, it sounds reasonable, but it is known that that such
possibility leads to a violation of privacy.

At first glance, it looks like if we anonymize the data by deleting the names, and if we allow only statistical questions, then privacy will be preserved. Not so. For example, we may want to analyze how blood pressure depends on the closeness to I-10. So, we take a street -- e.g., Robinson Street -- and measure the blood pressure of all the people who live in that street. We then take an average blood pressure of those who live within a certain distance from I-10. Since we do not know how big is the effect, we compute averages corresponding to different distances. The problem is that if we know the average blood pressure in all the houses up to 501, and then up to 503, we will thus be able to reconstruct the blood pressure of a person living in 503 Robinson.

**HOW CAN WE PRESERVE PRIVACY AND STILL ALLOW RESEARCHERS TO
ANALYZE DATA.** If we allow researchers to ask all possible
questions based on the exact data, then privacy is not preserved.
We do not want to limit possible queries: we do not know what is
the actual dependence, and by limiting queries, we may
inadvertently prevent the researchers from finding the actual
dependence. So the only remaining choice is to modify either
directly the query result or the data values on which this result
is based.

This modification should be unpredictable -- so that it will not
be possible to reverse it and get the actual value. From the
mathematical viewpoint, unpredictable means random, so we arrive
at the idea of adding random noise either directly to the query
result or, indirectly, to the data values. Thus, we arrive at
*probabilistic* methods of maintaining privacy.

**PRIVACY IS PRESERVED, BUT DATA PROCESSING BECOMES MORE
COMPLEX.** To a large extend, probabilistic methods do preserve
privacy, but now we have an additional problem: the added noise
affects the results of data processing. We therefore need to know
how big is this effect, i.e., how accurate are the results. When
we uses this data processing to check a certain hypothesis, we
need to take this noise into account in the criteria for testing
the hypothesis.

**PROBABILISTIC METHODS MAY ALSO LEAD TO LOSS OF PRIVACY.**
Another problem with the probabilistic methods is that leave room
open for privacy loss. For example, if we add random noise to the
query, then, by asking the same query many times and averaging the
result, an adversary will be able to get the original exact value
-- and thus, to gain supposedly protected information.

If we add noise to the original data, then the adversary can take into account that the same person is usually listed in many different databases. Based on each database, we can reconstruct the stored value. These stored values are obtained by adding different instances of random noise to the actual value. Thus, we can again average these values and reconstruct the actual data.

**STORING RANGES INSTEAD OF ACTUAL VALUES: INTERVAL APPROACH TO
PRESERVING PRIVACY.** These problems can be avoided if, instead
of the actual value (with or without noise), we only store a range
of values. For example, instead of the exact age, we only mark
whether a person is between 20 and 30, between 30 and 40, etc. In
other words, we only keep an interval (such as [20, 30]) that
contains the actual (unknown) value. Since this is the only data
that is stored in the database, this is the only information that
can be reconstructed.

**PRIVACY IS PRESERVED, BUT DATA PROCESSING AGAIN BECOMES MORE
COMPLEX.** In such situations, data processing also becomes
complex: instead of the original algorithms for processing real
values, we now need to come up with new algorithms for processing
intervals.

**WHAT WILL BE COVERED IN THIS CLASS:**

**General data processing techniques.** We start with a brief
overview of the standard statistical methods, such as methods for
finding the parameters of a linear dependence or methods for
checking whether the dependence is indeed linear. For students who
have already taken probability and statistics class, this will be
an easy part, but since not everyone has taken this class, we will
go slowly.

Once we learn the algorithms, we will program them -- and this will be a pattern throughout the class: to solidify the knowledge of the algorithms and to practice using them, we will write programs implementing these algorithms.

**How probabilistic approaches to privacy preservation affect
data processing.** Then, we will analyze how addition of random
noise affects the data processing results. We will learn some
techniques for estimating this effect, and then, after learning
how to simulate random noise, we will experimentally test how well
these methods work.

**How interval approach to privacy preservation affects data
processing.** Once we learn how to process data under
probabilistic modifications, we will move to a more complex topic
-- how interval approach to privacy preservation affects data
processing. We will consider two important cases:

- the usual case when an approximate estimation is OK, and
- the case when decisions are extremely important -- and thus, we want guaranteed bounds on the corresponding quantities.

**How to gauge the corresponding level of privacy.** In all
these methods, some information is still revealed -- e.g., the
range of possible values. So, some privacy is lost. How can we
gauge the resulting level of privacy? we will the basic way of
gauging privacy level: k-anonymity (for each piece of information
that can be extracted, there are at least k different individuals
whose data is consistent with this information), l-diversity,
differential privacy, and utility-based measures or privacy and of
privacy loss.

**What is the optimal way of modifying the data?** Within each
method of preserving privacy, there are many different options. In
the probabilistic approach, we can add different noise. In the
interval approach, we can divide into different ranges: e.g., by
dozens instead of by tens. Once we know what we want to estimate,
a natural question is which option should we select so that --
within the desired level of privacy -- we can the most accurate
estimations of the desired quantities. On simple examples, we will
learn how to solve the corresponding optimization problems.

**Auxiliary questions.** The above are the major topics related
to data processing under security and privacy. In addition to
these major topics, there are many auxiliary topics, and we will
cover some of them.

For example, the best ways to gauge the reliability of a program and the uncertainty of the results of data processing is to take into account the structure of the code. But what is the code is proprietary, only available as a black box? In this case, we need to come up with different techniques for estimating the corresponding uncertainty. Some of these techniques will be described in the class.

**THERE IS NO TEXTBOOK:** we will use handouts and links

**PROJECTS:** An important part of the class is a project.
There are three possible types of projects:

- An ideal class project is if you do something related to security and privacy which is useful for your future thesis or dissertation. Please check with your advisor about it, maybe he or she wants you to read and report on some privacy- or security-related paper, maybe you need to do some privacy- or security-related research, whatever your advisor recommends will be a very good project for this class, just let the class instructor know what exactly you plan to do.
- If you have not yet selected an advisor, but you already know what research area you want to work in, come talk to the class instructor, we will try to find some appropriate topic -- and if you have any proposals already, great.
- If you do not have a research topic or you have a one but your advisor cannot find anything security- or privacy--related that will be helpful for your future thesis or dissertation, come talk to the instructor too.
- Maybe you like security and privacy and want to start doing a related thesis or dissertation, then come and talk to the instructor, we will try to find something that will be of interest to you.

- reviewing and reporting on a related paper, or
- doing some independent research (not research as in high school, but research as in graduate school, i.e., trying to come up with something new), or
- programming something security- and/or privacy-related.

**TESTS AND GRADES:** There will be two tests and one final
exam. Each topic means home assignments -- both on a sheet of
paper and related to programming. Maximum number of points:

- first test: 10
- second test: 25
- home assignments: 10
- final exam: 35
- project: 20

A good project can help but it cannot completely cover possible deficiencies of knowledge as shown on the test and on the homeworks. In general, up to 80 points come from tests and home assignments. So:

- to get an A, you must gain, on all the tests and home assignments, at least 90% of the possible amount of points (i.e., at least 72), and also at least 90 points overall;
- to get a B, you must gain, on all the tests and home assignments, at least 80% of the possible amount of points (i.e., at least 64), and also at least 80 points overall;
- to get a C, you must gain, on all the tests and home assignments, at least 70% of the possible amount of points (i.e., at least 56), and also at least 70 points overall.

**STANDARDS OF CONDUCT:** Students are expected to conduct
themselves in a professional and courteous manner, as prescribed
by the Standards of Conduct. Students may discuss programming
exercises in a general way with other students, but the solutions
must be done independently. Similarly, groups may discuss project
assignments with other groups, but the solutions must be done by
the group itself. Graded work should be unmistakably your own. You
may not transcribe or copy a solution taken from another person,
book, or other source, e.g., a web page). Professors are required
to -- and will -- report academic dishonesty and any other
violation of the Standards of Conduct to the Dean of Students.

If you feel you may have a disability that requires accommodation, contact the The Center for Accommodations and Support Services (CASS) at 747-5148, go to Room 106 E. Union, or e-mail to cass@utep.edu. For additional information, please visit the CASS website.