Remote Access to Research Data
Data protection is important for the sake of the subjects of the statistics, i.e. the observations included in the data. The disclosure of information regarding individuals, households, enterprises and other statistical units to third parties must be prevented.
Data protection includes but is not limited to:
As a researcher, you must do your part to protect the data you use against disclosure to third parties, including unit-level data, at the processing stage and when publishing the results of your research. Data protection regulations such as the EU’s General Data Protection Regulation (GDPR) and the Finnish Data Protection Act (1050/2018) safeguard people who are the subject of research. For more information about data protection in scientific research, see the website of the Office of the Data Protection Ombudsman on scientific research and data protection.
Statistics Finland has a statutory right to collect register and survey data. The data we collect about society are very comprehensive and contain some highly sensitive information.
According to the Finnish Statistics Act, Statistics Finland may issue a licence to use the confidential data in its possession, originally collected for statistical purposes, for scientific research and statistical surveys of social conditions.
Section 13 of the Statistics Act specifies how Statistics Finland may disclose data collected for statistical purposes. The grounds in Section 13 state the following[1]:
“When releasing data, the protection of personal data and data regarding business and professional secrets must be ensured on a case-by-case basis with practical measures, such as by requiring sufficient data security measures and by arranging the necessary monitoring and tracking of the data use. [...] Because the end results of scientific research are usually public, it should always be separately made sure in connection with their publication that it would not be possible to identify the individual statistical units on which the research is based from the public end result of the research.”
[1] Government proposal HE 154/2012 for legislation to amend the Statistics Act and the Act for amending Sections 2 and 3 of the Act on rural economic activity statistics.
The unit-level data of Statistics Finland are only accessible with a user licence. These data may only be used by the person named in the user licence, and only for the purposes indicated in the user licence. It is prohibited to attempt to identify the data subjects included in the data.
By signing a user agreement and a pledge of secrecy concerning a research project or SISU microsimulation model, you agree never to disclose or use for your own benefit any confidential information (that is unit-level personal and business data included in the research data) that you discover during research.
Statistical disclosure control methods are used to prepare (tabular) data for publication by modifying it to protect the data of individual data suppliers and statistical units (person, household, enterprise, establishment, etc.) from disclosure. Because tables are a common format for presenting research results, this section includes information about the concepts, disclosure risk and protection methods related to the statistical data protection of tabular data.
Research results can be other aggregated data than a table, for example a graph or a single distribution parameter. You can find more information about the data protection rules applied to these from Rules and instructions of the Research Services and section 3.4 Data protection for research results. Please note that data protection must be applied to all results to be published.
Because tables are a common format for presenting research results, this section includes information about the concepts, disclosure risk and protection methods related to the statistical data protection of tabular data.
It is almost impossible to give detailed universal instructions for protecting tabular data, because there is a wide variety of tables. Tables may differ by content, structure, publishing concept and purpose of use – at worst, every table is a special case when it comes to protection. The better you have accounted for the special features of each table, the better you can protect them while keeping the features that are essential for its intended purpose.
“Tabular data” refers to aggregated data arranged in a table format. Tables can be classified as frequency tables or magnitude tables.
Distribution parameters can also be presented in a single table. You have to apply the data protection rules of distribution parameters to protect the table. More information from Rules and instructions of the Research Services.
In the context of tabular data, “disclosure” means the probability of determining the identity or property of a unit more precisely than if the table had not been published. Disclosure may be approximate or exact.
In magnitude tables, “disclosure” usually refers to a situation where the value of a certain statistical unit's tabulated variable can be too accurately estimated based on the table’s figures and structure. Approximate disclosure may be equally harmful to exact disclosure, especially in the case of enterprise data.
The first stage includes an assessment of the need to protect the table: identifying the so-called sensitive cells at risk of disclosure based on the chosen sensitivity rule. The second stage is described in the section Tabular data protection methods.
Most commonly used sensitivity rules:
In frequency tables, cells are considered sensitive if they only include a few statistical units, meaning the value combination of the cell’s categorical variables is rare. You can use the threshold value rule to find these low cell frequency cells as sensitive (see Example 1).
We recommend using the threshold value rule if only exact disclosure (e.g. exposing a person’s identity) needs to be protected against.
You prepare a table where you cross-tabulate people’s age groups, domiciles and civil status. For cell sensitivity analysis, you have decided to use a threshold value of 3. After tabulation, you realise that there are only two widows under 20 years of age in municipality X.
Therefore, a combination of the following categorical variables:
is too rare according to the threshold rule, as the cell frequency, 2, is lower than the threshold value, 3. The risk of disclosure is too great for the persons belonging to these cells, so the data in the cells must be protected before the table is published.
In magnitude tables, cells can also be considered sensitive if one or only a few units dominate the cell’s value (their values are considerably greater than those of the other units). This may allow the values of the dominant units to be determined too precisely (even if not exactly). Protection against unacceptably accurate estimation means protection against approximate disclosure, and it can be implemented by using a dominance rule to determine which cells are sensitive (see Example 2 and Example 3).
You can use multiple sensitivity rules in parallel. In this case, a cell is considered sensitive if it matches any of the sensitivity rules.
A cell in a certain table has the total value X, consisting of three observations with the following values:
The cell’s total value is therefore: X = x1 + x2 + x3 = 59 + 27 + 14 = 100. You need to analyse if one or more observations dominate the value of X. You have chosen to use the dominance rule with n as 1 and k as 75 – in other words, you want to know if one (the largest) observation in a cell contributes at least 75 per cent of the cell’s value. In this cell, the largest observation x1 contributes x1 / X = 59 / 100 = 0.59 = 59% of the cell’s total value. Because 75 > 59, the largest observation (and therefore no other observation) does not sufficiently dominate the cell’s value, according to your dominance rule, to require protection of the cell.
A cell in a certain table has the total value X, consisting of twelve observations. The largest observation is x1 = 61 and the second-largest is x2 = 20. The other ten observations, x3–12, have a total value of 19.
Should the cell be protected, if:
a. you use the dominance rule n = 1 and k = 60
b. you use the dominance rule (2,90)
c. you use the threshold value 3 and the dominance rule (1,60)?
Solution: The cell’s total value is X = 61 + 20 +19 = 100.
a. The largest observation is x1 / X = 61 / 100 = 0.61 = 61% of the cell’s total value. Because 61 > 60, you must protect the cell.
b. Your two largest observations contribute (x1 + x2) / X = (61 + 20) / 100 = 0.81 = 81% of the cell’s total value. Because 81 ≤ 90, the cell needs no protection.
c. If two sensitivity rules are applied, a cell must be protected if it meets the criteria of at least one rule. As in “a” above, the cell must be protected according to the dominance rule (1,60). It follows that the cell must be protected when both threshold value rule, value 3, and the dominance rule (1,60) are used. On its own, a threshold value of 3 would not require the cell’s protection.
In the second stage of the table protection process, the data in sensitive cells are protected by applying the selected protection method. The first stage, the protection need assessment, has been described above in the Disclosure from tabular data section. The primary criteria for selecting a protection method are achieving the required level of protection and preserving the essential properties of the table. The table must be protected to a sufficient degree but remain useful even in its protected form.
The choice of method is often influenced by the available resources such as time and access to protection software. In addition, the protection method should be transparent: the users of the protected table should understand the protection method in general and therefore be able to account for the changes introduced by the protection method to the table.
Masking involves suppressing the principal cells at risk of disclosure and secondary suppression. Secondary suppression ensures that the table’s row and column totals cannot be used to expose the values of the principal cells at risk. Masking can also be done for a row. If the total of a table’s row only includes a small number of statistical units (less than the threshold value), you must mask the row in its entirety without considering the number of statistical units in each cell of the row.
Reclassification removes the sensitive cells from a table by combining the categories that include those cells with the table’s other categories. Changing the categories usually means generalising the entire classification.
Another table protection method is to amend the values of the cells at risk of disclosure. They can be amended by rounding or replacing the original cell value with an approximate random value, for example.
In practice, protecting tables and other types of output in the remote access system means that the disclosure risk is eliminated from the outputs and tables sent for review. Their protection must include sufficiently generalised classifications or other designs that result in acceptable data protection for the output’s contents. The research results and tables sent to the result review must no longer be exposed to the risk of disclosure. The reviewer will not provide additional protection for the results. The protection and review procedure for research results is described further in section 3.4 Data protection for research results.
In Example 4, you can explore the practical issues of protecting tabular data.
Table 1 includes the number of people working in a certain profession by area and income bracket. For protection reasons, any cell with a value other than zero has been masked. Some null cells may also have been masked.
Is this sufficient protection? Is there a way to use the table to discover the values of the masked cells? Would an alternative tabulation method be preferrable? If yes, what kind?
Income bracket | Area A | Area B | Area C | Area D | Areas, total |
---|---|---|---|---|---|
1 | 0 | x1 | x2 | 0 | 25 |
2 | x3 | 0 | 0 | x4 | 15 |
3 | x5 | 0 | 0 | x6 | 30 |
4 | 0 | x7 | x8 | x9 | 30 |
Income brackets, total | 35 | 10 | 15 | 40 | 100 |
Example solution:
Because this is a frequency table, the values cannot be negative. It appears null cells have not needed protection, as they remain visible in the table. The sums of the table’s rows and columns allow the following to be deduced:
Now that you know x1 = 10, you can calculate the following: x2 = 15, x7 = 0 and x8 = 0. Because 0 + x7 + x8 + x9 = 30, x9 = 30. The masking of cells x1, x2, x7, x8 and x9 was a wasted effort, because their exact values could be calculated even after masking.
Also note that if we assume the protection is based on a low-value threshold rule (less than 10) to find the principal cells to mask, none of the above cells were principal cells. However, this information can be used to disclose a group. Disclosing a group means that no individual observations in the table can be identified, but the property of an identifiable group is disclosed.
In this table, the disclosed property is that all persons in areas B and C belong to income bracket 1. Group disclosures are not always considered sensitive or needing protection, and taking them into account typically makes it more difficult to apply protection.
The remaining masked cells are presented in Table 2 as a sub-table of the original table.
Income bracket | Area A | Area D | Areas total |
---|---|---|---|
2 | x3 | x4 | 15 |
3 | x5 | x6 | 30 |
Income brackets total | 35 | 10 | 45 |
The sub-table’s row and column totals give the following ranges of variation for cell values x3 and x5:
If we are aware that the need for protection was determined with a threshold value of 5 (or lower), then neither x3 nor x5 would be a principal cell to protect. It follows that cell x4 or x6 is a principal cell, because protection has been applied. The table can be used to infer that the value of either cell is at most 10, but their exact values cannot be determined.
Summary
Based on the above calculations and reasoning, the original questions could be answered thus:
Aside from the answers, this example has taught you the following:
The obligation of secrecy requires you to ensure that your research results contain no unit-level data or the possibility of their disclosure. The outputs you publish must meet the data protection requirements laid out in the Statistics Finland guidelines for protection of tabular data. For more information about the guidelines, see the instruction. Data protection and result checking process (pdf).
As a rule, enterprise data must be protected so that each cell or group includes at least three (unweighted) observations. A dominance rule (1,75) must be applied alongside a threshold value rule for recent enterprise data (under 15 months from the reference date). Establishment-level data protection must also ensure enterprise-level protection, meaning each cell must have establishments from at least three different enterprises. Likewise, corporate group-level protection must be considered in all enterprise data that include information about group relationships.
Personal data must be protected with a cell threshold value of 3, and special attention must be paid to the sensitivity of the variables being tabulated. Combined employer-employee data must be protected at the personal and enterprise levels, meaning each cell of a table must have employees from at least three different enterprises. The data on self-employed workers included in tabular enterprise statistics are subject to the same data protection practices as other enterprise data.
Maximum and minimum are typically related to one observation. If this observation can be identified, you may not publish the maximum or the minimum..
Distribution points (excluding the maximum and minimum) are a special case where a table’s cell frequencies are equal to the number of observations between the distribution points. You can publish the distribution points if these numbers exceed the threshold value of 3.
Mode. Can be published if (almost) every observation has a different value.
Average, other ratios and higher moments of distribution key figures (e.g. variance) can be published if they have been calculated from at least three observations.
If you are publishing proportions, the threshold value of 3 must be true for all groups that form the proportions. In other words, if you want to publish the statement that women account for 58 per cent of the total population, the 58 per cent of women and the 42 per cent of men must include at least three people. It is not enough that the total population includes at least three women and men.
Index point figures, correlation coefficients and test quantities (t, F, χ2, etc.) can usually be published if the calculations include a sufficient number of observations (minimum 10).
Regression models can be published in their entirety if the model includes a sufficient number of observations, and it is not a time series of observations about one enterprise or person. The model’s individual factors can usually be published.
Publishing figures and diagrams based on the data is allowed if individual points cannot expose an individual observation used to draw them. Submit your figures for review just like tables – clearly and precisely documented. Suitable image file formats include PNG, BMP, JPEG, TIFF, EPS, PS, PDF, SVG and WMF/EMF.
Bar charts and other diagrams used to present classified data are typically permitted for publication, as long as each of their classes includes a sufficient number of observations.
The information in these diagrams can usually be presented as a table, and hence is subject to the same data protection rules as other tabular data (see the Frequency and Magnitude Tables section above).
Distribution charts may include outliers or extreme values (extremum) that expose observation unit data. Distributions, histograms, and cumulative distribution functions are permitted with sufficient smoothing or sufficiently broad scales.
Scatter plots are typically used to present the values of two continuous variables, which may make them more problematic for data protection than the previous diagrams. If you use scatter plots, pay close attention to the nature of the data: sample size, data sensitivity, outliers, etc.
The exercises in the following galleries will give you practical tips for improving the data protection of your outputs and how to assess it.
Statistics Finland's Research Services use a manual checking procedure and a random checking procedure for research output.
If the contents are too unclear or expansive to review the output’s data protection, the reviewer will reject it. For more information about the output data protection requirements and review procedure, read the Research Services rules and instructions. The rules and instructions are binding for all researchers who have signed an agreement for a research project or the use of the SISU microsimulation model. Please note that, despite the review process, you, as the researcher, are responsible for enforcing data protection in the research results you publish.
Different review procedures are applied to projects subject to manual checks, projects subject to random checks and the remote use of the SISU microsimulation model:
Research output produced in remote access use are checked before the output are released, and files cannot be transferred independently to the own local workstation from the remote access environment. The data transfer is made upon separate request by email (tutkijapalvelut@stat.fi). The check takes place within one to two working days. Please take the Research Services’ resources into account when you submit your service request. By paying attention to the quality of the output files and restricting the number of review requests you can facilitate and quicken the review procedure significantly. Research Services personnel resources are assigned to remote access system maintenance and related tasks based on daily demand. Try to anticipate your data needs and send review requests well ahead of time. All review requests and file transfers will be processed on the next working day. Responses to requests for corrections or clarifications regarding file contents will be processed on the next working day.
When you want to export output from remote access use, fill in the form concerning the output to be exported on FIONA’s desktop. After this, the output is either checked manually in advance or if you are subject to random checks, you can receive the output directly to your email. All output of new users is checked in advance. The probability of being subject to random checks rises when the user receives approved consecutive advance checks. Breaches and errors reset the situation and the advance checks start from the beginning. The sanction for serious and repeated breaches is removal from the random checking procedure and other measures resulting from data protection breaches. If you are unsure about the data protection of the output, you should contact the Research Services already before the output is exported from the system.
Predict your data needs and send review requests well ahead of time. All review requests and file transfers will be processed on the next working day. Replies to correction and information requests regarding file contents will be processed on the next working day.
You can transfer research result files from the microsimulation remote access environment to your local workstation. Each user has a personal email folder, “Mail”, in the remote access environment, which you can use to transfer files to your local workstation.
Statistics Finland reviews the files in the microsimulation team’s email inbox afterwards. You must follow the Research Services’ instructions and rules for the remote access environment and its data transfers:
Make especially sure that the files transferred from the microsimulation environment contain absolutely no unit-level data or any possibility that they could be disclosed.
Please avoid sending the following kinds of outputs for review: