3 Statistical data protection of research data

3.1 Legal basis of data protection

Data protection is important for the sake of the subjects of the statistics, i.e. the observations included in the data. The disclosure of information regarding individuals, households, enterprises and other statistical units to third parties must be prevented.

Data protection includes but is not limited to:

all legislation and instructions concerning data protection
careful and planned processing of data at all stages of research
protection implementation methods (statistical disclosure control methods).

Researchers are responsible for data protection

As a researcher, you must do your part to protect the data you use against disclosure to third parties, including unit-level data, at the processing stage and when publishing the results of your research. Data protection regulations such as the EU’s General Data Protection Regulation (GDPR) and the Finnish Data Protection Act (1050/2018) safeguard people who are the subject of research. For more information about data protection in scientific research, see the website of the Office of the Data Protection Ombudsman on scientific research and data protection.

Statistics Finland is bound by the Statistics Act

Statistics Finland has a statutory right to collect register and survey data. The data we collect about society are very comprehensive and contain some highly sensitive information.

According to the Finnish Statistics Act, Statistics Finland may issue a licence to use the confidential data in its possession, originally collected for statistical purposes, for scientific research and statistical surveys of social conditions.

Section 13 of the Statistics Act specifies how Statistics Finland may disclose data collected for statistical purposes. The grounds in Section 13 state the following[1]:

“When releasing data, the protection of personal data and data regarding business and professional secrets must be ensured on a case-by-case basis with practical measures, such as by requiring sufficient data security measures and by arranging the necessary monitoring and tracking of the data use. [...] Because the end results of scientific research are usually public, it should always be separately made sure in connection with their publication that it would not be possible to identify the individual statistical units on which the research is based from the public end result of the research.”

[1] Government proposal HE 154/2012 for legislation to amend the Statistics Act and the Act for amending Sections 2 and 3 of the Act on rural economic activity statistics.

3.2 Obligation of secrecy

The unit-level data of Statistics Finland are only accessible with a user licence. These data may only be used by the person named in the user licence, and only for the purposes indicated in the user licence. It is prohibited to attempt to identify the data subjects included in the data.

By signing a user agreement and a pledge of secrecy concerning a research project or SISU microsimulation model, you agree never to disclose or use for your own benefit any confidential information (that is unit-level personal and business data included in the research data) that you discover during research.

3.3 Statistical disclosure control methods for tabular data

Statistical disclosure control methods are used to prepare (tabular) data for publication by modifying it to protect the data of individual data suppliers and statistical units (person, household, enterprise, establishment, etc.) from disclosure. Because tables are a common format for presenting research results, this section includes information about the concepts, disclosure risk and protection methods related to the statistical data protection of tabular data.

Research results can be other aggregated data than a table, for example a graph or a single distribution parameter. You can find more information about the data protection rules applied to these from Rules and instructions of the Research Services and section 3.4 Data protection for research results. Please note that data protection must be applied to all results to be published.

Because tables are a common format for presenting research results, this section includes information about the concepts, disclosure risk and protection methods related to the statistical data protection of tabular data.

It is almost impossible to give detailed universal instructions for protecting tabular data, because there is a wide variety of tables. Tables may differ by content, structure, publishing concept and purpose of use – at worst, every table is a special case when it comes to protection. The better you have accounted for the special features of each table, the better you can protect them while keeping the features that are essential for its intended purpose.

“Tabular data” refers to aggregated data arranged in a table format. Tables can be classified as frequency tables or magnitude tables.

In a frequency table, the value of each cell is the number of statistical units belonging to that cell.
In a magnitude table, the statistical unit values are the values of the tabulated variable, meaning the cell values are aggregates of the values of the statistical units belonging to each cell – typically sums or averages. Magnitude tables may be accompanied by cell frequencies.

Distribution parameters can also be presented in a single table. You have to apply the data protection rules of distribution parameters to protect the table. More information from Rules and instructions of the Research Services.

Disclosure from tabular data

In the context of tabular data, “disclosure” means the probability of determining the identity or property of a unit more precisely than if the table had not been published. Disclosure may be approximate or exact.

In magnitude tables, “disclosure” usually refers to a situation where the value of a certain statistical unit's tabulated variable can be too accurately estimated based on the table’s figures and structure. Approximate disclosure may be equally harmful to exact disclosure, especially in the case of enterprise data.

Two-stage table protection process

The first stage includes an assessment of the need to protect the table: identifying the so-called sensitive cells at risk of disclosure based on the chosen sensitivity rule. The second stage is described in the section Tabular data protection methods.

Most commonly used sensitivity rules:

Threshold value – a cell is considered sensitive if it includes fewer statistical units than a predetermined threshold value.
Dominance (n,k rule) – a cell is considered sensitive if the n largest statistical units amount to at least k per cent of the cell’s total value.

In frequency tables, cells are considered sensitive if they only include a few statistical units, meaning the value combination of the cell’s categorical variables is rare. You can use the threshold value rule to find these low cell frequency cells as sensitive (see Example 1).

We recommend using the threshold value rule if only exact disclosure (e.g. exposing a person’s identity) needs to be protected against.

Example 1. Using a threshold value to determine the sensitive cells in a frequency table

You prepare a table where you cross-tabulate people’s age groups, domiciles and civil status. For cell sensitivity analysis, you have decided to use a threshold value of 3. After tabulation, you realise that there are only two widows under 20 years of age in municipality X.

Therefore, a combination of the following categorical variables:

age group: under 20
marital status: widow
municipality: X

is too rare according to the threshold rule, as the cell frequency, 2, is lower than the threshold value, 3. The risk of disclosure is too great for the persons belonging to these cells, so the data in the cells must be protected before the table is published.

In magnitude tables, cells can also be considered sensitive if one or only a few units dominate the cell’s value (their values are considerably greater than those of the other units). This may allow the values of the dominant units to be determined too precisely (even if not exactly). Protection against unacceptably accurate estimation means protection against approximate disclosure, and it can be implemented by using a dominance rule to determine which cells are sensitive (see Example 2 and Example 3).

You can use multiple sensitivity rules in parallel. In this case, a cell is considered sensitive if it matches any of the sensitivity rules.

Example 2. Using a dominance rule to determine which cells to protect (part 1)

A cell in a certain table has the total value X, consisting of three observations with the following values:

x1 = 59
x2 = 27
x3 = 14.

The cell’s total value is therefore: X = x1 + x2 + x3 = 59 + 27 + 14 = 100. You need to analyse if one or more observations dominate the value of X. You have chosen to use the dominance rule with n as 1 and k as 75 – in other words, you want to know if one (the largest) observation in a cell contributes at least 75 per cent of the cell’s value. In this cell, the largest observation x1 contributes x1 / X = 59 / 100 = 0.59 = 59% of the cell’s total value. Because 75 > 59, the largest observation (and therefore no other observation) does not sufficiently dominate the cell’s value, according to your dominance rule, to require protection of the cell.

Example 3. Using a dominance rule to determine which cells to protect (part 2)

A cell in a certain table has the total value X, consisting of twelve observations. The largest observation is x1 = 61 and the second-largest is x2 = 20. The other ten observations, x_3–12, have a total value of 19.

Should the cell be protected, if:

a. you use the dominance rule n = 1 and k = 60
b. you use the dominance rule (2,90)
c. you use the threshold value 3 and the dominance rule (1,60)?

Solution: The cell’s total value is X = 61 + 20 +19 = 100.

a. The largest observation is x1 / X = 61 / 100 = 0.61 = 61% of the cell’s total value. Because 61 > 60, you must protect the cell.

b. Your two largest observations contribute (x1 + x2) / X = (61 + 20) / 100 = 0.81 = 81% of the cell’s total value. Because 81 ≤ 90, the cell needs no protection.

c. If two sensitivity rules are applied, a cell must be protected if it meets the criteria of at least one rule. As in “a” above, the cell must be protected according to the dominance rule (1,60). It follows that the cell must be protected when both threshold value rule, value 3, and the dominance rule (1,60) are used. On its own, a threshold value of 3 would not require the cell’s protection.

Tabular data protection methods

In the second stage of the table protection process, the data in sensitive cells are protected by applying the selected protection method. The first stage, the protection need assessment, has been described above in the Disclosure from tabular data section. The primary criteria for selecting a protection method are achieving the required level of protection and preserving the essential properties of the table. The table must be protected to a sufficient degree but remain useful even in its protected form.

The choice of method is often influenced by the available resources such as time and access to protection software. In addition, the protection method should be transparent: the users of the protected table should understand the protection method in general and therefore be able to account for the changes introduced by the protection method to the table.

Common table protection methods include masking and reclassification

Masking involves suppressing the principal cells at risk of disclosure and secondary suppression. Secondary suppression ensures that the table’s row and column totals cannot be used to expose the values of the principal cells at risk. Masking can also be done for a row. If the total of a table’s row only includes a small number of statistical units (less than the threshold value), you must mask the row in its entirety without considering the number of statistical units in each cell of the row.

Reclassification removes the sensitive cells from a table by combining the categories that include those cells with the table’s other categories. Changing the categories usually means generalising the entire classification.
Another table protection method is to amend the values of the cells at risk of disclosure. They can be amended by rounding or replacing the original cell value with an approximate random value, for example.

In practice, protecting tables and other types of output in the remote access system means that the disclosure risk is eliminated from the outputs and tables sent for review. Their protection must include sufficiently generalised classifications or other designs that result in acceptable data protection for the output’s contents. The research results and tables sent to the result review must no longer be exposed to the risk of disclosure. The reviewer will not provide additional protection for the results. The protection and review procedure for research results is described further in section 3.4 Data protection for research results.

In Example 4, you can explore the practical issues of protecting tabular data.

Example 4. Assessing the protection of a table

Table 1 includes the number of people working in a certain profession by area and income bracket. For protection reasons, any cell with a value other than zero has been masked. Some null cells may also have been masked.

Is this sufficient protection? Is there a way to use the table to discover the values of the masked cells? Would an alternative tabulation method be preferrable? If yes, what kind?

Table 1. Practising professionals by income bracket and area
Income bracket	Area A	Area B	Area C	Area D	Areas, total
1	0	x1	x2	0	25
2	x3	0	0	x4	15
3	x5	0	0	x6	30
4	0	x7	x8	x9	30
Income brackets, total	35	10	15	40	100

Example solution:
Because this is a frequency table, the values cannot be negative. It appears null cells have not needed protection, as they remain visible in the table. The sums of the table’s rows and columns allow the following to be deduced:

x1 + x7 = 10 therefore 0 ≤ x1 ≤ 10
x2 + x8 = 15 therefore 0 ≤ x2 ≤ 15
Because x1+ x2 = 25 and x2:n has the above range of variation x1 = 25 - x2 ≥ 25 - 15 = 10 and therefore x1 ≥ 10
Because x1 ≤ 10 and x1 ≥ 10 are true, x1 = 10 must be true.

Now that you know x1 = 10, you can calculate the following: x2 = 15, x7 = 0 and x8 = 0. Because 0 + x7 + x8 + x9 = 30, x9 = 30. The masking of cells x1, x2, x7, x8 and x9 was a wasted effort, because their exact values could be calculated even after masking.

Also note that if we assume the protection is based on a low-value threshold rule (less than 10) to find the principal cells to mask, none of the above cells were principal cells. However, this information can be used to disclose a group. Disclosing a group means that no individual observations in the table can be identified, but the property of an identifiable group is disclosed.

In this table, the disclosed property is that all persons in areas B and C belong to income bracket 1. Group disclosures are not always considered sensitive or needing protection, and taking them into account typically makes it more difficult to apply protection.

The remaining masked cells are presented in Table 2 as a sub-table of the original table.

Table 2. Economically active by income bracket and area, sub-table
Income bracket	Area A	Area D	Areas total
2	x3	x4	15
3	x5	x6	30
Income brackets total	35	10	45

The sub-table’s row and column totals give the following ranges of variation for cell values x3 and x5:

5 ≤ x3 ≤ 15
20 ≤ x5 ≤ 30

If we are aware that the need for protection was determined with a threshold value of 5 (or lower), then neither x3 nor x5 would be a principal cell to protect. It follows that cell x4 or x6 is a principal cell, because protection has been applied. The table can be used to infer that the value of either cell is at most 10, but their exact values cannot be determined.

Summary

Based on the above calculations and reasoning, the original questions could be answered thus:

Is this sufficient protection? Yes, if protection against group exposure is not required.
Is there a way to use the table to discover the values of the masked cells? Yes; cell values x1, x2, x7, x8 and x9 can be disclosed exactly. Ranges of variation can be calculated for the rest.
Would an alternative tabulation method be preferrable? What kind? One option is to only present the marginal distributions (column totals). This also avoids group exposure. Alternatively, you could use different categories for the areas and income brackets. The usefulness of a table is ultimately about its purpose (which was left unstated in this case).

Aside from the answers, this example has taught you the following:

If you use masking for data protection, choose your secondary cells carefully to avoid overmasking (hiding data unnecessarily).
If the value of your threshold value rule is discovered, it can help undo the protection.
On the other hand, if the protection design is good, knowing the threshold value may not lead to the disclosure of exact cell frequencies. Cell frequencies x3, x4, x5 and x6 in this example had several alternatives, even when the threshold value was known.
At the same time, the locations of the potentially sensitive cells could be (partly) inferred when the threshold value was known.
Group disclosures from statistics can be more difficult to prevent than the disclosure of individual observations. There is a theoretical possibility of group exposure in any table with null cells.

3.4 Data protection for research results

The obligation of secrecy requires you to ensure that your research results contain no unit-level data or the possibility of their disclosure. The outputs you publish must meet the data protection requirements laid out in the Statistics Finland guidelines for protection of tabular data. For more information about the guidelines, see the instruction. Data protection and result checking process (pdf).

Frequency and magnitude tables

As a rule, enterprise data must be protected so that each cell or group includes at least three (unweighted) observations. A dominance rule (1,75) must be applied alongside a threshold value rule for recent enterprise data (under 15 months from the reference date). Establishment-level data protection must also ensure enterprise-level protection, meaning each cell must have establishments from at least three different enterprises. Likewise, corporate group-level protection must be considered in all enterprise data that include information about group relationships.

Personal data must be protected with a cell threshold value of 3, and special attention must be paid to the sensitivity of the variables being tabulated. Combined employer-employee data must be protected at the personal and enterprise levels, meaning each cell of a table must have employees from at least three different enterprises. The data on self-employed workers included in tabular enterprise statistics are subject to the same data protection practices as other enterprise data.

Distribution key figures

Maximum and minimum are typically related to one observation. If this observation can be identified, you may not publish the maximum or the minimum..

Distribution points (excluding the maximum and minimum) are a special case where a table’s cell frequencies are equal to the number of observations between the distribution points. You can publish the distribution points if these numbers exceed the threshold value of 3.

Mode. Can be published if (almost) every observation has a different value.

Average, other ratios and higher moments of distribution key figures (e.g. variance) can be published if they have been calculated from at least three observations.

If you are publishing proportions, the threshold value of 3 must be true for all groups that form the proportions. In other words, if you want to publish the statement that women account for 58 per cent of the total population, the 58 per cent of women and the 42 per cent of men must include at least three people. It is not enough that the total population includes at least three women and men.

Other numerical outputs

Index point figures, correlation coefficients and test quantities (t, F, χ2, etc.) can usually be published if the calculations include a sufficient number of observations (minimum 10).

Regression models can be published in their entirety if the model includes a sufficient number of observations, and it is not a time series of observations about one enterprise or person. The model’s individual factors can usually be published.

Figures and diagrams

Publishing figures and diagrams based on the data is allowed if individual points cannot expose an individual observation used to draw them. Submit your figures for review just like tables – clearly and precisely documented. Suitable image file formats include PNG, BMP, JPEG, TIFF, EPS, PS, PDF, SVG and WMF/EMF.

Bar charts and other diagrams used to present classified data are typically permitted for publication, as long as each of their classes includes a sufficient number of observations.

The information in these diagrams can usually be presented as a table, and hence is subject to the same data protection rules as other tabular data (see the Frequency and Magnitude Tables section above).

Distribution charts may include outliers or extreme values (extremum) that expose observation unit data. Distributions, histograms, and cumulative distribution functions are permitted with sufficient smoothing or sufficiently broad scales.

Scatter plots are typically used to present the values of two continuous variables, which may make them more problematic for data protection than the previous diagrams. If you use scatter plots, pay close attention to the nature of the data: sample size, data sensitivity, outliers, etc.

The exercises in the following galleries will give you practical tips for improving the data protection of your outputs and how to assess it.

Learn more about data protection and browse the gallery (1)

The following gallery has some questions and answers. You can move between the questions and answers with the arrow buttons (next and previous).

Practise improving the data protection of outputs: Dentists with criminal records

You have prepared the three tables below, which indicate the number of dentists in areas A and B, classified by gender and whether the person has a criminal record. In total, there are 68 dentists in these areas.

Table 1. Gender and area
Gender	Area A	Area B	Areas total
Female	21	12	33
Male	16	19	35
Genders total	37	31	68

Table 2. Gender and criminal record yes/no
Gender	Criminal record yes	Criminal record no	Total
Female	23	10	33
Male	8	27	35
Genders total	31	37	68

Table 3. Area and criminal record yes/no
Alue	Criminal record yes	Criminal record no	Total
Area A	11	26	37
Area B	20	11	31
Total	31	37	68

Now consider the following:

Is the data protection of the above tables good enough to pass a review?
How is data protection affected if you send in the two-dimensional tables or one three-dimensional table that includes area, gender and criminal record?

1/4

Answer

This example aims to show how hidden data protection risks may exist in “linked tables” that are produced from one population sample and contain some of the same variables and marginal distributions.

The variables of the above tables could be used to produce the “three-dimensional” table below (gender × criminal record × area). The values of cells that did not appear in the original two-variable cross-tabulations are indicated with an X.

Table 4. Gender, criminal record and area
Gender	Criminal record	Area A	Area B	Areas total
Female	yes	x	x	23
Male	yes	x	x	8
Genders total	yes	11	20	31
Female	no	x	x	10
Male	no	x	x	27
Genders total	no	26	11	37
Female	total	21	12	33
Male	total	16	19	35
Genders total	total	37	31	68

Looking at the table above 4a, you will notice the following:

Area B has a total of 20 dentists with a criminal record.
Area B also has a total of 12 dentists who are women, and therefore even if all women had a criminal record, then a minimum of eight dentists who are men would also have to have a criminal record.
There are only eight dentists who are men who have a criminal record. Based on the above, it can be inferred that all dentists who are men and have a criminal record are located in area B.

2/4

Answer (cont.)

After concluding the above, you can find the values of some of the masked (X) values (marked in red and with a “(t)”):

Table 5. Gender, criminal record and area (some masked cell values have been calculated)
Gender	Criminal record	Area A	Area B	Areas total
Female	yes	x	x	23
Male	yes	0 (t)	8 (t)	8
Genders total	yes	11	20	31
Female	no	x	x	10
Male	no	x	x	27
Genders total	no	26	11	37
Female	total	21	12	33
Male	total	16	19	35
Genders total	total	37	31	68

3/4

Answer (cont.)

You can now easily calculate the values for the rest of the cells in the three-dimensional table.

Table 6. Gender, criminal record and area (all masked cell values have been calculated)
Gender	Criminal record	Area A	Area B	Areas total
Female	yes	11 (t)	12 (t)	23
Male	yes	0 (t)	8 (t)	8
Genders total	yes	11	20	31
Female	no	10 (t)	0 (t)	10
Male	no	16 (t)	11 (t)	27
Genders total	no	26	11	37
Female	total	21	12	33
Male	total	16	19	35
Genders total	total	37	31	68

The above table discloses the following sensitive data:

all dentists in area B who are women have a criminal record, and
all dentists who are men and have a criminal record are located in area B.

We have learned that it is usually easier to observe data protection risks by looking at one multi-dimensional table that includes all necessary variables, than looking at several smaller linked tables.

4/4

Learn more about data protection and browse the gallery (2)

The following gallery has some questions and answers. You can move between the questions and answers with the arrow buttons (next and previous).

Practise assessing the data protection of outputs: Number of enterprises receiving subsidies by region

Now test if you can assess the data protection of results. Based on your data, you have prepared the tables below, which indicate the number of subsidised enterprises, and you are planning to submit them for review.

Consider the following:

Do these tables meet the data protection requirements for outputs?
What are the data protection risks?
If the tables fail to meet data protection requirements, how should they be modified?

Table 1. Number of enterprises receiving subsidies and amount of subsidies by region (MK = region)
Year	MKA	MK B	MK C	MK D	MK E	MK F–S	Whole country
2015	21	2	5	9	5	396	438
2016	8	1	6	9	3	460	487
2017	18	2	10	10	1	592	633
2018	17	3	6	7	7	559	599
2019	15	1	6	12	9	560	603

Table 2. Benefits (EUR 1,000) by region (MK = region)
Year	MK A	MK B	MK C	MK D	MK E	MK F–S	Whole country
2015	3 552	183	1 317	2 016	355	120 124	127 547
2016	855	580	650	761	307	145 460	148 613
2017	2 623	125	851	1 577	15	146 335	151 526
2018	3 508	153	476	1 315	275	158 581	164 308
2019	1 928	15	653	1 467	1 247	174 478	179 788

1/2

Answer

The frequency table includes low cell frequencies – they will not pass a review. For example, region B only had one subsidised enterprise in 2019. The risk exists that another (public) source could be used to identify the enterprise and discover the amount of subsidy received. You may also produce additional tables from the same data, which could be used together to disclose the other enterprises that received subsidies.

Table 1. Number of enterprises receiving subsidies and amount of subsidies by region (MK = region)
Year	MK A	MK B	MK C	MK D	MK E	MK F–S	Whole country
2015	21	2	5	9	5	396	438
2016	8	1	6	9	3	460	487
2017	18	2	10	10	1	592	633
2018	17	3	6	7	7	559	599
2019	15	1	6	12	9	560	603

The tables do not indicate if they use a limited group of enterprises (e.g. only one industry or sector) or every enterprise in Finland. Indicating an enterprise’s industry (with its region) may considerably increase the risk of disclosure, so the reviewer may reject the tables due to insufficient documentation alone.

You can improve the data protection of these tables by combining regions or masking small cell frequencies. If you choose to use masking, you must also consider secondary masking. For example, you would only mask the 2019 amounts for enterprises in region B, which could be recalculated from the nationwide total by deducting the figures of the other regions.

2/2

3.5. Result review procedure

Statistics Finland's Research Services use a manual checking procedure and a random checking procedure for research output.

You are responsible for the output you send in for review meeting the data protection requirements.
The outputs must be easy to interpret.
The reviewer must be able to understand the main content and variables used in the outputs.
The number of observations per cell must be indicated in tables, as well as the number of observations used for calculating estimates and key figures.

If the contents are too unclear or expansive to review the output’s data protection, the reviewer will reject it. For more information about the output data protection requirements and review procedure, read the Research Services rules and instructions. The rules and instructions are binding for all researchers who have signed an agreement for a research project or the use of the SISU microsimulation model. Please note that, despite the review process, you, as the researcher, are responsible for enforcing data protection in the research results you publish.

Different review procedures are applied to projects subject to manual checks, projects subject to random checks and the remote use of the SISU microsimulation model:

Manual checks: output produced in remote access use are reviewed concerning data protection before the data are released to the researcher. Files cannot be transferred from the remote environment to the local workstation independently, transfers require a separate email request.
SISU microsimulation model and random checks: As relates to random checks and SISU microsimulation, the user transfers the research output files directly to their local workstation without any advance check.

Manual checking procedure

Research output produced in remote access use are checked before the output are released, and files cannot be transferred independently to the own local workstation from the remote access environment. The data transfer is made upon separate request by email (tutkijapalvelut@stat.fi). The check takes place within one to two working days. Please take the Research Services’ resources into account when you submit your service request. By paying attention to the quality of the output files and restricting the number of review requests you can facilitate and quicken the review procedure significantly. Research Services personnel resources are assigned to remote access system maintenance and related tasks based on daily demand. Try to anticipate your data needs and send review requests well ahead of time. All review requests and file transfers will be processed on the next working day. Responses to requests for corrections or clarifications regarding file contents will be processed on the next working day.

Keep the number and size of output files reasonable. In practice, a reasonable number means only a few individual files that are meant to be published, not dozens of different versions or long log files intended for comparison by a group of writers, for example.
Only request a review for files that you need for publishing or work outside the remote access system.
Only include essential tables and data in your output files. The files are reviewed as a whole, so extensive log files will complicate and lengthen the review process.
Make sure that the contents of the output files meet all data protection requirements. Be proactive, and remove or edit any data that poses a risk for data protection.
Make sure that the format of the output files (image files especially) is as described in our rules and instructions. If you absolutely must have an output file in a format not allowed in the rules, state your reasoning in the review request and include instructions for opening the file format.
Your review request must include a description of the files to review and indicate what data were used to calculate the results. You may only omit the description from the review request if the data content is self-evident.
Copy the files to your project folder’s review folder (...\out).
Send file and image review requests by email to tutkijapalvelut@stat.fi. Include your project code and location details.
The results of the data protection review will be sent to the email address you have given.
Allow two working days for the review.
The files must be easy to interpret – if additional information is requested, the review will be delayed until the next working day.

Random checking procedure

When you want to export output from remote access use, fill in the form concerning the output to be exported on FIONA’s desktop. After this, the output is either checked manually in advance or if you are subject to random checks, you can receive the output directly to your email. All output of new users is checked in advance. The probability of being subject to random checks rises when the user receives approved consecutive advance checks. Breaches and errors reset the situation and the advance checks start from the beginning. The sanction for serious and repeated breaches is removal from the random checking procedure and other measures resulting from data protection breaches. If you are unsure about the data protection of the output, you should contact the Research Services already before the output is exported from the system.

Predict your data needs and send review requests well ahead of time. All review requests and file transfers will be processed on the next working day. Replies to correction and information requests regarding file contents will be processed on the next working day.

Microsimulation review procedure

You can transfer research result files from the microsimulation remote access environment to your local workstation. Each user has a personal email folder, “Mail”, in the remote access environment, which you can use to transfer files to your local workstation.

Copy the files you need (from the User, Forum or Admin folder) to your personal Mail folder.
After about two minutes, the copied file will be automatically sent to your personal email inbox and the email inbox of the Statistics Finland microsimulation team (mikrosimulointi@stat.fi).
For every file copied to the Mail folder, a separate email will be sent with the copied file attached. The attached files may have a maximum file size of 1 MB (megabyte).

Statistics Finland reviews the files in the microsimulation team’s email inbox afterwards. You must follow the Research Services’ instructions and rules for the remote access environment and its data transfers:

protection of the transferred research results
size and other restrictions for files
clarity of information
publishing of results.

Make especially sure that the files transferred from the microsimulation environment contain absolutely no unit-level data or any possibility that they could be disclosed.

3.6 Unreviewable outputs

Please avoid sending the following kinds of outputs for review:

All files sent for review must meet the same criteria as tables and diagrams intended for publication. For example, only essential log files or published parts thereof should be sent for review.
Outputs that include unit-level information or data will be rejected. The system only allows aggregated information output.
If your output is too poorly documented to assess its data protection, the reviewer has no choice but to reject it.
Completely prohibited types of images include diagrams that include outlier unit values and scatter plots that can be used to infer the data of an industry’s largest enterprise. The drawing functions of programs will often automatically mark outliers in scatter plots so they can be left out of the final published images.
Certain image files (e.g. Stata GPH files) will by default save the data used to draw the image, which may make them unsuitable for export.

Concept quiz

Dominance rule: A cell is considered sensitive if the n largest statistical units amount to at least k per cent of the cell’s total value
Frequency table: The value of the table of each cell is the number of statistical units belonging to that cell.
Threshold value: A cell is considered sensitive if it includes fewer statistical units than a predetermined threshold value.
Magnitude table: The statistical unit values are the values of the tabulated variable, meaning the cell values are aggregates of the values of the statistical units belonging to each cell – typically sums or averages.

Remote Access to Research Data

3 Statistical data protection of research data

3.1 Legal basis of data protection

Researchers are responsible for data protection

Statistics Finland is bound by the Statistics Act

3.2 Obligation of secrecy

3.3 Statistical disclosure control methods for tabular data

Disclosure from tabular data

Two-stage table protection process

Example 1. Using a threshold value to determine the sensitive cells in a frequency table

Example 2. Using a dominance rule to determine which cells to protect (part 1)

Example 3. Using a dominance rule to determine which cells to protect (part 2)

Tabular data protection methods

Common table protection methods include masking and reclassification

Example 4. Assessing the protection of a table

3.4 Data protection for research results

Frequency and magnitude tables

Distribution key figures

Other numerical outputs

Figures and diagrams

Learn more about data protection and browse the gallery (1)

Practise improving the data protection of outputs: Dentists with criminal records

Answer

Answer (cont.)

Answer (cont.)

Learn more about data protection and browse the gallery (2)

Practise assessing the data protection of outputs: Number of enterprises receiving subsidies by region

Answer

3.5. Result review procedure

Manual checking procedure

Random checking procedure

Microsimulation review procedure

3.6 Unreviewable outputs

Concept quiz