3  Data protection and privacy

Protecting privacy and confidentiality of survey respondents is a key concern and has to be considered carefully before preparing and launching a survey. This involves a clear understanding of the concepts involved and to clarify the legal and organizational requirements for the survey. It also involves following certain standards for data management, storage and data reuse.

3.1 About privacy, confidentiality, anonymity

As mentioned in the previous chapter, anonymity is a key principle of equality data collection. As such it is closely related to privacy and confidentiality. Although all three concepts aim to protect individuals and their information, they do so in different ways (Commission 2010):

Anonymity

makes individuals unidentifiable. It assures that the data individuals provide cannot be linked to their person. It concerns what meta-data is collected during the survey but also aggregating responses sufficiently to prevent involuntary disclosure.

Privacy

gives individuals control over their information. As such it entails the a) control of information about oneself, b) control over access to oneself (both physical and mental), c) control over one’s ability to make important life decisions.

Confidentiality

involves protecting information from unauthorized access. Confidentiality is a more limited concept and concerns first and foremost the protection of personal information. “Confidentiality is a duty that arises when someone has been granted access to information that would otherwise be kept secret” (Commission 2010, 79ff).

Most importantly, privacy considerations limit the ways in which we acquire information, whereas confidentiality considerations deal with the protection of this acquired information.

Note

In concrete terms, the distinction between “privacy” and “confidentiality” implies to specify how privacy is protected during data collection, including the decisions to (not) participate, and how the collected data will be handled in a confidential manner after it has been recorded.

For the GEAM survey, privacy implies to give respondents relative control over data entry (“what” and “when”), including the possibility to opt out and delete their data at any moment during the submission process. Confidentiality implies that result data is anonymized and stored in a secure way to prevent unauthorized access.

3.2 Introduction to the General Data Protection Regulation (GDPR)

The EU General Data Protection Regulation (GDPR Regulation 2016/679) is a regulation by which the European Parliament, the Council of the European Union and the European Commission intend to strengthen and unify data protection for individuals within the European Union (EU).

The GDPR enhances the rights of ‘data subjects’ (individuals to whom personal data pertain) and provides clear conditions for collecting, storing, processing and transferring personal data (including ‘special categories’ data) by institutions. It is advisable to become familiar with the principles outlined in Article 5 of the GDPR, which address the following:

  • Accountability
  • Lawfulness, fairness and transparency
  • Purpose limitation
  • Data minimization
  • Accuracy
  • Storage limitation
  • Integrity and confidentiality.

The GDPR is applicable to data processing. This includes data collection, recording, organization, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction.

3.3 Collecting personal (sensitive) data

The GDPR draws a distinction between “personal data” and “personal sensitive data”.

Personal data means “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person” (article 4.1).

Sensitive personal data or special categories of personal data refers to information revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data, data concerning health or data concerning a natural person’s sex life or sexual orientation (article 9).

Guidance on EU legal frameworks for data collection

The European Commission has provided several reports that provide an overview of legal frameworks for equality data collection in general (Commission 2017) and on specific topics such as racial or ethnic origin (Commission 2021) and ethnic origin and disability (Chopin, Farkas, and Germaine 2014), or LGBTI people (Commission 2023).

The GEAM includes items inquiring about Sensitive Personal Information (e.g. sexual orientation, health impairments or reporting discrimination associated with ethnicity) because these variables also constitute the main dimensions of social discrimination (Baumann, Egenberger, and Supik 2018). From our perspective, it is important to address these sensitive issues in the GEAM because otherwise important discriminatory practices will remain invisible and cannot be addressed by adequate Gender Equality measures.

It is important to notice that European and national legislations have put safeguards in place on the collection of sensitive personal data. While they do not categorically prohibit the collection of data on race, religion, ethnic origin among others, they emphasize thoughtful compliance with restrictions and purposeful data handling. This involves that sensitive personal data can only be collected if complying with ten clearly specified conditions, including where explicit consent has been given or where processing is necessary, for example for scientific or historical research that is proportionate to the aim pursued and safeguards the fundamental rights and interests of the individuals involved. GDPR legislation is closely aligned with quality data principles, for example in terms of consent, purpose, anonymity.

Note

Modifications to the survey content and the handling of result data will be the responsibility of the survey administrator only.

In the last instance it is the survey administrator who must decide the adequate security settings for a specific survey instance taking into account the GEAM content, national legislation and organizational information needs and context(s). In other words, it is up to individual organizations to specify who will have access to the raw data and how it will be stored securely.

Furthermore, it is up to organizations and the survey administrator¡ to make sure that the GEAM survey is accompanied by adequate information for respondents to provide informed consent. Specifically, you must provide:

  • Information about your organization: including name and contact details, details of your representative (if relevant) and contact details of your Data Protection Officer.

  • Information about the type of data you will collect: for example, gender, contract details, salary, etc.

  • The purpose of collecting the data: including what you will use it for and whether it will be used to make an automated decision; the legal basis for using the data including any ‘legitimate interest’ relied upon. -Who will receive or have access to the data: for example, members of an Equality, Diversity and Inclusion committee, Human Resources, or a Gender Equality Planning group.

  • Other information: including whether the data will be transferred, stored, or processed outside the EU and on what basis; how long the data will be stored for; what security arrangements are in place to protect the data; whether provision of the data is required and the consequences of not doing so.

  • Data subjects’ rights: the right to be informed; right of access; right to rectification; right to erasure; right to restrict processing; right to data portability; right to object; and rights in relation to automated decision making and profile.

  • Contact information: who they can contact in relation to questions or complaints.

It is important to provide a means for individuals to give consent to your processing of their information after they have read the above information. An example statement for the collection of personal and sensitive personal data forms part of the GEAM questionnaire. It is the default setting in the GEAM LimeSurvey platform that requests respondents to accept the data protection statement before they can proceed to answer the survey. Survey administrators are nevertheless advised to carefully revise this statement together with the legal support of the targeted organization before launching their survey.

Questions and response options may need to be adapted according to an organisation’s or country’s legal requirements where it affects monitoring practices, policies and the terms used to describe populations. Survey administrators, when editing questions and response options on protected characteristics, need to be aware of the rights and permissions in their country. For instance, they need to know if permissions to collect data on protected characteristics such as sex, race and sexual orientation differ or require an organization to re-phrase the language in the survey to be in line with regulatory standards.

When editing questions or response options for regulatory reasons, it is recommended that you use tried and tested, nationally accepted replacements (for example, the response options presented in a national census). This will increase the likelihood that other organizations in the same country have used similar questions and response options, producing a standardized approach and enabling organizations to benchmark themselves nationally.

In order to make informed decisions regarding privacy protection and confidential treatment of result data, please make sure to read this document carefully, especially the section on Controlling data privacy and response tracking. You will also be required to sign the Annex II - Declaration of Data Protection and Confidentiality Agreement for survey administrators before you can launch a survey (you can still edit and prepare your questionnaire without having signed the agreement with the FUOC. However, before collecting data, you will need to provide a signed copy).

Finally, additional information regarding the collection of sensitive personal data and considerations for primary research are discussed further in Advance HE’s research and data briefing on GDPR compliance (Christoffersen 2018).

3.4 Data storage

Once a survey has finished and the result data matrix has been downloaded from the LimeSurvey platform, we recommend the following steps to protect the confidentiality of the responses and prevent unauthorized access to your data.

3.4.1 Anonymization

Anonymity means that a person participating in the research cannot be identified from the information provided. This is distinct from confidentiality, which is when the researcher or data collector knows the identity of the person, but keeps this information secure, and anonymizes data before being published. For more on this distinction see Advance HE’s briefing on ethics in primary research (Haley 2017).

The GEAM does not contain questions regarding direct personal identifiers such as social security number, names, email addresses or similar. Thus, the collected data is anonymous on a very basic level. LimeSurvey does not store any personal information with the survey results either. However, respondents might provide certain identifiers in open text questions such as organizational names or names of colleagues which unintentionally can identify their (and others) contributions.

Note

Please check the open text fields in your result data for any plain names and other possible identifiers submitted by respondents. Before further processing your results. We recommend replacing these instances with custom codes or XXX-ing them out.

3.4.2 Encrypting result data

A relatively easy but seldom applied measure to protect your data is to save the corresponding file(s) with a password. This works mainly when the results are downloaded as a Microsoft Excel file.

Warning

Before sharing the result data with the inner circle of colleagues, please password protect the file. Do not send the password together with the file (in the same email!).

More sophisticated options are available such as using digital signatures and GPG encryption. GPG encryption is easily available on Linux operation systems. For Windows, see https://www.gpg4win.org/. The advantage of using advanced encryption techniques consists of protecting your data with a personal signature/password without limitations to share data with others. The architecture of public and private encryption keys enables you to encrypt data with a specific addressee in mind – and which then can be only decrypted by that particular person. This basically enables you to share protected data without having to worry about circulating a single password (used for encryption) among collaborators.

3.4.3 Deletion

Note that when deleting data from a computer, the file is not actually erased but moved to a trash-bin and can be easily recuperated. This is especially important when working on a shared computer. When you download the result dataset, make sure others (including administrators) cannot access it at a later point in time.

Data should be stored on your personal (Desktop) account (protected by your password) or on an encrypted shared drive (e.g. a cloud drive or the orgnization’s virtual private network). In either case, the folder in which the data are stored should be restricted to a set of one or two individuals who require access to the raw data. If you store it on an external (flash) drive, the drive should be encrypted or password protected in order to prevent unauthorized access in case of loss.

3.5 Disclosure control

Whereas “anonymization” can remove or replace identifiers from a dataset and thus prevent the identification of respondents, “quasi-identifiers” pose a different challenge. Quasi-identifiers refer to the variables used in the study and whose combination can lead to the re-identification of the respondent. For example, (Golle 2006) and (Sweeney 2000) show that 63% (or 87% respectively for older data) of the US population can be unambiguously identified by combining a 5-digit ZIP code, birth date and sex!

What is statistical disclosure control?

Statistical Disclosure Control (SDC) is a critical field in statistics and data analysis that deals with protecting sensitive or confidential data in general but particularly for small N (Griffiths et al. 2019). Using disclosure control is especially important for intersectional data collection and analysis when combining several socio-demographic categories. Having only few respondents for specific minoritised groups, however, is not a sign to give up but rather a sign of the particular challenge of overlapping disadvantages and invisibilities to be addressed (Armstrong and Jovanovic 2017).

Please consider whether the person to whom the data pertains might still be identified through indirect identifiers (e.g. occupation, salary); by people who know them or the context; or by those who have access to other information which, when combined with the data, might allow them to be identified. In these cases, it is necessary to take further action beyond removing names to anonymize the data. The GDPR contains a strict definition of anonymity: it considers data anonymous only when it cannot be identified by any means “reasonably likely to be used … either by the controller or by another person”. This means that if the data could be re-identified by any person using ‘reasonable effort’, it would not be considered to be anonymized, and respondents would need to be made aware of this during the consent process described above. For example, if a report summarizing the proportions of men and women working in individual roles only included one female respondent from a specific department, it is possible for this single individual to be identified by her colleagues.

In case you plan to distribute the raw data beyond the inner circle of persons involved with the collection and analysis of the GEAM, you need to carefully think about your quasi-identifiers. Quasi-identifiers need to be examined in relation to the number of respondents in your data. For example, a combination of professional category, age and gender might be enough to identify a person in a small research institute but insufficient in a dataset of several hundred or thousands of respondents.

If there is a small number of respondents within a certain category, this identifier may need to be removed before sharing to prevent breaching participant confidentiality and anonymity. Alternatively, it may be better to refrain from sharing the raw data and instead only share the results of the survey in summary tables that have small numbers represented as less than values (e.g. groups with fewer than 5 individuals are represented as ‘< 5’ instead of listing the exact frequency) or by applying a rounding strategy (e.g. all frequencies are rounded to the nearest 5 and any percentages with denominators of less than 22 individuals are suppressed).

Higher Education Statistics Agency rounding strategy for disclosure control

The Higher Education Statistics Agency in the UK (Lawson 2016) proposes the following rounding strategy:

  • All numbers are rounded to the nearest multiple of 5

  • Any number lower than 2.5 is rounded to 0

  • Halves are always rounded upwards (e.g. 2.5 is rounded to 5)

  • Percentages based on fewer than 22.5 individuals are suppressed

  • Averages based on 7 or fewer individuals are suppressed

Finally, when there are few participants, anonymization can also involve creating groupings from certain variables. For example, the GEAM collects data on respondents’ nationality, which could be transformed into groups by continent, or by EU versus non-EU, both of which would decrease the likelihood of them being identified.

3.6 References

Armstrong, Mary A., and Jasna Jovanovic. 2017. “The Intersectional Matrix: Rethinking Institutional Change for URM Women in STEM.” Journal of Diversity in Higher Education 10 (3): 216–31. https://doi.org/10.1037/dhe0000021.
Baumann, Anne-Luise, Vera Egenberger, and Linda Supik. 2018. Erhebung von Antidiskriminierungsdaten in Repräsentativen Wiederholungsbefragungen. Bestandsaufnahme Und Entwicklungsmöglichkeiten. Antidiskriminierungstelle des Bundes. https://www.antidiskriminierungsstelle.de/SharedDocs/Downloads/DE/publikationen/Expertisen/Datenerhebung.html.
Chopin, Isabelle, Lilla Farkas, and Catharina Germaine. 2014. Ethnic Origin and Disability Data Collection in Europe: Measuring Inequality - Combating Discrimination. Equality Data Initiative. Open Society Foundations. https://www.opensocietyfoundations.org/uploads/d28c9226-bed7-4b1b-ac8b-4455f3c3451a/ethnic-origin-and-disability-data-collection-europe-20141126.pdf.
Christoffersen, Ashlee. 2018. Data Protection and Anonymity Considerations for Equality Research and Data. Advance HE. https://www.advance-he.ac.uk/knowledge-hub/data-protection-and-anonymity-considerations-equality-research-and-data.
Commission, European. 2010. European Textbook on Ethics in Research. Publications Office of the European Union.
———. 2017. Analysis and Comparative Review of Equality Data Collection Practices in the European Union: Legal Framework and Practice in the EU Member States. Publications Office. https://data.europa.eu/doi/10.2838/6934.
———. 2021. Guidance Note on the Collection and Use of Equality Data Based on Racial or Ethnic Origin. Publications Office of the European Union. https://data.europa.eu/doi/10.2838/06180.
———. 2023. Guidance Note on the Collection and Use of Data for LGBTIQ Equality. Publications Office. https://data.europa.eu/doi/10.2838/398439.
Golle, Philippe. 2006. “Revisiting the Uniqueness of Simple Demographics in the US Population.” In Proceedings of the 5th ACM Workshop on Privacy in Electronic Society, 77–80. WPES ’06. ACM. https://doi.org/10.1145/1179601.1179615.
Griffiths, Emily, Carlotta Greci, Yannis Kotrotsios, Simon Parker, James Scott, Richard Welpton, Arne Wolters, and Christine Woods. 2019. Handbook on Statistical Disclosure Control for Outputs. Safe Data Access Professionals Working Group. https://securedatagroup.files.wordpress.com/2019/10/sdc-handbook-v1.0.pdf.
Haley, J. 2017. Ethics in Primary Research (Focus Groups, Interviews and Surveys). Advance HE. https://www.advance-he.ac.uk/knowledge-hub/ethics-primary-research-focus-groups-interviews-and-surveys.
Lawson, J. 2016. Working with Data. Guidance and Tools to Help You Gather and Analyse Equality Data, and Deal with Small Numbers. Equality Challenge Unit. https://www.advance-he.ac.uk/guidance/equality-diversity-and-inclusion/using-data-and-evidence/working-data.
Sweeney, Latanya. 2000. “Uniqueness of Simple Demographics in the US Population.” In LIDAP-WP4. Carnegie Mellon University, Laboratory for International Data Privacy.