Data Science Dispatch
Legal/Ethical Analysis of Reidentification in Open Datasets: Open Sourcing Mental Health (OSMI)
A legal and ethical analysis of re-identification risk in the open-sourced OSMI Mental Health in Tech survey — the harms, the frameworks (HIPAA, GDPR, Solove, Nissenbaum), and how to mitigate them.
By Ambro Quach, Joseph Chan, Sarah Julius & Serina Li
How “anonymous” data gets re-identified
I. Introduction
The proliferation of open data sets has been an invaluable resource for data scientists both for performing research but also for democratizing access to information and accelerating cross-disciplinary innovation (Hardy & Maurushat, 2017; Huston et al., 2019). The idea of open sourcing data provides the ability to generate new insights that would otherwise remain siloed within organizations. However, these datasets have risks and raise both ethical and legal concerns about their use. The increasing quantity of open data, ease of access and proliferation are all factors that are amplifying the risks of re-identification (Green et al., 2017). These risks, if not mitigated, have the potential to cause significant harm to individuals and institutions, including privacy violations and erosion of public trust. For this project, we will analyze a public dataset to investigate the potential risks of re-identification of individuals within that dataset. We will then dissect key issues associated with our analysis, identifying ethical and legal implications of re-identification. Finally, we will offer concrete steps and suggest solutions that can mitigate the risks we have identified.
The dataset that we are using is the 'OSMI mental health in tech survey 2016' dataset that is hosted on Kaggle (OSMI Mental Health in Tech Survey 2016, n.d.). The dataset consists of 1433 survey responses from tech workers around the world gauging the prevalence of mental health disorders. Each survey response consists of 63 items which cover demographics of the respondents, questions pertaining to the respondent's mental health and questions on their employers' attitudes to mental health. The survey and the subsequent dataset were produced by Open Sourcing Mental Health (OSMI), a 501(c)(3) non-profit corporation whose purported goals are dedicated to 'raising awareness, educating, and providing resources to support mental wellness in the tech and open source communities' (About OSMI, n.d.). OSMI was founded back in 2013 by Ed Finkler, a tech worker who was inspired to enact change after discussing mental health at tech conferences.
The specific dataset we have chosen is the 2016 edition of the survey, the second in a line of ongoing surveys and the largest to date. The 2016 edition has been released with a Creative Commons Attribution-Share Alike 4.0 license. This license allows anyone to download, analyze, and redistribute the data. The OSMI website and Kaggle offer surprisingly little information about the 2016 survey. Indeed, these primary sources offer negligible information on how the survey study was initially conducted, what notice and/or consent was performed, what the data was used for and why the data has now been released. What is known, based on a third-party source, is that the survey was an opt-in survey distributed via twitter and at talks given at conferences (Mammal, 2017). One blog post from the OSMI website references a dataset challenge which uses the 2016 survey. The authors of the blog hope to 'get some fresh eyes on the results, with new ideas for how to analyze them' (OSMI, n.d.) From these sources, we see that at least part of the motivation for the release is to draw attention to mental health issues and to increase research in the area.
II. Privacy Impact Assessment (PIA)
Data is a very powerful tool that can be used to improve lives and make processes easier, but it is also sensitive. Whenever an individual's data is collected, especially sensitive data like social security numbers or health information, there are risks. Individuals expect that the data will be protected and used properly in order to mitigate these risks. A Privacy Impact Assessment (PIA) is one method to determine possible risks of data collection as well as methods to limit said risks.
A. PIA Applied to the OSMI Dataset:
The OSMI Dataset looks at mental health history as well as employment details and individual demographics, which are all sensitive data on their own, but can be combined to potentially re-identify subjects within the dataset. Re-identification is a huge concern when working with human data, as it can not only compromise an individual's privacy but also lead directly to harm for the individual identified. It is the responsibility of OSMI to have considered these harms prior to collecting the data (Mulligan et al, 2016).
One such harm could be employment discrimination. Employers may recognize the responses of their employees' survey responses and punish, or potentially even fire, people because of it. While the Fair Employment and Housing Act protects employees from being fired due to their mental health (Holben, 2024), the majority of states follow at-will employment, meaning the employer doesn't necessarily have to disclose why an employee is let go (At-Will Employment & States with Exceptions in 2024 | Atticus, n.d.). If an employee is re-identified from the OSMI Dataset, there is nothing they can do to ensure that they are not going to be terminated from their position.
In addition to potential employment discrimination, individuals re-identified from this dataset may face social stigma because of their mental health history. Many people still believe that individuals who have faced mental health issues are dangerous or incompetent. Individuals with mental health issues can be severely impacted by these views, resulting in reduced hope and low self-esteem. This stigma can also lead to people with mental illnesses having difficulty with personal relationships, being reluctant to seek treatment and assistance, isolating themselves from others (Singhal, 2024).
While it is unlikely that an individual would be able to be identified with just the OSMI dataset, one could easily use other publicly available information to identify many, if not all, of the individuals within the dataset. This risk leads to issues of trust and transparency for the OSMI dataset. In Solove's Privacy Taxonomy, this could be considered an issue of information processing, as the information could easily be aggregated with other publicly available data to re-identify individuals within the dataset (Solove, 2006).
OSMI informed participants that their responses would be anonymous, and that no personally identifiable information would be collected or stored. This is obviously untrue as employment and demographic variables enable indirect re-identification. This is a direct violation of their claims and can cause harm to many of the individuals they surveyed.
Nissenbaum would likely argue that this is a violation of contextual information norms as data participants would expect that their data is protected and that they would be made aware of any potential risks, whether directly or indirectly caused by the OSMI, of their data collection (Nissenbaum, 2011)
In order to maintain their promise and rectify this situation, there are a few things that OSMI can do. We first recommend that OSMI update their privacy disclaimer to ensure that individuals being surveyed understand the risks of re-identification by participating in the study. Next, it is recommended that OSMI use stronger de-identification measures, such as utilizing differential privacy or k-anonymity to protect the privacy of participants. Finally, OSMI should consider if they want to keep the data released to the public. The data was initially collected for their own research and OSMI should re-evaluate whether sharing this data publicly is within the scope of their work (Mulligan et al, 2016). They should consider releasing the data to only vetted researchers to further protect participant privacy. Control over who receives the data will help limit the risk of re-identification (Nissenbaum, 2011). While OSMI may not be able to prevent all risks, allowing participants to make fully informed decisions on whether they want their data included is essential.
III. Legal & Ethical Implications
A. Legal Risks
The open release of the OSMI Mental Health in Tech dataset raises significant legal concerns under HIPAA, GDPR, and CalOPPA, as the dataset includes potentially re-identifiable health-related information. While OSMI is not a healthcare provider, the disclosure of health-related data without adequate privacy safeguards could constitute negligence and non-compliance with key data protection laws.
First, HIPAA governs the protection of Protected Health Information (PHI) and requires entities handling such data to implement appropriate safeguards. Although OSMI is not a "covered entity" under HIPAA, the dataset contains mental health disclosures that could be re-identified, posing risks similar to those addressed under HIPAA's de-identification requirements. For instance, HIPAA Safe Harbor Method (45 CFR § 164.514(b)(2)) states that PHI must be stripped of 18 specific identifiers to be considered de-identified. The OSMI dataset includes demographic details, employment status, and workplace mental health policies, which when cross-referenced, could still allow re-identification. Furthermore, HIPAA Expert Determination Method (45 CFR § 164.514(b)(1)) requires a qualified expert to certify that the risk of re-identification is "very small." The OSMI dataset does not document any such evaluation, indicating a lack of due diligence in ensuring de-identification standards.
For GDPR, Under Article 7(2), data subjects must be provided consent using "using clear and plain language" regarding how their personal data will be used (GDPR Article 7). GDPR defines personal data as "any information relating to an identified or identifiable natural person" (GDPR Article 4). Although the OSMI dataset claims to be "anonymous," Article 9(1) of GDPR recognizes health data as a special category requiring explicit consent for processing (GDPR Article 9). The dataset lacks documentation proving participants were informed of potential re-identification risks, making it non-compliant. The legal consequences of such possible non-compliance are violations of consent and data minimization principles that can lead to fines of up to €20 million or 4% of an organization's annual revenue (GDPR Article 83).
Additionally, the California Online Privacy Protection Act (CalOPPA) requires businesses collecting personal data from California residents to disclose how the data will be used and with whom it may be shared (California Legislature). The OSMI dataset lacks a publicly available data retention policy, which may violate CalOPPA's transparency requirements. Under Cal. Bus. & Prof. Code § 22576, organizations must honor users' privacy expectations. The failure to notify participants about potential re-identification risks conflicts with the act's principles of informed consent.
Aside from Data Privacy Laws employment and labor regulations is another point of contention. If employers access identifiable responses in the OSMI dataset, they may discriminate against employees or job applicants based on mental health disclosures. This raises concerns under both U.S. employment law and international labor standards. For instance, the Americans with Disabilities Act (ADA) (42 U.S.C. § 12112) prohibits discrimination based on mental health conditions. If an employer uses the dataset to identify employees with mental health conditions, it may violate ADA protections. Likewise, the Equal Employment Opportunity Commission (EEOC) Guidance affirms that mental health conditions cannot be used as a basis for hiring, firing, or job assignments (EEOC Guidance). Lastly, GDPR Article 9(2)(b) states that processing health data for employment-related decisions must be "necessary", meaning employers should not access or use such data for hiring or workplace decisions (GDPR Article 9). The risk of employer-driven discrimination further underscores the need for stronger de-identification and access control measures in future data releases.
B. Ethical Concerns
The public release of the OSMI's dataset raises serious ethical concerns regarding participant expectations, data confidentiality, and the potential for harm. While OSMI intended to promote mental health awareness through open data, its failure to implement adequate privacy safeguards exposes survey respondents to risks of re-identification, workplace discrimination, and social stigma. This section examines how the dataset violates contextual privacy norms, breaches participant trust, and underscores the ethical responsibilities of data stewards.
Helen Nissenbaum's Contextual Integrity framework asserts that privacy violations occur when information flows in ways that contradict the established social norms of the context in which it was collected (Nissenbaum). The OSMI dataset explicitly states that "all responses are anonymous, and no identifying information will be collected or stored". However, the dataset contains demographic details, job titles, and company size, all of which, when cross-referenced with LinkedIn or company websites, could allow re-identification. This contradicts the expectations of survey participants, who likely assumed their responses would remain confidential and could not be traced back to them personally. As Nissenbaum emphasizes, "Generally, when the flow of information adheres to entrenched norms, all is well; violations of these norms, however, often result in protest and complaint". By violating contextual privacy norms, OSMI undermines participant trust in research processes. This breach of expectation may deter individuals from participating in future mental health surveys, ultimately harming the quality and inclusivity of mental health research.
Lastly, the ethical responsibility of data stewards requires that organizations handling sensitive personal data adhere to best practices in confidentiality, transparency, and participant protection. The OSMI dataset fails to uphold these standards, leading to significant ethical concerns. For example, publicly available mental health disclosures can lead to workplace discrimination, social isolation, and psychological distress if an individual is re-identified. OSMI has an ethical duty to protect participant identities by ensuring proper anonymization, providing transparency about re-identification risks, and implementing safeguards like differential privacy and restricted access policies.
IV. Reidentification Risks & Case Studies
The OSMI dataset contains high-dimensional data consisting of 63 columns. Although direct identifiers, such as names, emails, and IDs, have been removed, the dataset still includes demographic and employment attributes classified as quasi-identifiers. Quasi-identifiers are attributes that, when combined, can lead to the re-identification of individuals, posing substantial privacy risks (Sweeney, 2000). Attackers can exploit quasi-identifiers through re-identification techniques such as linkage attacks, which involve combining anonymized data with external datasets and websites to reconstruct individual identities. Linkage attacks have been demonstrated in scenarios where anonymized healthcare or demographic datasets were cross-referenced with publicly available data sources, effectively revealing individuals' identities (Narayanan & Shmatikov, 2008).
When assessing re-identification risks within the OSMI dataset, it is critical to analyze attributes associated with different risk levels. This involves examining the uniqueness and potential combinations of quasi-identifiers such as age, gender, geographic location, and employment information. Understanding how these attributes interact can reduce the likelihood of successful linkage attacks and further privacy violations. After examining the survey questions in the OSMI dataset, three categories of high-risk attributes were identified that pose significant privacy risks of re-identification:
- Demographic Information (PII): age, gender, country, state/territory (where participants work and reside)
- Employment & Industry Information: company size, tech role, remote status, employer mental health policies
- Protected Health Information (PHI): mental health conditions, diagnosis, treatment status, productivity impact, family history, personal experiences
First, these questions contain personally identifying information (PII) that could narrow down individuals in the dataset. The combination of age, gender, and location can uniquely identify people in specific demographics, especially in small populations like the tech community. Additionally, rare combinations are at higher risk of re-identification (e.g., "40-year-old non-binary back-end developer in Tennessee"). The risk associated with PII is further increased by employment and industry details. For instance, if someone works for a small company (with 5-10 employees) that has a specific focus within the tech industry, it becomes easier to cross-reference this information with external data sources like LinkedIn, Medium, or Twitter, which can lead to de-anonymization.
The risk continues to compound significantly due to the protected health information (PHI) contained in the dataset. The questions related to mental health conditions and diagnosis, treatment status, and personal experiences are highly sensitive. When linked with PII and employment details, this information can have serious negative consequences for the individuals involved. Mental health conditions, combined with employment information, can result in various forms of discrimination and exclusionary practices, such as insurance discrimination, workplace bias, and limited job opportunities. There are still stigmas surrounding mental health issues, and employees often do not want their managers or coworkers to be aware of their conditions. If their information is de-anonymized, this dataset violates privacy laws in the healthcare setting. It is not HIPAA compliant, as the combination of protected health information (PHI) and PII directly reveals the identity of individuals and their specific health conditions, which is unacceptable and utterly concerning.
However, effective strategies can be implemented to minimize re-identification risks and protect the privacy of survey participants. To ensure responsible data sharing, OSMI must adopt accountable and stringent measures to transform its data, mitigating potential harmful effects before it becomes publicly accessible. While demographic, employment, and protected health information each have distinct privacy risks, several mitigation strategies are applicable across these categories. Generalization is a key technique that can be used across these three categories to prevent re-identification. For instance, in demographics, instead of listing specific values, age should be placed into buckets such as "18–24" or "25–34," and geographic information should be aggregated into "West Coast" or "Midwest." With health data, specific diagnoses (e.g., "Bipolar Disorder," "OCD") should be replaced with only general categories like "Mood Disorders" or "Anxiety Disorders." Suppression of outlier responses should be used when a specific group or individual might be too easily identifiable. If a respondent belongs to a small demographic group (e.g., "42-year-old transitioned team lead in Canada"), their data can be suppressed or combined with other groups. For a company with very few employees, its responses can be removed or replaced with "Other."
Furthermore, the OSMI dataset should implement access restrictions for high-risk data attributes to prevent linkage attacks. Demographic breakdowns and employment data should only be accessible to approved researchers or those with explicit permission. The general public should only have access to aggregated data. Health data should always be protected under strict access controls, ensuring that no individual responses are publicly available without anonymization. Also, noise addition, often referred to as differential privacy, can obscure exact values while maintaining the overall trends in a dataset. Adding small, random variations to reported company sizes or job roles makes it harder to identify individual records. Additionally, demographic information such as age or location can be slightly modified to ensure that no individual person is easily singled out.
Besides the risks of re-identification posed by demographic and employment information, health data is the most sensitive category, requiring the strictest compliance with legal and ethical guidelines. When health data is leaked along with an individual's identity, it can result in discrimination, stigma, and insurance consequences at the workplace. Therefore, even though OSMI is a non-profit entity, it should take the necessary precautions to comply with HIPAA (Health Insurance Portability and Accountability Act) and GDPR (General Data Protection Regulation). These regulations require strict anonymization of health records, meaning diagnosis and treatment-related information must be removed or de-identified before sharing.
Beyond legal compliance, adhering to HIPAA and GDPR is also a matter of ethical responsibility when providing publicly accessible datasets. OSMI has a duty to protect the privacy of survey participants. It should proactively implement the necessary data safeguards to prevent any potential harm that could arise from the misuse of sensitive information. By leveraging privacy protection strategies such as data generalization, suppression of outlier responses, access restrictions, pseudonymization, and differential privacy, OSMI can mitigate risks while preserving the dataset's value for research and advocacy purposes. Upholding these standards and maintaining a transparent set of principles will not only build trust within the tech and mental health communities but also establish a strong precedent for responsible data stewardship in open public datasets.
The OSMI dataset was designed to be anonymous, yet it contains data points that can be matched to public sources, potentially re-identifying individuals. The following case studies explore re-identification scenarios, illustrating how anonymous survey data can be traced back to individuals. These cases do not attempt to expose real individuals but highlight the methodologies attackers can use to exploit privacy vulnerabilities in open datasets and propose enhanced security measures.
Case Study 1: Self-Employment & LinkedIn Match
This case examines a self-employed individual who works in an 11-50-person company, holds executive leadership and developer roles, and is based in Switzerland. They also work remotely. To re-identify this individual, one could search LinkedIn, business registries, or startup directories such as Crunchbase for self-employed tech professionals in Switzerland. Filtering by job title, company size, and industry could narrow the results. If the individual has shared blog posts or social media discussions on mental health, these could be cross-referenced with their survey responses. If identified, their private mental health disclosures could become publicly linked to them. This could lead to discrimination from clients or employers and potential reputation damage. This case highlights the risk of linking employment data to mental health information, emphasizing the need for better anonymization in open datasets.
Case Study 2: Age, Gender, and Employer Details Cross-Matching
This case examines individuals with rare demographic profiles, making them more identifiable. It focuses on a 21-year-old non-binary individual working at a tech company in New Jersey with 1-5 employees. To re-identify this person, one could search the web for non-binary professionals in New Jersey's tech sector. Filtering by company size could further refine the selection of companies to look at. If the individual has publicly discussed mental health or gender identity through blogs or forums such as Medium and LinkedIn, these posts could be linked to their OSMI responses. If identified, their workplace could become aware of their mental health status. Colleagues might associate survey responses with them, potentially leading to discrimination or bias. Additionally, if their gender identity was not previously known, this unintended exposure could create personal and professional risks.
Case Study 3: Public Business Records Matching
This case examines a self-employed entrepreneur in New York, working in a 1-5-person company that does not provide mental health benefits. They also work remotely sometimes. Self-employed individuals in small businesses are at higher risk of re-identification. One could search New York's public business records for small business owners to re-identify this individual. By cross-matching profiles on LinkedIn, Wellfound, or Crunchbase, the list of potential candidates can be narrowed further. Social media and platforms like Substack can be used to explore discussions about mental health from this individual. A company's "About Us" page could also confirm whether the business has mental health policies. If identified, their mental health struggles could be linked to their business, affecting their professional reputation. Clients, partners, or investors might hesitate to work with them due to stigma. Additionally, this exposure could impact future employment opportunities, making it harder for them to transition into other roles.
V. Addressing Identified Issues
To mitigate the aforementioned risks, technical solutions and policy recommendations must be implemented to strengthen data protection, limit access, and improve informed consent processes.
A. Technical Solutions (De-Identification & Access Controls)
First, one effective method for reducing the risk of re-identification is differential privacy, which introduces statistical noise into the dataset to ensure that individual responses cannot be distinguished. This technique has been widely recognized as a best practice for protecting sensitive data while maintaining its overall utility for research. Implementing differential privacy would help OSMI ensure that the dataset remains useful for analysis while preventing personal information from being reconstructed.
Another key strategy is data minimization, which involves releasing only aggregated insights rather than raw response-level data. Instead of making the full dataset openly available, OSMI could have published summary statistics on mental health trends within the tech industry, reducing the likelihood of individuals being identified. This approach aligns with data protection principles under GDPR and CalOPPA, which emphasize limiting data exposure to what is strictly necessary.
Lastly, access control mechanisms should be implemented to restrict dataset access to approved researchers rather than making the data publicly available. This could be done by requiring formal ethics approval or data-use agreements, ensuring that the dataset is used responsibly by qualified individuals rather than freely accessible for potential misuse. Many privacy-sensitive datasets already follow this model, providing access only under strict confidentiality agreements to prevent unauthorized use or de-anonymization attempts.
B. Regulatory & Policy Recommendations
Beyond technical solutions, regulatory and policy improvements should be adopted to enhance privacy protections. A notice-and-comment process before data release would allow stakeholders, including privacy experts, legal scholars, and mental health advocacy groups, to review dataset privacy risks before publication. This process would help identify vulnerabilities in the dataset and provide recommendations on how to anonymize or restrict certain variables before the data becomes publicly available.
Another essential measure is strengthening informed consent by ensuring that participants clearly understand how their data will be used and the potential risks associated with re-identification. Many respondents likely assumed that their survey answers would remain anonymous, yet the inclusion of employment details, demographic information, and workplace policies makes re-identification feasible. Going forward, OSMI must provide explicit disclosures about privacy risks, allowing participants to make fully informed decisions about whether to share their mental health information.
By adopting these technical and regulatory safeguards, OSMI can align its data-sharing practices with legal and ethical standards, ensuring that future mental health research is conducted responsibly and without compromising participant confidentiality.
VI. Open Data Governance
The governance of open data is a multifactorial problem that is dependent on the type of data, the scope of that data and the jurisdictions which govern the data. Governance of open data ensures ongoing protection of privacy, maintains quality assurance of the data, preserves ethical principles, permits transparency and enables legal compliance (Green et al., 2017). The OSMI dataset provides key insights into shortcomings that can arise if governance is not carefully considered. These flaws present excellent learning opportunities and lead to recommendations on best practices. Here we propose a number of recommendations for governance of open datasets.
A. Recommendation 1: Governance structures
We recommend that organizations should have clear roles internally as to who is the steward of any open datasets. For small datasets, this may be a single individual who takes responsibility for control and oversight of the dataset. For larger datasets this will likely comprise of a committee with established roles. The members of this committee would ideally have a wide range of skills that encompass ethical, legal and security concerns. In the OSMI 2016 dataset, there was no clear steward of the dataset on their website or on Kaggle. The lack of a structure prevents responsibility from being assigned and weakens the ability of an organization to provide governance with adequate oversight. Here, this lack of stewardship for the OSMI dataset manifests as a failure to consider legal, ethical and privacy issues.
B. Recommendation 2: Clear documentation
We recommend that organizations should have clear documentation about the dataset. This should include the provenance of the data including information on how the data was collected and whether participants consented. There should also be clear information on variables, any processing of data, the current version of data and any updates. In the OSMI dataset, there was very limited documentation. This raises many questions on the quality of the data, limitations of the data and ethical concerns regarding participant consent. The lack of documentation likely arises because of the lack of governance structures in our first recommendation.
C. Recommendation 3: Ethical Review
We strongly recommend that prior to the release of data that there is a review by an ethics committee. This can be an internal committee that considers key ethical issues that may need resolution prior to data release. If the datasets contain health information and/or could lead to health research, we strongly recommend review by an institutional review board as best practice. This is to ensure that the Belmont principles are adhered to and that any research complies with the Helsinki declaration (WMA - The World Medical Association-WMA Declaration of Helsinki – Ethical Principles for Medical Research Involving Human Participants, n.d.). Failure to consider ethical principles can lead to harms to participants of the research.
D. Recommendation 4: Notice and Consent
There should be consideration of stronger notice and consent processes at time of data collection. Ideally, organizations should provide adequate notice to individuals and/or entities whose data is being collected. Moreover, consent for open release of data should be obtained if possible. This provides transparency and allows trust in the organization collecting the data to be retained. It is unclear from the available documentation whether the OSMI dataset had provided sufficient notice and consent to the individuals who sent their data to OSMI.
E. Recommendation 5: Data-use agreements
Whilst the idea behind open datasets is to democratize access to data, there are potential harms that can arise if there are no limitations placed upon open datasets (Green et al., 2017). We propose that at minimum, that a data-use agreement be signed by any user of open datasets released by organizations. For datasets that contain particularly sensitive data like health information, we suggest even stronger protections. This may include restricting access to approved researchers whose proposals are vetted by the governance structure of the organization releasing the dataset. In the OSMI dataset that we have reviewed in this project, the use of the data was governed by a creative commons attribution-sharealike 4.0 license. This is insufficient in dealing with the potential harms associated with open dataset release. Other health related datasets like the MIMIC-IV dataset require users to agree to not make any attempts to re-identify individuals in the dataset(Johnson et al., 2023).
F. Recommendation 6: Differential privacy
Differential privacy is one method that can be considered to enhance the privacy of individuals. It describes a mathematical way to introduce noise to a dataset to reduce the ability to re-identify patients, whilst maintaining the overall statistical viability of the data (Cummings et al., 2024). There is an inherent tradeoff where privacy can be better protected but the accuracy of the results is reduced. Organizations should consider differential privacy to protect open datasets. However, they should also be aware that it may be difficult to apply this framework to datasets where accuracy is essential. It does not appear that differential privacy was applied to the OSMI 2016 dataset, but again this is hard to deduce due to the poor documentation of the dataset.
G. Recommendation 7: Data correction and removal
Our final recommendation is that data correction and recourse for individuals be a guiding principle of open datasets. If there are mistakes with the data, there should be mechanisms to correct those principles. These mechanisms should be easily accessible and in a timely manner. Importantly, if participants of a dataset do not wish to participate in the open release of their data, there should be mechanisms to allow removal of their data from the dataset. This protects participants from harm and enhances trust in the organization that has collected the data. Furthermore it complies with the principle of individual participation found in the Fair Information Practice Principles (DHEW, 1973).
VII. Conclusion & Recommendations
The analysis of the OSMI Mental Health in Tech dataset highlights significant re-identification risks due to the inclusion of employment details, demographic information, and health status disclosures. While OSMI aimed to promote mental health awareness, the lack of proper de-identification techniques exposes survey respondents to privacy threats, employment discrimination, and social stigma. To address these issues, we propose the following recommendations:
- Use differential privacy to prevent re-identification by introducing statistical noise while preserving data utility.
- Restrict dataset access to vetted researchers instead of making the data publicly available.
- Implement a standardized Privacy Impact Assessment (PIA) before releasing sensitive datasets to evaluate privacy risks.
- Develop an Open Data Governance Code of Conduct that enforces better anonymization practices, clearer consent procedures, and data access controls.
OSMI must re-evaluate its data-sharing policies and implement privacy-by-design principles in future dataset releases. By adopting stronger technical safeguards, stricter regulatory processes, and ethical oversight, OSMI can continue its mission of supporting mental health research while ensuring participant confidentiality and trust.
VIII. References
- About OSMI: Open Sourcing Mental Health - Changing how we talk about mental health in the tech community - Stronger Than Fear. (n.d.). Retrieved March 13, 2025, from https://osmihelp.org/about/about-osmi
- At-Will Employment & States With Exceptions in 2024 | Atticus. (n.d.). Www.atticus.com. https://www.atticus.com/advice/workers-compensation/what-is-at-will-employment
- California Legislature. California Business and Professions Code Section 22575. FindLaw, codes.findlaw.com/ca/business-and-professions-code/bpc-sect-22575/.
- Cummings, R., Desfontaines, D., Evans, D., Geambasu, R., Huang, Y., Jagielski, M., Kairouz, P., Kamath, G., Oh, S., Ohrimenko, O., Papernot, N., Rogers, R., Shen, M., Song, S., Su, W., Terzis, A., Thakurta, A., Vassilvitskii, S., Wang, Y.-X., … Zhang, W. (2024). Advancing Differential Privacy: Where We Are Now and Future Directions for Real-World Deployment. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.d3197524
- DHEW. (1973). Records, Computers and the Rights of Citizens.
- European Union. General Data Protection Regulation (GDPR) Article 4 - Definitions. GDPR-Info, gdpr-info.eu/art-4-gdpr/.
- European Union. General Data Protection Regulation (GDPR) Article 7 - Conditions for Consent. GDPR-Info, gdpr-info.eu/art-7-gdpr/.
- European Union. General Data Protection Regulation (GDPR) Article 9 - Processing of Special Categories of Personal Data. GDPR-Info, gdpr-info.eu/art-9-gdpr/.
- European Union. General Data Protection Regulation (GDPR) Article 83 - General Conditions for Imposing Administrative Fines. GDPR-Info, gdpr-info.eu/art-83-gdpr/.
- GDPR. (2018). General Data Protection Regulation (GDPR). General Data Protection Regulation (GDPR). https://gdpr-info.eu/
- Green, B., Cunningham, G., Ekblaw, A., Kominers, P., Linzer, A., & Crawford, S. P. (2017). Open Data Privacy (SSRN Scholarly Paper No. 2924751). Social Science Research Network. https://doi.org/10.2139/ssrn.2924751
- Hardy, K., & Maurushat, A. (2017). Opening up government data for Big Data analysis and public benefit. Computer Law & Security Review, 33(1), 30–37. https://doi.org/10.1016/j.clsr.2016.11.003
- Holben, D. R. (2024, August 14). Can you sue if you are terminated for mental health issues? Donald R. Holben & Associates, APC. https://www.sandiegotrialattorneys.com/blog/2024/08/can-you-sue-if-you-are-terminated-for-mental-health-issues/
- Huston, P., Edge, V., & Bernier, E. (2019). Reaping the benefits of Open Data in public health. Canada Communicable Disease Report, 45(11), 252–256. https://doi.org/10.14745/ccdr.v45i10a01
- Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L. H., Celi, L. A., & Mark, R. G. (2023). MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10(1), 1. https://doi.org/10.1038/s41597-022-01899-x
- Mammal, T. F. (2017, June 21). Data and Mental Health: The OSMI Survey 2016. TDS Archive. https://medium.com/towards-data-science/data-and-mental-health-the-osmi-survey-2016-39a3d308ac2f
- Mulligan, D. K., Koopman, C., & Doty, N. (2016). Privacy is an essentially contested concept: a multi-dimensional analytic for mapping privacy. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2083), 20160118. https://doi.org/10.1098/rsta.2016.0118
- Narayanan, A., & Shmatikov, V. (2008). Robust De-anonymization of Large Sparse Datasets. Proceedings of the 2008 IEEE Symposium on Security and Privacy, pp. 111-125. IEEE.
- Nissenbaum, H. (2011). A Contextual Approach to Privacy Online. Daedalus, 140(4), 32–48. https://doi.org/10.1162/daed_a_00113
- OECD Legal Instruments. (n.d.). Legalinstruments.oecd.org. https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0188
- OSMI Mental Health in Tech Survey 2016. (n.d.). Retrieved March 13, 2025, from https://www.kaggle.com/datasets/osmi/mental-health-in-tech-2016
- Singhal, N. (2024). Stigma, Prejudice and Discrimination Against People With Mental Illness. American Psychiatric Association. https://www.psychiatry.org/patients-families/stigma-and-discrimination
- Solove, D. J. (2006). A Taxonomy of Privacy. University of Pennsylvania Law Review, 154(3), 477–560. https://doi.org/10.2307/40041279
- Sweeney, L. (2000). Simple Demographics Often Identify People Uniquely. Carnegie Mellon University, Data Privacy Working Paper.
- The OSMI 2016 survey results are part of Data Society's Dataset Challenge for the month of March: Open Sourcing Mental Health—Changing how we talk about mental health in the tech community—Stronger Than Fear. (n.d.). Retrieved March 13, 2025, from https://osmihelp.org/blog/the-osmi-2016-survey-results-are-part-of-data.html
- U.S. Congress. 42 U.S. Code § 12112 - Discrimination. Legal Information Institute, Cornell Law School, www.law.cornell.edu/uscode/text/42/12112.
- U.S. Department of Health & Human Services. 45 CFR § 164.514 - Other Requirements Relating to Uses and Disclosures of Protected Health Information. Legal Information Institute, Cornell Law School, www.law.cornell.edu/cfr/text/45/164.514.
- U.S. Equal Employment Opportunity Commission. EEOC Guidance on Employment Discrimination Laws. EEOC, www.eeoc.gov/eeoc-guidance.
- WMA - The World Medical Association-WMA Declaration of Helsinki – Ethical Principles for Medical Research Involving Human Participants. (n.d.). Retrieved March 13, 2025, from https://www.wma.net/policies-post/wma-declaration-of-helsinki/
IX. Appendix
Privacy Frameworks Application
Solove's Privacy Taxonomy (Solove, 2006):
- Information Collection – OSMI claimed that the data they collected was fully anonymous. While they did not directly collect identifying information such as name, birthday, or social security number, other information within the dataset could still contribute greatly to the risk of re-identification.
- Information Processing – The real risk of reidentification comes from the potential of this data being combined with other public sources of information. Information from websites such as LinkedIn and company websites could potentially be used to re-identify many individuals in this dataset. As the data is available publicly, OSMI should have also alerted participants that the data may be used for purposes other than the original study.
- Information Dissemination – The privacy disclosure provided by OSMI stated that all information collected was anonymous, but due to the aforementioned risks of re-identification, this is not a promise they can uphold, potentially constituting a confidentiality breach, Potential re-identification could also be considered an exposure risk as the data collected gives an overview of participant's mental health, something that they may not want to make available to others.
- Invasion – The re-identification of participants in this mental health survey is a severe invasion on the lives of the impacted individuals. As stated in the PIA, this information is exceptionally sensitive because it can lead to both social and employment discrimination.
Nissenbaum's Contextual Integrity (Nissenbaum, 2011):
- Appropriate flows of information and contextual information norms – While it seems like OSMI strived for appropriate flows of information, there was oversight on their part on how this information could potentially be used. Participants in the study were lead to believe that their survey responses were fully anonymous, but OSMI should have warned about the potential use of external data linkages to data to re-identify the information.
- Data Subject/Sender – The data subjects of this study are the participants in the survey. The sender is OSMI itself.
- Recipient – The recipient of the data is any researcher, student, or other individual who uses this publicly available dataset. Should OSMI want to decrease risk of re-identification and further protect the privacy of their participants, they may consider vetting users before allowing them access to the data.
- Information Type/Transmission Principles – All of the data within the dataset was directly related to the study OSMI was performing, however, once they publicly released the data, the researchers no longer had any control over other use of this dataset. As this data can be especially sensitive, it is recommended that OSMI review privacy laws and ensure that their standards are met. Some laws and standards they may consider looking at include the General Data Protection Regulation (GDPR) and the Recommendation of the Council concerning Guidelines Governing the Protection of Privacy and Transborder Flows of Personal Data (GDPR, 2018; OECD Legal Instruments, n.d.).
- Ethical Issues – As stated within the body of the text, the biggest ethical issue was determined to be the risk of re-identification. Re-identification could lead to embarrassment to the individual participants as well as the potential for employment and social discrimination.
Mulligan/Koopman's Privacy Analytic (Mulligan et al, 2016):
- Theory Dimension – The data released by OSMI is the object of privacy in this situation. OSMI stated that the participants will be fully anonymous in the survey, but this cannot be guaranteed. For both legal and ethical issues, OSMI should reconsider their privacy disclosure to ensure that participants are aware of the risks associated with being included in a publicly available dataset.
- Protection Dimension – OSMI did strive to protect privacy by avoiding the collection of personal data like name or birthday, but they missed the risk of re-identification. Differential privacy or other effective measures should be used in future research to decrease the risk of re-identification.
- Harm Dimension – Privacy in the context of this application implies protection from de-anonymization in which users' identity and mental health history may be revealed. This could result in loss of employment and reputational damage for participants.
- Provision Dimension – OSMI is directly responsible for protecting the privacy of their participants. Their privacy disclosure stated that all data collected was anonymous, but we have identified risks of re-identification. OSMI should rewrite their privacy disclosure to accurately portray the risk of re-identification. They should also ensure that it is clear to participants what any privacy safeguards, including legal, organizational, and technical, they employ.
- Scope Dimension – Individuals expect their data to be protected, especially when it involves as sensitive of a topic as mental health. OSMI should consider all aspects of the scope of their data, including their initial use, data retention, and data sharing. It is concerning that such sensitive data was released globally without any safeguards in place to prevent misuse and re-identification.
A data ethics project from UC Berkeley's School of Information, by Ambro Quach, Joseph Chan, Sarah Julius, and Serina Li.