Expanding the involvement of women in Science, Technology, Engineering, and Mathematics (STEM) across Latin America is crucial for economic advancement, social equity, and global competitiveness; however, these efforts have proven to be challenging. Women in the region are underrepresented in STEM10 and even more so in leadership positions.17,18 The limited availability of current information and the difficulties associated with obtaining reliable data to mitigate gender disparities create difficulties in implementing policies to reduce the gender gap in STEM. Researchers, organizations, and policymakers working to reduce the gender gap need access to dependable data to understand the root causes of gender disparities, promote evidence-based interventions, and increase accountability and transparency.
In the quest for solutions to these challenges, an international research network between Bolivia, Brazil, and Peru, “Equality in Leadership for Latin America STEM” (ELLAS), emerged in 2022.6 This network, formed by eight Latin American universities and one from the U.S., runs the research project entitled “Latin American Open Data for Gender Equality Policies Focusing on Leadership in STEM”, funded by the International Development Research Centre (Project ID #109798).a
The project’s objective is to generate and promote the use of a cross-country comparable open data platform related to gender disparity within STEM in involved countries,13 with a focus on leadership.14 With this purpose, it is essential to define an architecture that can deal with the complete process of data curation.
In this article, we present an innovative architecture that allows for the curation of different data sources, from raw data to data consumption of individual users such as researchers, policymakers, and decision makers working on STEM and gender issues. This architecture alleviates the challenge for users in locating and accessing trustworthy information concerning gender policies, initiatives, and contextual factors, consolidating them into a single source. This contrasts with the scattered nature of such information across various formats, vocabularies, and sources.
The Open Data ELLAS Platform Architecture is composed of three layers, as presented in the accompanying figure. The data layer (from the bottom up) organizes two different types of data sources: “primary data,’’ which comprises mostly unstructured data in PDF formats (that is, academic papers), data from social media, and data collected via a survey—for which data fields have been identified about contextual factors, initiatives, and policies related to gender representation and leadership; and “secondary data,’’ which comprises semi-structured data about women in STEM in Latin America from various websites of national and international organizations.3,12,15,16 This layer relies on the collaboration of multidisciplinary teams to curate the data, ensuring its readiness for integration into the subsequent layer.
The processing layer involves data collection of structured comma separated values (CSV) files for the process of ontology modeling that will represent the knowledge around policies, factors, and initiatives in three languages (Portuguese, English, and Spanish). The tool Protégé is used to model the ontology, which is created in Web Ontology Language (OWL). The next process is semantic mapping that materializes the knowledge graph7 where primary and secondary data structured in CSV files are instantiated into the OWL ontology and become resource description framework (RDF) data through mapping technologies like the Ontotext Refine tool. This process generates a mapping file in JavaScript Object Notation (JSON) format that can be reused to update data as new data is generated. These three processes form one complex pipeline orchestrated and integrated by Pentaho and Python technologies. This layer depends on the work of platform developers like app and ontology developers. The processing layer also includes the knowledge graph integration that involves triplification, where specific knowledge graphs from different data sources come together and are stored in GraphDB TripleStore.
Finally, the application layer allows users to search, understand, and use data. This layer mediates the access to data through an interface focused on end-users with no technical knowledge, but with interests in gender equality in STEM. Technical users also can access the knowledge graph in GraphDB to query the data using an application program interface (API) like SPARQL or with a non-specific language. The development of this layer follows human-centered design approaches, such as value-sensitive design8 and feminist theories.1 All processes in ELLAS platform utilize cloud services.
We actively engage stakeholders such as policymakers and researchers to identify requirements for our platform and participate in potential interaction scenarios via quantitative and qualitative user studies.4
Data Layer Curation
In order to have the right amount of data integrated in the processing layer, we defined a rigorous and replicable methodology for data curation which includes identifying, collecting, and organizing primary and secondary data.2 Here, we present the resulting instantiation of the data layer.
As shown in the accompanying table, for each kind of data, data sources were defined, as well as the appropriate collection techniques. Each collection of data was analyzed to select reliable and relevant data for our context. In addition, the table shows the number of instances in each data source.
All the selected data about policies,11 initiatives,9 and contextual factors5 was transformed into a knowledge graph with more than 295.000 triples by the end of 2023.
Kind of data | Data source | Collection Techniques | Analyzed data |
---|---|---|---|
Primary Data | Survey Data | Survey Design | 10.000+ responses |
Academic Papers | Systematic Literature Review | 352 about Latin American policies, 231 about international policies, 259 about contextual factors, 775 about initiatives, 74 about women leadership in STEM | |
Social Media | Systematic Gray Literature Review | 300+ profiles | |
Gray literature (Governmental websites, official reports, and more) | Systematic Gray Literature Review | 26 | |
Secondary Data | Open Data websites | Web scraping | 8 |
For access to the ELLAS platform and to learn more about the project, visit the ELLAS website.6
Final Remarks
In this article, we described the three-layer architecture of the open data platform and the resulting instantiation of the data layer. The establishment of an open-data platform focused on women in STEM that has been curated from different data sources allows users like researchers, policymakers, and decision makers access to reliable information. Once the platform is finalized and published on the ELLAS website, a significant challenge lies in effectively engaging stakeholders to utilize it. While scientific contributions from the project have been disseminated in more than 30 academic papers and conference presentations,6 this outreach is insufficient. Hence, we have initiated efforts to secure public endorsements from interested groups such as universities and international organizations. This strategy aims to enhance awareness of the platform and encourage its use. Ultimately, the use of the platform has the potential to promote informed decision-making, transparency, and active public engagement for the development of gender equality policies in leadership in STEM. While this project initiative began with three countries in Latin America, our aim is to expand to other countries in the region.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment