GDG Surrey

Creating the Ideal Multilingual and Multicultural Retrieval Dataset

Summary: Priti yadav initiated a discussion about developing multicultural and multilingual retrieval datasets in the tech industry, focusing on handling diverse queries like those involving African science fiction or Indigenous Brazilian poetry. Eduard Maievskyi expressed concerns about the inherent challenges, such as defining factual content, handling sensitive topics, and maintaining a balance between diverse perspectives. They emphasized the need for careful design to avoid unintentional biases. Priti yadav appreciated these insights, suggesting the use of proxy indicators to ensure fairness and inclusivity, acknowledging the complexity of fully capturing the world's diversity but underlining the importance of striving for a balanced representation.
AI Summary

As our community embraces diversity and inclusivity in the tech industry, one exciting frontier is the development of retrieval datasets that genuinely reflect our global culture. Imagine datasets that can efficiently handle genre-based queries such as science fiction by African authors or poetry by Indigenous Brazilians. In your view, what are the key features and considerations that an ideal multilingual and multicultural dataset should have?

  • What challenges do you foresee in building such a dataset, and how might they be overcome?

  • How can we ensure that this dataset remains fair and unbiased, and represents a wide range of cultural narratives?

  • What role can local communities, like ours, play in contributing to or shaping such datasets?

  • Have you come across any projects or organizations that are already working towards this goal? What can we learn from them?

Share your thoughts, experiences, and any relevant resources or tools that can foster a development environment grounded in a true representation of multicultural and multilingual landscapes.

2 comments
  • What challenges do you foresee in building such a dataset, and how might they be overcome?

    It will be challenging to define what the absolute truth should be for such a dataset, and what facts and information should and should not be included in it. It is a profound, fundamental problem that can be discussed for hours. At this moment, I do not see any consensus rules and concepts that could help solve it. Let me provide a few examples:
    1. All religion-related information.
    2. Historical and political information related to the problems of existential confrontation of nations and countries.
    3. Scientific Theories and Debates: topics related to scientific theories, such as evolution, climate change, or the origins of the universe, can be contentious and may have varying perspectives depending on the scientific community, cultural context, or individual beliefs.
    4. Social and Cultural Norms: answers to such questions can be sensitive and influenced by regional, cultural, or personal biases.
    5. Economic and Financial Systems: answers to questions about the best economic systems, taxation policies, or the impact of globalization can be subjective and shaped by individual experiences, cultural values, or socioeconomic backgrounds.
    6. Philosophical and Moral Dilemmas: topics related to morality, ethics, and philosophical concepts, such as the nature of free will, the morality of artificial intelligence, or the ethics of euthanasia, can be deeply divisive and influenced by personal convictions, cultural norms, or philosophical traditions.

  • How can we ensure that this dataset remains fair and unbiased, and represents a wide range of cultural narratives?

    :)

  • What role can local communities, like ours, play in contributing to or shaping such datasets?

    Introduce adequate and effective guardrails to ensure the system can safely and without any negative consequences handle questions that may ignite aggression between members of our local community.

  • Have you come across any projects or organizations that are already working towards this goal? What can we learn from them?

    That is one of the biggest and hottest problems nowadays. Almost everyone tries to propose some solutions to this issue. The best practices are: "do no harm" & "do not try to fix this world". It is very easy to introduce biases into such systems and loose an equilibrium.

Hi Eduard, thank you for such a thoughtful response. You’ve really captured how complex this challenge is. One thing I keep coming back to is whether we can use proxy indicators to guide fairness, like checking if a variety of perspectives are present or if voices from often-overlooked regions are intentionally included. Of course, it’s not about choosing sides, but about reflecting the diversity that already exists in the world.

I also connect with your point about not trying to force a solution. We may never capture every nuance, but that’s exactly why these conversations matter. Even just noticing what’s missing can help us design with more care. I’d love to hear how you and others in the community think about this.