350
87
1
Computer Science
Artificial Intelligence
Shaoyang Xu1, Yongqi Leng2, Linhao Yu2, Deyi Xiong2,1
As large language models (LLMs) become increasingly accessible in many countries, it is essential to align them to serve pluralistic human values across cultures. However, pluralistic culture alignment in LLMs remain an open problem[1]. In this paper, we propose CultureSPA, a Self-Pluralising Culture Alignment framework that allows LLMs to simultaneously align to pluralistic cultures. The framework first generates questions on various culture topics, then yields LLM outputs in response to these generated questions under both culture-aware and culture-unaware settings. By comparing culture-aware/unaware outputs, we are able to detect and collect culture-related instances. These instances are employed to fine-tune LLMs to serve pluralistic cultures in either a culture-joint or culture-specific way. Extensive experiments demonstrate that CultureSPA significantly improves the alignment of LLMs to diverse cultures without compromising general abilities. And further improvements can be achieved if CultureSPA is combined with advanced prompt engineering techniques. Comparisons between culture-joint and culture-specific tuning strategies, along with variations in data quality and quantity, illustrate the robustness of our method. We also explore the mechanisms underlying CultureSPA and the relations between different cultures it reflects.
Corresponding author: Deyi Xiong, dyxiong@tju.edu.cn
Large language models, such as GPT-4[2], have gained widespread use due to their extensive knowledge and prowess in reasoning[3][4][5]. Given the multicultural nature of our society, it is essential for LLMs to serve diverse human values and preferences across cultures. However, existing alignment techniques, such as RLHF[6] and DPO[7], do not specifically take cultural diversity into account. With such alignment techniques, LLMs tend to learn biased human values and preferences[8][9][1][10].
Many studies examine how well LLMs align to serve specific cultures by simulating social surveys on LLMs[11][12][13][14][15][16][17]. In these studies, the similarity between the outputs of an LLM and real-world survey answers from a specific culture is calculated as the cultural alignment score (CAS) between the LLM and given culture. Findings with CAS suggest that LLMs often exhibit cultural dominance, as shown in Figure 1 (Culture-Unaware Prompting), where LLaMA3’s outputs naturally align more closely to certain North American and European cultures.
To mitigate the reduction of LLMs in distributional pluralism, efforts are dedicated to pluralistic value alignment in pre-training[18][19][12][15], alignment training[13][16][20][21], and prompt engineering[11][12][15][22][17][23]. However, training-based approaches require external cultural data, which are often scarce, especially for underrepresented cultures. Meanwhile, prompt engineering methods necessitate careful example selection and can yield inconsistent results[22].
To address these issues, we propose to explore self-pluralising culture alignment without relying on external cultural resources. Our approach is grounded in two key findings: (1) Research in prompt engineering shows that LLMs possess a certain level of internal knowledge about diverse cultures. As illustrated in Figure 1 (Culture-Aware Prompting), simply prompting LLaMA3 to align to a given culture is an effective way to enhance its cultural alignment; (2) Studies on data synthesis[24][25] indicate that LLMs can generate data using their existing knowledge to improve performance on specific tasks. Building on these findings, we explore the following research question: Can we harness the internal culture knowledge of LLMs to enhance their alignment to specific cultures?
To this end, we propose CutureSPA, a framework that achieves pluralistic culture alignment in LLMs by “activating” their internal culture knowledge. As illustrated in Figure 2, CutureSPA first generates survey questions on diverse culture topics (§4.1). It then collects LLM outputs for these questions under two scenarios: culture-unaware prompting, where the model does not receive specific cultural information, and culture-aware prompting, where the model is prompted to align to a specific culture (§4.2). Samples that exhibit shifted outputs when cultural information is provided are deemed the most representative of a specific culture. Culture-related QA pairs collecting is employed to select such samples (§4.3). The collected data instances are ultimately used for culture-joint and culture-specific supervised fine-tuning (SFT) (§4.4).
We conduct extensive experiments to examine CultureSPA. Experimental results indicate that CultureSPA effectively enhances LLM alignment to pluralistic cultures and can be integrated with advanced prompt engineering techniques (§5.3). A comparison between culture-joint and culture-specific SFT strategies demonstrates the superiority of the former (§5.4). Additionally, we explore the mechanism behind CultureSPA (§6.1), investigate cross-cultural relationships (§6.2), and examine the effects of data quality and quantity (§6.3). We summarize our contributions as follows:
Extensive efforts have been made to enhance the pluralistic culture alignment of LLMs. These efforts include advancements in pre-training[18][19][12][15] and alignment training[13][16][20][21], which rely on external data that reflect specific cultures. Model inference strategies have also been developed, including effective prompt design[11][12][15][22], in-context learning[17][23], and multi-model collaboration[26]. In contrast to these approaches, our work explores pluralistic culture alignment without depending on external cultural resources by activating internal culture knowledge in LLMs.
Traditional methods for instruction tuning in LLMs use either previously manually created NLP datasets[27][28] or real-world user prompts[6]. However, these methods are time-consuming and challenging to scale. Recent efforts have explored LLM-driven data synthesis[29][30][24][25] to address these issues. Specifically, Self-Instruct[24] utilizes the in-context learning and generation capabilities of LLMs to automatically generate general instruction tuning data from 175 seed instructions. Our work follows a philosophy similar to Self-Instruct to produce diverse questions from seed questions on cultures, investigating the feasibility of self-pluralising culture alignment in LLMs.
In this section, we first define culture and culture alignment, then present the framework used to assess the cultural alignment of LLMs.
Culture generally refers to the way of life shared by a collective group of people, distinguishing them from other groups with unique cultural identities[31]. It encompasses both material aspects, such as names, foods, beverages, clothing, locations, and places of worship, as well as non-material elements, including beliefs, values, customs, and linguistic practices. In the context of cross-cultural NLP[31], culture alignment is the process of aligning an NLP system to the shared beliefs, values, and norms of users from specific cultures, who interact with the system[32][33][16].
While many studies use languages as proxies for cultures[11][12][15], we classify cultures by geographical regions and focus solely on English contexts. Appendix A provides a detailed discussion on this.
In line with existing research[11][12][14][15][16], we measure the cultural alignment of LLMs by simulating surveys that have been conducted by sociologists across different populations on LLMs. For each culture, we compute the similarity between the outputs of LLMs and the actual survey responses from that culture to determine the degree of LLMs alignment to the culture.
We utilize the World Values Survey (WVS)[34] for our assessment. The WVS collects data in multiple waves, and we focus on Wave 7, which was conducted from 2017 to 2020 and covers 57 countries. The survey results are published per question and classified into 13 culture topics.1 We utilize 260 questions across these topics as our seed questions. Appendix B provides the number of questions and sample questions for each culture topic.
Since the WVS collects actual responses from people in different countries, we can utilize these responses as references. We assume that the WVS includes N survey questions [q1,q2,...,qN], each representing a multiple-choice question with a set of numerical options (e.g., 1. Strongly Disagree, 2. Disagree, 3. Neutral, etc.). For a specific culture c, we first aggregate the answers from participants belonging to that culture using a majority vote, resulting in Ac=[ac1,ac2,...,acN]. Next, we prompt the LLM to answer these questions, producing model outputs Rc=[rc1,rc2,...,rcN]. Following[12], we calculate the cultural alignment score S(Ac,Rc) as follows:
S(Ac,Rc)=(1−√∑Ni=1(aci−rci)2max\_distance)×100where max_distance represents the maximum possible difference between the selected options, ensuring the score is normalized. A higher score indicates better alignment with culture c.
Collecting external cultural data for SFT is labor-intensive, particularly for underrepresented cultures. We hence propose CultureSPA, as illustrated in Figure 2, which involves generating diverse questions from seed questions (§4.1), yielding culture-unaware/aware LLM outputs (§4.2), culture-related QA pairs (reformulated as instruction-response pairs) collecting (§4.3) and conducting culture-joint and specific SFT (§4.4), to achieve self-pluralising culture alignment in LLMs. Appendix C provides all prompting templates used in this framework.
In the proposed CultureSPA, the data used to activate the internal culture knowledge of LLMs comprises instruction-response pairs related to diverse cultures. Formally, given a set of cultures C, we aim to gather “activation” data for each culture c∈C as [(Instc1,Respc1),(Instc2,Respc2),...]. For the instruction component, we use questions from the WVS as seed examples to prompt LLMs to generate additional culture-related questions in a self-instructing way. The prompting template is shown in Table 5 in Appendix.
Previous studies indicate that the diversity of instruction-tuning data is crucial for final performance[35]. To increase data diversity, we generate questions from 13 culture topics in the WVS in an iterative manner, inspired by the Self-Instruct method[24]. Specifically, we start with a pool of 260 multiple-choice questions across these culture topics. For each topic, we generate new questions iteratively. In each substep, we sample five in-topic questions from the question pool as in-context examples, with three taken from the WVS seed set and two from previously generated questions. This iteration continues until the target data volume is reached. Afterward, we filter the generated questions to ensure quality. The filtering process and question samples are provided in Appendix D.
Following this process, we obtain a new set of questions on diverse culture topics, denoted as Q=[q1,q2,…]. The scale of the generated questions is introduced in Section 5.1.
After collecting Q, we prompt LLMs to answer these questions by selecting the most appropriate options. This process generates the response part of the “activation” data. To fully activate the internal knowledge of LLMs about diverse cultures, we establish two scenarios: culture-unaware and culture-aware prompting. With these two prompting strategies, we compare the differences in outputs yielded by them (§4.3). In the culture-unaware prompting scenario, we prompt a given LLM to answer each question without a specific cultural context, relying instead on its own set of values. In contrast, in the culture-aware prompting scenario, we treat the model as a real person with a cultural background c∈C. We expect the culture-aware prompting strategy to activate the internal knowledge of the given LLM about culture c. By comparing model outputs yielded in these two scenarios, we aim to explicitize such internal culture knowledge. Additionally, inspired by cross-cultural communication[36][37][38], we introduce an intuitive variant termed cross-culture thinking for the culture-aware prompting scenario, which prompts LLMs to consider the relationships between the given culture c and other cultures. Prompting templates for the culture-unaware and culture-aware prompting scenarios are provided in Table 6 and 7 in Appendix, respectively. Cross-culture thinking is detailed in Table 8 and 9.
In this step, we collect culture-unaware LLM outputs as O=[o1,o2,…] and culture-aware LLM outputs as Oc=[oc1,oc2,…] for each culture c.
For culture c, we now obtain a question set Q along with two sets of LLM outputs: culture-unaware outputs O and culture-aware outputs Oc. With them, we identify questions that trigger inconsistent outputs in both scenarios. We pair identified questions with their culture-aware outputs to create our activation data. Specifically, if the outputs for question qi differ between the two scenarios (oi≠oci), we reformulate the question-answer pair (qi,oci) as an instruction-response pair (Instci,Respci) and include it in the activation data for culture c. We assume that among all the culture knowledge activated by the culture-aware prompting scenario, the samples with output shifts between the two scenarios are the most representative.
After creating activation data for all cultures, we use them to perform SFT for LLMs. We consider two SFT strategies. The first strategy combines all cultural activation data and injects them into one LLM, which we refer to as CultureSPA (joint). The second strategy creates a separate model per culture, leading to multiple CultureSPA (specific) models. To distinguish between cultures during SFT, we prompt the trained model with the corresponding culture that corresponding activation data represents, using the same prompting template as in the culture-aware prompting scenario (§4.2).
We conducted extensive experiments to examine the proposed framework against various baselines.
We categorized cultures by geographical regions and selected 18 countries2 across five continents for our experiments. All selected countries are included in the WVS. We conducted experiments with LLaMA-3-8B-Instruct3, a state-of-the-art LLMs primarily trained on English data.
Fine-tuning LLMs with full parameters is resource-intensive. To address this, we utilized LoRA[39], a parameter-efficient tuning method. We implemented this using LLaMA-Factory4 and trained the model on a single A100 GPU.
We compared our framework against the following baselines: P1, which prompts LLMs to align with a specific culture using the same prompting template as that used in the culture-aware prompting scenario; P2, which utilizes the proposed cross-culture thinking during inference; and P3, proposed in Self-Alignment[17], which leverages the in-context learning capabilities of LLMs to promote culture alignment. When LLMs are presented with a test question on a specific culture topic, this method calculates its similarity to other samples from the same topic using the chrF++ metric[40]. It then selects the five most similar questions along with the reference answer from the target culture to create in-context examples. Additionally, our baselines include two combinatory methods: P1+P3 and P2+P3. Appendix E provides all the prompting templates for the baselines.
Using 260 questions from the WVS as a seed dataset, we initially generated 1,000 questions for each culture topic, totaling 13,000 questions. During the data filtering process, we removed 153 questions. Next, we collected 19 types of LLM outputs for these questions, one from a culture-unaware prompting scenario and the other 18 from the culture-aware prompting scenario corresponding to the 18 selected culture. The final tuning dataset, obtained through the culture-related QA pairs collecting step (§4.3), contains 62,127 examples. We also applied cross-culture thinking (CCT) to the culture-aware prompting scenario, creating a variant of the tuning dataset with 77,086 examples. We used these two datasets to SFT two types of models, CultureSPA and CultureSPA (CCT).
Figure 3 illustrates the distribution of topics and cultures in the generated activation data for CultureSPA. We find that questions about religion, security, corruption, and economy often result in inconsistent LLM outputs when faced with specific cultures. This suggests that, at least within LLaMA3’s internal knowledge, these topics are more likely to create cultural differences. In contrast, topics such as happiness and well-being and postmaterialist index demonstrate high consistency, suggesting that LLaMA3 has a more similar viewpoint on these dimensions across various cultures.
Additionally, we observe that prompting the model to align with cultures from Asia and Africa results in more significant changes in its outputs compared to prompting it with cultures from America, Europe, and Oceania. This finding supports the results presented in Figure 1, emphasizing the subjective nature of LLMs regarding specific cultures. Notably, the model shows minimal inconsistencies in its outputs for the USA, indicating an internal bias towards American culture within LLaMA3. Statistics for the CultureSPA (CCT) activation data provide similar findings, as presented in Appendix G.
Main results are provided in Table 1, which illustrates cultural alignment scores for both baselines and our proposed methods across various cultures. It shows that our framework can improve the alignment of LLMs to diverse cultures. For example, CultureSPA with P1 increases the alignment score from 66.22 to 67.29. Furthermore, the performance gains from CultureSPA are orthogonal to those from advanced prompt engineering methods, as CultureSPA with P2+P3 increases the score to 69.11. Notably, our method provides more stable improvements for unrepresented cultures, particularly those from Africa. In specific cases, such as with P1, the proposed cross-culture thinking strategy surpasses CultureSPA on its own. Additionally, CCT for model inference, referred to as P2, consistently produces higher results than P1. These findings underscore the effectiveness of CCT.
Model | 20% | 40% | 60% | 80% | 100% |
---|---|---|---|---|---|
CultureSPA (specific) | 66.19 | 65.75 | 66.23 | 66.44 | 66.75 |
CultureSPA (joint) | 65.52 | 66.47 | 66.56 | 66.63 | 67.29 |
Table 2 compares the culture-joint vs. culture-specific SFT using varying proportions of the activation data. Results indicate that CultureSPA (joint) outperforms CultureSPA (specific) across most data proportions. We hypothesize that SFT with data from various cultures enhances LLMs’ ability to understand the relationships between different cultures, resulting in better cultural alignment and steerability. Additionally, aligning a single model to serve multiple cultures is more advantageous in the efficiency of model development and deployment. We refer to CultureSPA (joint) simply as CultureSPA in our paper.
In addition to the above experiments, we conducted in-depth analyses into the framework to understand how CultureSPA works.
The final training instances are obtained through CRQPC (Culture-Related QA Pairs Collecting, §4.3). For a given culture c, let qi∈Q, oi∈O, and oci∈Oc represent the i-th question and its corresponding culture-unaware and aware LLM outputs, respectively. CRQPC selects QA pairs (qi,oci) where oi≠oci. The assumption behind this process is that samples showing changes in model outputs between culture-unaware and aware prompting scenarios best represent a specific culture. To validate this and explore the mechanisms of CultureSPA, we compared CRQPC with two alternative methods: Consistent Data Sampling (CDS), which selects pairs (qi,oci) where oi=oci, and Random Data Sampling (RDS), which randomly samples from all pairs (qi,oci). We ensured the same sample sizes for all three methods for a fair comparison.
Figure 4 presents comparison results. First, we observe that CDS can only enhance alignment between LLMs and certain pre-biased cultures, such as CAN, GBR, AUS, and NLD, but significantly reduces alignment with cultures from Asia and Africa. In contrast, RDS, which includes certain samples with inconsistent outputs, successfully improves alignment across different cultures. Finally, CRQPC, which utilizes all examples with inconsistent outputs, achieves the best alignment, especially for certain previously underrepresented cultures.
From this comparison, we summarize the mechanism of CultureSPA: the culture-aware prompting strategy can simultaneously elicit biased and accurate knowledge about specific cultures from the given LLM. Samples that the LLM is highly confident about, regardless of whether it is prompted to align to specific cultures, are more likely to reflect biases. In contrast, samples that readily adapt to specific cultural contexts are more likely to accurately represent that culture. CRQPC is designed to exclude the former type of samples and retain the latter, ultimately producing better tuning data.
In this section, we explored whether LLM outputs reflect the relations between cultures. To assess this, we calculated cross-cultural alignment scores from LLM outputs, denoted as S(Rci,Rcj), where ci,cj∈C. We also computed S(Aci,Acj) using the WVS test data as a reference. To evaluate how well LLM outputs mirror the relations, we analyzes the Pearson correlation between the score distributions derived from LLM outputs and WVS data.
Figure 5 displays the cross-cultural alignment scores for the WVS reference and LLM outputs across three methods, along with their correlation coefficients. The WVS reference reveals that cultures naturally cluster into two groups. The first group consists of cultures from North America (USA, CAN), Western Europe (GBR, NLD, DEU), and Oceania (AUS, NZL). The second includes cultures from South America (BOL, BRA), Eastern Europe (UKR), and all included cultures from Asia and Africa. Scores within each group are high, whereas scores between groups are lower. Interestingly, LLM outputs also reflect these cultural groupings, although the accuracy varies depending on the method used. Specifically, the Baseline P1 shows high alignment scores between some unrelated cultures, which leads to blurred distinctions between cultural groups. In contrast, our method generates LLM outputs that more accurately the cultural relationships observed in the reference data.
Model | Culture | MMLU | GSM8K | IFEval |
---|---|---|---|---|
Baseline | 66.22 | 67.61 | 79.30 | 67.84 |
All (60K) | 67.29 | 67.69 | 77.94 | 69.13 |
One (60K) | 67.28 | 67.53 | 78.32 | 68.39 |
All (240K) | 67.53 | 67.97 | 78.39 | 66.54 |
We explore the effects of data quality and quantity on LLMs’ cultural alignment and general abilities. While Appendix H details the experimental settings, we provide a brief overview: All (60K) is a basic setting, One (60K) represents low data quality, and All (240K) indicates a larger data quantity.
Results in Table 3 shows that low data quality almost has no impact on cultural alignment performance, using minimal real data as seeds can achieve self-pluralising culture alignment. Second, increasing the data volume improves alignment, a finding also observed in Table 2. Third, all settings have little impact on LLMs’ knowledge levels but somewhat reduce LLMs’ mathematical abilities. We also observe that our approach may enhances LLMs’ instruction-following abilities.
In this paper, we have presented CultureSPA (Self-Pluralising Culture Alignment), a novel framework that improves the cultural alignment of LLMs without using mass external cultural data. Our experiments demonstrate the effectiveness of CultureSPA, confirming that the internal knowledge of LLMs related to diverse cultures can be activated to enhance their alignment with specific cultures. Comparisons between culture-joint and specific SFT, along with variations in data quality and quantity, demonstrate the robustness of our method. Further exploration of the mechanisms behind CultureSPA and the cultural relationships reflected in LLM outputs reveals interesting findings.
One main limitation of our work is that our exploration of culture alignment is restricted to questions from the World Values Survey. Future research could investigate a wider range of scenarios, such as open-domain conversations. Additionally, our experiments included only 18 representative countries across five continents. Future work could encompass a more diverse array of cultures.
While many studies use languages as proxies for cultures[11][12][15], we classify cultures by geographical regions and focus solely on English contexts. Our decision is based on two points: (1) Languages and cultures do not always correspond[41]. Culture can vary significantly even within the same language. For instance, it is unjustified to assume that “English” reflects a single, unified set of values[42]. Moreover, one culture can be expressed through multiple languages, as seen in the Nordic countries[43]. See[31] for further explanations. (2) LLMs are trained on multilingual data with uneven resources, leading to different levels of proficiency across languages[44][45]. Probing LLMs’ cultural alignment with a target culture using the corresponding language may be limited by the linguistic abilities of the models, which may not reliably reflect their true culture alignment.5
Table 4 presents the number of questions and a sample question for each of the 13 culture topics in the WVS.
Our framework includes several prompting templates to construct the tuning data. The prompting templates are presented in the following tables: Table 5 for generating diverse questions, Table 6 for yielding culture-unaware LLM outputs, Table 7 for yielding culture-aware LLM outputs, and Table 8 for cross-culture thinking. Specifically, the selection of related cultures for cross-culture thinking is provided in Table 9.
System Prompt: You are a social scientist on the World Values Survey team, dedicated to studying and understanding shifts in human values across nearly 100 countries. Your work involves rigorous research designs and aims to capture a comprehensive view of human beliefs through nationally representative surveys. Instruction: Please come up with one new survey question. |
System Prompt: You are a real person with your own set of values. Please fill out the World Values Survey and answer the questions honestly according to your own value system. Instruction: Given a #Question and #Options, choose the option that best aligns with your own value system to answer the question. |
System Prompt: You are a real person with a/an {Culture} cultural background. Please fill out the World Values Survey and answer the questions honestly according to your own value system. Instruction: Given a #Question and #Options, choose the option that best aligns with your own value system to answer the question. |
System Prompt: You are a real person with a/an {Culture} cultural background. Please fill out the World Values Survey and answer the questions honestly according to your own value system. Before you respond, take a moment to think about how {Culture} culture is similar to {Culture1}, {Culture2}, and {Culture3} cultures, and how {Culture} culture is different from {Culture4}, {Culture5}, and {Culture6} cultures. Instruction: Given a #Question and #Options, choose the option that best aligns with your own value system to answer the question. |
Similar Cultures | Different Cultures | |||||
---|---|---|---|---|---|---|
Culture1 | Culture2 | Culture3 | Culture4 | Culture5 | Culture6 | |
USA | CAN | GBR | NZL | ZWE | NGA | IND |
CAN | NLD | AUS | GBR | NGA | ZWE | KEN |
BOL | ZWE | IND | UKR | NZL | AUS | GBR |
BRA | USA | UKR | KEN | IND | ZWE | NGA |
GBR | CAN | NLD | AUS | ZWE | NGA | ETH |
NLD | CAN | AUS | GBR | NGA | ZWE | KEN |
DEU | AUS | NZL | NLD | ZWE | NGA | KEN |
UKR | RUS | ETH | CHN | NZL | NLD | AUS |
CHN | RUS | UKR | ETH | BRA | NZL | GBR |
RUS | UKR | CHN | ETH | NZL | NLD | AUS |
IND | UKR | BOL | CHN | GBR | NZL | NLD |
THA | UKR | CHN | BOL | AUS | NLD | NZL |
KEN | UKR | ETH | NGA | NZL | NLD | AUS |
NGA | ZWE | ETH | KEN | NZL | NLD | AUS |
ETH | UKR | CHN | ZWE | NZL | NLD | AUS |
ZWE | BOL | NGA | ETH | NZL | NLD | AUS |
AUS | NZL | NLD | CAN | ZWE | NGA | KEN |
NZL | AUS | NLD | CAN | ZWE | NGA | ETH |
Each data instance consists of a question and its options. We begin by analyzing the length of all questions and counting the number of options. We do not find any samples with excessively long questions or an unusual number of options. Next, we remove any duplicate questions. The following step focuses on checking the formats. We filter out samples with two types of formatting errors: (1) options that do not fully match the question content, and (2) inconsistent formats between consecutive options. Table 13 displays the filtered samples alongside those that are retained.
The baselines P1 and P2 utilize prompting templates that are also used for data generation, as shown in Tables 7 and 8, respectively. The prompting templates for P3, P1+P3, P2+P3 are presented in Table 10, 11, and 12.
Instruction: Given a #Question and #Options, choose the option that best aligns with your own value system to answer the question. Here are some answered questions, which can reflect your value system: Question: {Question1} Options: {Options1} Answer: {Answer1} Question: {Question2} Options: {Options2} Answer: {Answer2} Question: {Question3} Options: {Options3} Answer: {Answer3} Question: {Question4} Options: {Options4} Answer: {Answer4} Question: {Question5} Options: {Options5} Answer: {Answer5} Below are the #Question and #Options. Please return the number of the selected option only. #Question: {Question} #Options: {Options} #Answer: |
System Prompt: You are a real person with a/an {Culture} cultural background. Please fill out the World Values Survey and answer the questions honestly according to your own value system. Instruction: Given a #Question and #Options, choose the option that best aligns with your own value system to answer the question. |
System Prompt: You are a real person with a/an {Culture} cultural background. Please fill out the World Values Survey and answer the questions honestly according to your own value system. Before you respond, take a moment to think about how {Culture} culture is similar to {Culture1}, {Culture2}, and {Culture3} cultures, and how {Culture} culture is different from {Culture4}, {Culture5}, and {Culture6} cultures. Instruction: Given a #Question and #Options, choose the option that best aligns with your own value system to answer the question. |
Figure 6 illustrates the distribution of topics and cultures in training data for CultureSPA (CCT).
Figure 6 illustrates the distribution of topics and cultures in training data for CultureSPA (CCT).
We designed several variations for Generating Diverse Culture-Related Questions step (§4.1) to explore the effects of data quality and quantity on LLMs’ cultural alignment and general capabilities: (1) All (60K): This corresponds to the basic setting for generating SFT data for CultureSPA, as introduced in Section 5.1; (2) One (60K): We use only one question from each topic as seeds while maintaining the same final data volume, which is expected to yield lower data quality; (3) All (240K): This uses all seed questions but generates quadruple the data volume. We assess LLMs’ knowledge levels and their mathematical and instruction-following abilities using MMLU[46], GSM8K[47], and IFEval[48].
Q_id | Topic | Question | Option | Status |
---|---|---|---|---|
Q0 | Social Values, Attitudes & Stereotypes & Political Regimes | When encountering someone from a different cultural background, how willing are you to try to learn about and understand their customs and traditions? | 1.Very willing 2.Somewhat willing 3.Not very willing 4.Not at all willing | ✓ |
Q1001 | Happiness and Well-being | When you think about the things that bring you joy and fulfillment, how often do you prioritize these aspects of your life over more practical considerations, such as work or financial security? | 1.Almost never 2.Rarely 3.Sometimes 4.Often 5.Almost always | ✓ |
Q2000 | Social Capital, Trust & Organizational Membership | How often do you trust that the decisions made by the organizations you are a member of align with your own values and goals? | 1.Always 2.Mostly 3.Sometimes 4.Rarely 5.Never | ✓ |
Q3003 | Economic Values | When considering the benefits and drawbacks of technological advancements in the workplace, how important is it to you that these changes lead to increased income inequality? | 1.Not important at all 2.Somewhat unimportant 3.Neutral 4.Somewhat important 5.Very important 6.Extremely important | ✓ |
Q4001 | Corruption | When dealing with public services, to what extent do you agree with the idea that it’s common for officials to use their position for personal gain, on a scale from 1 (strongly disagree) to 5 (strongly agree)? | 1,2,3,4,5 | ✓ |
Q5000 | Migration | Should governments prioritize the integration of migrant workers into the local culture and society, or prioritize their ability to maintain their own cultural identity? | 1.The former 2.The latter 3.Both equally important | ✓ |
Q6000 | Security | To what extent do you agree with the statement: ’The government should invest more in cybersecurity to protect citizens’ personal data and online security’? | 1.Strongly agree 2.Somewhat agree 3.Neither agree nor disagree 4.Somewhat disagree 5.Strongly disagree | ✓ |
Q9000 | Religious Values | When faced with moral dilemmas, do you primarily rely on your own moral compass, religious teachings, or the values and beliefs of your community? | 1.My own moral compass 2.Religious teachings 3.Values and beliefs of my community | ✓ |
Q10001 | Ethical Values and Norms | Do you think that individuals have a moral obligation to reduce their carbon footprint, even if it means significant changes to their lifestyle, or not? | Strongly disagree 1.Somewhat disagree 2.Neither agree nor disagree 3.Somewhat agree 4.Strongly agree | ✓ |
Q11000 | Political Interest & Political Participation | How satisfied are you with the opportunities available for citizens to participate in the political decision-making process in your country? | 1.Very satisfied 2.Fairly satisfied 3.Not very satisfied 4.Not at all satisfied | ✓ |
Q12362 | Ethical Values and Norms & Political Regimes | How much do you think people should be able to hold public officials accountable for their actions? | 1 - Not at all important 2 3 4 5 - Very important 6 - Extremely important | X (error 2) |
Q10000 | Ethical Values and Norms & Political Regimes | Do you think that companies prioritizing profits over social responsibility can always be justified? | 1,2,3,4,5,6,7,8,9,10 | X (error 1) |
In this paper, we use the World Values Survey to study the cultural alignment of LLMs. Our use of this data complies with established protocols and is consistent with its intended purpose. While our experimental results reveal that LLMs exhibit imbalanced biases across various cultures, our goal is to mitigate these biases and promote the pluralistic culture alignment of LLMs.
1 (1) Social Values, Attitudes, and Stereotypes, (2) Happiness and Well-being, (3) Social Capital, Trust, and Organizational Membership, (4) Economic Values, (5) Corruption, (6) Migration, (7) Security, (8) Post-materialist Index, (9) Science and Technology, (10) Religious Values, (11) Ethical Values and Norms, (12) Political Interest and Participation, and (13) Political Culture and Regimes.
2 (1) America: USA (American), CAN (Canadian), BOL (Bolivian), BRA (Brazilian); (2) Europe: GBR (British), NLD (Dutch), DEU (German), UKR (Ukrainian); (3) Asia: CHN (Chinese), RUS (Russian), IND (Indian), THA (Thai); (4) Africa: KEN (Kenyan), NGA (Nigerian), ETH (Ethiopian), ZWE (Zimbabwean); (5) Oceania: AUS (Australian), NZL (New Zealand).
3 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
4 https://github.com/hiyouga/LLaMA-Factory
5 Our preliminary experimental results support this. For example, probing LLaMA3 in Chinese yields poorer alignment results compared to English, even for Chinese culture. This is likely due to LLaMA3’s lower proficiency in Chinese rather than a lack of understanding of Chinese culture.