Self-Pluralising Culture Alignment for Large Language Models

Shaoyang Xu; Yongqi Leng; Linhao Yu; Deyi Xiong

doi:10.32388/Q7YWFV

+1
- LY
- YL
- SX

350

PDF

Field

Computer Science

Subfield

Artificial Intelligence

Open Peer Review

Preprint

4.00 | 1 peer reviewer

Self-Pluralising Culture Alignment for Large Language Models

Shaoyang Xu¹, Yongqi Leng², Linhao Yu², Deyi Xiong^2,1

Affiliations

Abstract

As large language models (LLMs) become increasingly accessible in many countries, it is essential to align them to serve pluralistic human values across cultures. However, pluralistic culture alignment in LLMs remain an open problem^[1]. In this paper, we propose CultureSPA, a Self-Pluralising Culture Alignment framework that allows LLMs to simultaneously align to pluralistic cultures. The framework first generates questions on various culture topics, then yields LLM outputs in response to these generated questions under both culture-aware and culture-unaware settings. By comparing culture-aware/unaware outputs, we are able to detect and collect culture-related instances. These instances are employed to fine-tune LLMs to serve pluralistic cultures in either a culture-joint or culture-specific way. Extensive experiments demonstrate that CultureSPA significantly improves the alignment of LLMs to diverse cultures without compromising general abilities. And further improvements can be achieved if CultureSPA is combined with advanced prompt engineering techniques. Comparisons between culture-joint and culture-specific tuning strategies, along with variations in data quality and quantity, illustrate the robustness of our method. We also explore the mechanisms underlying CultureSPA and the relations between different cultures it reflects.

Corresponding author: Deyi Xiong, dyxiong@tju.edu.cn

1. Introduction

Large language models, such as GPT-4^[2], have gained widespread use due to their extensive knowledge and prowess in reasoning^[3]^[4]^[5]. Given the multicultural nature of our society, it is essential for LLMs to serve diverse human values and preferences across cultures. However, existing alignment techniques, such as RLHF^[6] and DPO^[7], do not specifically take cultural diversity into account. With such alignment techniques, LLMs tend to learn biased human values and preferences^[8]^[9]^[1]^[10].

**Figure 1.** Cultural alignment scores of LLaMA3 across various countries. Culture-Unaware/Aware Prompting: The model isn’t/is prompted to align with the target culture. CultureSPA: The model is fine-tuned with the proposed self-pluralising culture alignment. Country names are standardized according to the ISO 3166-1 alpha-3 country codes.

Many studies examine how well LLMs align to serve specific cultures by simulating social surveys on LLMs^[11]^[12]^[13]^[14]^[15]^[16]^[17]. In these studies, the similarity between the outputs of an LLM and real-world survey answers from a specific culture is calculated as the cultural alignment score (CAS) between the LLM and given culture. Findings with CAS suggest that LLMs often exhibit cultural dominance, as shown in Figure 1 (Culture-Unaware Prompting), where LLaMA3’s outputs naturally align more closely to certain North American and European cultures.

To mitigate the reduction of LLMs in distributional pluralism, efforts are dedicated to pluralistic value alignment in pre-training^[18]^[19]^[12]^[15], alignment training^[13]^[16]^[20]^[21], and prompt engineering^[11]^[12]^[15]^[22]^[17]^[23]. However, training-based approaches require external cultural data, which are often scarce, especially for underrepresented cultures. Meanwhile, prompt engineering methods necessitate careful example selection and can yield inconsistent results^[22].

To address these issues, we propose to explore self-pluralising culture alignment without relying on external cultural resources. Our approach is grounded in two key findings: (1) Research in prompt engineering shows that LLMs possess a certain level of internal knowledge about diverse cultures. As illustrated in Figure 1 (Culture-Aware Prompting), simply prompting LLaMA3 to align to a given culture is an effective way to enhance its cultural alignment; (2) Studies on data synthesis^[24]^[25] indicate that LLMs can generate data using their existing knowledge to improve performance on specific tasks. Building on these findings, we explore the following research question: Can we harness the internal culture knowledge of LLMs to enhance their alignment to specific cultures?

To this end, we propose CutureSPA, a framework that achieves pluralistic culture alignment in LLMs by “activating” their internal culture knowledge. As illustrated in Figure 2, CutureSPA first generates survey questions on diverse culture topics (§4.1). It then collects LLM outputs for these questions under two scenarios: culture-unaware prompting, where the model does not receive specific cultural information, and culture-aware prompting, where the model is prompted to align to a specific culture (§4.2). Samples that exhibit shifted outputs when cultural information is provided are deemed the most representative of a specific culture. Culture-related QA pairs collecting is employed to select such samples (§4.3). The collected data instances are ultimately used for culture-joint and culture-specific supervised fine-tuning (SFT) (§4.4).

We conduct extensive experiments to examine CultureSPA. Experimental results indicate that CultureSPA effectively enhances LLM alignment to pluralistic cultures and can be integrated with advanced prompt engineering techniques (§5.3). A comparison between culture-joint and culture-specific SFT strategies demonstrates the superiority of the former (§5.4). Additionally, we explore the mechanism behind CultureSPA (§6.1), investigate cross-cultural relationships (§6.2), and examine the effects of data quality and quantity (§6.3). We summarize our contributions as follows:

We propose a novel framework, CultureSPA, which enables pluralistic culture alignment in LLMs based on their internal knowledge.
CultureSPA effectively enhances LLM alignment to diverse cultures and can be combined with advanced prompt engineering techniques for further improvements.
We compare different settings, such as culture-joint versus culture-specific SFT strategies, as well as variations in data quality and quantity, demonstrating the robustness of our method.
An in-depth analysis of the mechanisms behind CultureSPA and an exploration of the cultural relationships reflected in LLM outputs provide intriguing findings.

2. Related Work

Pluralistic Culture Alignment

Extensive efforts have been made to enhance the pluralistic culture alignment of LLMs. These efforts include advancements in pre-training^[18]^[19]^[12]^[15] and alignment training^[13]^[16]^[20]^[21], which rely on external data that reflect specific cultures. Model inference strategies have also been developed, including effective prompt design^[11]^[12]^[15]^[22], in-context learning^[17]^[23], and multi-model collaboration^[26]. In contrast to these approaches, our work explores pluralistic culture alignment without depending on external cultural resources by activating internal culture knowledge in LLMs.

Data Synthesis

Traditional methods for instruction tuning in LLMs use either previously manually created NLP datasets^[27]^[28] or real-world user prompts^[6]. However, these methods are time-consuming and challenging to scale. Recent efforts have explored LLM-driven data synthesis^[29]^[30]^[24]^[25] to address these issues. Specifically, Self-Instruct^[24] utilizes the in-context learning and generation capabilities of LLMs to automatically generate general instruction tuning data from 175 seed instructions. Our work follows a philosophy similar to Self-Instruct to produce diverse questions from seed questions on cultures, investigating the feasibility of self-pluralising culture alignment in LLMs.

3. Preliminary

In this section, we first define culture and culture alignment, then present the framework used to assess the cultural alignment of LLMs.

3.1. Definitions of Culture and Culture Alignment

Culture generally refers to the way of life shared by a collective group of people, distinguishing them from other groups with unique cultural identities^[31]. It encompasses both material aspects, such as names, foods, beverages, clothing, locations, and places of worship, as well as non-material elements, including beliefs, values, customs, and linguistic practices. In the context of cross-cultural NLP^[31], culture alignment is the process of aligning an NLP system to the shared beliefs, values, and norms of users from specific cultures, who interact with the system^[32]^[33]^[16].

While many studies use languages as proxies for cultures^[11]^[12]^[15], we classify cultures by geographical regions and focus solely on English contexts. Appendix A provides a detailed discussion on this.

3.2. Assessing Cultural Alignment of LLMs

In line with existing research^[11]^[12]^[14]^[15]^[16], we measure the cultural alignment of LLMs by simulating surveys that have been conducted by sociologists across different populations on LLMs. For each culture, we compute the similarity between the outputs of LLMs and the actual survey responses from that culture to determine the degree of LLMs alignment to the culture.

World Values Survey (WVS)

We utilize the World Values Survey (WVS)^[34] for our assessment. The WVS collects data in multiple waves, and we focus on Wave 7, which was conducted from 2017 to 2020 and covers 57 countries. The survey results are published per question and classified into 13 culture topics.¹ We utilize 260 questions across these topics as our seed questions. Appendix B provides the number of questions and sample questions for each culture topic.

**Figure 2.** Diagram of the proposed CultureSPA. The framework consists of 4 key steps. In the first step, it generates diverse culture-related questions on 13 culture topics from 260 seed questions collected from WVS. It then collects LLM outputs for these questions under two scenarios: culture-unaware prompting and culture-aware prompting. Samples that demonstrate output shifts between the two scenarios are considered the most representative of the corresponding culture and hence collected in Step 3. Finally, the collected culture-related QA pairs (Question+CAP output) are employed for culture-joint/specific SFT.

Evaluation Metric

Since the WVS collects actual responses from people in different countries, we can utilize these responses as references. We assume that the WVS includes $N$ survey questions $[q_1,q_2,...,q_N]$ , each representing a multiple-choice question with a set of numerical options (e.g., 1. Strongly Disagree, 2. Disagree, 3. Neutral, etc.). For a specific culture $c$ , we first aggregate the answers from participants belonging to that culture using a majority vote, resulting in $\mathcal{A}_c = [a_1^c,a_2^c,...,a_N^c]$ . Next, we prompt the LLM to answer these questions, producing model outputs $\mathcal{R}_c = [r_1^c,r_2^c,...,r_N^c]$ . Following^[12], we calculate the cultural alignment score $\text{S}(\mathcal{A}_c,\mathcal{R}_c)$ as follows:

$\text{S}(\mathcal{A}_c,\mathcal{R}_c) = \left(1-\frac{\sqrt{\sum_{i=1}^{N}(a_i^c-r_i^c)^2}}{\text{max\_distance}}\right)\times 100 \tag{1}$

where max_distance represents the maximum possible difference between the selected options, ensuring the score is normalized. A higher score indicates better alignment with culture $c$ .

4. CultureSPA

Collecting external cultural data for SFT is labor-intensive, particularly for underrepresented cultures. We hence propose CultureSPA, as illustrated in Figure 2, which involves generating diverse questions from seed questions (§4.1), yielding culture-unaware/aware LLM outputs (§4.2), culture-related QA pairs (reformulated as instruction-response pairs) collecting (§4.3) and conducting culture-joint and specific SFT (§4.4), to achieve self-pluralising culture alignment in LLMs. Appendix C provides all prompting templates used in this framework.

4.1. Generating Diverse Culture-Related Questions

In the proposed CultureSPA, the data used to activate the internal culture knowledge of LLMs comprises instruction-response pairs related to diverse cultures. Formally, given a set of cultures $C$ , we aim to gather “activation” data for each culture $c \in C$ as $[(Inst_1^c, Resp_1^c),(Inst_2^c, Resp_2^c),...]$ . For the instruction component, we use questions from the WVS as seed examples to prompt LLMs to generate additional culture-related questions in a self-instructing way. The prompting template is shown in Table 5 in Appendix.

Previous studies indicate that the diversity of instruction-tuning data is crucial for final performance^[35]. To increase data diversity, we generate questions from 13 culture topics in the WVS in an iterative manner, inspired by the Self-Instruct method^[24]. Specifically, we start with a pool of 260 multiple-choice questions across these culture topics. For each topic, we generate new questions iteratively. In each substep, we sample five in-topic questions from the question pool as in-context examples, with three taken from the WVS seed set and two from previously generated questions. This iteration continues until the target data volume is reached. Afterward, we filter the generated questions to ensure quality. The filtering process and question samples are provided in Appendix D.

Following this process, we obtain a new set of questions on diverse culture topics, denoted as $\mathcal{Q} = {\lbrack q_{1},q_{2},\ldots\rbrack}$ . The scale of the generated questions is introduced in Section 5.1.

4.2. Yielding Culture-Unaware/Aware LLM Outputs

After collecting $\mathcal{Q}$ , we prompt LLMs to answer these questions by selecting the most appropriate options. This process generates the response part of the “activation” data. To fully activate the internal knowledge of LLMs about diverse cultures, we establish two scenarios: culture-unaware and culture-aware prompting. With these two prompting strategies, we compare the differences in outputs yielded by them (§4.3). In the culture-unaware prompting scenario, we prompt a given LLM to answer each question without a specific cultural context, relying instead on its own set of values. In contrast, in the culture-aware prompting scenario, we treat the model as a real person with a cultural background $c \in C$ . We expect the culture-aware prompting strategy to activate the internal knowledge of the given LLM about culture $c$ . By comparing model outputs yielded in these two scenarios, we aim to explicitize such internal culture knowledge. Additionally, inspired by cross-cultural communication^[36]^[37]^[38], we introduce an intuitive variant termed cross-culture thinking for the culture-aware prompting scenario, which prompts LLMs to consider the relationships between the given culture $c$ and other cultures. Prompting templates for the culture-unaware and culture-aware prompting scenarios are provided in Table 6 and 7 in Appendix, respectively. Cross-culture thinking is detailed in Table 8 and 9.

In this step, we collect culture-unaware LLM outputs as $\mathcal{O} = {\lbrack o_{1},o_{2},\ldots\rbrack}$ and culture-aware LLM outputs as $\mathcal{O}_{c} = {\lbrack o_{1}^{c},o_{2}^{c},\ldots\rbrack}$ for each culture $c$ .

4.3. Culture-Related QA Pairs Collecting

For culture $c$ , we now obtain a question set $\mathcal{Q}$ along with two sets of LLM outputs: culture-unaware outputs $\mathcal{O}$ and culture-aware outputs $\mathcal{O}_{c}$ . With them, we identify questions that trigger inconsistent outputs in both scenarios. We pair identified questions with their culture-aware outputs to create our activation data. Specifically, if the outputs for question $q_{i}$ differ between the two scenarios $({o_{i} \neq o_{i}^{c}})$ , we reformulate the question-answer pair $(q_{i},o_{i}^{c})$ as an instruction-response pair $(\text{Inst}_{i}^{c},\text{Resp}_{i}^{c})$ and include it in the activation data for culture $c$ . We assume that among all the culture knowledge activated by the culture-aware prompting scenario, the samples with output shifts between the two scenarios are the most representative.

4.4. Culture-Joint/Specific SFT

After creating activation data for all cultures, we use them to perform SFT for LLMs. We consider two SFT strategies. The first strategy combines all cultural activation data and injects them into one LLM, which we refer to as CultureSPA (joint). The second strategy creates a separate model per culture, leading to multiple CultureSPA (specific) models. To distinguish between cultures during SFT, we prompt the trained model with the corresponding culture that corresponding activation data represents, using the same prompting template as in the culture-aware prompting scenario (§4.2).

**Table 1.** Cultural alignment scores for CultureSPA and the baselines. Paired comparisons of the baselines with CultureSPA, using the same prompting strategy, are presented. P3 is excluded due to its poor performance when used alone. Scores from the baselines are labeled in gray, while red highlights indicate where CultureSPA outperforms the corresponding baselines, and green highlights indicate the opposite. “CCT” refers to the cross-culture thinking strategy. For each setting, the average results from three runs using different random seeds are reported.

5. Experiments

We conducted extensive experiments to examine the proposed framework against various baselines.

5.1. Settings

Examined Cultures and LLMs

We categorized cultures by geographical regions and selected 18 countries² across five continents for our experiments. All selected countries are included in the WVS. We conducted experiments with LLaMA-3-8B-Instruct³, a state-of-the-art LLMs primarily trained on English data.

SFT

Fine-tuning LLMs with full parameters is resource-intensive. To address this, we utilized LoRA^[39], a parameter-efficient tuning method. We implemented this using LLaMA-Factory⁴ and trained the model on a single A100 GPU.

Baselines

We compared our framework against the following baselines: P1, which prompts LLMs to align with a specific culture using the same prompting template as that used in the culture-aware prompting scenario; P2, which utilizes the proposed cross-culture thinking during inference; and P3, proposed in Self-Alignment^[17], which leverages the in-context learning capabilities of LLMs to promote culture alignment. When LLMs are presented with a test question on a specific culture topic, this method calculates its similarity to other samples from the same topic using the chrF++ metric^[40]. It then selects the five most similar questions along with the reference answer from the target culture to create in-context examples. Additionally, our baselines include two combinatory methods: P1+P3 and P2+P3. Appendix E provides all the prompting templates for the baselines.

Data Creation

Using 260 questions from the WVS as a seed dataset, we initially generated 1,000 questions for each culture topic, totaling 13,000 questions. During the data filtering process, we removed 153 questions. Next, we collected 19 types of LLM outputs for these questions, one from a culture-unaware prompting scenario and the other 18 from the culture-aware prompting scenario corresponding to the 18 selected culture. The final tuning dataset, obtained through the culture-related QA pairs collecting step (§4.3), contains 62,127 examples. We also applied cross-culture thinking (CCT) to the culture-aware prompting scenario, creating a variant of the tuning dataset with 77,086 examples. We used these two datasets to SFT two types of models, CultureSPA and CultureSPA (CCT).

**Figure 3.** Distribution of topics and cultures in the activation data generated by LLaMA-3-8B-Instruct.

5.2. Statistics of Generated Data

Figure 3 illustrates the distribution of topics and cultures in the generated activation data for CultureSPA. We find that questions about religion, security, corruption, and economy often result in inconsistent LLM outputs when faced with specific cultures. This suggests that, at least within LLaMA3’s internal knowledge, these topics are more likely to create cultural differences. In contrast, topics such as happiness and well-being and postmaterialist index demonstrate high consistency, suggesting that LLaMA3 has a more similar viewpoint on these dimensions across various cultures.

Additionally, we observe that prompting the model to align with cultures from Asia and Africa results in more significant changes in its outputs compared to prompting it with cultures from America, Europe, and Oceania. This finding supports the results presented in Figure 1, emphasizing the subjective nature of LLMs regarding specific cultures. Notably, the model shows minimal inconsistencies in its outputs for the USA, indicating an internal bias towards American culture within LLaMA3. Statistics for the CultureSPA (CCT) activation data provide similar findings, as presented in Appendix G.

**Figure 4.** Comparison of different data sampling strategies. With the P1 baseline as a reference, changes in cultural alignment scores achieved by each strategy are reported. “CRQPC” refers to our proposed Culture-Related QA Pairs Collecting, “RDS” refers to Random Data Sampling, and “CDS” refers to Consistent Data Sampling, which is the opposite of CRQPC.

5.3. Main Results

Main results are provided in Table 1, which illustrates cultural alignment scores for both baselines and our proposed methods across various cultures. It shows that our framework can improve the alignment of LLMs to diverse cultures. For example, CultureSPA with P1 increases the alignment score from 66.22 to 67.29. Furthermore, the performance gains from CultureSPA are orthogonal to those from advanced prompt engineering methods, as CultureSPA with P2+P3 increases the score to 69.11. Notably, our method provides more stable improvements for unrepresented cultures, particularly those from Africa. In specific cases, such as with P1, the proposed cross-culture thinking strategy surpasses CultureSPA on its own. Additionally, CCT for model inference, referred to as P2, consistently produces higher results than P1. These findings underscore the effectiveness of CCT.

Model	20%	40%	60%	80%	100%
CultureSPA (specific)	66.19	65.75	66.23	66.44	66.75
CultureSPA (joint)	65.52	66.47	66.56	66.63	67.29

Table 2. Comparison between culture-joint and culture-specific SFT using varying proportions of the generate activation data.

5.4. Comparing Culture-Joint vs. Specific SFT

Table 2 compares the culture-joint vs. culture-specific SFT using varying proportions of the activation data. Results indicate that CultureSPA (joint) outperforms CultureSPA (specific) across most data proportions. We hypothesize that SFT with data from various cultures enhances LLMs’ ability to understand the relationships between different cultures, resulting in better cultural alignment and steerability. Additionally, aligning a single model to serve multiple cultures is more advantageous in the efficiency of model development and deployment. We refer to CultureSPA (joint) simply as CultureSPA in our paper.

**Figure 5.** Cross-cultural alignment scores for the WVS reference and LLM outputs across three methods, along with their correlation coefficients with the reference distribution.

6. Analysis

In addition to the above experiments, we conducted in-depth analyses into the framework to understand how CultureSPA works.

6.1. How does CultureSPA Enhance Culture Alignment?

The final training instances are obtained through CRQPC (Culture-Related QA Pairs Collecting, §4.3). For a given culture $c$ , let $q_i \in \mathcal{Q}$ , $o_i \in \mathcal{O}$ , and $o_i^c \in \mathcal{O}_c$ represent the $i$ -th question and its corresponding culture-unaware and aware LLM outputs, respectively. CRQPC selects QA pairs $(q_i,o_i^c)$ where $o_i \neq o_i^c$ . The assumption behind this process is that samples showing changes in model outputs between culture-unaware and aware prompting scenarios best represent a specific culture. To validate this and explore the mechanisms of CultureSPA, we compared CRQPC with two alternative methods: Consistent Data Sampling (CDS), which selects pairs $(q_i,o_i^c)$ where $o_i = o_i^c$ , and Random Data Sampling (RDS), which randomly samples from all pairs $(q_i,o_i^c)$ . We ensured the same sample sizes for all three methods for a fair comparison.

Figure 4 presents comparison results. First, we observe that CDS can only enhance alignment between LLMs and certain pre-biased cultures, such as CAN, GBR, AUS, and NLD, but significantly reduces alignment with cultures from Asia and Africa. In contrast, RDS, which includes certain samples with inconsistent outputs, successfully improves alignment across different cultures. Finally, CRQPC, which utilizes all examples with inconsistent outputs, achieves the best alignment, especially for certain previously underrepresented cultures.

From this comparison, we summarize the mechanism of CultureSPA: the culture-aware prompting strategy can simultaneously elicit biased and accurate knowledge about specific cultures from the given LLM. Samples that the LLM is highly confident about, regardless of whether it is prompted to align to specific cultures, are more likely to reflect biases. In contrast, samples that readily adapt to specific cultural contexts are more likely to accurately represent that culture. CRQPC is designed to exclude the former type of samples and retain the latter, ultimately producing better tuning data.

6.2. Do LLM Outputs Reflect Relations between Cultures?

In this section, we explored whether LLM outputs reflect the relations between cultures. To assess this, we calculated cross-cultural alignment scores from LLM outputs, denoted as $S(\mathcal{R}_{c_i},\mathcal{R}_{c_j})$ , where ${c_i},{c_j} \in C$ . We also computed $S(\mathcal{A}_{c_i},\mathcal{A}_{c_j})$ using the WVS test data as a reference. To evaluate how well LLM outputs mirror the relations, we analyzes the Pearson correlation between the score distributions derived from LLM outputs and WVS data.

Figure 5 displays the cross-cultural alignment scores for the WVS reference and LLM outputs across three methods, along with their correlation coefficients. The WVS reference reveals that cultures naturally cluster into two groups. The first group consists of cultures from North America (USA, CAN), Western Europe (GBR, NLD, DEU), and Oceania (AUS, NZL). The second includes cultures from South America (BOL, BRA), Eastern Europe (UKR), and all included cultures from Asia and Africa. Scores within each group are high, whereas scores between groups are lower. Interestingly, LLM outputs also reflect these cultural groupings, although the accuracy varies depending on the method used. Specifically, the Baseline P1 shows high alignment scores between some unrelated cultures, which leads to blurred distinctions between cultural groups. In contrast, our method generates LLM outputs that more accurately the cultural relationships observed in the reference data.

Model	Culture	MMLU	GSM8K	IFEval
Baseline	66.22	67.61	79.30	67.84
All (60K)	67.29	67.69	77.94	69.13
One (60K)	67.28	67.53	78.32	68.39
All (240K)	67.53	67.97	78.39	66.54

Table 3. Effects of data quality and quantity on LLMs’ cultural alignment and general capabilities.

6.3. Effects of Data Quality and Quantity

We explore the effects of data quality and quantity on LLMs’ cultural alignment and general abilities. While Appendix H details the experimental settings, we provide a brief overview: All (60K) is a basic setting, One (60K) represents low data quality, and All (240K) indicates a larger data quantity.

Results in Table 3 shows that low data quality almost has no impact on cultural alignment performance, using minimal real data as seeds can achieve self-pluralising culture alignment. Second, increasing the data volume improves alignment, a finding also observed in Table 2. Third, all settings have little impact on LLMs’ knowledge levels but somewhat reduce LLMs’ mathematical abilities. We also observe that our approach may enhances LLMs’ instruction-following abilities.

7. Conclusion

In this paper, we have presented CultureSPA (Self-Pluralising Culture Alignment), a novel framework that improves the cultural alignment of LLMs without using mass external cultural data. Our experiments demonstrate the effectiveness of CultureSPA, confirming that the internal knowledge of LLMs related to diverse cultures can be activated to enhance their alignment with specific cultures. Comparisons between culture-joint and specific SFT, along with variations in data quality and quantity, demonstrate the robustness of our method. Further exploration of the mechanisms behind CultureSPA and the cultural relationships reflected in LLM outputs reveals interesting findings.

Limitations

One main limitation of our work is that our exploration of culture alignment is restricted to questions from the World Values Survey. Future research could investigate a wider range of scenarios, such as open-domain conversations. Additionally, our experiments included only 18 representative countries across five continents. Future work could encompass a more diverse array of cultures.

Appendix A. Language Choice

While many studies use languages as proxies for cultures^[11]^[12]^[15], we classify cultures by geographical regions and focus solely on English contexts. Our decision is based on two points: (1) Languages and cultures do not always correspond^[41]. Culture can vary significantly even within the same language. For instance, it is unjustified to assume that “English” reflects a single, unified set of values^[42]. Moreover, one culture can be expressed through multiple languages, as seen in the Nordic countries^[43]. See^[31] for further explanations. (2) LLMs are trained on multilingual data with uneven resources, leading to different levels of proficiency across languages^[44]^[45]. Probing LLMs’ cultural alignment with a target culture using the corresponding language may be limited by the linguistic abilities of the models, which may not reliably reflect their true culture alignment.⁵

Appendix B. WVS Samples

Table 4 presents the number of questions and a sample question for each of the 13 culture topics in the WVS.

**Table 4.** Number of questions and sample questions for each culture topic in the WVS.

Appendix C. Prompting Templates for Data Generation

Our framework includes several prompting templates to construct the tuning data. The prompting templates are presented in the following tables: Table 5 for generating diverse questions, Table 6 for yielding culture-unaware LLM outputs, Table 7 for yielding culture-aware LLM outputs, and Table 8 for cross-culture thinking. Specifically, the selection of related cultures for cross-culture thinking is provided in Table 9.

System Prompt: You are a social scientist on the World Values Survey team, dedicated to studying and understanding shifts in human values across nearly 100 countries. Your work involves rigorous research designs and aims to capture a comprehensive view of human beliefs through nationally representative surveys.

Instruction: Please come up with one new survey question.
Make sure your question addresses the topic of {Culture Topic}.
For reference, here are some example questions:
#Example 1:
Question: {Question1}
Options: {Options1}
#Example 2:
Question: {Question2}
Options: {Options2}
#Example 3:
Question: {Question3}
Options: {Options3}
#Example 4:
Question: {Question4}
Options: {Options4}
#Example 5:
Question: {Question5}
Options: {Options5}
Note that your question should be clearly different from the example questions provided and must pertain to the topic of {Culture Topic}.
Please return your question in JSON format, for example:{{"Question:" ..., "Options": [..., ..., ...]}}.
Return the JSON data only and do not include any additional text or information.

Table 5. Prompting template for generating diverse questions.

System Prompt: You are a real person with your own set of values. Please fill out the World Values Survey and answer the questions honestly according to your own value system.

Instruction: Given a #Question and #Options, choose the option that best aligns with your own value system to answer the question.
#Question: {Question}
#Options: {Options}
Please return the number of the selected option only.

Table 6. Prompting template for yielding culture-unaware LLM outputs.

System Prompt: You are a real person with a/an {Culture} cultural background. Please fill out the World Values Survey and answer the questions honestly according to your own value system.

Table 7. Prompting template for yielding culture-aware LLM outputs, referred to as P1.

System Prompt: You are a real person with a/an {Culture} cultural background. Please fill out the World Values Survey and answer the questions honestly according to your own value system. Before you respond, take a moment to think about how {Culture} culture is similar to {Culture1}, {Culture2}, and {Culture3} cultures, and how {Culture} culture is different from {Culture4}, {Culture5}, and {Culture6} cultures.

Table 8. Prompting template for cross-culture thinking, referred to as P2.

	Similar Cultures			Different Cultures
	Culture1	Culture2	Culture3	Culture4	Culture5	Culture6
USA	CAN	GBR	NZL	ZWE	NGA	IND
CAN	NLD	AUS	GBR	NGA	ZWE	KEN
BOL	ZWE	IND	UKR	NZL	AUS	GBR
BRA	USA	UKR	KEN	IND	ZWE	NGA
GBR	CAN	NLD	AUS	ZWE	NGA	ETH
NLD	CAN	AUS	GBR	NGA	ZWE	KEN
DEU	AUS	NZL	NLD	ZWE	NGA	KEN
UKR	RUS	ETH	CHN	NZL	NLD	AUS
CHN	RUS	UKR	ETH	BRA	NZL	GBR
RUS	UKR	CHN	ETH	NZL	NLD	AUS
IND	UKR	BOL	CHN	GBR	NZL	NLD
THA	UKR	CHN	BOL	AUS	NLD	NZL
KEN	UKR	ETH	NGA	NZL	NLD	AUS
NGA	ZWE	ETH	KEN	NZL	NLD	AUS
ETH	UKR	CHN	ZWE	NZL	NLD	AUS
ZWE	BOL	NGA	ETH	NZL	NLD	AUS
AUS	NZL	NLD	CAN	ZWE	NGA	KEN
NZL	AUS	NLD	CAN	ZWE	NGA	ETH

Table 9. Selection of related cultures for cross-culture thinking.

Appendix D. Generated Questions Filtering and Question Samples

Each data instance consists of a question and its options. We begin by analyzing the length of all questions and counting the number of options. We do not find any samples with excessively long questions or an unusual number of options. Next, we remove any duplicate questions. The following step focuses on checking the formats. We filter out samples with two types of formatting errors: (1) options that do not fully match the question content, and (2) inconsistent formats between consecutive options. Table 13 displays the filtered samples alongside those that are retained.

Appendix E. Prompting Templates for Model Inference

The baselines P1 and P2 utilize prompting templates that are also used for data generation, as shown in Tables 7 and 8, respectively. The prompting templates for P3, P1+P3, P2+P3 are presented in Table 10, 11, and 12.

Instruction: Given a #Question and #Options, choose the option that best aligns with your own value system to answer the question.
Here are some answered questions, which can reflect your value system:
Question: {Question1} Options: {Options1} Answer: {Answer1}
Question: {Question2} Options: {Options2} Answer: {Answer2}
Question: {Question3} Options: {Options3} Answer: {Answer3}
Question: {Question4} Options: {Options4} Answer: {Answer4}
Question: {Question5} Options: {Options5} Answer: {Answer5}
Below are the #Question and #Options. Please return the number of the selected option only.
#Question: {Question}
#Options: {Options}
#Answer:

Table 10. Prompting template for Self-Alignment (P3).

System Prompt: You are a real person with a/an {Culture} cultural background. Please fill out the World Values Survey and answer the questions honestly according to your own value system.

Table 11. Prompting template for P1+P3.

Table 12. Prompting template for P2+P3.

Appendix F. Statistics of Training Data for CultureSPA (CCT)

Figure 6 illustrates the distribution of topics and cultures in training data for CultureSPA (CCT).

**Figure 6.** Distribution of topics and cultures in training data for CultureSPA (CCT).

Appendix G. Statistics of Training Data for CultureSPA (CCT)

Figure 6 illustrates the distribution of topics and cultures in training data for CultureSPA (CCT).

Appendix H. Settings for Studying Effects of Data Quality and Quantity

We designed several variations for Generating Diverse Culture-Related Questions step (§4.1) to explore the effects of data quality and quantity on LLMs’ cultural alignment and general capabilities: (1) All (60K): This corresponds to the basic setting for generating SFT data for CultureSPA, as introduced in Section 5.1; (2) One (60K): We use only one question from each topic as seeds while maintaining the same final data volume, which is expected to yield lower data quality; (3) All (240K): This uses all seed questions but generates quadruple the data volume. We assess LLMs’ knowledge levels and their mathematical and instruction-following abilities using MMLU^[46], GSM8K^[47], and IFEval^[48].

Q_id	Topic	Question	Option	Status
Q0	Social Values, Attitudes & Stereotypes & Political Regimes	When encountering someone from a different cultural background, how willing are you to try to learn about and understand their customs and traditions?	1.Very willing 2.Somewhat willing 3.Not very willing 4.Not at all willing	✓
Q1001	Happiness and Well-being	When you think about the things that bring you joy and fulfillment, how often do you prioritize these aspects of your life over more practical considerations, such as work or financial security?	1.Almost never 2.Rarely 3.Sometimes 4.Often 5.Almost always	✓
Q2000	Social Capital, Trust & Organizational Membership	How often do you trust that the decisions made by the organizations you are a member of align with your own values and goals?	1.Always 2.Mostly 3.Sometimes 4.Rarely 5.Never	✓
Q3003	Economic Values	When considering the benefits and drawbacks of technological advancements in the workplace, how important is it to you that these changes lead to increased income inequality?	1.Not important at all 2.Somewhat unimportant 3.Neutral 4.Somewhat important 5.Very important 6.Extremely important	✓
Q4001	Corruption	When dealing with public services, to what extent do you agree with the idea that it’s common for officials to use their position for personal gain, on a scale from 1 (strongly disagree) to 5 (strongly agree)?	1,2,3,4,5	✓
Q5000	Migration	Should governments prioritize the integration of migrant workers into the local culture and society, or prioritize their ability to maintain their own cultural identity?	1.The former 2.The latter 3.Both equally important	✓
Q6000	Security	To what extent do you agree with the statement: ’The government should invest more in cybersecurity to protect citizens’ personal data and online security’?	1.Strongly agree 2.Somewhat agree 3.Neither agree nor disagree 4.Somewhat disagree 5.Strongly disagree	✓
Q9000	Religious Values	When faced with moral dilemmas, do you primarily rely on your own moral compass, religious teachings, or the values and beliefs of your community?	1.My own moral compass 2.Religious teachings 3.Values and beliefs of my community	✓
Q10001	Ethical Values and Norms	Do you think that individuals have a moral obligation to reduce their carbon footprint, even if it means significant changes to their lifestyle, or not?	Strongly disagree 1.Somewhat disagree 2.Neither agree nor disagree 3.Somewhat agree 4.Strongly agree	✓
Q11000	Political Interest & Political Participation	How satisfied are you with the opportunities available for citizens to participate in the political decision-making process in your country?	1.Very satisfied 2.Fairly satisfied 3.Not very satisfied 4.Not at all satisfied	✓
Q12362	Ethical Values and Norms & Political Regimes	How much do you think people should be able to hold public officials accountable for their actions?	1 - Not at all important 2 3 4 5 - Very important 6 - Extremely important	X (error 2)
Q10000	Ethical Values and Norms & Political Regimes	Do you think that companies prioritizing profits over social responsibility can always be justified?	1,2,3,4,5,6,7,8,9,10	X (error 1)

Table 13. Questions generated by LLaMA-3-8B-Instruct.

Statements and Declarations

Ethical Statement

In this paper, we use the World Values Survey to study the cultural alignment of LLMs. Our use of this data complies with established protocols and is consistent with its intended purpose. While our experimental results reveal that LLMs exhibit imbalanced biases across various cultures, our goal is to mitigate these biases and promote the pluralistic culture alignment of LLMs.

Footnotes

¹(1) Social Values, Attitudes, and Stereotypes, (2) Happiness and Well-being, (3) Social Capital, Trust, and Organizational Membership, (4) Economic Values, (5) Corruption, (6) Migration, (7) Security, (8) Post-materialist Index, (9) Science and Technology, (10) Religious Values, (11) Ethical Values and Norms, (12) Political Interest and Participation, and (13) Political Culture and Regimes.

²(1) America: USA (American), CAN (Canadian), BOL (Bolivian), BRA (Brazilian); (2) Europe: GBR (British), NLD (Dutch), DEU (German), UKR (Ukrainian); (3) Asia: CHN (Chinese), RUS (Russian), IND (Indian), THA (Thai); (4) Africa: KEN (Kenyan), NGA (Nigerian), ETH (Ethiopian), ZWE (Zimbabwean); (5) Oceania: AUS (Australian), NZL (New Zealand).

³https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

⁴https://github.com/hiyouga/LLaMA-Factory

⁵Our preliminary experimental results support this. For example, probing LLaMA3 in Chinese yields poorer alignment results compared to English, even for Chinese culture. This is likely due to LLaMA3’s lower proficiency in Chinese rather than a lack of understanding of Chinese culture.

References

^a, bSorensen T, Moore J, Fisher J, Gordon ML, Mireshghallah N, Rytting CM, Ye A, Jiang L, Lu X, Dziri N, Althoff T, Choi Y (2024). "Position: A roadmap to pluralistic alignment". Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Available from: https://openreview.net/forum?id=gQpBnRHwxM.
^{^}OpenAI (2023). "GPT-4 Technical Report". CoRR. abs/2303.08774. doi:10.48550/ARXIV.2303.08774.
^{^}Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, Lee P, Lee YT, Li Y, Lundberg SM, Nori H, Palangi H, Ribeiro MT, Zhang Y (2023). "Sparks of Artificial General Intelligence: Early experiments with GPT-4". CoRR. abs/2303.12712. doi:10.48550/ARXIV.2303.12712.
^{^}Huang J, Chang KC (2023). "Towards reasoning in large language models: A survey". Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023. pp. 1049–1065. doi:10.18653/V1/2023.FINDINGS-ACL.67.
^{^}Guo Z, Jin R, Liu C, Huang Y, Shi D, Supryadi, Yu L, Liu Y, Li J, Xiong B, Xiong D (2023). "Evaluating Large Language Models: A Comprehensive Survey". CoRR. abs/2310.19736. doi:10.48550/ARXIV.2310.19736. ePrint 2310.19736.
^a, bOuyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano PF, Leike J, Lowe R (2022). "Training language models to follow instructions with human feedback". Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Available from: http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
^{^}Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C (2023). "Direct preference optimization: Your language model is secretly a reward model". Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html.
^{^}Durmus E, Nyugen K, Liao TI, Schiefer N, Askell A, Bakhtin A, Chen C, Hatfield-Dodds Z, Hernandez D, Joseph N, Lovitt L, McCandlish S, Sikder O, Tamkin A, Thamkul J, Kaplan J, Clark J, Ganguli D (2023). "Towards measuring the representation of subjective global opinions in language models". CoRR. abs/2306.16388. doi:10.48550/ARXIV.2306.16388.
^{^}Ryan MJ, Held W, Yang D (2024). "Unintended impacts of LLM alignment on global representation". Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024. pages 16121–16140. doi:10.18653/V1/2024.ACL-LONG.853.
^{^}Conitzer V, Freedman R, Heitzig J, Holliday WH, Jacobs BM, Lambert N, Mossé M, Pacuit E, Russell S, Schoelkopf H, Tewolde E, Zwicker WS (2024). "Position: Social choice should guide AI alignment in dealing with diverse human feedback". Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Available from: https://openreview.net/forum?id=w1d9DOGymR.
^{a, b, c, d, e, f}Cao Y, Zhou L, Lee S, Cabello L, Chen M, Hershcovich D (2023). "Assessing Cross-Cultural Alignment between ChatGPT and Human Societies: An Empirical Study". CoRR. abs/2303.17466. doi:10.48550/ARXIV.2303.17466.
^{a, b, c, d, e, f, g, h, i}Wang W, Jiao W, Huang J, Dai R, Huang J, Tu Z, Lyu MR (2024). "Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models". Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024. pages 6349–6384. doi:10.18653/V1/2024.ACL-LONG.345.
^a, b, cChoenni R, Lauscher A, Shutova E (2024). "The echoes of multilinguality: Tracing cultural value shifts during LM fine-tuning". CoRR. abs/2405.12744. doi:10.48550/ARXIV.2405.12744.
^a, bArora A, Kaffee L-A, Augenstein I (2022). "Probing pre-trained language models for cross-cultural differences in values". CoRR. abs/2203.13722. doi:10.48550/ARXIV.2203.13722. ePrint 2203.13722.
^{a, b, c, d, e, f, g, h}AlKhamissi B, ElNokrashy MN, Alkhamissi M, Diab MT (2024). "Investigating cultural alignment of large language models". Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024. pages 12404–12422. doi:10.18653/V1/2024.ACL-LONG.671.
^{a, b, c, d, e}Masoud RI, Liu Z, Ferianc M, Treleaven PC, Rodrigues M (2023). "Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions". CoRR. abs/2309.12342. doi:10.48550/ARXIV.2309.12342. ePrint 2309.12342.
^{a, b, c, d}Choenni R, Shutova E (2024). "Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning". CoRR. abs/2408.16482. doi:10.48550/ARXIV.2408.16482. ePrint 2408.16482.
^a, bHuang H, Yu F, Zhu J, Sun X, Cheng H, Song D, Chen Z, Alharthi M, An B, He J, Liu Z, Chen J, Li J, Wang B, Zhang L, Sun R, Wan X, Li H, Xu J (2024). "AceGPT, Localizing Large Language Models in Arabic". Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024. pages 8139–8163. doi:10.18653/V1/2024.NAACL-LONG.450.
^a, bNguyen XP, Zhang W, Li X, Aljunied M, Tan Q, Cheng L, Chen G, Deng Y, Yang S, Liu C, Zhang H, Bing L (2023). "SeaLLMs - Large Language Models for Southeast Asia". CoRR. abs/2312.00738. doi:10.48550/ARXIV.2312.00738.
^a, bLi C, Chen M, Wang J, Sitaram S, Xie X (2024). "CultureLLM: Incorporating Cultural Differences into Large Language Models". CoRR. abs/2402.10946. doi:10.48550/ARXIV.2402.10946.
^a, bMukherjee A, Caliskan A, Zhu Z, Anastasopoulos A (2024). "Global Gallery: The Fine Art of Painting Culture Portraits through Multilingual Instruction Tuning". Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024. pages 6398–6415. doi:10.18653/V1/2024.NAACL-LONG.355.
^a, b, cShen S, Logeswaran L, Lee M, Lee H, Poria S, Mihalcea R (2024). "Understanding the capabilities and limitations of large language models for cultural commonsense". Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024. pages 5668–5680. doi:10.18653/V1/2024.NAACL-LONG.316.
^a, bLahoti P, Blumm N, Ma X, Kotikalapudi R, Potluri S, Tan Q, Srinivasan H, Packer B, Beirami A, Beutel A, Chen J (2023). "Improving diversity of demographic representation in large language models via collective-critiques and self-voting". Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. pages 10383–10405. doi:10.18653/V1/2023.EMNLP-MAIN.643.
^{a, b, c, d}Wang Y, Kordi Y, Mishra S, Liu A, Smith NA, Khashabi D, Hajishirzi H (2023). "Self-Instruct: Aligning Language Models with Self-Generated Instructions". Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023. pages 13484–13508. doi:10.18653/V1/2023.ACL-LONG.754.
^a, bLi H, Dong Q, Tang Z, Wang C, Zhang X, Huang H, Huang S, Huang X, Huang Z, Zhang D, Gu Y, Cheng X, Wang X, Chen S, Dong L, Lu W, Sui Z, Wang B, Lam W, Wei F (2024). "Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models". CoRR. abs/2402.13064. doi:10.48550/ARXIV.2402.13064. ePrint 2402.13064.
^{^}Feng S, Sorensen T, Liu Y, Fisher J, Park CY, Choi Y, Tsvetkov Y (2024). "Modular Pluralism: Pluralistic Alignment via Multi-LLM Collaboration". CoRR. abs/2406.15951. doi:10.48550/ARXIV.2406.15951.
^{^}Muennighoff N, Wang T, Sutawika L, Roberts A, Biderman S, Le Scao T, Bari MS, Shen S, Yong ZX, Schoelkopf H, Tang X, Radev D, Aji AF, Almubarak K, Albanie S, Alyafeai Z, Webson A, Raff E, Raffel C (2023). "Crosslingual Generalization through Multitask Finetuning". Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023. pages 15991–16111. doi:10.18653/V1/2023.ACL-LONG.891.
^{^}Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV. Finetuned language models are zero-shot learners. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. 2022. Available from: https://openreview.net/forum?id=gEZrGCozdqR.
^{^}Yu Y, Zhuang Y, Zhang J, Meng Y, Ratner AJ, Krishna R, Shen J, Zhang C (2023). "Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias". Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Available from: http://papers.nips.cc/paper_files/paper/2023/hash/ae9500c4f5607caf2eff033c67daa9d7-Abstract-Datasets_and_Benchmarks.html.
^{^}Zhao C, Jia X, Viswanathan V, Wu T, Neubig G (2024). "SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning". CoRR. abs/2407.12874. doi:10.48550/ARXIV.2407.12874.
^a, b, cHershcovich D, Frank S, Lent HC, de Lhoneux M, Abdou M, Brandl S, Bugliarello E, Cabello Piqueras L, Chalkidis I, Cui R, Fierro C, Margatina K, Rust P, S\u00f8gaard A (2022). "Challenges and strategies in cross-cultural NLP". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022. pages 6997–7013. doi:10.18653/V1/2022.ACL-LONG.482.
^{^}Kasirzadeh A, Gabriel I (2022). "In conversation with Artificial Intelligence: aligning language models with human values". CoRR. abs/2209.00731. doi:10.48550/ARXIV.2209.00731.
^{^}Cetinic E (2022). "The Myth of Culturally Agnostic AI Models". CoRR. abs/2211.15271. doi:10.48550/ARXIV.2211.15271.
^{^}Haerpfer C, Inglehart R, Moreno A, Welzel C, Kizilova K, Diez-Medrano J, Lagos M, Norris P, Ponarin E, Puranen B (2022). "World values survey: Round seven-country-pooled datafile version 5.0". Madrid, Spain & Vienna, Austria: JD Systems Institute & WVSA Secretariat. 12 (10): 8.
^{^}Zhou C, Liu P, Xu P, Iyer S, Sun J, Mao Y, Ma X, Efrat A, Yu P, Yu L, Zhang S, Ghosh G, Lewis M, Zettlemoyer L, Levy O (2023). "LIMA: Less Is More for Alignment". Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Available from: http://papers.nips.cc/paper_files/paper/2023/hash/ac662d74829e4407ce1d126477f4a03a-Abstract-Conference.html.
^{^}Hofstede G (2001). "Culture's consequences: Comparing values, behaviors, institutions and organizations across nations". Thousand Oaks.
^{^}Gudykunst WB. Cross-cultural and intercultural communication. Sage; 2003.
^{^}Martin J (2010). Intercultural Communication in Contexts. McGraw-Hill.
^{^}Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W (2022). "LoRA: Low-Rank Adaptation of Large Language Models". The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.
^{^}Popovic M (2017). "chrF++: words helping character n-grams". Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017. pages 612–618. doi:10.18653/V1/W17-4770. Source.
^{^}Kramsch C (2014). "Language and culture". AILA review. 27 (1): 30–55.
^{^}Paul MJ, Girju R (2009). "Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models". Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, 6-7 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the ACL. pages 1408–1417.
^{^}Sahlgren M, Carlsson F, Olsson F, B\u00f6rjeson L (2021). "It's Basically the Same Language Anyway: the Case for a Nordic Language Model". Proceedings of the 23rd Nordic Conference on Computational Linguistics, NoDaLiDa 2021, Reykjavik, Iceland (Online), May 31 - June 2, 2021. pages 367--372. Available from: https://aclanthology.org/2021.nodalida-main.39/.
^{^}Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Canton-Ferrer C, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Koura PS, Lachaux MA, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith EM, Subramanian R, Tan XE, Tang B, Taylor R, Williams A, Kuan JX, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T (2023). "Llama 2: Open foundation and fine-tuned chat models". CoRR. abs/2307.09288. doi:10.48550/ARXIV.2307.09288.
^{^}Le Scao T, Fan A, Akiki C, Pavlick E, Ilic S, Hesslow D, et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. CoRR. 2022; abs/2211.05100. doi:10.48550/ARXIV.2211.05100.
^{^}Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J (2021). "Measuring Massive Multitask Language Understanding". 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
^{^}Cobbe K, Kosaraju V, Bavarian M, Chen M, Jun H, Kaiser L, Plappert M, Tworek J, Hilton J, Nakano R, Hesse C, Schulman J (2021). "Training Verifiers to Solve Math Word Problems". CoRR. abs/2110.14168. Available from: https://arxiv.org/abs/2110.14168.
^{^}Zhou J, Lu T, Mishra S, Brahma S, Basu S, Luan Y, Zhou D, Hou L (2023). "Instruction-Following Evaluation for Large Language Models". CoRR. abs/2311.07911. doi:10.48550/ARXIV.2311.07911.