Technology | 24 May 2024

Technology Series: Minghui Hu


Minghui Hu is Associate Professor in the History Department and Affiliated Faculty in the History of Consciousness Department at UC Santa Cruz. Hu was a computer programmer at a child psychiatrist lab in a UCLA hospital while pursuing a Ph.D. in History. Upon completing his dissertation in the History of Science program at UCLA in 2004, he moved to the University of Chicago as an Andrew Mellon postdoctoral fellow. He joined the faculty of History at UCSC in 2005. Data mining and the field of digital humanities are the main focus of his future research. Hu’s first book is China’s Transition to Modernity: The New Classical Vision of Dai Zhen (Washington 2015; the Chinese translation is forthcoming in 2023). He also co-edited Cosmopolitanism in China, 1600-1950 (Cambria 2016) with Johan Elverskog. His forthcoming book is Waiting for the Barbarians: A History of Geopolitics in Early Modern China (Cambria Press, 2025). Hu is also the Principal Investigator for THI’s Humanities in the Age of Artificial Intelligence Research Cluster

The Evolution and Impact of Machine Translation: A Digital Humanist Perspective

My recent work has looked at the evolution of machine translation – technologies like Google Translate and DeepL that use artificial intelligence to provide high-quality translations in multiple languages, including the pivotal role of large language models (LLMs) in shaping the current landscape. As part of my work as the organizer of the THI research cluster on Humanities in the Age of Artificial Intelligence, I have been examining the historical backdrop, LLMs’ advancements, their application in academic settings, and the ensuing challenges and opportunities.

The journey of machine translation is a fascinating narrative of technological evolution, marked by ambitious beginnings, setbacks, and groundbreaking advancements. This history provides a backdrop against which to appreciate the complexities and achievements of modern translation efforts.

Machine translation emerged in the early 1950s, fueled by the Cold War’s demands for rapid translation of Russian scientific documents. The initial approach was rule-based, focusing on directly substituting words between languages without much contextual understanding. And despite early optimism and funding, these methods were rudimentary at best.

The publication of the Automatic Language Processing Advisory Committee (ALPAC) report in 1966, which criticized the effectiveness of machine translation, led to a significant reduction in funding and interest in the field. The report gained notoriety for being very skeptical of research done in machine translation so far and emphasizing the need for basic research in computational linguistics; this eventually caused the U.S. government to dramatically reduce its funding of the topic.

The 1980s saw a resurgence in machine translation research, driven by the advent of statistical methods. This approach utilized bilingual text corpora to learn translations, marking a shift towards data-driven methods. The late 1990s and 2000s witnessed the introduction of neural machine translation (NMT), employing deep learning and artificial neural networks. This represented a significant leap forward, enabling more fluent and contextually accurate translations. Developing large language models (LLMs) like GPT-3 and beyond has transformed the machine translation landscape. These models, trained on vast amounts of text data, have pushed the boundaries of translation, achieving unprecedented fluency and comprehension levels. We are now at a point where machine translation is close enough to human performance to consider the future of translation.

We are now at a point where machine translation is close enough to human performance to consider the future of translation.

A 2022 article from The Economist discussing “Cyborg Translation” posits a future where translation is not merely the product of humans or machines but a collaborative fusion of both. This idea is resonant with my own recent translation experience. I was part of a team of academics who translated Wang Hui’s The Rise of Modern Chinese Thought, a seminal text exploring the philosophical underpinnings of modern China, aided in part by an initial machine translation draft. The project demanded fluency, an immersive understanding of the historical and philosophical context, and academic subtleties embedded in the Chinese academic language. The endeavor highlighted the present machine translation tools’ shortcomings when confronted with texts that require an elevated level of cultural and contextual insight. Even with the advanced capabilities of large language models like ChatGPT-4 and Nautilus, these tools play a supporting role, offering preliminary drafts and aiding in comprehending complex syntactical structures. This personal translation experience underscores the “Cyborg Translation” philosophy, emphasizing the critical interplay of human intellect with machine proficiency.

My experience translating The Rise of Modern Chinese Thought illustrates the complex interplay between human expertise and machine translation technologies. As LLMs continue to evolve, the future of translation lies in leveraging these tools while recognizing the irreplaceable value of human insight and cultural understanding. This synergistic approach will undoubtedly shape the next frontier in the quest to transcend language barriers, offering new possibilities for global communication and understanding.

As LLMs continue to evolve, the future of translation lies in leveraging these tools while recognizing the irreplaceable value of human insight and cultural understanding.

Translating academic texts using large language models (LLMs) involves a sophisticated and meticulous process designed to ensure accuracy and maintain the integrity of the original material. This process is particularly crucial when translating complex academic monographs from languages such as Chinese or Italian into English. The translation pipeline includes three key stages: entity mapping, sequential translation in context, and rewriting in academic language.

The first stage, entity mapping, involves identifying and translating specific terms and names, such as names of people, places, and technical terms. This step is essential for ensuring that these critical elements are correctly understood and translated. In academic texts, these entities often carry significant meaning and importance. By carefully mapping these entities, LLMs can preserve the precise references and terminologies vital to the academic work.

The second stage is sequential translation in context. This means translating the text sentence by sentence while considering the context of the entire document. Academic texts are typically dense and complex, with ideas and arguments developed over multiple sentences and paragraphs. Translating each sentence in isolation can result in a cohesive and coherent translation. Instead, LLMs analyze the broader context to ensure that each sentence fits well with the sentences before and after it. This approach maintains the overall flow and coherence of the text, ensuring that the translated material accurately reflects the original structure and logic.

The final stage is rewriting in academic language. After the initial translation, the text is polished to match the formal and precise language typically used in academic writing. This step is crucial for ensuring the translated text sounds natural and professional in the target language. Academic writing follows specific conventions and styles; a direct translation does not always capture the same tone and clarity. By rewriting the text in an appropriate academic style, LLMs ensure that the translation is accurate and meets the expectations and standards of the target academic audience.

By following a structured and detailed process, LLMs can produce high-quality translations that maintain the integrity and quality of the original material.

This meticulous approach to translation is essential for preserving the essence and nuance of the original academic texts. Translating complex ideas and arguments requires more than just a word-for-word substitution; it requires an in-depth understanding of the subject matter and the ability to convey it effectively in another language. By following a structured and detailed process, LLMs can produce high-quality translations that maintain the integrity and quality of the original material.

To achieve these objectives, it is essential to develop various translation engines utilizing LLMs and evaluate their strengths and weaknesses to enhance the “cyborg” academic translation process. Since 2023, I have led a research project to critically evaluate translation engines, specifically Open AI and Nautilus, and examine their advantages and limitations. Open AI, with its flagship model ChatGPT-4, boasts massive training data and leadership in the AI field yet faces challenges related to cost and security. On the other hand, Nautilus offers cost-effectiveness, interoperability, and flexibility, albeit with potential limitations in development incentives. These contrasting profiles underscore the need for a balanced approach in selecting the appropriate engine for academic translation projects.

Conducting a comparative analysis of translation engines, particularly between the Open AI platform and running the Mistral model on Nautilus, requires considering various factors that influence these tools’ performance, utility, and suitability for specific translation tasks. Here, we delve into several critical aspects of this comparative analysis. Open AI’s platform, representing the forefront of artificial intelligence research, utilizes large language models (LLMs) trained on diverse and extensive datasets. This training enables the models to manage complex and nuanced academic language translations effectively. However, significant challenges are associated with using Open AI’s advanced models. The cost can be prohibitive for individuals or smaller organizations, and the reliance on cloud-based services raises concerns about data security, especially for sensitive content. Furthermore, the platform may enforce rate limits, impeding the translation process for large-scale projects or high-demand users.

Conversely, Nautilus provides an infrastructure that supports open-source models like Mistral or Llama3, presenting a more cost-effective and flexible option for machine translation tasks. The choice between Open AI and Nautilus with Mistral depends on specific project requirements and constraints. This analysis highlights a trade-off between the sophisticated capabilities of Open AI’s offerings and the cost-effectiveness and adaptability of open-source models on Nautilus. Decision-makers should consider their specific translation needs, budget limitations, and data security requirements when choosing between these platforms.

Despite progress in machine translation, significant challenges persist, mainly when translating low-resource languages in Sub-Saharan Africa and Southeast Asia. One of the primary difficulties lies in the high cost of training language models and the limited availability of comprehensive datasets for these languages. A recent Stanford report highlights the astronomical costs associated with training cutting-edge models. Additionally, the decreasing marginal benefits of training LLMs for low-resource languages complicate the cost-benefit analysis.

Several issues exacerbate these challenges. First, orthographic variations, where a single language may have multiple writing systems or inconsistent spelling, add complexity to the translation process. Many of these languages are predominantly oral, needing more extensive written records that can be used to train models. The digital divide further limits the collection of data and the application of advanced technologies in these regions. Colonial legacies also play a role, as historical biases and inequalities have left some languages underrepresented and under-resourced.

Efforts to achieve digital justice through projects that leverage AI to preserve endangered languages underscore the broader implications of MT in safeguarding linguistic diversity. These initiatives aim to bridge the gap by developing resources and technologies that support using and preserving low-resource languages. Such efforts highlight the importance of inclusive and equitable approaches in advancing MT, ensuring that the benefits of technological progress extend to all linguistic communities.

Banner Image: IBM-701, similar to the one used in the Georgetown-IBM project to translate Russian documents.

The Humanities Institute’s 2024 Technology Series features contributions from a range of faculty and emeriti engaged in humanities scholarship at UC Santa Cruz. The statements, views, and data contained in these pieces belong to the individual contributors and draw on their academic expertise and insight. This series showcases the ways in which scholars from diverse disciplinary perspectives contend with the issues connected with our annual theme. Sign up for our newsletter to receive the latest piece in the series every week!