In an ever-evolving digital landscape, the preservation and relevance of cultural heritage face significant challenges. The MuseIT project, aims to tackle these challenges by leveraging advanced technologies to ensure that digital artefacts remain significant and accessible. MuseIT is not just about preserving the past but also about making cultural heritage more inclusive and representative of our dynamic society.
Understanding Semantic Drift in Cultural Heritage
Language is not static; it evolves continuously, influenced by social practices, events, and technological advancements. This constant evolution poses a risk to digital artefacts that encapsulate cultural expressions. Words and phrases that were once neutral or positive can become offensive, and vice versa. This phenomenon, known as semantic drift, adds complexity to the representation of cultural heritage in a digital, multimodal context.
The Role of MuseIT
MuseIT addresses these challenges by developing sophisticated methods to detect, interpret, and measure changes in the semantics of language over time. This is crucial for maintaining the authenticity and relevance of cultural expressions as societal norms and technologies evolve. By understanding how language changes, MuseIT helps ensure that digital artefacts remain true to their original context and meaning.
Technical Innovations and Methodologies
MuseIT employs advanced techniques in Natural Language Processing (NLP) to tackle the issue of semantic drift. The project utilizes both traditional and contemporary embedding methods. Initially, token/type embedding methods like Word2Vec (1) and GloVe (2) were used, but recent advancements have seen the integration of contextual embedding methods such as BERT (3) and ELMo (4). These methods provide a deeper understanding of word contexts and their semantic shifts.
One of the key methodologies involves comparing embeddings from different time periods using cosine similarity measures. This approach allows the detection of changes in word meanings by analyzing the difference between sets of embeddings trained on historical and contemporary corpora. Additionally, unsupervised methods like clustering and topic modeling enhance the sensitivity to word distribution changes, further refining the detection of semantic shifts.
Key Results
MuseIT's analysis of offensive language within the context of disabilities has yielded significant insights. By examining a substantial anonymized dataset from social media platforms, predominantly Reddit, the project identified key terms that have undergone semantic drift. For instance, the term "lame," which was once associated with physical disabilities (5), is now frequently used in a derogatory manner. Similarly, "retarded," originally a clinical term, has devolved into a pejorative slang. To filter relevant texts, MuseIT utilized a comprehensive keyword-based approach, employing terms sourced from academic research, social organizations, and legal guidelines. This resulted in a focused dataset of 59,639 comments, which were then analyzed for semantic drift and offensive language. The findings highlight the dynamic nature of language and underscore the importance of context-aware models for accurate detection.
Sentiment and Relevance Analysis
To support the annotation task, MuseIT incorporated sentiment analysis using TimeLMs, a model trained on Twitter data for various NLP tasks, including offensive detection. This model provided sentiment labels and scores, aiding in the identification of potentially offensive content. Additionally, relevance analysis was conducted using the Llama-2-13b model, which classified posts based on their relevance to disability discourse. This dual approach ensured a robust dataset for further analysis.
Promoting Inclusivity and Respect
By identifying and addressing offensive language and its evolution, MuseIT contributes to a more inclusive society. The project emphasizes the importance of mindful language use, particularly in discussions related to disabilities. This approach not only preserves cultural heritage but also promotes social awareness and sensitivity.
Conclusion
The MuseIT project stands at the intersection of technology and cultural heritage, offering innovative solutions to preserve the past while accommodating the present's evolving linguistic landscape. Through its focus on detecting semantic drift and offensive language, MuseIT ensures that digital artefacts remain relevant, authentic, and respectful. This initiative not only safeguards our cultural heritage but also fosters a more inclusive and aware society, bridging the gap between technology and the rich tapestry of human expression.
References:
(1) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
(2) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.
(3) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
(4) Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of ACL, pages 1756–1765.
Comentários