Masterarbeit, 2017
73 Seiten, Note: 4.6/5
Chapter 1: Introduction
1.1 Motivation: Why is the Topic so Important?
1.2 Thesis Overview
1.3 The Problem and Contribution
1.4 Outline of The Thesis
Chapter 2: Background
2.1 Dialogue Systems
2.1.1 Introduction
2.1.2 Data-Driven vs Other Design Approaches
2.2 McGill Ubuntu Dialogue Corpus
Chapter 3: Methods and Techniques
3.1 Natural Language Processing (NLP)
3.1.1 Introduction
3.1.2 Wikipedia-Based Explicit Semantic Analysis (ESA)
3.2 Deep Learning
3.2.1 Why Deep Learning?
3.2.2 Deep Neural Networks: Definitions and Basics
3.2.3 RNN and LSTM Networks
Chapter 4: Data Collection: Six IRC Channels
4.1 Ubuntu
4.2 Lisp
4.3 Perl6
4.4 Koha
4.5 ScummVM
4.6 MediaWiki
Chapter 5: General Pipeline for Dialogue Extraction: IRC-VPP
5.1 Pipeline Architecture
5.2 Components and Configurations
5.2.1 IRC Channel Crawler
5.2.2 Raw IRC Cleaner
5.2.3 Dialogue Extraction
5.3 Post-Processing Algorithms
5.3.1 Message Extraction
5.3.2 Recipient Identification
5.3.3 Dialogue Extraction and Hole-Filling
5.3.4 Relevant Messages Concatenation
5.4 Annotating IRC-VPP Dialogues Datasets
Chapter 6: Experiments and Evaluation
6.1 Pre-Training Datasets Statistics
6.2 IRC-VPP Software vs McGill Software
6.3 RNN/LSTM/ESA Results
Chapter 7: Conclusion and Future Work
This thesis aims to maximize the utility of unstructured, domain-specific online conversations for training intelligent dialogue systems. The primary research goal is to design and implement a versatile pipeline, referred to as IRC-VPP, which automates the collection, cleaning, and extraction of one-on-one dialogue data from various Internet Relay Chat (IRC) channels. Furthermore, the work evaluates the integration of this pipeline with deep learning and semantic analysis techniques to improve dialogue response accuracy.
3.1.2 Wikipedia-Based Explicit Semantic Analysis (ESA)
In general, the issue of computing the semantics of natural languages showed that humans interpret and recognize the relatedness of a text automatically in the background without noticing that. That is because humans have a common sense of the world. On the contrary, machines do not have such common sense. It has become clear that in order to process a natural language, computers require access to large amounts of common sense and domain-specific knowledge [6]. However, previous work on semantic relatedness was based on purely statistical techniques that did not make use of background knowledge [7] or it was depending on lexical resources that contain a limited knowledge [3]. The idea of ESA was originally introduced by Evgeniy Gabrilovich and Shaul Markovitch [8] to give the ability to the machine to semantically analyze and interpret textual data using explicitly defined common senses knowledge from Wikipedia knowledge base, where each Wikipedia article describes and concept.
Chapter 1: Introduction: Discusses the importance of data-driven dialogue systems and introduces the need for automated, domain-specific data collection and post-processing.
Chapter 2: Background: Provides an overview of dialogue systems, their evolution, and the specific state-of-the-art work by McGill University on the Ubuntu IRC channel.
Chapter 3: Methods and Techniques: Details the natural language processing techniques, specifically Wikipedia-based ESA, and the deep learning architectures (RNN/LSTM) used in the thesis.
Chapter 4: Data Collection: Six IRC Channels: Demonstrates the data collection phase across various IRC channels, detailing their unique HTML structures and data formats.
Chapter 5: General Pipeline for Dialogue Extraction: IRC-VPP: Explains the design and functionality of the IRC-VPP software, including its components, configuration parameters, and post-processing algorithms.
Chapter 6: Experiments and Evaluation: Presents and discusses the statistical results of the extracted datasets and evaluates the performance of the deep learning models.
Chapter 7: Conclusion and Future Work: Concludes the thesis by summarizing the success of the generalized pipeline and suggests potential future research directions.
Dialogue Systems, Deep Learning, Natural Language Processing, IRC-VPP, Explicit Semantic Analysis, ESA, Data Collection, Recurrent Neural Networks, RNN, Long Short Term Memory, LSTM, Post-Processing, Ubuntu, IRC, Information Extraction
The thesis focuses on automating the process of collecting and post-processing unstructured human-human conversational data from various Internet Relay Chat (IRC) channels to facilitate the development of domain-specific data-driven dialogue systems.
The work spans several fields including Natural Language Processing (NLP), Deep Learning (specifically RNN and LSTM networks), data pipeline architecture design, and semantic analysis using Wikipedia-based ESA.
The objective is to generalize the pipeline architecture for dialogue extraction so that it is not limited to a single domain (like Ubuntu), but can instead adapt to different IRC channel formats and provide structured data for neural network training.
The research employs automatic crawling techniques, novel post-processing heuristics for log cleaning and dialogue extraction, and evaluates machine learning performance using Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) models.
The main body covers the theoretical background of dialogue systems, detailed methodology for data collection and extraction (the IRC-VPP pipeline), and comprehensive experimental evaluations of deep learning models on multiple datasets.
Key terms include Data-Driven Dialogue Systems, IRC-VPP, Deep Learning, Explicit Semantic Analysis (ESA), and automated data pipeline generation.
The Hole-Filling algorithm is designed to capture conversational segments where users do not explicitly address recipients, which are common in IRC logs, thus significantly increasing the volume of usable training data.
ESA is integrated with neural network models to improve semantic interpretation and relatedness estimation, aiming to boost the accuracy of "best response" selection compared to using deep learning methods in isolation.
Der GRIN Verlag hat sich seit 1998 auf die Veröffentlichung akademischer eBooks und Bücher spezialisiert. Der GRIN Verlag steht damit als erstes Unternehmen für User Generated Quality Content. Die Verlagsseiten GRIN.com, Hausarbeiten.de und Diplomarbeiten24 bieten für Hochschullehrer, Absolventen und Studenten die ideale Plattform, wissenschaftliche Texte wie Hausarbeiten, Referate, Bachelorarbeiten, Masterarbeiten, Diplomarbeiten, Dissertationen und wissenschaftliche Aufsätze einem breiten Publikum zu präsentieren.
Kostenfreie Veröffentlichung: Hausarbeit, Bachelorarbeit, Diplomarbeit, Dissertation, Masterarbeit, Interpretation oder Referat jetzt veröffentlichen!

