Bachelorarbeit, 2009
63 Seiten, Note: 1,0
1 Introduction to CREAM
1.1 The Corpus Research for Exploitation of Annotated Metadata project (CREAM)
1.1.1 The main topic
1.1.2 The Corpus - eNova Database
1.1.2.1 eNova Application and Sample Walk Through
1.1.3 The main goal
2 State of the Art
2.1 Contributions to Word Prediction and Word Probability
2.1.1 Current Contributions to n-Gram Analysis
2.2 Applied n-Gram Analysis
2.2.1 National Security
2.2.2 Spelling Correction
2.2.2.1 Other Areas Related to Spelling Correction
3 n-Grams - Word Count and n-Gram Modeling
3.1 Introduction to Word Count in Corpora
3.2 Tokenization - Word Segmentation in Corpora
3.2.1 Word Types vs. Word Tokens
3.2.2 Stemming and Lemmatization
3.2.3 Non-word Characters
3.3 Parameters of Tokenization
3.3.1 Compounding and Words Separated by Whitespace
3.3.2 Hyphens
3.3.3 Case-(In-)Sensitivity
3.3.4 Other Cases in English
3.3.5 Stop words
3.3.6 Other Languages
3.4 Introduction to Word Probability and Word Prediction
3.4.1 Markov Assumption and n-Gram Modeling
3.4.2 n-Grams - n-Token Sequences of Words
3.4.3 Simple n-Gram Analysis - Maximum Likelihood Estimation
3.5 n-Gram Analysis over Sample Corpus
4 Waterfall Model
4.1 Introduction
4.2 Introduction to the Waterfall Model of Software Development
4.2.1 Requirement Analysis
4.2.2 Specification Phase
4.2.3 Design
4.2.4 Implementation and Testing - Monogram Code
4.2.5 Implementation and Testing - Bigram Code
4.2.6 Integration
5 Interpretation of n-Gram Retrieval
5.1 Introduction
5.2 Presentation of Monograms
5.2.1 Interpretation of Monograms
5.2.2 Spelling Correction
5.3 Bigrams
5.3.1 Relative Frequency of Sample Bigrams
5.4 Trigrams
5.5 Main interpretation and forecast
The primary goal of this thesis is to improve the user experience and guided search functionality of the eNova pharmaceutical database through the application of n-gram modeling. By analyzing historical search queries, the work aims to infer user search strategies, predict intended keyword combinations, and optimize database performance to support both scientific and expert users.
1.1.2 The Corpus - eNova Database
As the main goal of the CREAM project is to exploit large databases, we will have to take a look at the corpus in advance. The corpus we are going to work with consists of search queries of the Novartis Corporate Drug Literature Database, also called eNova. By using eNova, Novartis created a system with which they supply customers with any information about their own products (drugs). Hence, Novartis provides scientists, doctors, internal experts, or the like with a thorough and large database of pharmaceutical keywords and articles written by experts about a certain drug and related issues, such as contributing authors, respective journals, drug side-effects, and so on. The unique feature of the drug literature database covers for instance: ”1) a comprehensive coverage of products (drugs) by Novartis, 2) customized Novartis drug specific abstracts, [...], and 4) direct access to the full text of most articles” [Mas, 2008]. Therefore, eNova is a drug literature database of excellent quality and quantity. Further, every search over eNova executed by the user is saved and stored on hard disc for further analysis. The analysis of this corpus is part of this thesis and will be dealt with throughout the following chapters. What now follows are some sample lines taken from the corpus of executed searches in order to get a first impression on how the, say, ”raw” corpus looks like:
ea8022569a146c814c33eac56ef767f5,,"xolair.prn and zeig.au"
41a544e889f3807af4bcb5340d5ed3e5,,00290:00290.ANO
41a544e889f3807af4bcb5340d5ed3e5,,"(#3 ) and (#2 )"
41a544e889f3807af4bcb5340d5ed3e5,,00072:00072.ANO
41a544e889f3807af4bcb5340d5ed3e5,,"(#5 ) and ( ’TP FTY720’.PRN )"
As we will focus on the occurring keywords, we will refer to the actual term (such as xolair ) as the stem and its specification (such as .prn) as its suffix. A detailed list of what the suffixes stand for is given in appendix A. It is important to note that we will be working with two different corpora. The one consists of internal expert search queries based on expert searches by experts within Novartis’s department
1 Introduction to CREAM: Introduces the CREAM project and the eNova pharmaceutical database, setting the stage for the research objective of utilizing n-gram modeling to enhance guided search capabilities.
2 State of the Art: Provides a theoretical overview of word prediction, n-gram modeling, and related computational linguistics concepts, including historical context like the Shannon Game.
3 n-Grams - Word Count and n-Gram Modeling: Discusses the methodology of corpus preparation, focusing on tokenization, lemmatization, and the mathematical foundations of word probability and n-gram sequences.
4 Waterfall Model: Details the software development process used to create the Ruby-based n-gram analysis tools, covering requirement analysis, design, implementation, and integration of the code.
5 Interpretation of n-Gram Retrieval: Presents the empirical results of the n-gram analysis performed on the eNova database, including interpretations of monograms, bigrams, and trigrams, and concludes with a forecast for future work.
n-gram modeling, corpus analysis, pharmaceutical database, eNova, Novartis, tokenization, word probability, Markov assumption, information retrieval, Ruby, software development, waterfall model, spelling correction, search strategy, lexicography
The work focuses on analyzing search query data from the eNova pharmaceutical database to understand user search behavior and improve the efficiency of guided search interfaces through n-gram modeling.
The study integrates computational linguistics and software engineering, applying natural language processing (NLP) techniques to optimize professional, domain-specific search engines.
The main goal is to implement a method for predicting keyword combinations that users are likely to look for, thereby enabling more intuitive and powerful search interfaces using picklists and statistical distribution data.
The research uses n-gram analysis—specifically monograms, bigrams, and trigrams—combined with the Maximum Likelihood Estimation (MLE) method to calculate word probabilities within the provided search corpora.
The main body follows a software development lifecycle approach (Waterfall Model) to conceptualize, design, and implement customized Ruby code that parses search queries and generates statistical models.
Key terms include n-gram modeling, corpus analysis, eNova, Novartis, tokenization, and Markov assumption.
The author notes that while spelling errors exist (approx. 0.7-1.1%), they are deemed minor enough not to significantly skew the statistical results, though future improvements might incorporate automated spelling correction.
Three distinct Ruby scripts are used to handle different levels of n-gram analysis (monograms, bigrams, and trigrams), specifically filtered for pharma-industry search syntax (e.g., ignoring Boolean operators).
The study contrasts "internal expert" queries with "external expert" queries to demonstrate differences in search strategies and keyword usage between different user groups.
The author suggests that future work should focus on optimizing code for syntactic case-insensitivity and shifting the focus towards analyzing suffixes to better categorize generic search intents.
Der GRIN Verlag hat sich seit 1998 auf die Veröffentlichung akademischer eBooks und Bücher spezialisiert. Der GRIN Verlag steht damit als erstes Unternehmen für User Generated Quality Content. Die Verlagsseiten GRIN.com, Hausarbeiten.de und Diplomarbeiten24 bieten für Hochschullehrer, Absolventen und Studenten die ideale Plattform, wissenschaftliche Texte wie Hausarbeiten, Referate, Bachelorarbeiten, Masterarbeiten, Diplomarbeiten, Dissertationen und wissenschaftliche Aufsätze einem breiten Publikum zu präsentieren.
Kostenfreie Veröffentlichung: Hausarbeit, Bachelorarbeit, Diplomarbeit, Dissertation, Masterarbeit, Interpretation oder Referat jetzt veröffentlichen!

