Designing an Information Extraction System for Amharic Vacancy Announcement Text

Magisterarbeit, 2011
105 Seiten, Note: Very Good

Informatik - Angewandte Informatik

Leseprobe

CHAPTER ONE

INTRODUCTION

1.1. GENERAL BACKGROUND

1.2. STATEMENT OF THE PROBLEM

1.3. OBJECTIVE OF THE STUDY

1.3.1 GENERAL OBJECTIVE

1.3.2. SPECIFIC OBJECTIVES

1.4. METHODOLOGY

1.4.1.STUDY DESIGN

1.4.2.LITERATURE REVIEW

1.4.3. DATA SOURCES AND DATA PREPARATION FOR THE EXPERIMENT

UNDERSTANDING OF DOMAIN LANGUAGE

1.4.4. DESIGN AND IMPLEMENTATION OF AVATIES

1.5. APPLICATION OF RESULTS AND BENEFICIARIES

1.6. SCOPE AND LIMITATIONS OF THE STUDY

1.7. ORGANIZATION OF THE STUDY

CHAPTER TWO

LITERATURE REVIEW

2.1. INTRODUCTION

2.2. INFORMATION EXTRACTION (IE)

2.3. BUILDING INFORMATION EXTRACTION SYSTEMS

I. KNOWLEDGE ENGINEERING APPROACH

II. AUTOMATIC TRAINING APPROACH

2.4. ARCHITECTURE OF INFORMATION EXTRACTION SYSTEM

2.5. PREPROCESSING OF INPUT TEXTS

2.6. LEARNING AND APPLICATION OF THE EXTRACTION MODEL

2.7. POST PROCESSING OF OUTPUT

2.8. RELATED NLP FIELDS TO INFORMATION EXTRACTION

2.8.1. INFORMATION RETRIEVAL (IR)

2.8.2. TEXT SUMMARIZATION

2.8.3. QUESTION ANSWERING SYSTEMS

2.8.4. TEXT CATEGORIZATION

2.9. INFORMATION EXTRACTION (IE) AND INFORMATION RETRIEVAL (IR)

2.10. EVALUATION OF INFORMATION EXTRACTION

2.11. RELATED WORKS

INFORMATION EXTRACTION FOR E-JOB MARKETPLACE

INFORMATION EXTRACTION FROM AMHARIC TEXT

INFORMATION EXTRACTION FROM ENGLISH TEXT

INFORMATION EXTRACTION FROM CHINESE TEXT

CHAPTER THREE

THE AMHARIC WRITING SYSTEM

3.1. INTRODUCTION

3.2. AMHARIC CHARACTER REPRESENTATION AND WRITING SYSTEM

3.3. AMHARIC PUNCTUATION MARKS AND NUMERALS

3.4. CHARACTERISTICS OF THE AMHARIC WRITING SYSTEM

3.5. THE MORPHOLOGY OF AMHARIC

3.6. GRAMMATICAL STRUCTURE OF AMHARIC

3.6.1 WORD CATEGORIZATION IN AMHARIC

3.7. SENTENCES IN AMHARIC

CHAPTER FOUR

DESIGN AND IMPLEMENTATION OF AVATIES

4.1. INTRODUCTION

4.2. PROPOSED MODEL

DATA PREPROCESSING

LEARNING AND EXTRACTION COMPONENT

POST PROCESSING

THE PROTOTYPE SYSTEM

CHAPTER FIVE

RESULT AND EVALUATION

5.1. INTRODUCTION

5.2. EVALUATION METRICS

5.3. THE DATASETS

5.4. EXPERIMENTAL RESULT AND EVALUATION EACH COMPONENT OF OUR SYSTEM

5.4.1. EXPERIMENTAL RESULT AND EVALUATION OF NORMALIZATION

5.4.2. EXPERIMENTAL RESULT AND EVALUATION OF STOPWORD REMOVAL

5.4.3. EXPERIMENTAL RESULT AND EVALUATION OF TRANSLITERATION

5.4.5. EXPERIMENTAL RESULT AND EVALUATION OF PROTOTYPE SYSTEM FOR CANDIDATE TEXT EXTRACTION

5.4.5.1. EXPERIMENTAL RESULT AND EVALUATION OF ORGANIZATION AND POSITION EXTRACTION

5.4.5.2. EXPERIMENTAL RESULT AND EVALUATION OF OTHER CANDIDATE TEXT EXTRACTION

CHAPTR SIX

CONCLUSION AND RECOMMENDATION

6.1. CONCLUSIONS

6.2. RECOMMENDATION

REFERENCE

Research Objectives and Topics

The primary objective of this research is to design and implement an automated Information Extraction (IE) system for Amharic vacancy announcement texts. The study aims to overcome the challenges of manually processing unstructured job postings by developing a rule-based system capable of accurately identifying and extracting key organizational and job-related data.

Development of an Amharic-specific information extraction architecture.
Implementation of robust linguistic preprocessing techniques for Amharic text (tokenization, normalization, transliteration).
Design and testing of rule-based algorithms to extract specific data attributes like organization, position, qualification, salary, and deadlines.
Evaluation of system performance using precision, recall, and F-measure metrics on real-world datasets.

Excerpt from the Book

1.1. GENERAL BACKGROUND

Rapid developments in Information and Communication Technology are making available huge amount of data and information. Much of these data is in electronics forms (like more than billion documents in the World Wide Web). Usually these data are unstructured or semi-structured and can generally be considered as a text database. Likewise, the recent decades witnessed a rapid proliferation of Amharic textual information available in digital form in a myriad of repositories on the Internet and intranets. As a result of this growth, a huge amount of valuable information, which can be used in education, business, health and other many areas are hidden under unstructured representation of the textual data and is thus hard to search in. This resulted in a growing need for effective and efficient techniques for analyzing free-text data and discovering valuable and relevant knowledge from it in the form of structured information, and led to the emergence of Information Extraction technologies.

Information Extraction (IE) is one of the NLP applications that aim to automatically extract structured factual from unstructured text. Riloff [2] discusses, the task of automatic extraction of information from texts involves identify a predefined set of concepts and deciding whether a text is relevant for a certain domain, and if so extracting a set of facts from that text.

IE has three different components regardless of the language and domain on which it is developed for. The components are linguistic preprocessing, learning and application, and post processing. Linguistic preprocessing uses different tools to make the natural language texts ready for extraction. The learning and the application component learns a model and extract the required information from the preprocessed text.

Chapter Summaries

CHAPTER ONE: Provides an introduction to the research, outlining the background, problem statement, objectives, and the methodology used to develop the IE system.

CHAPTER TWO: Reviews the literature on Information Extraction (IE) techniques, related NLP fields, and existing IE systems, providing a foundation for the proposed approach.

CHAPTER THREE: Discusses the Amharic writing system, morphology, and grammatical structure, highlighting language-specific challenges relevant to the research.

CHAPTER FOUR: Details the design and implementation of the AVATIES prototype, including the proposed model, preprocessing steps, and extraction algorithms.

CHAPTER FIVE: Presents the experimental results and performance evaluation of the system, using precision, recall, and F-measure metrics on the collected test dataset.

CHAPTR SIX: Concludes the thesis by summarizing key findings and providing recommendations for future research and system improvements.

Keywords

Information Extraction, Amharic Language, Vacancy Announcement, Rule-Based System, Natural Language Processing, Tokenization, Normalization, Named Entity Recognition, Gazetteer, Morphology, POS Tagging, Prototype System, Precision, Recall, F-measure.

Frequently Asked Questions

What is the core purpose of this research?

The research focuses on designing an automated Information Extraction system specifically for Amharic vacancy announcements to reduce the manual effort of extracting job-related details from unstructured newspaper text.

What are the central themes of this work?

Key themes include natural language processing for the Amharic language, rule-based IE modeling, linguistic preprocessing, and system performance evaluation within the specific domain of job postings.

What is the primary research question?

The study seeks to determine the most effective approaches, algorithms, and models for designing an Amharic IE system capable of accurately identifying and extracting relevant data from unstructured vacancy announcements.

Which scientific methodology does the author use?

The research employs an experimental methodology, involving the collection of Amharic vacancy texts, development of rule-based algorithms, and testing the system's performance using standard NLP metrics like precision and recall.

What is covered in the main body of the work?

The main body covers the literature review of IE systems, an analysis of the Amharic language structure (writing system and morphology), the technical design of the AVATIES prototype, and a thorough performance evaluation.

Which keywords define this work?

The study is characterized by terms such as Information Extraction, Amharic Language, Vacancy Announcement, Rule-Based System, Natural Language Processing, and Prototype System.

Why is the Amharic writing system challenging for information extraction?

Amharic has a unique syllabic script where characters exhibit spelling variations due to interchangeably used consonants that share the same sound, requiring robust normalization algorithms for effective extraction.

How does the AVATIES prototype handle different vacancy formats?

The system uses a combination of gazetteers (predefined lists) and feature-based context rules to identify entities like job position and organization name regardless of the specific format of the vacancy text.

What were the final results of the experiment?

The prototype system achieved an overall F-measure of 71.7%, demonstrating that a rule-based knowledge engineering approach is a promising direction for Amharic information extraction.

Ende der Leseprobe aus 105 Seiten - nach oben

Details

Titel: Designing an Information Extraction System for Amharic Vacancy Announcement Text
Hochschule: Addis Ababa University
Veranstaltung: NAtural Language processing
Note: Very Good
Autor: Sintayehu Hirpassa (Autor:in)
Erscheinungsjahr: 2011
Seiten: 105
Katalognummer: V289226
ISBN (eBook): 9783656895565
ISBN (Buch): 9783656895572
Dateigröße: 1129 KB
Sprache: Englisch
Anmerkungen: This paper is a very good view in Natural Language Processing specifically Information Extraction for local language of Ethiopia. The author is the pioneer on this topic in our school and spawns remarkable results from his study; I believe that this study is a milestone for the integration of local language in to computerize world.
Schlagworte: information extraction
Produktsicherheit: GRIN Publishing GmbH
Preis (Ebook): US$ 34,99
Preis (Book): US$ 46,99

Arbeit zitieren: Sintayehu Hirpassa (Autor:in), 2011, Designing an Information Extraction System for Amharic Vacancy Announcement Text, München, Page::Imprint:: GRINVerlagOHG, https://www.diplomarbeiten24.de/document/289226