Bachelorarbeit, 2020
74 Seiten, Note: 1.1
1 Introduction
2 Background
2.1 The Property Graph Model
2.2 Cluster Analysis
2.2.1 Approaches
2.2.2 Clustering as a Search
2.3 Taxonomy
2.4 Probability Theory
3 Algorithms
3.1 Hierarchical Clustering Algorithms
3.1.1 Hierarchical Agglomerative Clustering
3.1.2 Robust Single Linkage
3.2 Partition-based Clustering
3.2.1 K-Means
3.2.2 TTSAS
3.3 Density-based Clustering
3.3.1 DBSCAN
3.3.2 OPTICS
3.3.3 HDBSCAN
3.4 Model-based Clustering – Conceptual Clustering
3.4.1 Cobweb
3.4.2 Extensions of Cobweb
3.5 Feature Extraction
3.5.1 Characteristic Sets
3.5.2 Recursive Feature Extraction
4 Label Inference
4.1 Problem Statement
4.2 Proposed Solution
4.2.1 Pre-Processing
4.2.1.1 Encoding Sets of Tags as Vectors
4.2.1.2 Feature Vector Extension
4.2.2 Clustering
4.2.3 Post-Processing: Taxonomy Extraction
5 Evaluation
5.1 Tag-based clustering
5.1.1 Setup
5.1.1.1 Data
5.1.1.2 Implementation
5.1.2 Results
5.1.3 Discussion
5.2 Graph-aware clustering of Nodes
5.2.1 Setup
5.2.1.1 Data
5.2.1.2 Pre-Processing
5.2.1.3 Implementation
5.2.2 Results
5.2.3 Discussion
6 Conclusion
6.1 Summary
6.2 Future Work
The thesis aims to automate the extraction of hierarchical type taxonomies from property graph databases, addressing the lack of implicit hierarchy support in common graph data models. By deriving these structures, the study facilitates improved query optimization, data exploration, and the generation of more concise queries.
3.4.1 Cobweb
Fisher describes COBWEB as a conceptual clustering algorithm that is designed to maximize inference ability, is incremental (sometimes also referred to as an online algorithm) and computationally economical [22]. To achieve this, Fisher takes the viewpoint of clustering as a learning task for concepts [50] and applies the search paradigm to the task. He describes incremental learning along two dimension: search direction and search control.
As COBWEB shall produce a hierarchy of clusters along with a concept description there are three search problems present according to Fisher [23]:
1. Searching the space of Characterizations: According to Mitchell [50] the space of characterizations may be ordered by generality. Thus one can either search this space (search direction) bottom-up, starting with a very specific hypothesis and generalize incrementally or search top-down, starting with the most general hypothesis and discriminating incrementally or one can search in both directions, converging to some solution that is in the best case neither too specific nor too general. Mitchell called this the version space strategy [50].
2. Searching the space of Aggregations: This is the search through the space of all possible partitionings in all levels of the hierarchy. One can start with the smallest possible partitions and the largest number of partitions and then proceed by selectively merging two or more partitions or the other way round with only one partition consisting of all elements, splitting in subsequent search steps.
3. Searching the space of Hierarchies: This is the space of orderings of the partitions with the constraint that the partitions must be ordered by the subset relation. Here one may build the hierarchy from the smallest partitions to the largest or the other way round.
1 Introduction: Discusses the ubiquity of hierarchical structures in data and identifies the lack of native support for these hierarchies in current property graph models.
2 Background: Provides foundational knowledge on property graph models, cluster analysis definitions, taxonomies, and probability theory.
3 Algorithms: Describes and analyzes various clustering algorithms, including hierarchical, partition-based, density-based, and conceptual clustering methods.
4 Label Inference: Outlines the problem of extracting explicit hierarchies from implicit data and presents a solution pipeline involving pre-processing, clustering, and taxonomy extraction.
5 Evaluation: Details the experimental setup and results for both tag-based and graph-aware clustering, discussing the performance and applicability of different feature vectors.
6 Conclusion: Summarizes the thesis findings and suggests potential future research directions in conceptual clustering and adaptive feature extraction.
Property Graph Model, Hierarchical Clustering, Conceptual Clustering, Cobweb, Taxonomy Extraction, Label Inference, Graph Databases, Neo4j, Data Mining, Feature Extraction, Cardinality Estimation, Cluster Analysis, Incremental Learning, Synthetic Data, Knowledge Discovery
The research addresses the fact that property graph databases lack internal support for hierarchical structures (type taxonomies), which are essential for tasks like query optimization and effective data exploration.
The thesis focuses on clustering algorithms, property graph data models, taxonomy extraction, and the development of pipelines to derive structural hierarchies automatically.
The primary objective is to develop and evaluate a pipeline that can automatically extract meaningful, representative type hierarchies from data within property graph databases.
The study employs a mix of survey-based research of existing clustering literature and experimental implementation of algorithms, including hierarchical agglomerative, density-based, and conceptual clustering (Cobweb) within a Neo4j environment.
The main part covers the theoretical background of clustering, a detailed algorithmic overview, the design of a graph-aware taxonomy extraction pipeline, and an extensive evaluation of different feature vectors using synthetic and real-world datasets.
Key terms include Property Graph Model, Hierarchical Clustering, Conceptual Clustering, Taxonomy Extraction, and Label Inference.
Cobweb is highlighted because it is parameter-free, incremental, provides probabilistic concept descriptions, and produces hierarchies by design, making it highly suitable for graph-aware clustering.
The pipeline uses a post-processing step to flatten dendrograms into usable taxonomies by aggregating consecutive merges with identical distances.
No, the research demonstrates that no single feature vector is universally optimal; instead, feature vectors must be constructed adaptively based on the specific properties and structures available in the data.
Der GRIN Verlag hat sich seit 1998 auf die Veröffentlichung akademischer eBooks und Bücher spezialisiert. Der GRIN Verlag steht damit als erstes Unternehmen für User Generated Quality Content. Die Verlagsseiten GRIN.com, Hausarbeiten.de und Diplomarbeiten24 bieten für Hochschullehrer, Absolventen und Studenten die ideale Plattform, wissenschaftliche Texte wie Hausarbeiten, Referate, Bachelorarbeiten, Masterarbeiten, Diplomarbeiten, Dissertationen und wissenschaftliche Aufsätze einem breiten Publikum zu präsentieren.
Kostenfreie Veröffentlichung: Hausarbeit, Bachelorarbeit, Diplomarbeit, Dissertation, Masterarbeit, Interpretation oder Referat jetzt veröffentlichen!

