What is KDD in DBMS

Database systems basic topics. Knowledge Discovery in Databases (KDD)

Transcript

1 Database Systems Basic Topics Peter Brezany Institute for Software Sciences University of Vienna Knowledge Discovery in Databases (KDD) It is a strongly interdisciplinary topic at the interface of statistics, machine learning and database systems. First we deal with some basics of the KDD in the area of ​​data backup systems. Then from the field of statistics. Then the relevant machine learning techniques are introduced. 2 Page 1

2 Basic Terms The data base system (DBS) is used for the description, permanent storage and efficient recovery of large amounts of data that are used by various application programs. A DBS consists of 2 components: Database (DB): Collection of all stored data and the associated descriptions. Database management system (DBMS): Program system for managing the DB. All access by application programs to the DB does not take place directly, but centrally via the DBMS. 3 Architecture of a database system application program 1 database system application program 2 database management system application program N database 4 Page 2

3 abstraction levels of a DBS External level It allows different views of different users or user groups on the data status. Conceptual level This is the central level. It specifies the logical overall view (i.e. independent of the actual storage) of all data that are required by any application program. All object and relationship types and their value ranges are described. Internal level It describes the physical data organization, i.e. it defines how the objects and relationships described in the conceptual schema are physically stored and which access options are available. Options for accessing the data records are specified using index structures such as hash tables, inverted lists or B-trees. 5 Data models Data models are formalisms for describing all objects contained in the database and their relationships to one another (in the database schema), as they are required on the conceptual and external level. They differ in the way objects and relationships between objects are represented. In the following we present the relational data model, which has found widespread use. 6 Page 3

4 Relational data model The relational data model is based on the structuring principle of quantities (tables, relations). A range of values ​​(domain) is a (logically related set of values, e.g. INTEGER, STRING, DATE, {1, ..., 10}. A range of values ​​can have finite or infinite cardinality. A relation R is a subset of the Cartesian product of k 1 value ranges D 1, ..., D k. Individual elements of a relation are called tuples. For RD 1 x D 2 x ... x D k, k is the degree or arity of the relation; all tuples in R have k components . Relations can be understood and represented as tables. The rows of a table correspond to the tuples. The columns are called attributes and can have names. 7 Relational data model (2) A relation schema is a k-tuple (D 1, ..., D k The components of a relational schema are called attributes; they can be named: (A 1: D 1, ..., A k: D k). A minimal subset of the attributes of a relational schema, on the basis of which all tuples of a ( possible) relation are distinguishable is called key n schema describes all possible tuples, while a relation contains the actual tuples. 8 Page 4

5 Relational data model (3) Example cities Name Inhabitants Land Munich Bayer Bremen Bremen Scheme: (Name: STRING, Inhabitants: INTEGER, Country: STRING Relation: {(Munich,, Bavaria), (Bremen,, Bremen), ...} Key: {Name} 9 Relational Database Languages ​​SQL (Structured Query Language) was developed at the IBM Almaden Research Laboratory in 1974 as DDL (Data Definition Language) and DML (Data Manipulation Language). Today, SQL is the industry standard for relational database languages. Definition of a relation CREATE TABLE ( {, }) :: = {

6 Database queries The basic form of an SQL query is: SELECT FROM [WHERE ] [GROUP BY ] [HAVING ] [ORDER BY

7 Relational base operations (2) Selection σ B = b (R) SELECT * FROM R WHERE B = `b This operation selects all tuples from R that meet condition B =` b. Projection π A, C (R) SELECT DISTINCT A, C FROM R The projection returns the attributes A and C for all tuples from R Join (join) SELECT * FROM R, S WHERE B θ F The join returns all pairs (r , s) with r R, s S and rb θ sf Here, θ denotes a predicate on the range of values ​​of the attributes B and F, e.g. The predicate = or the predicate

8 Relational basic operations (4) The aggregate functions COUNT, MIN, MAX, SUM, AVG etc. can be applied to a set of values ​​that is given as a column of a relation. How many suppliers are there? SELECT COUNT (DISTINCT LName) FROM Supplier 15 Grouping and sorting The GROUP BY clause combines sets of tuples with the same values ​​of the specified attributes to form groups. The result relation contains a tuple for each group. The result relation contains a tuple for each group. In the SELECT clause, only expressions are permitted that have one value per group. With the help of a having clause, groups are selected based on the specified condition. Only arguments with one value per group may appear in the condition. The ORDER BY operation sorts the result relation according to one or more attributes in ascending or descending order (ASC DESC). Examples on the next slide 16 Page 8

9 Grouping and sorting (2) Outputs the names of all suppliers who deliver more than five parts. SELECT LName FROM Supplier GROUP BY Lname HAVING COUNT (*)> 5 Create an alphabetically sorted list of all goods in which the minimum, maximum and average price is given for each goods. SELECT goods, MIN (price), MAX (price), AVG (price) FROM supplier GROUP BY goods ORDER BY goods 17 Inquiry processing There are usually many different ways of answering a given SQL query, each of which is described by a query plan. A query plan can be represented by a so-called operator tree: The leaves contain the relations that occur. The inner nodes represent the operations used. Example; Cities (Sname, SEinw, Land) Countries (LName, LEinw, Party) The next figure presents 2 possible inquiry plans for the inquiry Find all names of cities in CDU-governed countries 18 Page 9

10 Two inquiry plans π SName π SName σ Country = LName σ Country = LName σ Party = CDU Plan AX Plan BX Cities σ Party = CDU Cities Countries Countries 19 Inquiry processing The inquiry processing takes place in 2 main steps: Generation of inquiry plans With the help of heuristic rules for the arrangement Various alternative query plans are generated for the basic operations. Evaluation of the inquiry plans based on a cost model and statistical information, the expected costs are calculated. Efficient implementation of the relational basic operations is necessary. The selection is implemented with the help of basic queries such as A point query (x 1, ..., xk) that specifies k attributes exactly or a range query ([u 1, o 1], ..., [uk, ok]) that specifies k ranges with uioi, 1 ik . 20 Page 10

11 Physical storage of data It is necessary to ensure persistence permanent storage of the data. Managing large amounts of data in the gigabyte range these requirements can only be met with the help of secondary storage devices, whereby magnetic disks are usually used. A magnetic disk is divided into pages (blocks) as the smallest transfer unit that is transferred between the main and secondary storage. Pages have the following properties: Direct access to a page with a given page number. Fixed size between 128 bytes and 16 Kbytes. 21 Index structures In order to handle the basic queries such as In order to be able to carry out area inquiries efficiently, the internal level of the database system uses suitable data structures and storage procedures (index structures). Index structures consist of file pages, which contain the data records to be saved, and of directory pages, which contain additional information to describe the access to the data. Directory ... file with data records Principle of an index structure 22 Page 11

12 Index structures (2) Tree-based index structures use the concept of search trees. A binary search tree is a binary tree that fulfills the following search tree properties for each node: all keys in the left subtree are smaller, all keys in the right subtree are larger than the key k in the given node. k k left subtree right subtree Principle of a binary search tree 23 Page 12