10160010@unknown@formal@none@1@S@
Cross-platform
@@@@1@1@@danf@17-8-2009 10160020@unknown@formal@none@1@S@'''Cross-platform''' (also known as '''multi-platform''') is a term used in computing to refer to [[computer program]]s, [[operating system]]s, [[computer language]]s, [[programming language]]s, or other [[computer software]] and their implementations which can be made to work on multiple [[computer platform]]s.@@@@1@39@@danf@17-8-2009 10160030@unknown@formal@none@1@S@“Cross-platform” and “multi-platform” both refer to the idea that a given piece of computer software is able to be run on more than one computer platform.@@@@1@26@@danf@17-8-2009 10160040@unknown@formal@none@1@S@There are two major types of cross-platform software; one requires building for each platform that it supports (e.g., is written in a compiled language, such as [[Pascal (programming language)|Pascal]]), and the other one can be directly run on any platform which supports it (e.g., software written in an [[interpreted language]] such as [[Perl]], [[Python (programming language)|Python]], or [[shell script]]) or software written in a language which compiles to [[bytecode]] and the bytecode is redistributed (such as is the case with [[Java (programming language)|Java]] and languages used in the [[.NET Framework]]) such as [[Chrome (programming language)|Chrome]].@@@@1@95@@danf@17-8-2009 10160050@unknown@formal@none@1@S@For example, a cross-platform [[application software|application]] may run on [[Microsoft Windows]] on the [[x86 architecture]], [[Linux]] on the [[x86 architecture]] and [[Mac OS X]] on either the [[PowerPC]] or [[x86]] based [[Apple Macintosh]] systems.@@@@1@34@@danf@17-8-2009 10160060@unknown@formal@none@1@S@A cross-platform [[application software|application]] may run on as many as all existing platforms, or on as few as two platforms.@@@@1@20@@danf@17-8-2009 10160070@unknown@formal@none@1@S@== Platforms ==@@@@1@3@@danf@17-8-2009 10160080@unknown@formal@none@1@S@A platform is a combination of hardware and software used to run software applications.@@@@1@14@@danf@17-8-2009 10160090@unknown@formal@none@1@S@A platform can be described simply as an operating system or computer architecture, or it could be the combination of both.@@@@1@21@@danf@17-8-2009 10160100@unknown@formal@none@1@S@Probably the most familiar platform is [[Microsoft Windows]] running on the [[x86 architecture]].@@@@1@13@@danf@17-8-2009 10160110@unknown@formal@none@1@S@Other well-known desktop computer platforms include [[Linux]] and [[Mac OS X]] (both of which are themselves cross-platform).@@@@1@17@@danf@17-8-2009 10160120@unknown@formal@none@1@S@There are, however, many devices such as [[cellular telephones]] that are also effectively computer platforms but less commonly thought about in that way.@@@@1@23@@danf@17-8-2009 10160130@unknown@formal@none@1@S@[[Application software]] can be written to depend on the features of a particular platform—either the hardware, operating system, or virtual machine it runs on.@@@@1@24@@danf@17-8-2009 10160140@unknown@formal@none@1@S@The [[Java Platform|Java platform]] is a [[virtual machine]] platform which runs on many operating systems and hardware types, and is a common platform for software to be written for.@@@@1@29@@danf@17-8-2009 10160150@unknown@formal@none@1@S@=== Hardware platforms ===@@@@1@4@@danf@17-8-2009 10160160@unknown@formal@none@1@S@A '''hardware platform''' can refer to a computer’s [[computer architecture|architecture]] or [[processor architecture]].@@@@1@13@@danf@17-8-2009 10160170@unknown@formal@none@1@S@For example, the [[x86]] and [[x86-64]] [[CPU]]s make up one of the most common [[computer architecture]]s in use in home machines today.@@@@1@22@@danf@17-8-2009 10160180@unknown@formal@none@1@S@These machines commonly run [[Microsoft Windows]], though they can run other [[operating system]]s as well, including [[Linux]], [[OpenBSD]], [[NetBSD]], [[Mac OS X]] and [[FreeBSD]].@@@@1@24@@danf@17-8-2009 10160190@unknown@formal@none@1@S@=== Software platforms ===@@@@1@4@@danf@17-8-2009 10160200@unknown@formal@none@1@S@Software platforms can either be an [[operating system]] or programming environment, though more commonly it is a combination of both.@@@@1@20@@danf@17-8-2009 10160210@unknown@formal@none@1@S@A notable exception to this is [[Java (programming language)|Java]], which uses an [[operating system]] independent [[virtual machine]] for its [[compiled]] code, known in the world of Java as [[bytecode]].@@@@1@29@@danf@17-8-2009 10160220@unknown@formal@none@1@S@Examples of software platforms include:@@@@1@5@@danf@17-8-2009 10160230@unknown@formal@none@1@S@* [[MS-DOS]] ([[x86]]), [[DR-DOS]] ([[x86]]), [[FreeDOS]] ([[x86]]) etc.@@@@1@8@@danf@17-8-2009 10160240@unknown@formal@none@1@S@* [[Microsoft Windows]] ([[x86]], [[x64]])@@@@1@5@@danf@17-8-2009 10160250@unknown@formal@none@1@S@* [[Linux]] (x86, x64, [[PowerPC]], various other architectures)@@@@1@8@@danf@17-8-2009 10160260@unknown@formal@none@1@S@* [[Mac OS X]] (PowerPC, x86)@@@@1@6@@danf@17-8-2009 10160270@unknown@formal@none@1@S@* [[OS/2]], [[eComStation]]@@@@1@3@@danf@17-8-2009 10160280@unknown@formal@none@1@S@* [[AmigaOS]] ([[m68k]]), [[AROS]] (x86, PowerPC, m68k), [[MorphOS]] (PowerPC)@@@@1@9@@danf@17-8-2009 10160290@unknown@formal@none@1@S@* [[Java (programming language)|Java]]@@@@1@4@@danf@17-8-2009 10160300@unknown@formal@none@1@S@==== Java platform ====@@@@1@4@@danf@17-8-2009 10160310@unknown@formal@none@1@S@As previously noted, the [[Java platform]] is an exception to the general rule that an [[operating system]] is a software platform.@@@@1@21@@danf@17-8-2009 10160320@unknown@formal@none@1@S@The Java language provides a [[virtual machine]], or a “virtual CPU” which runs all of the code that is written for the language.@@@@1@23@@danf@17-8-2009 10160330@unknown@formal@none@1@S@This enables the same [[executable]] [[binary file|binary]] to run on all systems which support the Java software, through the [[Java Virtual Machine]].@@@@1@22@@danf@17-8-2009 10160340@unknown@formal@none@1@S@Java [[executable]]s do not run directly on the [[operating system]]; that is, neither [[Microsoft Windows|Windows]] nor [[Linux]] execute Java programs directly.@@@@1@21@@danf@17-8-2009 10160350@unknown@formal@none@1@S@Because of this, however, Java is limited in that it does not directly support system-specific functionality.@@@@1@16@@danf@17-8-2009 10160360@unknown@formal@none@1@S@[[Java Native Interface|JNI]] can be used to access system specific functions, but then the code is likely no longer portable.@@@@1@20@@danf@17-8-2009 10160370@unknown@formal@none@1@S@Java programs can run on at least the [[Microsoft Windows]], [[Mac OS X]], [[Linux]], and [[Solaris Operating System|Solaris]] operating systems, and so the language is limited to functionality that exists on all these systems.@@@@1@34@@danf@17-8-2009 10160380@unknown@formal@none@1@S@This includes things such as [[computer networking]], [[Internet socket]]s, but not necessarily raw hardware [[input/output]].@@@@1@15@@danf@17-8-2009 10160390@unknown@formal@none@1@S@== Cross-platform software ==@@@@1@4@@danf@17-8-2009 10160400@unknown@formal@none@1@S@In order for software to be considered '''cross-platform''', it must be able to function on more than one [[computer architecture]] or [[operating system]].@@@@1@23@@danf@17-8-2009 10160410@unknown@formal@none@1@S@This can be a time-consuming task given that different [[operating system]]s have different [[application programming interface]]s or [[application programming interface|API]]s (for example, [[Linux]] uses a different [[application programming interface|API]] for [[application software]] than [[Microsoft Windows|Windows]] does).@@@@1@36@@danf@17-8-2009 10160420@unknown@formal@none@1@S@Just because a particular [[operating system]] may run on different [[computer architecture]]s, that does not mean that the software written for that operating system will automatically work on all [[computer architecture|architecture]]s that the operating system supports.@@@@1@36@@danf@17-8-2009 10160430@unknown@formal@none@1@S@One example as of August, 2006 was [[OpenOffice.org]], which did not natively run on the [[AMD64]] or [[EM64T]] lines of processors implementing the [[x86-64]] [[64-bit]] standards for computers; this has since been changed, and the OpenOffice.org suite of software is “mostly” ported to these 64-bit systems[http://wiki.services.openoffice.org/wiki/Porting_to_x86-64_(AMD64,_EM64T)].@@@@1@46@@danf@17-8-2009 10160440@unknown@formal@none@1@S@This also means that just because a program is written in a popular programming language such as [[C (programming language)|C]] or [[C++]], it does not mean it will run on all [[operating systems]] that support that [[programming language]].@@@@1@38@@danf@17-8-2009 10160450@unknown@formal@none@1@S@=== Web applications ===@@@@1@4@@danf@17-8-2009 10160460@unknown@formal@none@1@S@[[Web application]]s are typically described as cross-platform because, ideally, they are accessible from any of various [[web browser]]s within different operating systems.@@@@1@22@@danf@17-8-2009 10160470@unknown@formal@none@1@S@Such applications generally employ a [[client-server]] system architecture, and vary widely in complexity and functionality.@@@@1@15@@danf@17-8-2009 10160480@unknown@formal@none@1@S@This wide variability significantly complicates the goal of cross-platform capability, which is routinely at odds with the goal of advanced functionality.@@@@1@21@@danf@17-8-2009 10160490@unknown@formal@none@1@S@==== Basic applications ====@@@@1@4@@danf@17-8-2009 10160500@unknown@formal@none@1@S@Basic web applications perform all or most processing from a [[Stateless server|stateless]] [[web server]], and pass the result to the client web browser.@@@@1@23@@danf@17-8-2009 10160510@unknown@formal@none@1@S@All user interaction with the application consists of simple exchanges of data requests and server responses.@@@@1@16@@danf@17-8-2009 10160520@unknown@formal@none@1@S@These types of applications were the norm in the early phases of [[World Wide Web]] application development.@@@@1@17@@danf@17-8-2009 10160530@unknown@formal@none@1@S@Such applications follow a simple [[Transaction processing|transaction]] model, identical to that of serving [[static web page]]s.@@@@1@16@@danf@17-8-2009 10160540@unknown@formal@none@1@S@Today, they are still relatively common, especially where cross-platform compatibility and simplicity are deemed more critical than advanced functionality.@@@@1@19@@danf@17-8-2009 10160550@unknown@formal@none@1@S@==== Advanced applications ====@@@@1@4@@danf@17-8-2009 10160560@unknown@formal@none@1@S@Prominent examples of advanced web applications include the Web interface to [[Gmail]], [[A9.com]], and the maps.live.com section of [[Live Search]].@@@@1@20@@danf@17-8-2009 10160570@unknown@formal@none@1@S@Such advanced applications routinely depend on additional features found only in the more recent versions of popular web browsers.@@@@1@19@@danf@17-8-2009 10160580@unknown@formal@none@1@S@These dependencies include [[Ajax (programming)|Ajax]], [[JavaScript]], [[Dynamic HTML|“Dynamic” HTML]], [[SVG]], and other components of [[rich internet application]]s.@@@@1@17@@danf@17-8-2009 10160590@unknown@formal@none@1@S@Older versions of popular browsers tend to lack support for certain features.@@@@1@12@@danf@17-8-2009 10160600@unknown@formal@none@1@S@==== Design strategies ====@@@@1@4@@danf@17-8-2009 10160610@unknown@formal@none@1@S@Because of the competing interests of cross-platform compatibility and advanced functionality, numerous alternative web application design strategies have emerged.@@@@1@19@@danf@17-8-2009 10160620@unknown@formal@none@1@S@Such strategies include:@@@@1@3@@danf@17-8-2009 10160630@unknown@formal@none@1@S@=====Graceful degradation=====@@@@1@2@@danf@17-8-2009 10160640@unknown@formal@none@1@S@Graceful degradation attempts to provide the same or similar functionality to all users and platforms, while diminishing that functionality to a ‘least common denominator’ for more limited client browsers.@@@@1@29@@danf@17-8-2009 10160650@unknown@formal@none@1@S@For example, a user attempting to use a limited-feature browser to access Gmail may notice that Gmail switches to “Basic Mode,” with reduced functionality.@@@@1@24@@danf@17-8-2009 10160660@unknown@formal@none@1@S@Some view this strategy as a lesser form of cross-platform capability.@@@@1@11@@danf@17-8-2009 10160670@unknown@formal@none@1@S@=====Separation of functionality=====@@@@1@3@@danf@17-8-2009 10160680@unknown@formal@none@1@S@Separation of functionality attempts to simply omit those subsets of functionality that are not capable from within certain client browsers or operating systems, while still delivering a ‘complete’ application to the user. (see also [[Separation of concerns]]).@@@@1@37@@danf@17-8-2009 10160690@unknown@formal@none@1@S@=====Multiple codebase=====@@@@1@2@@danf@17-8-2009 10160700@unknown@formal@none@1@S@Multiple codebase applications present different versions of an application depending on the specific client in use.@@@@1@16@@danf@17-8-2009 10160710@unknown@formal@none@1@S@This strategy is arguably the most complicated and expensive way to fulfill cross-platform capability, since even different versions of the same client browser (within the same operating system) can differ dramatically between each other.@@@@1@34@@danf@17-8-2009 10160720@unknown@formal@none@1@S@This is further complicated by the support for “plugins” which may or may not be present for any given installation of a particular browser version.@@@@1@25@@danf@17-8-2009 10160730@unknown@formal@none@1@S@=====Third party libraries=====@@@@1@3@@danf@17-8-2009 10160740@unknown@formal@none@1@S@Third party libraries attempt to simplify cross-platform capability by ‘hiding’ the complexities of client differentiation behind a single, unified API.@@@@1@20@@danf@17-8-2009 10160750@unknown@formal@none@1@S@==== Testing strategies ====@@@@1@4@@danf@17-8-2009 10160760@unknown@formal@none@1@S@One complicated aspect of cross-platform web application design is the need for [[software testing]].@@@@1@14@@danf@17-8-2009 10160770@unknown@formal@none@1@S@In addition to the complications mentioned previously, there is the additional restriction that some browsers prohibit installation of different versions of the same browser on the same operating system.@@@@1@29@@danf@17-8-2009 10160780@unknown@formal@none@1@S@Techniques such as [[full virtualization]] are sometimes used as a workaround for this problem.@@@@1@14@@danf@17-8-2009 10160790@unknown@formal@none@1@S@=== Traditional applications ===@@@@1@4@@danf@17-8-2009 10160800@unknown@formal@none@1@S@Although web applications are becoming increasingly popular, many computer users still use traditional [[application software]] which does not rely on a client/web-server architecture.@@@@1@23@@danf@17-8-2009 10160810@unknown@formal@none@1@S@The distinction between “traditional” and “web” applications is not always unambiguous, however, because applications have many different features, installation methods and architectures; and some of these can overlap and occur in ways that blur the distinction.@@@@1@36@@danf@17-8-2009 10160820@unknown@formal@none@1@S@Nevertheless, this simplifying distinction is a common and useful generalization.@@@@1@10@@danf@17-8-2009 10160830@unknown@formal@none@1@S@==== Binary software ====@@@@1@4@@danf@17-8-2009 10160840@unknown@formal@none@1@S@Traditionally in modern computing, application software has been distributed to end-users as '''binary images''', which are stored in [[executable]]s, a specific type of [[binary file]].@@@@1@25@@danf@17-8-2009 10160850@unknown@formal@none@1@S@Such [[executable]]s only support the [[operating system]] and [[computer architecture]] that they were built for—which means that making a “cross-platform executable” would be something of a massive task, and is generally not done.@@@@1@33@@danf@17-8-2009 10160860@unknown@formal@none@1@S@For software that is distributed as a [[binary file|binary]] [[executable]], such as software written in [[C (programming language)|C]] or [[C++]], the programmer must [[software build|build the software]] for each different [[operating system]] and [[computer architecture]].@@@@1@35@@danf@17-8-2009 10160870@unknown@formal@none@1@S@For example, [[Mozilla]] [[Mozilla Firefox|Firefox]], an open-source web browser, is available on [[Microsoft Windows]], [[Mac OS X]] (both [[PowerPC]] and [[x86]] through something Apple calls a '''[[Universal binary]]'''), and [[Linux]] on multiple computer architectures.@@@@1@34@@danf@17-8-2009 10160880@unknown@formal@none@1@S@The three platforms (in this case, [[Microsoft Windows|Windows]], [[Mac OS X]], and [[Linux]]) are separate [[executable]] distributions, although they come from the same [[source code]].@@@@1@25@@danf@17-8-2009 10160890@unknown@formal@none@1@S@In the context of binary software, cross-platform programs are written in the source code and then “translated” to each system that it runs on through compiling it on different platforms.@@@@1@30@@danf@17-8-2009 10160900@unknown@formal@none@1@S@Also, software can be [[porting|ported]] to a new [[computer architecture]] or [[operating system]] so that the program becomes more cross-platform than it already is.@@@@1@24@@danf@17-8-2009 10160910@unknown@formal@none@1@S@For example, a program such as Firefox, which already runs on Windows on the x86 family, can be modified and re-built to run on Linux on the x86 (and potentially other architectures) as well.@@@@1@34@@danf@17-8-2009 10160920@unknown@formal@none@1@S@As an alternative to porting, cross-platform virtualization allows applications compiled for one CPU and operating system to run on a system with a different CPU and/or operating system, without modification to the source code or binaries.@@@@1@36@@danf@17-8-2009 10160930@unknown@formal@none@1@S@As an example, [[Apple Computer|Apple's]] [[Rosetta (software)|Rosetta]] software, which is built into [[Intel]]-based Apple Macintosh computers, runs applications compiled for the previous generation of Macs that used [[PowerPC]] CPUs.@@@@1@29@@danf@17-8-2009 10160940@unknown@formal@none@1@S@Another example is IBM PowerVM Lx86, which allows Linux/x86 applications to run unmodified on the Linux/Power operating system.@@@@1@18@@danf@17-8-2009 10160950@unknown@formal@none@1@S@==== Scripts and [[interpreted language]]s ====@@@@1@6@@danf@17-8-2009 10160960@unknown@formal@none@1@S@A script can be considered to be cross-platform if the [[scripting language]] is available on multiple platforms and the script only uses the facilities provided by the language.@@@@1@28@@danf@17-8-2009 10160970@unknown@formal@none@1@S@That is, a script written in [[Python (programming language)|Python]] for a [[Unix-like]] system will likely run with little or no modification on [[Microsoft Windows|Windows]], because Python also runs on [[Microsoft Windows|Windows]]; there is also more than one implementation of Python that will run the same scripts (e.g., [[IronPython]] for [[.NET Framework|.NET]]).@@@@1@51@@danf@17-8-2009 10160980@unknown@formal@none@1@S@The same goes for many of the [[open source]] [[programming language]]s that are available and are [[scripting language]]s.@@@@1@18@@danf@17-8-2009 10160990@unknown@formal@none@1@S@Unlike [[binary file|binary]] [[executable]]s, the same script can be used on all computers that have software to interpret the script.@@@@1@20@@danf@17-8-2009 10161000@unknown@formal@none@1@S@This is because the script is generally stored in [[plain text]] in a [[text file]].@@@@1@15@@danf@17-8-2009 10161010@unknown@formal@none@1@S@There may be some issues, however, such as the type of [[newline|new line character]] that sits between the lines.@@@@1@19@@danf@17-8-2009 10161020@unknown@formal@none@1@S@Generally, however, little or no work has to be done to make a script written for one system, run on another.@@@@1@21@@danf@17-8-2009 10161030@unknown@formal@none@1@S@Some quite popular cross-platform scripting or [[interpreted language]]s are:@@@@1@9@@danf@17-8-2009 10161040@unknown@formal@none@1@S@* [[bash]]—A [[Unix shell]] commonly run on [[Linux]] and other modern [[Unix-like]] systems, as well as on [[Microsoft Windows|Windows]] via the [[Cygwin]] [[POSIX]] compatibility layer.@@@@1@25@@danf@17-8-2009 10161050@unknown@formal@none@1@S@* [[Python (programming language)|Python]]—A modern [[scripting language]] where the focus is on [[rapid application development]] and ease-of-writing, instead of program run-time efficiency.@@@@1@22@@danf@17-8-2009 10161060@unknown@formal@none@1@S@* [[Perl]]—A scripting language first released in 1987.@@@@1@8@@danf@17-8-2009 10161070@unknown@formal@none@1@S@Used for [[Common Gateway Interface|CGI]] [[WWW]] programming, small [[system administration]] tasks, and more.@@@@1@13@@danf@17-8-2009 10161080@unknown@formal@none@1@S@* [[PHP]]—A [[scripting language]] most popular in use on the [[WWW]] for [[web application]]s.@@@@1@14@@danf@17-8-2009 10161090@unknown@formal@none@1@S@* [[Ruby (programming language)|Ruby]]—A scripting language who's purpose is to be object-oriented and easy to read.@@@@1@16@@danf@17-8-2009 10161100@unknown@formal@none@1@S@Can also be used on the web through [[Ruby on Rails]].@@@@1@11@@danf@17-8-2009 10161110@unknown@formal@none@1@S@* [[Tcl]] - A dynamic programming language, suitable for a wide range of uses, including web and desktop applications, networking, administration, testing and many more.@@@@1@25@@danf@17-8-2009 10161120@unknown@formal@none@1@S@==== Video games ====@@@@1@4@@danf@17-8-2009 10161130@unknown@formal@none@1@S@Cross-platform is a term that can also apply to [[video game]]s.@@@@1@11@@danf@17-8-2009 10161140@unknown@formal@none@1@S@Such games are released on a range of [[video game console]]s and [[handheld game console]]s, which are specialized [[computer]]s dedicated to the task of playing games (and thus, are a platform as any other computer).@@@@1@35@@danf@17-8-2009 10161150@unknown@formal@none@1@S@Examples of these games include:@@@@1@5@@danf@17-8-2009 10161160@unknown@formal@none@1@S@* [[Miner 2049er]], the first major multiplatform game@@@@1@8@@danf@17-8-2009 10161170@unknown@formal@none@1@S@* [[Phantasy Star Online]]@@@@1@4@@danf@17-8-2009 10161180@unknown@formal@none@1@S@* [[Lara Croft Tomb Raider: Legend]]@@@@1@6@@danf@17-8-2009 10161190@unknown@formal@none@1@S@* [[FIFA Series]]@@@@1@3@@danf@17-8-2009 10161200@unknown@formal@none@1@S@* [[Shadow of Legend]]@@@@1@4@@danf@17-8-2009 10161210@unknown@formal@none@1@S@… which are spread across a variety of platforms, such as the [[Nintendo GameCube]], [[PlayStation 2]], [[Xbox]], [[Personal computer|PC]], and [[mobile devices]].@@@@1@22@@danf@17-8-2009 10161220@unknown@formal@none@1@S@In some cases, depending on the hardware of a particular system it may take longer than expected to create a video game across multiple platforms.@@@@1@25@@danf@17-8-2009 10161230@unknown@formal@none@1@S@So, a video game may only get released on a few platforms and then later released on the remaining platforms.@@@@1@20@@danf@17-8-2009 10161240@unknown@formal@none@1@S@Typically, this is what occurs when a new system is released, because the [[Video game developer|developer]]s of the video game need to become acquainted with the hardware and software associated with the new console.@@@@1@34@@danf@17-8-2009 10161250@unknown@formal@none@1@S@Some games may not become cross-platform because of licensing agreements between the [[Video game developer|developer]]s and the maker of the [[video game console]] which state that the game will only be made for one particular console.@@@@1@36@@danf@17-8-2009 10161260@unknown@formal@none@1@S@As an example, [[Disney]] could create a new game and wish to release it on the latest [[Nintendo]] and [[Sony]] game consoles.@@@@1@22@@danf@17-8-2009 10161270@unknown@formal@none@1@S@If [[Disney]] licenses the game with [[Sony]] first, [[Disney]] may be required to only release the game on [[Sony|Sony’s]] console for a short time, or indefinitely—effectively prohibiting the game from cross-platform at least for a period of time.@@@@1@38@@danf@17-8-2009 10161280@unknown@formal@none@1@S@Several developers have developed ways to play games online while using different platforms.@@@@1@13@@danf@17-8-2009 10161290@unknown@formal@none@1@S@Epic Games, Microsoft and Valve Software all have this technology, that allows Xbox 360 gamers and PS3 gamers to play with PC gamers, allowing gamers to finally decide which platform is the best for a game.@@@@1@36@@danf@17-8-2009 10161300@unknown@formal@none@1@S@The first game released to allow this interactivity between PC and Console games was [[Quake 3]].@@@@1@16@@danf@17-8-2009 10161310@unknown@formal@none@1@S@Games that feature cross-platform online play include:@@@@1@7@@danf@17-8-2009 10161320@unknown@formal@none@1@S@*[[Champions Online]]@@@@1@2@@danf@17-8-2009 10161330@unknown@formal@none@1@S@*[[Lost Planet: Colonies]]@@@@1@3@@danf@17-8-2009 10161340@unknown@formal@none@1@S@*[[Phantasy Star Online]]@@@@1@3@@danf@17-8-2009 10161350@unknown@formal@none@1@S@*[[Shadowrun (2007 video game)|Shadowrun]]@@@@1@4@@danf@17-8-2009 10161360@unknown@formal@none@1@S@*[[UNO (Xbox Live Arcade)|UNO]]@@@@1@4@@danf@17-8-2009 10161370@unknown@formal@none@1@S@*[[Final Fantasy XI Online]]@@@@1@4@@danf@17-8-2009 10161380@unknown@formal@none@1@S@== Platform independent software ==@@@@1@5@@danf@17-8-2009 10161390@unknown@formal@none@1@S@Software that is platform independent does not rely on any special features of any single platform, or, if it does, handles those special features such that it can deal with multiple platforms.@@@@1@32@@danf@17-8-2009 10161400@unknown@formal@none@1@S@All [[algorithm]]s, such as the [[quicksort]] algorithm, are able to be implemented on different platforms.@@@@1@15@@danf@17-8-2009 10161410@unknown@formal@none@1@S@== Cross-platform programming ==@@@@1@4@@danf@17-8-2009 10161420@unknown@formal@none@1@S@Cross-platform programming is the practice of actively writing software that will work on more than one platform.@@@@1@17@@danf@17-8-2009 10161430@unknown@formal@none@1@S@=== Approaches to cross-platform programming ===@@@@1@6@@danf@17-8-2009 10161440@unknown@formal@none@1@S@There are different ways of approaching the problem of writing a cross-platform application program.@@@@1@14@@danf@17-8-2009 10161450@unknown@formal@none@1@S@One such approach is simply to create multiple versions of the same program in different ''source trees''—in other words, the [[Microsoft Windows|Windows]] version of a program might have one set of source code files and the [[Apple Macintosh|Macintosh]] version might have another, while a FOSS *nix system might have another.@@@@1@50@@danf@17-8-2009 10161460@unknown@formal@none@1@S@While this is a straightforward approach to the problem, it has the potential to be considerably more expensive in development cost, development time, or both, especially for the corporate entities.@@@@1@30@@danf@17-8-2009 10161470@unknown@formal@none@1@S@The idea behind this is to create more than two different programs that have the ability to behave similarly to each other.@@@@1@22@@danf@17-8-2009 10161480@unknown@formal@none@1@S@It is also possible that this means of developing a cross-platform application will result in more problems with bug tracking and fixing, because the two different ''source trees'' would have different programmers, and thus different defects in each version.@@@@1@39@@danf@17-8-2009 10161490@unknown@formal@none@1@S@The smaller the programming team, the quicker the bug fixes tend to be.@@@@1@13@@danf@17-8-2009 10161500@unknown@formal@none@1@S@Another approach that is used is to depend on pre-existing software that hides the differences between the platforms—called [[abstraction]] of the platform—such that the program itself is unaware of the platform it is running on.@@@@1@35@@danf@17-8-2009 10161510@unknown@formal@none@1@S@It could be said that such programs are ''platform agnostic''.@@@@1@10@@danf@17-8-2009 10161520@unknown@formal@none@1@S@Programs that run on the [[Java (Sun)|Java]] [[Virtual Machine]] ([[Java Virtual Machine|JVM]]) are built in this fashion.@@@@1@17@@danf@17-8-2009 10161530@unknown@formal@none@1@S@Some applications mix various methods of cross-platform programming to create the final application.@@@@1@13@@danf@17-8-2009 10161540@unknown@formal@none@1@S@An example of this is the [[Firefox]] [[web browser]], which uses [[abstraction]] to build some of the lower-level components, separate source subtrees for implementing platform specific features (like the GUI), and the implementation of more than one [[scripting language]] to help facilitate ease of portability.@@@@1@45@@danf@17-8-2009 10161550@unknown@formal@none@1@S@[[Firefox]] implements [[XUL]], [[Cascading Style Sheets|CSS]] and [[JavaScript]] for extending the browser, in addition to classic [[Netscape]]-style browser plugins.@@@@1@19@@danf@17-8-2009 10161560@unknown@formal@none@1@S@Much of the browser itself is written in XUL, CSS, and JavaScript, as well.@@@@1@14@@danf@17-8-2009 10161570@unknown@formal@none@1@S@=== Cross-platform programming toolkits ===@@@@1@5@@danf@17-8-2009 10161580@unknown@formal@none@1@S@There are a number of tools which are available to help facilitate the process of cross-platform programming:@@@@1@17@@danf@17-8-2009 10161590@unknown@formal@none@1@S@* [[Simple DirectMedia Layer]]—An [[open source]] cross-platform multimedia library written in C that creates an abstraction over various platforms’ graphics, sound, and input [[Application programming interface|API]]s.@@@@1@26@@danf@17-8-2009 10161600@unknown@formal@none@1@S@It runs on many operating systems including Linux, Windows and Mac OS X and is aimed at games and multimedia applications.@@@@1@21@@danf@17-8-2009 10161610@unknown@formal@none@1@S@* [[Cairo (graphics)|Cairo]]−A [[free software]] library used to provide a vector graphics-based, device-independent API.@@@@1@14@@danf@17-8-2009 10161620@unknown@formal@none@1@S@It is designed to provide primitives for 2-dimensional drawing across a number of different backends.@@@@1@15@@danf@17-8-2009 10161630@unknown@formal@none@1@S@Cairo is written in C and has bindings for many programming languages.@@@@1@12@@danf@17-8-2009 10161640@unknown@formal@none@1@S@* ''ParaGUI''—ParaGUI is a cross-platform high-level application framework and GUI library.@@@@1@11@@danf@17-8-2009 10161650@unknown@formal@none@1@S@It can be compiled on various platforms(Linux, Win32, BeOS, Mac OS, ...).@@@@1@12@@danf@17-8-2009 10161660@unknown@formal@none@1@S@ParaGUI is based on the Simple DirectMedia Layer (SDL).@@@@1@9@@danf@17-8-2009 10161670@unknown@formal@none@1@S@ParaGUI is targeted on crossplatform multimedia applications and embedded devices operating on framebuffer displays.@@@@1@14@@danf@17-8-2009 10161680@unknown@formal@none@1@S@* [[wxWidgets]]—An open source widget toolkit that is also an [[application framework]].@@@@1@12@@danf@17-8-2009 10161690@unknown@formal@none@1@S@It runs on [[Unix-like]] systems with [[X11]], Microsoft Windows and Mac OS X. It permits applications written to use it to run on all of the systems that it supports, if the application does not use any [[operating system]]-specific programming in addition to it.@@@@1@44@@danf@17-8-2009 10161700@unknown@formal@none@1@S@* [[Qt (toolkit)|Qt]]—An application framework and [[widget toolkit]] for [[Unix-like]] systems with [[X11]], Microsoft Windows, Mac OS X, and other systems—available under both [[open source]] and commercial licenses.@@@@1@28@@danf@17-8-2009 10161710@unknown@formal@none@1@S@* [[GTK+]]—An open source widget toolkit for Unix-like systems with X11 and Microsoft Windows.@@@@1@14@@danf@17-8-2009 10161720@unknown@formal@none@1@S@* [[FLTK]]—Another open source cross platform toolkit, but more light weight because it restricts itself to the GUI.@@@@1@18@@danf@17-8-2009 10161730@unknown@formal@none@1@S@* [[Mozilla application framework|Mozilla]]—An open source platform for building Mac, Windows and Linux applications.@@@@1@14@@danf@17-8-2009 10161740@unknown@formal@none@1@S@* [[Mono (software)|Mono]] (and more specifically, [[Microsoft .NET]])—A cross-platform framework for applications and programming languages.@@@@1@15@@danf@17-8-2009 10161750@unknown@formal@none@1@S@* ''molib''—A robust commercial application toolkit library that abstracts the system calls through C++ objects (such as the file system, database system and thread implementation.).@@@@1@25@@danf@17-8-2009 10161760@unknown@formal@none@1@S@This allows for the creation of applications that compile and run under Microsoft Windows, Mac OS X, GNU/Linux, and other uses (Sun OS, AIX, HP-UX, 32/64 bit, SMP).@@@@1@28@@danf@17-8-2009 10161770@unknown@formal@none@1@S@Use in concert with ''the sandbox'' to create GUI-based applications.@@@@1@10@@danf@17-8-2009 10161780@unknown@formal@none@1@S@* [[fpGUI]] - An open source widget toolkit that is completely implemented in Object Pascal.@@@@1@15@@danf@17-8-2009 10161790@unknown@formal@none@1@S@It currently supports Linux, Windows and a bit of Windows CE.@@@@1@11@@danf@17-8-2009 10161795@unknown@formal@none@1@S@fpGUI does not rely on any large libraries, instead it talks directly to Xlib (Linux) or GDI (Windows).@@@@1@18@@danf@17-8-2009 10161800@unknown@formal@none@1@S@The framework is compiled with the Free Pascal compiler.@@@@1@9@@danf@17-8-2009 10161810@unknown@formal@none@1@S@Mac OS support is also in the works.@@@@1@8@@danf@17-8-2009 10161820@unknown@formal@none@1@S@* [[Tcl/Tk]] - Tcl (Tool Command Language) is a dynamic programming language, suitable for a wide range of uses, including web and desktop applications, networking, administration, testing and many more.@@@@1@30@@danf@17-8-2009 10161830@unknown@formal@none@1@S@Open source and business-friendly, Tcl is a mature yet evolving language that is truly cross platform, easily deployed and highly extensible.@@@@1@21@@danf@17-8-2009 10161840@unknown@formal@none@1@S@Tk is a graphical user interface toolkit that takes developing desktop applications to a higher level than conventional approaches.@@@@1@19@@danf@17-8-2009 10161850@unknown@formal@none@1@S@Tk is the standard GUI not only for Tcl, but for many other dynamic languages, and can produce rich, native applications that run unchanged across Windows, Mac OS X, Linux and more.@@@@1@32@@danf@17-8-2009 10161860@unknown@formal@none@1@S@The combination of Tcl and the Tk GUI toolkit is referred to as Tcl/Tk.@@@@1@14@@danf@17-8-2009 10161870@unknown@formal@none@1@S@* [[XVT]] is a cross-platform toolkit for creating enterprise and desktop applications in C/C++ on Windows, Linux and Unix (Solaris, HPUX, AIX), and Mac.@@@@1@24@@danf@17-8-2009 10161880@unknown@formal@none@1@S@Most recent release is 5.8, in April 2007@@@@1@8@@danf@17-8-2009 10161890@unknown@formal@none@1@S@=== Cross-platform development environments ===@@@@1@5@@danf@17-8-2009 10161900@unknown@formal@none@1@S@Cross-platform applications can also be built using proprietary [[Integrated development environment|IDE]]s, or so-called [[Rapid Application Development]] tools.@@@@1@17@@danf@17-8-2009 10161910@unknown@formal@none@1@S@There are a number of development environments which allow developers to build and deploy applications across multiple platforms:@@@@1@18@@danf@17-8-2009 10161920@unknown@formal@none@1@S@* [[Eclipse (software)| Eclipse]]—An Open source [[software framework]] and [[Integrated development environment|IDE]] extendable through plug-ins including the C++ Development Toolkit.@@@@1@20@@danf@17-8-2009 10161930@unknown@formal@none@1@S@Eclipse is available on any operating system with a modern Java virtual machine (including Windows, Linux, and Mac OS X, Sun, HP-UX, and other systems).@@@@1@25@@danf@17-8-2009 10161940@unknown@formal@none@1@S@* [[IntelliJ IDEA]]—A proprietary [[Integrated development environment|IDE]]@@@@1@7@@danf@17-8-2009 10161950@unknown@formal@none@1@S@* [[NetBeans]]—An Open source [[software framework]] and [[Integrated development environment|IDE]] extendable through plug-ins.@@@@1@13@@danf@17-8-2009 10161960@unknown@formal@none@1@S@NetBeans is available on any operating system with a modern Java virtual machine (including Windows, Linux, and Mac OS X, Sun, HP-UX, and other systems).@@@@1@25@@danf@17-8-2009 10161970@unknown@formal@none@1@S@Similar to Eclipse in features and functionality.@@@@1@7@@danf@17-8-2009 10161980@unknown@formal@none@1@S@Promoted by [[Sun Microsystems]]@@@@1@4@@danf@17-8-2009 10161990@unknown@formal@none@1@S@* [[Omnis Studio]]—A proprietary [[Integrated development environment|IDE]] or Rapid Application Development tool for creating enterprise and web applications for Windows, Linux, and Mac OS X.@@@@1@25@@danf@17-8-2009 10162000@unknown@formal@none@1@S@* [[Runtime Revolution]]—a proprietary [[Integrated development environment|IDE]], compiler engine and CGI builder that [[cross compile]]s to [[Microsoft Windows|Windows]], [[Mac OS X]] ([[PowerPC|PPC]], [[Intel]]), [[Linux]], [[Solaris Operating System|Solaris]], [[BSD]], and [[Irix]].@@@@1@30@@danf@17-8-2009 10162010@unknown@formal@none@1@S@*[[Code::Blocks]]—A free/open source, cross platform IDE.@@@@1@6@@danf@17-8-2009 10162020@unknown@formal@none@1@S@It is developed in C++ using wxWidgets.@@@@1@7@@danf@17-8-2009 10162030@unknown@formal@none@1@S@Using a plugin architecture, its capabilities and features are defined by the provided plugins.@@@@1@14@@danf@17-8-2009 10162040@unknown@formal@none@1@S@*[[Lazarus (software)]]—Lazarus is a cross platform Visual IDE developed for and supported by the open source Free Pascal compiler.@@@@1@19@@danf@17-8-2009 10162050@unknown@formal@none@1@S@It aims to provide a Rapid Application Development Delphi Clone for Pascal and Object Pascal developers.@@@@1@16@@danf@17-8-2009 10162060@unknown@formal@none@1@S@*[[REALbasic]]—REALbasic (RB) is an object-oriented dialect of the BASIC programming language developed and commercially marketed by REAL Software, Inc in Austin, Texas for Mac OS X, Microsoft Windows, and Linux.@@@@1@30@@danf@17-8-2009 10162070@unknown@formal@none@1@S@== Criticisms of cross-platform development ==@@@@1@6@@danf@17-8-2009 10162080@unknown@formal@none@1@S@There are certain issues associated with cross-platform development.@@@@1@8@@danf@17-8-2009 10162090@unknown@formal@none@1@S@Some of these include:@@@@1@4@@danf@17-8-2009 10162100@unknown@formal@none@1@S@* Testing cross-platform applications may also be considerably more complicated, since different platforms can exhibit slightly different behaviors or subtle bugs.@@@@1@21@@danf@17-8-2009 10162110@unknown@formal@none@1@S@This problem has led some developers to deride cross-platform development as “Write Once, Debug Everywhere”, a take on Sun’s [[Write once, run anywhere|“Write Once, Run Anywhere”]] marketing slogan.@@@@1@28@@danf@17-8-2009 10162120@unknown@formal@none@1@S@* Developers are often restricted to using the [[lowest common denominator]] subset of features which are available on all platforms.@@@@1@20@@danf@17-8-2009 10162130@unknown@formal@none@1@S@This may hinder the application's performance or prohibit developers from using platforms’ most advanced features.@@@@1@15@@danf@17-8-2009 10162140@unknown@formal@none@1@S@* Different platforms often have different user interface conventions, which cross-platform applications do not always accommodate.@@@@1@16@@danf@17-8-2009 10162150@unknown@formal@none@1@S@For example, applications developed for Mac OS X and [[GNOME]] are supposed to place the most important button on the right-hand side of windows and dialogs, whereas Microsoft Windows and [[KDE]] have the opposite convention.@@@@1@35@@danf@17-8-2009 10162160@unknown@formal@none@1@S@Though many of these differences are subtle, a cross-platform application which does not conform appropriately to these conventions may feel clunky or alien to the user.@@@@1@26@@danf@17-8-2009 10162170@unknown@formal@none@1@S@When working quickly, such opposing conventions may even result in [[data loss]], such as in a [[dialog box]] confirming whether the user wants to save or discard changes to a file.@@@@1@31@@danf@17-8-2009 10162180@unknown@formal@none@1@S@* Scripting languages and virtual machines must be translated into native executable code each time the application is executed, imposing a performance penalty.@@@@1@23@@danf@17-8-2009 10162190@unknown@formal@none@1@S@This performance hit can be alleviated using advanced techniques like [[just-in-time compilation]]; but even using such techniques, some performance overhead may be unavoidable.@@@@1@23@@danf@17-8-2009 10170010@unknown@formal@none@1@S@
Data
@@@@1@1@@danf@17-8-2009 10170020@unknown@formal@none@1@S@'''Data''' (singular: '''datum''') are collected of natural phenomena descriptors including the results of [[experience]], [[observation]] or [[experiment]], or a set of [[premise]]s.@@@@1@22@@danf@17-8-2009 10170030@unknown@formal@none@1@S@This may consist of [[number]]s, [[word]]s, or [[image]]s, particularly as [[measurement]]s or observations of a set of [[variable]]s.@@@@1@18@@danf@17-8-2009 10170040@unknown@formal@none@1@S@==Etymology==@@@@1@1@@danf@17-8-2009 10170050@unknown@formal@none@1@S@The word ''data ''is the plural of [[Latin]] ''[[datum]]'', [[Grammatical gender|neuter]] past [[participle]] of ''dare'', "to give", hence "something given".@@@@1@20@@danf@17-8-2009 10170060@unknown@formal@none@1@S@The [[past participle]] of "to give" has been used for millennia, in the sense of a statement accepted at face value; one of the works of [[Euclid]], circa 300 BC, was the ''Dedomena'' (in Latin, ''Data'').@@@@1@36@@danf@17-8-2009 10170070@unknown@formal@none@1@S@In discussions of problems in [[geometry]], [[mathematics]], [[engineering]], and so on, the terms ''givens'' and ''data'' are used interchangeably.@@@@1@19@@danf@17-8-2009 10170080@unknown@formal@none@1@S@Such usage is the origin of ''data'' as a concept in [[computer science]]:'' ''data'' ''are numbers, words, images, etc., accepted as they stand.@@@@1@23@@danf@17-8-2009 10170090@unknown@formal@none@1@S@Pronounced dey-tuh, dat-uh, or dah-tuh.''@@@@1@5@@danf@17-8-2009 10170100@unknown@formal@none@1@S@[[Experimental data]] are data generated within the context of a scientific investigation.@@@@1@12@@danf@17-8-2009 10170110@unknown@formal@none@1@S@Mathematically, data can be grouped in many ways.@@@@1@8@@danf@17-8-2009 10170120@unknown@formal@none@1@S@==Usage in English==@@@@1@3@@danf@17-8-2009 10170130@unknown@formal@none@1@S@In [[English language|English]], the word ''datum'' is still used in the general sense of "something given", and more specifically in [[cartography]], [[geography]], [[geology]], [[NMR]] and [[technical drawing|drafting]] to mean a reference point, reference line, or reference surface.@@@@1@37@@danf@17-8-2009 10170140@unknown@formal@none@1@S@More generally speaking, any measurement or result can be called a (single) ''datum'', but ''data point'' is more common.@@@@1@19@@danf@17-8-2009 10170150@unknown@formal@none@1@S@Both ''datums'' (see usage in [[datum]] article) and the originally Latin plural ''data'' are used as the plural of ''datum'' in English, but ''data'' is more commonly treated as a [[mass noun]] and used in the [[Grammatical number|singular]], especially in day-to-day usage.@@@@1@42@@danf@17-8-2009 10170160@unknown@formal@none@1@S@For example, "This is all the data from the experiment".@@@@1@10@@danf@17-8-2009 10170170@unknown@formal@none@1@S@This usage is inconsistent with the rules of Latin grammar and traditional English, which would instead suggest "These are all the data from the experiment".@@@@1@25@@danf@17-8-2009 10170180@unknown@formal@none@1@S@Some British and UN academic, scientific, and professional [[style guides]] (e.g., see page 43 of the [http://whqlibdoc.who.int/hq/2004/WHO_IMD_PUB_04.1.pdf World Health Organization Style Guide]) request that authors treat ''data'' as a plural noun.@@@@1@31@@danf@17-8-2009 10170190@unknown@formal@none@1@S@Other international organization, such as the IEEE computing society , allow its usage as either a mass noun or plural based on author preference.@@@@1@24@@danf@17-8-2009 10170200@unknown@formal@none@1@S@It is now usually treated as a singular mass noun in informal usage, but usage in scientific publications shows a strong UK/U.S divide.@@@@1@23@@danf@17-8-2009 10170210@unknown@formal@none@1@S@U.S. usage tends to treat ''data'' in the singular, including in serious and academic publishing, although some major newspapers (such as the [[New York Times]]) regularly use it in the plural.@@@@1@31@@danf@17-8-2009 10170220@unknown@formal@none@1@S@"The plural usage is still common, as this headline from the New York Times attests: “Data Are Elusive on the Homeless.”@@@@1@21@@danf@17-8-2009 10170230@unknown@formal@none@1@S@Sometimes scientists think of data as plural, as in ''These data do not support the conclusions.''@@@@1@16@@danf@17-8-2009 10170240@unknown@formal@none@1@S@But more often scientists and researchers think of data as a singular mass entity like information, and most people now follow this in general usage.@@@@1@25@@danf@17-8-2009 10170245@unknown@formal@none@1@S@"[http://www.bartleby.com/61/51/D0035100.html] UK usage now widely accepts treating ''data'' as singular in standard English, including everyday newspaper usage at least in non-scientific use.@@@@1@22@@danf@17-8-2009 10170250@unknown@formal@none@1@S@UK scientific publishing usually still prefers treating it as a plural..@@@@1@11@@danf@17-8-2009 10170260@unknown@formal@none@1@S@Some UK university style guides recommend using ''data'' for both singular and plural use and some recommend treating it only as a singular in connection with computers.@@@@1@27@@danf@17-8-2009 10170270@unknown@formal@none@1@S@==Uses of ''data'' in science and computing==@@@@1@7@@danf@17-8-2009 10170280@unknown@formal@none@1@S@''Raw data'' are [[number]]s, [[character (computing)|characters]], [[image]]s or other outputs from devices to convert physical quantities into symbols, in a very broad sense.@@@@1@23@@danf@17-8-2009 10170290@unknown@formal@none@1@S@Such data are typically further [[data processing|processed]] by a human or [[input]] into a [[computer]], [[Computer storage|stored]] and processed there, or transmitted ([[output]]) to another human or computer.@@@@1@28@@danf@17-8-2009 10170300@unknown@formal@none@1@S@''Raw data'' is a relative term; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next.@@@@1@28@@danf@17-8-2009 10170310@unknown@formal@none@1@S@Mechanical computing devices are classified according to the means by which they represent data.@@@@1@14@@danf@17-8-2009 10170320@unknown@formal@none@1@S@An [[analog computer]] represents a datum as a voltage, distance, position, or other physical quantity.@@@@1@15@@danf@17-8-2009 10170330@unknown@formal@none@1@S@A [[digital computer]] represents a datum as a sequence of symbols drawn from a fixed [[alphabet]].@@@@1@16@@danf@17-8-2009 10170340@unknown@formal@none@1@S@The most common digital computers use a binary alphabet, that is, an alphabet of two characters, typically denoted "0" and "1".@@@@1@21@@danf@17-8-2009 10170350@unknown@formal@none@1@S@More familiar representations, such as numbers or letters, are then constructed from the binary alphabet.@@@@1@15@@danf@17-8-2009 10170360@unknown@formal@none@1@S@Some special forms of data are distinguished.@@@@1@7@@danf@17-8-2009 10170370@unknown@formal@none@1@S@A [[computer program]] is a collection of data, which can be interpreted as instructions.@@@@1@14@@danf@17-8-2009 10170380@unknown@formal@none@1@S@Most computer languages make a distinction between programs and the other data on which programs operate, but in some languages, notably [[Lisp programming language|Lisp]] and similar languages, programs are essentially indistinguishable from other data.@@@@1@34@@danf@17-8-2009 10170390@unknown@formal@none@1@S@It is also useful to distinguish [[metadata]], that is, a description of other data.@@@@1@14@@danf@17-8-2009 10170400@unknown@formal@none@1@S@A similar yet earlier term for metadata is "ancillary data."@@@@1@10@@danf@17-8-2009 10170410@unknown@formal@none@1@S@The prototypical example of metadata is the library catalog, which is a description of the contents of books.@@@@1@18@@danf@17-8-2009 10170420@unknown@formal@none@1@S@==Meaning of data, information and knowledge==@@@@1@6@@danf@17-8-2009 10170430@unknown@formal@none@1@S@The terms [[information]] and [[knowledge]] are frequently used for overlapping concepts.@@@@1@11@@danf@17-8-2009 10170440@unknown@formal@none@1@S@The main difference is in the level of [[abstraction]] being considered.@@@@1@11@@danf@17-8-2009 10170450@unknown@formal@none@1@S@Data is the lowest level of abstraction, information is the next level, and finally, knowledge is the highest level among all three.@@@@1@22@@danf@17-8-2009 10170460@unknown@formal@none@1@S@For example, the height of Mt. Everest is generally considered as "data", a book on Mt. Everest geological characteristics may be considered as "information", and a report containing practical information on the best way to reach Mt. Everest's peak may be considered as "knowledge".@@@@1@44@@danf@17-8-2009 10170470@unknown@formal@none@1@S@Information as a concept bears a diversity of meanings, from everyday usage to technical settings.@@@@1@15@@danf@17-8-2009 10170480@unknown@formal@none@1@S@Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation.@@@@1@26@@danf@17-8-2009 10170490@unknown@formal@none@1@S@Beynon-Davies uses the concept of a [[sign]] to distinguish between [[data]] and [[information]].@@@@1@13@@danf@17-8-2009 10170500@unknown@formal@none@1@S@Data are symbols.@@@@1@3@@danf@17-8-2009 10170510@unknown@formal@none@1@S@Information occurs when symbols are used to refer to something.@@@@1@10@@danf@17-8-2009 10180010@unknown@formal@none@1@S@
Data analysis
@@@@1@2@@danf@17-8-2009 10180020@unknown@formal@none@1@S@'''Data analysis''' is the process of looking at and summarizing '''[[data]]''' with the intent to extract useful [[information]] and develop conclusions.@@@@1@21@@danf@17-8-2009 10180030@unknown@formal@none@1@S@Data analysis is closely related to [[data mining]], but data mining tends to focus on larger data sets, with less emphasis on making [[inference]], and often uses data that was originally collected for a different purpose.@@@@1@36@@danf@17-8-2009 10180040@unknown@formal@none@1@S@In [[statistics|statistical applications]], some people divide data analysis into [[descriptive statistics]], [[exploratory data analysis]] and [[confirmatory data analysis]], where the EDA focuses on discovering new features in the data, and CDA on confirming or falsifying existing hypotheses.@@@@1@37@@danf@17-8-2009 10180050@unknown@formal@none@1@S@Data analysis assumes different aspects, and possibly different names, in different fields.@@@@1@12@@danf@17-8-2009 10180060@unknown@formal@none@1@S@The term ''data analysis'' is also used as a synonym for [[data modeling]], which is unrelated to the subject of this article.@@@@1@22@@danf@17-8-2009 10180070@unknown@formal@none@1@S@==Nuclear and particle physics==@@@@1@4@@danf@17-8-2009 10180080@unknown@formal@none@1@S@In [[nuclear physics|nuclear]] and [[particle physics]] the data usually originate from the [[particle detector|experimental apparatus]] via a [[data acquisition]] system.@@@@1@20@@danf@17-8-2009 10180090@unknown@formal@none@1@S@It is then processed, in a step usually called ''data reduction'', to apply calibrations and to extract physically significant information.@@@@1@20@@danf@17-8-2009 10180100@unknown@formal@none@1@S@Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc.@@@@1@21@@danf@17-8-2009 10180110@unknown@formal@none@1@S@The resulting data ''n-tuples'' are then scrutinized by the physicists, using specialized software tools like [[ROOT]] or [[Physics Analysis Workstation|PAW]], comparing the results of the experiment with theory.@@@@1@28@@danf@17-8-2009 10180120@unknown@formal@none@1@S@The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for [[Monte Carlo method|Monte Carlo simulation]] software like [[Geant4]] that predict the response of the detector to a given theoretical event, producing '''simulated events''' which are then compared to experimental data.@@@@1@53@@danf@17-8-2009 10180130@unknown@formal@none@1@S@See also: [[Computational physics]].@@@@1@4@@danf@17-8-2009 10180140@unknown@formal@none@1@S@==Social sciences==@@@@1@2@@danf@17-8-2009 10180150@unknown@formal@none@1@S@[[Qualitative data analysis]] (QDA) or [[qualitative research]] is the analysis of non-numerical data, for example words, photographs, observations, etc..@@@@1@19@@danf@17-8-2009 10180160@unknown@formal@none@1@S@==Information technology==@@@@1@2@@danf@17-8-2009 10180170@unknown@formal@none@1@S@A special case is the [[Data analysis (information technology in othm )|data analysis in information technology audits]].@@@@1@17@@danf@17-8-2009 10180180@unknown@formal@none@1@S@==Business==@@@@1@1@@danf@17-8-2009 10180190@unknown@formal@none@1@S@See@@@@1@1@@danf@17-8-2009 10180200@unknown@formal@none@1@S@* [[Analytics]]@@@@1@2@@danf@17-8-2009 10180210@unknown@formal@none@1@S@* [[Business intelligence]]@@@@1@3@@danf@17-8-2009 10180220@unknown@formal@none@1@S@* [[Data mining]]@@@@1@3@@danf@17-8-2009 10190010@unknown@formal@none@1@S@
Database
@@@@1@1@@danf@17-8-2009 10190020@unknown@formal@none@1@S@A '''database''' is a [[structure]]d collection of records or [[data]].@@@@1@10@@danf@17-8-2009 10190030@unknown@formal@none@1@S@A [[computer]] database relies upon [[software]] to organize the storage of data.@@@@1@12@@danf@17-8-2009 10190040@unknown@formal@none@1@S@The software models the database structure in what are known as [[database model]]s.@@@@1@13@@danf@17-8-2009 10190050@unknown@formal@none@1@S@The model in most common use today is the [[relational model]].@@@@1@11@@danf@17-8-2009 10190060@unknown@formal@none@1@S@Other models such as the [[hierarchical model]] and the [[network model]] use a more explicit representation of relationships (see below for explanation of the various database models).@@@@1@27@@danf@17-8-2009 10190070@unknown@formal@none@1@S@Database management systems (DBMS) are the software used to organize and maintain the database.@@@@1@14@@danf@17-8-2009 10190080@unknown@formal@none@1@S@These are categorized according to the [[database model]] that they support.@@@@1@11@@danf@17-8-2009 10190090@unknown@formal@none@1@S@The model tends to determine the query languages that are available to access the database.@@@@1@15@@danf@17-8-2009 10190100@unknown@formal@none@1@S@A great deal of the internal engineering of a DBMS, however, is independent of the data model, and is concerned with managing factors such as performance, concurrency, integrity, and recovery from [[hardware failure]]s.@@@@1@33@@danf@17-8-2009 10190110@unknown@formal@none@1@S@In these areas there are large differences between products.@@@@1@9@@danf@17-8-2009 10190120@unknown@formal@none@1@S@==History==@@@@1@1@@danf@17-8-2009 10190130@unknown@formal@none@1@S@The earliest known use of the term '''''data base''''' was in November 1963, when the [[System Development Corporation]] sponsored a symposium under the title ''Development and Management of a Computer-centered Data Base''.@@@@1@32@@danf@17-8-2009 10190140@unknown@formal@none@1@S@'''Database''' as a single word became common in Europe in the early 1970s and by the end of the decade it was being used in major American newspapers.@@@@1@28@@danf@17-8-2009 10190150@unknown@formal@none@1@S@(The abbreviation DB, however, survives.)@@@@1@5@@danf@17-8-2009 10190160@unknown@formal@none@1@S@The first database management systems were developed in the 1960s.@@@@1@10@@danf@17-8-2009 10190170@unknown@formal@none@1@S@A pioneer in the field was [[Charles Bachman]].@@@@1@8@@danf@17-8-2009 10190180@unknown@formal@none@1@S@Bachman's early papers show that his aim was to make more effective use of the new direct access storage devices becoming available: until then, data processing had been based on [[punch card|punched cards]] and [[magnetic tape]], so that serial processing was the dominant activity.@@@@1@44@@danf@17-8-2009 10190190@unknown@formal@none@1@S@Two key [[data model]]s arose at this time: [[CODASYL]] developed the [[network model]] based on Bachman's ideas, and (apparently independently) the [[hierarchical model]] was used in a system developed by [[North American Rockwell]] later adopted by [[IBM]] as the cornerstone of their [[Information Management System|IMS]] product.@@@@1@46@@danf@17-8-2009 10190200@unknown@formal@none@1@S@While IMS along with the CODASYL [[IDMS]] were the big, high visibility databases developed in the 1960s, several others were also born in that decade, some of which have a significant installed base today.@@@@1@34@@danf@17-8-2009 10190210@unknown@formal@none@1@S@Two worthy of mention are the [[Pick operating system|PICK]] and [[MUMPS]] databases, with the former developed originally as an operating system with an embedded database and the latter as a programming language and database for the development of healthcare systems.@@@@1@40@@danf@17-8-2009 10190220@unknown@formal@none@1@S@The [[relational model]] was proposed by [[Edgar F. Codd|E. F. Codd]] in 1970.@@@@1@13@@danf@17-8-2009 10190230@unknown@formal@none@1@S@He criticized existing models for confusing the abstract description of information structure with descriptions of physical access mechanisms.@@@@1@18@@danf@17-8-2009 10190240@unknown@formal@none@1@S@For a long while, however, the relational model remained of academic interest only.@@@@1@13@@danf@17-8-2009 10190250@unknown@formal@none@1@S@While CODASYL products (IDMS) and network model products (IMS) were conceived as practical engineering solutions taking account of the technology as it existed at the time, the relational model took a much more theoretical perspective, arguing (correctly) that hardware and software technology would catch up in time.@@@@1@47@@danf@17-8-2009 10190260@unknown@formal@none@1@S@Among the first implementations were [[Michael Stonebraker]]'s [[Ingres (database)|Ingres]] at [[University of California, Berkeley|Berkeley]], and the [[System R]] project at IBM.@@@@1@21@@danf@17-8-2009 10190270@unknown@formal@none@1@S@Both of these were research prototypes, announced during 1976.@@@@1@9@@danf@17-8-2009 10190280@unknown@formal@none@1@S@The first commercial products, [[Oracle database|Oracle]] and [[IBM DB2|DB2]], did not appear until around 1980.@@@@1@15@@danf@17-8-2009 10190290@unknown@formal@none@1@S@The first successful database product for microcomputers was [[dBASE]] for the [[CP/M]] and [[PC-DOS]]/[[MS-DOS]] operating systems.@@@@1@16@@danf@17-8-2009 10190300@unknown@formal@none@1@S@During the 1980s, research activity focused on [[distributed database]] systems and [[database machine]]s.@@@@1@13@@danf@17-8-2009 10190310@unknown@formal@none@1@S@Another important theoretical idea was the [[Functional Data Model]], but apart from some specialized applications in genetics, molecular biology, and fraud investigation, the world took little notice.@@@@1@27@@danf@17-8-2009 10190320@unknown@formal@none@1@S@In the 1990s, attention shifted to [[OODB|object-oriented databases]].@@@@1@8@@danf@17-8-2009 10190330@unknown@formal@none@1@S@These had some success in fields where it was necessary to handle more complex data than relational systems could easily cope with, such as [[spatial database]]s, engineering data (including software [[Software repository|repositories]]), and multimedia data.@@@@1@35@@danf@17-8-2009 10190340@unknown@formal@none@1@S@Some of these ideas were adopted by the relational vendors, who integrated new features into their products as a result.@@@@1@20@@danf@17-8-2009 10190350@unknown@formal@none@1@S@The 1990s also saw the spread of [[Open Source]] databases, such as [[PostgreSQL]] and [[MySQL]].@@@@1@15@@danf@17-8-2009 10190360@unknown@formal@none@1@S@In the 2000s, the fashionable area for innovation is the [[XML database]].@@@@1@12@@danf@17-8-2009 10190370@unknown@formal@none@1@S@As with object databases, this has spawned a new collection of start-up companies, but at the same time the key ideas are being integrated into the established relational products.@@@@1@29@@danf@17-8-2009 10190380@unknown@formal@none@1@S@[[XML databases]] aim to remove the traditional divide between documents and data, allowing all of an organization's information resources to be held in one place, whether they are highly structured or not.@@@@1@32@@danf@17-8-2009 10190390@unknown@formal@none@1@S@==Database models==@@@@1@2@@danf@17-8-2009 10190400@unknown@formal@none@1@S@Various techniques are used to model data structure.@@@@1@8@@danf@17-8-2009 10190410@unknown@formal@none@1@S@Most database systems are built around one particular data model, although it is increasingly common for products to offer support for more than one model.@@@@1@25@@danf@17-8-2009 10190420@unknown@formal@none@1@S@For any one [[logical model]] various physical implementations may be possible, and most products will offer the user some level of control in tuning the [[physical implementation]], since the choices that are made have a significant effect on performance.@@@@1@39@@danf@17-8-2009 10190430@unknown@formal@none@1@S@Here are three examples:@@@@1@4@@danf@17-8-2009 10190440@unknown@formal@none@1@S@===Hierarchical model===@@@@1@2@@danf@17-8-2009 10190450@unknown@formal@none@1@S@In a [[hierarchical model]], data is organized into an inverted tree-like structure, implying a multiple downward link in each node to describe the nesting, and a sort field to keep the records in a particular order in each same-level list.@@@@1@40@@danf@17-8-2009 10190460@unknown@formal@none@1@S@This structure arranges the various data elements in a hierarchy and helps to establish logical relationships among data elements of multiple files.@@@@1@22@@danf@17-8-2009 10190470@unknown@formal@none@1@S@Each unit in the model is a record which is also known as a node.@@@@1@15@@danf@17-8-2009 10190480@unknown@formal@none@1@S@In such a model, each record on one level can be related to multiple records on the next lower level.@@@@1@20@@danf@17-8-2009 10190490@unknown@formal@none@1@S@A record that has subsidiary records is called a parent and the subsidiary records are called children.@@@@1@17@@danf@17-8-2009 10190500@unknown@formal@none@1@S@Data elements in this model are well suited for one-to-many relationships with other data elements in the database.@@@@1@18@@danf@17-8-2009 10190510@unknown@formal@none@1@S@This model is advantageous when the data elements are inherently hierarchical.@@@@1@11@@danf@17-8-2009 10190520@unknown@formal@none@1@S@The disadvantage is that in order to prepare the database it becomes necessary to identify the requisite groups of files that are to be logically integrated.@@@@1@26@@danf@17-8-2009 10190530@unknown@formal@none@1@S@Hence, a hierarchical data model may not always be flexible enough to accommodate the dynamic needs of an organization.@@@@1@19@@danf@17-8-2009 10190540@unknown@formal@none@1@S@===Network model===@@@@1@2@@danf@17-8-2009 10190550@unknown@formal@none@1@S@The [[network model]] tends to store records with links to other records.@@@@1@12@@danf@17-8-2009 10190560@unknown@formal@none@1@S@Each record in the database can have multiple parents, i.e., the relationships among data elements can have a many to many relationship.@@@@1@22@@danf@17-8-2009 10190570@unknown@formal@none@1@S@Associations are tracked via "pointers".@@@@1@5@@danf@17-8-2009 10190580@unknown@formal@none@1@S@These pointers can be node numbers or disk addresses.@@@@1@9@@danf@17-8-2009 10190590@unknown@formal@none@1@S@Most network databases tend to also include some form of hierarchical model.@@@@1@12@@danf@17-8-2009 10190600@unknown@formal@none@1@S@Databases can be translated from hierarchical model to network and vice versa.@@@@1@12@@danf@17-8-2009 10190610@unknown@formal@none@1@S@The main difference between the network model and hierarchical model is that in a network model, a child can have a number of parents whereas in a hierarchical model, a child can have only one parent.@@@@1@36@@danf@17-8-2009 10190620@unknown@formal@none@1@S@The network model provides greater advantage than the hierarchical model in that promotes greater flexibility and data accessibility, since records at a lower level can be accessed without accessing the records above them.@@@@1@33@@danf@17-8-2009 10190630@unknown@formal@none@1@S@This model is more efficient than hierarchical model, easier to understand and can be applied to many real world problems that require routine transactions.@@@@1@24@@danf@17-8-2009 10190640@unknown@formal@none@1@S@The disadvantages are that: It is a complex process to design and develop a network database; It has to be refined frequently; It requires that the relationships among all the records be defined before development starts, and changes often demand major programming efforts; Operation and maintenance of the network model is expensive and time consuming.@@@@1@55@@danf@17-8-2009 10190650@unknown@formal@none@1@S@Examples of database engines that have network model capabilities are [[RDM Embedded]] and [[RDM Server]].@@@@1@15@@danf@17-8-2009 10190660@unknown@formal@none@1@S@===Relational model===@@@@1@2@@danf@17-8-2009 10190670@unknown@formal@none@1@S@The basic data structure of the relational model is a table where information about a particular entity (say, an employee) is represented in columns and rows.@@@@1@26@@danf@17-8-2009 10190680@unknown@formal@none@1@S@The columns enumerate the various attributes of an entity (e.g. employee_name, address, phone_number).@@@@1@13@@danf@17-8-2009 10190690@unknown@formal@none@1@S@Rows (also called records) represent instances of an entity (e.g. specific employees).@@@@1@12@@danf@17-8-2009 10190700@unknown@formal@none@1@S@The "relation" in "relational database" comes from the mathematical notion of [[Relation (mathematics)|relations]] from the field of [[set theory]].@@@@1@19@@danf@17-8-2009 10190710@unknown@formal@none@1@S@A relation is a set of [[tuple]]s, so rows are sometimes called tuples.@@@@1@13@@danf@17-8-2009 10190720@unknown@formal@none@1@S@All tables in a relational database adhere to three basic rules.@@@@1@11@@danf@17-8-2009 10190730@unknown@formal@none@1@S@* The ordering of columns is immaterial@@@@1@7@@danf@17-8-2009 10190740@unknown@formal@none@1@S@* Identical rows are not allowed in a table@@@@1@9@@danf@17-8-2009 10190750@unknown@formal@none@1@S@* Each row has a single (separate) value for each of its columns (each tuple has an atomic value).@@@@1@19@@danf@17-8-2009 10190760@unknown@formal@none@1@S@If the same value occurs in two different records (from the same table or different tables) it can imply a relationship between those records.@@@@1@24@@danf@17-8-2009 10190770@unknown@formal@none@1@S@Relationships between records are often categorized by their [[Cardinality (data modeling)|cardinality]] (1:1, (0), 1:M, M:M).@@@@1@15@@danf@17-8-2009 10190780@unknown@formal@none@1@S@Tables can have a designated column or set of columns that act as a "key" to select rows from that table with the same or similar key values.@@@@1@28@@danf@17-8-2009 10190790@unknown@formal@none@1@S@A "primary key" is a key that has a unique value for each row in the table.@@@@1@17@@danf@17-8-2009 10190800@unknown@formal@none@1@S@Keys are commonly used to join or combine data from two or more tables.@@@@1@14@@danf@17-8-2009 10190810@unknown@formal@none@1@S@For example, an ''employee'' table may contain a column named ''address'' which contains a value that matches the key of an ''address'' table.@@@@1@23@@danf@17-8-2009 10190820@unknown@formal@none@1@S@Keys are also critical in the creation of indexes, which facilitate fast retrieval of data from large tables.@@@@1@18@@danf@17-8-2009 10190830@unknown@formal@none@1@S@It is not necessary to define all the keys in advance; a column can be used as a key even if it was not originally intended to be one.@@@@1@29@@danf@17-8-2009 10190840@unknown@formal@none@1@S@====Relational operations====@@@@1@2@@danf@17-8-2009 10190850@unknown@formal@none@1@S@Users (or programs) request data from a relational database by sending it a [[query]] that is written in a special language, usually a dialect of [[SQL]].@@@@1@26@@danf@17-8-2009 10190860@unknown@formal@none@1@S@Although SQL was originally intended for end-users, it is much more common for SQL queries to be embedded into software that provides an easier user interface.@@@@1@26@@danf@17-8-2009 10190870@unknown@formal@none@1@S@Many web applications, such as [[Wikipedia]], perform SQL queries when generating pages.@@@@1@12@@danf@17-8-2009 10190880@unknown@formal@none@1@S@In response to a query, the database returns a result set, which is the list of rows constituting the answer.@@@@1@20@@danf@17-8-2009 10190890@unknown@formal@none@1@S@The simplest query is just to return all the rows from a table, but more often, the rows are filtered in some way to return just the answer wanted.@@@@1@29@@danf@17-8-2009 10190900@unknown@formal@none@1@S@Often, data from multiple tables are combined into one, by doing a [[Join (SQL)|join]].@@@@1@14@@danf@17-8-2009 10190910@unknown@formal@none@1@S@There are a number of relational operations in addition to join.@@@@1@11@@danf@17-8-2009 10190920@unknown@formal@none@1@S@====Normal forms====@@@@1@2@@danf@17-8-2009 10190930@unknown@formal@none@1@S@Relations are classified based upon the types of anomalies to which they're vulnerable.@@@@1@13@@danf@17-8-2009 10190940@unknown@formal@none@1@S@A database that's in the first normal form is vulnerable to all types of anomalies, while a database that's in the domain/key normal form has no modification anomalies.@@@@1@28@@danf@17-8-2009 10190950@unknown@formal@none@1@S@Normal forms are hierarchical in nature.@@@@1@6@@danf@17-8-2009 10190960@unknown@formal@none@1@S@That is, the lowest level is the first normal form, and the database cannot meet the requirements for higher level normal forms without first having met all the requirements of the lesser normal form.@@@@1@34@@danf@17-8-2009 10190970@unknown@formal@none@1@S@==Database Management Systems==@@@@1@3@@danf@17-8-2009 10190980@unknown@formal@none@1@S@===Relational database management systems===@@@@1@4@@danf@17-8-2009 10190990@unknown@formal@none@1@S@An RDBMS implements the features of the relational model outlined above.@@@@1@11@@danf@17-8-2009 10191000@unknown@formal@none@1@S@In this context, [[Christopher J. Date|Date]]'s '''Information Principle''' states:@@@@1@9@@danf@17-8-2009 10191010@unknown@formal@none@1@S@
The entire information content of the database is represented in one and only one way.@@@@1@16@@danf@17-8-2009 10191020@unknown@formal@none@1@S@Namely as explicit values in column positions (attributes) and rows in relations ([[tuple]]s) Therefore, there are no explicit pointers between related tables.
@@@@1@22@@danf@17-8-2009 10191030@unknown@formal@none@1@S@===Post-relational database models===@@@@1@3@@danf@17-8-2009 10191040@unknown@formal@none@1@S@Several products have been identified as [[post-relational]] because the data model incorporates [[relations]] but is not constrained by the Information Principle, requiring that all information is represented by [[data values]] in relations.@@@@1@32@@danf@17-8-2009 10191050@unknown@formal@none@1@S@Products using a post-relational data model typically employ a model that actually pre-dates the [[relational model]].@@@@1@16@@danf@17-8-2009 10191060@unknown@formal@none@1@S@These might be identified as a [[directed graph]] with [[tree data structure|trees]] on the [[data structure|nodes]].@@@@1@16@@danf@17-8-2009 10191070@unknown@formal@none@1@S@Examples of models that could be classified as post-relational are [[Pick operating system|PICK]] aka [[Multidimensional database|MultiValue]], and [[MUMPS]].@@@@1@18@@danf@17-8-2009 10191080@unknown@formal@none@1@S@===Object database models===@@@@1@3@@danf@17-8-2009 10191090@unknown@formal@none@1@S@In recent years, the [[object-oriented]] paradigm has been applied to database technology, creating a new programming model known as [[object database]]s.@@@@1@21@@danf@17-8-2009 10191100@unknown@formal@none@1@S@These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same [[type system]] as the application program.@@@@1@31@@danf@17-8-2009 10191110@unknown@formal@none@1@S@This aims to avoid the overhead (sometimes referred to as the ''[[Object-Relational impedance mismatch|impedance mismatch]]'') of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects).@@@@1@40@@danf@17-8-2009 10191120@unknown@formal@none@1@S@At the same time, object databases attempt to introduce the key ideas of object programming, such as [[encapsulation]] and [[polymorphism (computer science)|polymorphism]], into the world of databases.@@@@1@27@@danf@17-8-2009 10191130@unknown@formal@none@1@S@A variety of these ways have been tried for storing objects in a database.@@@@1@14@@danf@17-8-2009 10191140@unknown@formal@none@1@S@Some products have approached the problem from the application programming end, by making the objects manipulated by the program [[Persistence (computer science)|persistent]].@@@@1@22@@danf@17-8-2009 10191150@unknown@formal@none@1@S@This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content.@@@@1@29@@danf@17-8-2009 10191160@unknown@formal@none@1@S@Others have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities.@@@@1@35@@danf@17-8-2009 10191170@unknown@formal@none@1@S@==DBMS internals==@@@@1@2@@danf@17-8-2009 10191180@unknown@formal@none@1@S@===Storage and physical database design===@@@@1@5@@danf@17-8-2009 10191190@unknown@formal@none@1@S@Database tables/indexes are typically stored in memory or on hard disk in one of many forms, ordered/unordered [[flat file database|flat files]], [[ISAM]], [[heap (data structure)|heaps]], [[hash table|hash buckets]] or [[B+ tree]]s.@@@@1@31@@danf@17-8-2009 10191200@unknown@formal@none@1@S@These have various advantages and disadvantages discussed further in the main article on this topic.@@@@1@15@@danf@17-8-2009 10191210@unknown@formal@none@1@S@The most commonly used are B+ trees and ISAM.@@@@1@9@@danf@17-8-2009 10191220@unknown@formal@none@1@S@Other important design choices relate to the clustering of data by category (such as grouping data by month, or location), creating pre-computed views known as materialized views, partitioning data by range or hash.@@@@1@33@@danf@17-8-2009 10191230@unknown@formal@none@1@S@As well memory management and storage topology can be important design choices for database designers.@@@@1@15@@danf@17-8-2009 10191240@unknown@formal@none@1@S@Just as normalization is used to reduce storage requirements and improve the extensibility of the database, conversely denormalization is often used to reduce join complexity and reduce execution time for queries.@@@@1@31@@danf@17-8-2009 10191250@unknown@formal@none@1@S@====Indexing====@@@@1@1@@danf@17-8-2009 10191260@unknown@formal@none@1@S@All of these databases can take advantage of [[Index (database)|indexing]] to increase their speed.@@@@1@14@@danf@17-8-2009 10191270@unknown@formal@none@1@S@This technology has advanced tremendously since its early uses in the 1960s and 1970s.@@@@1@14@@danf@17-8-2009 10191280@unknown@formal@none@1@S@The most common kind of index is a sorted list of the contents of some particular table column, with pointers to the row associated with the value.@@@@1@27@@danf@17-8-2009 10191290@unknown@formal@none@1@S@An index allows a set of table rows matching some criterion to be located quickly.@@@@1@15@@danf@17-8-2009 10191300@unknown@formal@none@1@S@Typically, indexes are also stored in the various forms of data-structure mentioned above (such as [[B-tree]]s, [[hash table|hash]]es, and [[linked lists]]).@@@@1@21@@danf@17-8-2009 10191310@unknown@formal@none@1@S@Usually, a specific technique is chosen by the database designer to increase efficiency in the particular case of the type of index required.@@@@1@23@@danf@17-8-2009 10191320@unknown@formal@none@1@S@Relational DBMS's have the advantage that indexes can be created or dropped without changing existing applications making use of it.@@@@1@20@@danf@17-8-2009 10191330@unknown@formal@none@1@S@The database chooses between many different strategies based on which one it estimates will run the fastest.@@@@1@17@@danf@17-8-2009 10191340@unknown@formal@none@1@S@In other words, indexes are transparent to the application or end-user querying the database; while they affect performance, any SQL command will run with or without index to compute the result of an [[SQL]] statement.@@@@1@35@@danf@17-8-2009 10191350@unknown@formal@none@1@S@The RDBMS will produce a plan of how to execute the query, which is generated by analyzing the run times of the different algorithms and selecting the quickest.@@@@1@28@@danf@17-8-2009 10191360@unknown@formal@none@1@S@Some of the key algorithms that deal with [[join (SQL)|joins]] are [[nested loop join]], [[sort-merge join]] and [[hash join]].@@@@1@19@@danf@17-8-2009 10191370@unknown@formal@none@1@S@Which of these is chosen depends on whether an index exists, what type it is, and its [[Cardinality (SQL statements)|cardinality]].@@@@1@20@@danf@17-8-2009 10191380@unknown@formal@none@1@S@An index speeds up access to data, but it has disadvantages as well.@@@@1@13@@danf@17-8-2009 10191390@unknown@formal@none@1@S@First, every index increases the amount of storage on the hard drive necessary for the database file, and second, the index must be updated each time the data are altered, and this costs time.@@@@1@34@@danf@17-8-2009 10191400@unknown@formal@none@1@S@(Thus an index saves time in the reading of data, but it costs time in entering and altering data.@@@@1@19@@danf@17-8-2009 10191410@unknown@formal@none@1@S@It thus depends on the use to which the data are to be put whether an index is on the whole a net plus or minus in the quest for efficiency.)@@@@1@31@@danf@17-8-2009 10191420@unknown@formal@none@1@S@A special case of an index is a primary index, or primary key, which is distinguished in that the primary index must ensure a unique reference to a record.@@@@1@29@@danf@17-8-2009 10191430@unknown@formal@none@1@S@Often, for this purpose one simply uses a running index number (ID number).@@@@1@13@@danf@17-8-2009 10191440@unknown@formal@none@1@S@Primary indexes play a significant role in relational databases, and they can speed up access to data considerably.@@@@1@18@@danf@17-8-2009 10191450@unknown@formal@none@1@S@===Transactions and concurrency===@@@@1@3@@danf@17-8-2009 10191460@unknown@formal@none@1@S@In addition to their data model, most practical databases ("transactional databases") attempt to enforce a [[database transaction]] .@@@@1@18@@danf@17-8-2009 10191470@unknown@formal@none@1@S@Ideally, the database software should enforce the [[ACID]] rules, summarized here:@@@@1@11@@danf@17-8-2009 10191480@unknown@formal@none@1@S@* [[Atomicity]]: Either all the tasks in a transaction must be done, or none of them.@@@@1@16@@danf@17-8-2009 10191490@unknown@formal@none@1@S@The transaction must be completed, or else it must be undone (rolled back).@@@@1@13@@danf@17-8-2009 10191500@unknown@formal@none@1@S@* [[Database consistency|Consistency]]: Every transaction must preserve the integrity constraints — the declared consistency rules — of the database.@@@@1@19@@danf@17-8-2009 10191510@unknown@formal@none@1@S@It cannot place the data in a contradictory state.@@@@1@9@@danf@17-8-2009 10191520@unknown@formal@none@1@S@* [[Isolation]]: Two simultaneous transactions cannot interfere with one another.@@@@1@10@@danf@17-8-2009 10191530@unknown@formal@none@1@S@Intermediate results within a transaction are not visible to other transactions.@@@@1@11@@danf@17-8-2009 10191540@unknown@formal@none@1@S@* [[Durability (computer science)|Durability]]: Completed transactions cannot be aborted later or their results discarded.@@@@1@14@@danf@17-8-2009 10191550@unknown@formal@none@1@S@They must persist through (for instance) restarts of the DBMS after crashes@@@@1@12@@danf@17-8-2009 10191560@unknown@formal@none@1@S@In practice, many DBMS's allow most of these rules to be selectively relaxed for better performance.@@@@1@16@@danf@17-8-2009 10191570@unknown@formal@none@1@S@[[Concurrency control]] is a method used to ensure that transactions are executed in a safe manner and follow the ACID rules.@@@@1@21@@danf@17-8-2009 10191580@unknown@formal@none@1@S@The DBMS must be able to ensure that only [[serializability|serializable]], [[serializability#correctness - recoverability|recoverable]] schedules are allowed, and that no actions of committed transactions are lost while undoing aborted transactions .@@@@1@30@@danf@17-8-2009 10191590@unknown@formal@none@1@S@===Replication===@@@@1@1@@danf@17-8-2009 10191600@unknown@formal@none@1@S@Replication of databases is closely related to transactions.@@@@1@8@@danf@17-8-2009 10191610@unknown@formal@none@1@S@If a database can log its individual actions, it is possible to create a duplicate of the data in real time.@@@@1@21@@danf@17-8-2009 10191620@unknown@formal@none@1@S@The duplicate can be used to improve performance or availability of the whole database system.@@@@1@15@@danf@17-8-2009 10191630@unknown@formal@none@1@S@Common replication concepts include:@@@@1@4@@danf@17-8-2009 10191640@unknown@formal@none@1@S@* Master/Slave Replication: All write requests are performed on the master and then replicated to the slaves@@@@1@17@@danf@17-8-2009 10191650@unknown@formal@none@1@S@* Quorum: The result of Read and Write requests are calculated by querying a "majority" of replicas.@@@@1@17@@danf@17-8-2009 10191660@unknown@formal@none@1@S@* Multimaster: Two or more replicas sync each other via a transaction identifier.@@@@1@13@@danf@17-8-2009 10191670@unknown@formal@none@1@S@Parallel synchronous replication of databases enables transactions to be replicated on multiple servers simultaneously, which provides a method for backup and security as well as data availability.@@@@1@27@@danf@17-8-2009 10191680@unknown@formal@none@1@S@===Security===@@@@1@1@@danf@17-8-2009 10191690@unknown@formal@none@1@S@[[Database security]] denotes the system, processes, and procedures that protect a database from unintended activity.@@@@1@15@@danf@17-8-2009 10191700@unknown@formal@none@1@S@Security is usually enforced through '''access control''', '''auditing''', and '''encryption'''.@@@@1@10@@danf@17-8-2009 10191710@unknown@formal@none@1@S@* Access control ensures and restricts who can connect and what can be done to the database.@@@@1@17@@danf@17-8-2009 10191720@unknown@formal@none@1@S@* Auditing logs what action or change has been performed, when and by who.@@@@1@14@@danf@17-8-2009 10191730@unknown@formal@none@1@S@* Encryption: Since security has become a major issue in recent years, many commercial database vendors provide built-in encryption mechanism.@@@@1@20@@danf@17-8-2009 10191740@unknown@formal@none@1@S@Data is encoded natively into the tables and deciphered "on the fly" when a query comes in.@@@@1@17@@danf@17-8-2009 10191745@unknown@formal@none@1@S@Connections can also be secured and encrypted if required using DSA, MD5, SSL or legacy encryption standard.@@@@1@17@@danf@17-8-2009 10191750@unknown@formal@none@1@S@Enforcing security is one of the major tasks of the DBA.@@@@1@11@@danf@17-8-2009 10191760@unknown@formal@none@1@S@In the United Kingdom, legislation protecting the public from unauthorized disclosure of personal information held on databases falls under the Office of the Information Commissioner.@@@@1@25@@danf@17-8-2009 10191770@unknown@formal@none@1@S@United Kingdom based organizations holding personal data in electronic format (databases for example) are required to register with the Data Commissioner.@@@@1@21@@danf@17-8-2009 10191780@unknown@formal@none@1@S@===Locking===@@@@1@1@@danf@17-8-2009 10191790@unknown@formal@none@1@S@[[Lock (computer science)|Locking]] is how the database handle multiple concurent operations.@@@@1@11@@danf@17-8-2009 10191800@unknown@formal@none@1@S@This is the way how concurency and some form of basic intergrity is managed within the database system.@@@@1@18@@danf@17-8-2009 10191810@unknown@formal@none@1@S@Such locks can be applied on a row level, or on other levels like page (a basic data block), extend (multiple array of pages) or even an entire table.@@@@1@29@@danf@17-8-2009 10191820@unknown@formal@none@1@S@This helps maintain the integrity of the data by ensuring that only one process at a time can modify the '''same''' data.@@@@1@22@@danf@17-8-2009 10191830@unknown@formal@none@1@S@Unlike a basic filesystem files or folders, where only one lock at the time can be set, restricting the usage to one process only.@@@@1@24@@danf@17-8-2009 10191840@unknown@formal@none@1@S@A database can set and hold mutiples locks at the same time on the different level of the physical data structure.@@@@1@21@@danf@17-8-2009 10191850@unknown@formal@none@1@S@How locks are set, last is determined by the database engine locking scheme based on the submitted SQL or transactions by the users.@@@@1@23@@danf@17-8-2009 10191860@unknown@formal@none@1@S@Generaly speaking no activity on the database should be translated by no or very light locking.@@@@1@16@@danf@17-8-2009 10191870@unknown@formal@none@1@S@For most DBMS systems existing on the market, locks are generaly '''shared''' or '''exclusive'''.@@@@1@14@@danf@17-8-2009 10191880@unknown@formal@none@1@S@Exclusive locks mean that no other lock can acquire the current data object as long as the exclusive lock lasts.@@@@1@20@@danf@17-8-2009 10191890@unknown@formal@none@1@S@Exclusive locks are usually set while the database needs to change data, like during an UPDATE or DELETE operation.@@@@1@19@@danf@17-8-2009 10191900@unknown@formal@none@1@S@Shared locks can take ownership one from the other of the current data structure.@@@@1@14@@danf@17-8-2009 10191910@unknown@formal@none@1@S@Shared locks are usually used while the database is reading data, during a SELECT operation.@@@@1@15@@danf@17-8-2009 10191920@unknown@formal@none@1@S@The number, nature of locks and time the lock holds a data block can have a huge impact on the database performances.@@@@1@22@@danf@17-8-2009 10191930@unknown@formal@none@1@S@Bad locking can lead to desastrous performance response (usually the result of poor SQL requests, or inadequate database physical structure)@@@@1@20@@danf@17-8-2009 10191940@unknown@formal@none@1@S@Default locking behavior is enforced by the '''isolation level''' of the dataserver.@@@@1@12@@danf@17-8-2009 10191950@unknown@formal@none@1@S@Changing the isolation level will affect how shared or exclusive locks must be set on the data for the entire database system.@@@@1@22@@danf@17-8-2009 10191960@unknown@formal@none@1@S@Default isolation is generaly 1, where data can not be read while it is modfied, forbiding to return "ghost data" to end user.@@@@1@23@@danf@17-8-2009 10191970@unknown@formal@none@1@S@At some point intensive or inappropriate exclusive locking, can lead to the "dead lock" situation between two locks.@@@@1@18@@danf@17-8-2009 10191980@unknown@formal@none@1@S@Where none of the locks can be released because they try to acquire ressources mutually from each other.@@@@1@18@@danf@17-8-2009 10191990@unknown@formal@none@1@S@The Database has a fail safe mecanism and will automaticly "sacrifice" one of the locks releasing the ressource.@@@@1@18@@danf@17-8-2009 10192000@unknown@formal@none@1@S@Doing so processes or transactions involved in the "dead lock" will be rolled back.@@@@1@14@@danf@17-8-2009 10192010@unknown@formal@none@1@S@Databases can also be locked for other reasons, like access restrictions for given levels of user.@@@@1@16@@danf@17-8-2009 10192020@unknown@formal@none@1@S@Databases are also locked for routine database maintenance, which prevents changes being made during the maintenance.@@@@1@16@@danf@17-8-2009 10192030@unknown@formal@none@1@S@See [http://publib.boulder.ibm.com/infocenter/rbhelp/v6r3/index.jsp?topic=/com.ibm.redbrick.doc6.3/wag/wag80.htm IBM] for more detail.)@@@@1@6@@danf@17-8-2009 10192040@unknown@formal@none@1@S@===Architecture===@@@@1@1@@danf@17-8-2009 10192050@unknown@formal@none@1@S@Depending on the intended use, there are a number of database architectures in use.@@@@1@14@@danf@17-8-2009 10192060@unknown@formal@none@1@S@Many databases use a combination of strategies.@@@@1@7@@danf@17-8-2009 10192070@unknown@formal@none@1@S@On-line Transaction Processing systems (OLTP) often use a row-oriented datastore architecture, while data-warehouse and other retrieval-focused applications like [[Google]]'s [[BigTable]], or bibliographic database(library catalogue) systems may use a column-oriented datastore architecture.@@@@1@31@@danf@17-8-2009 10192080@unknown@formal@none@1@S@Document-Oriented, XML, Knowledgebases, as well as frame databases and rdf-stores (aka Triple-Stores), may also use a combination of these architectures in their implementation.@@@@1@23@@danf@17-8-2009 10192090@unknown@formal@none@1@S@Finally it should be noted that not all database have or need a database 'schema' (so called schema-less databases).@@@@1@19@@danf@17-8-2009 10192100@unknown@formal@none@1@S@==Applications of databases==@@@@1@3@@danf@17-8-2009 10192110@unknown@formal@none@1@S@Databases are used in many applications, spanning virtually the entire range of [[computer software]].@@@@1@14@@danf@17-8-2009 10192120@unknown@formal@none@1@S@Databases are the preferred method of storage for large multiuser applications, where coordination between many users is needed.@@@@1@18@@danf@17-8-2009 10192130@unknown@formal@none@1@S@Even individual users find them convenient, and many electronic mail programs and personal organizers are based on standard database technology.@@@@1@20@@danf@17-8-2009 10192140@unknown@formal@none@1@S@Software database drivers are available for most database platforms so that [[application software]] can use a common [[Application Programming Interface]] to retrieve the information stored in a database.@@@@1@28@@danf@17-8-2009 10192150@unknown@formal@none@1@S@Two commonly used database APIs are [[Java Database Connectivity|JDBC]] and [[ODBC]].@@@@1@11@@danf@17-8-2009 10192160@unknown@formal@none@1@S@For example suppliers database contains the data relating to suppliers such as;@@@@1@12@@danf@17-8-2009 10192170@unknown@formal@none@1@S@*supplier name@@@@1@2@@danf@17-8-2009 10192180@unknown@formal@none@1@S@*supplier code@@@@1@2@@danf@17-8-2009 10192190@unknown@formal@none@1@S@*supplier address@@@@1@2@@danf@17-8-2009 10192200@unknown@formal@none@1@S@It is often used by schools to teach students and grade them.@@@@1@12@@danf@17-8-2009 10192210@unknown@formal@none@1@S@==Links to DBMS products==@@@@1@4@@danf@17-8-2009 10192220@unknown@formal@none@1@S@*[[4th Dimension (Software)|4D]]@@@@1@3@@danf@17-8-2009 10192230@unknown@formal@none@1@S@*[[ADABAS]]@@@@1@1@@danf@17-8-2009 10192240@unknown@formal@none@1@S@*[[Alpha Five]]@@@@1@2@@danf@17-8-2009 10192250@unknown@formal@none@1@S@*[[Apache Derby]] (Java, also known as IBM Cloudscape and Sun Java DB)@@@@1@12@@danf@17-8-2009 10192260@unknown@formal@none@1@S@*[[BerkeleyDB]]@@@@1@1@@danf@17-8-2009 10192270@unknown@formal@none@1@S@*[[CouchDB]]@@@@1@1@@danf@17-8-2009 10192280@unknown@formal@none@1@S@*[[CSQL]]@@@@1@1@@danf@17-8-2009 10192290@unknown@formal@none@1@S@*[[Datawasp]]@@@@1@1@@danf@17-8-2009 10192300@unknown@formal@none@1@S@*[[Db4objects]]@@@@1@1@@danf@17-8-2009 10192310@unknown@formal@none@1@S@*[[dBase]]@@@@1@1@@danf@17-8-2009 10192320@unknown@formal@none@1@S@*[[FileMaker]]@@@@1@1@@danf@17-8-2009 10192330@unknown@formal@none@1@S@*[[Firebird (database server)]]@@@@1@3@@danf@17-8-2009 10192340@unknown@formal@none@1@S@*[[H2 (DBMS)|H2]] (Java)@@@@1@3@@danf@17-8-2009 10192350@unknown@formal@none@1@S@*[[Hsqldb]] (Java)@@@@1@2@@danf@17-8-2009 10192360@unknown@formal@none@1@S@*[[IBM DB2]]@@@@1@2@@danf@17-8-2009 10192370@unknown@formal@none@1@S@*[[Information Management System|IBM IMS (Information Management System)]]@@@@1@7@@danf@17-8-2009 10192380@unknown@formal@none@1@S@*[[IBM UniVerse]]@@@@1@2@@danf@17-8-2009 10192390@unknown@formal@none@1@S@*[[Informix]]@@@@1@1@@danf@17-8-2009 10192400@unknown@formal@none@1@S@*[[Ingres (database)|Ingres]]@@@@1@2@@danf@17-8-2009 10192410@unknown@formal@none@1@S@*[[Interbase]]@@@@1@1@@danf@17-8-2009 10192420@unknown@formal@none@1@S@*[[InterSystems Caché]]@@@@1@2@@danf@17-8-2009 10192430@unknown@formal@none@1@S@*[[MaxDB]] (formerly SapDB)@@@@1@3@@danf@17-8-2009 10192440@unknown@formal@none@1@S@*[[Microsoft Access]]@@@@1@2@@danf@17-8-2009 10192450@unknown@formal@none@1@S@*[[Microsoft SQL Server]]@@@@1@3@@danf@17-8-2009 10192460@unknown@formal@none@1@S@*[[Model 204]]@@@@1@2@@danf@17-8-2009 10192470@unknown@formal@none@1@S@*[[MySQL]]@@@@1@1@@danf@17-8-2009 10192480@unknown@formal@none@1@S@*[[Nomad software|Nomad]]@@@@1@2@@danf@17-8-2009 10192490@unknown@formal@none@1@S@*[[Objectivity/DB]]@@@@1@1@@danf@17-8-2009 10192500@unknown@formal@none@1@S@*[[ObjectStore]]@@@@1@1@@danf@17-8-2009 10192510@unknown@formal@none@1@S@*[[Virtuoso Universal Server|OpenLink Virtuoso]]@@@@1@4@@danf@17-8-2009 10192520@unknown@formal@none@1@S@*[[OpenOffice.org Base]]@@@@1@2@@danf@17-8-2009 10192530@unknown@formal@none@1@S@*[[Oracle Database]]@@@@1@2@@danf@17-8-2009 10192540@unknown@formal@none@1@S@*[[Paradox (database)]]@@@@1@2@@danf@17-8-2009 10192550@unknown@formal@none@1@S@*[[Polyhedra DBMS]]@@@@1@2@@danf@17-8-2009 10192560@unknown@formal@none@1@S@*[[PostgreSQL]]@@@@1@1@@danf@17-8-2009 10192570@unknown@formal@none@1@S@*[[Progress 4GL]]@@@@1@2@@danf@17-8-2009 10192580@unknown@formal@none@1@S@*[[RDM Embedded]]@@@@1@2@@danf@17-8-2009 10192590@unknown@formal@none@1@S@*[[ScimoreDB]]@@@@1@1@@danf@17-8-2009 10192600@unknown@formal@none@1@S@*[[Sedna (database)|Sedna]]@@@@1@2@@danf@17-8-2009 10192610@unknown@formal@none@1@S@*[[SQLite]]@@@@1@1@@danf@17-8-2009 10192620@unknown@formal@none@1@S@*[[Superbase database|Superbase]]@@@@1@2@@danf@17-8-2009 10192630@unknown@formal@none@1@S@*[[Sybase]]@@@@1@1@@danf@17-8-2009 10192640@unknown@formal@none@1@S@*[[Teradata]]@@@@1@1@@danf@17-8-2009 10192650@unknown@formal@none@1@S@*[[Vertica]]@@@@1@1@@danf@17-8-2009 10192660@unknown@formal@none@1@S@*[[Visual FoxPro]]@@@@1@2@@danf@17-8-2009 10200010@unknown@formal@none@1@S@
Cluster analysis
@@@@1@2@@danf@17-8-2009 10200020@unknown@formal@none@1@S@'''Clustering''' is the [[Statistical classification|classification]] of objects into different groups, or more precisely, the [[partition of a set|partitioning]] of a [[data set]] into [[subset]]s (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined [[metric (mathematics)|distance measure]].@@@@1@47@@danf@17-8-2009 10200030@unknown@formal@none@1@S@Data clustering is a common technique for [[statistics|statistical]] [[data analysis]], which is used in many fields, including [[machine learning]], [[data mining]], [[pattern recognition]], [[image analysis]] and [[bioinformatics]].@@@@1@27@@danf@17-8-2009 10200040@unknown@formal@none@1@S@The computational task of classifying the data set into ''k'' clusters is often referred to as '''''k''-clustering'''''.@@@@1@17@@danf@17-8-2009 10200050@unknown@formal@none@1@S@Besides the term ''data clustering'' (or just ''clustering''), there are a number of terms with similar meanings, including ''cluster analysis'', ''automatic classification'', ''numerical taxonomy'', ''botryology'' and ''typological analysis''.@@@@1@28@@danf@17-8-2009 10200060@unknown@formal@none@1@S@== Types of clustering ==@@@@1@5@@danf@17-8-2009 10200070@unknown@formal@none@1@S@Data clustering algorithms can be [[hierarchical]].@@@@1@6@@danf@17-8-2009 10200080@unknown@formal@none@1@S@Hierarchical algorithms find successive clusters using previously established clusters.@@@@1@9@@danf@17-8-2009 10200090@unknown@formal@none@1@S@Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down").@@@@1@9@@danf@17-8-2009 10200100@unknown@formal@none@1@S@Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters.@@@@1@17@@danf@17-8-2009 10200110@unknown@formal@none@1@S@Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.@@@@1@16@@danf@17-8-2009 10200120@unknown@formal@none@1@S@[[partition of a set|Partitional]] algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the [[hierarchical]] clustering.@@@@1@23@@danf@17-8-2009 10200130@unknown@formal@none@1@S@''Two-way clustering'', ''co-clustering'' or [[biclustering]] are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a [[data matrix (statistics)|data matrix]], the rows and columns are clustered simultaneously.@@@@1@41@@danf@17-8-2009 10200140@unknown@formal@none@1@S@Another important distinction is whether the clustering uses symmetric or asymmetric distances.@@@@1@12@@danf@17-8-2009 10200150@unknown@formal@none@1@S@A property of [[Euclidean space]] is that distances are symmetric (the distance from object'' A'' to ''B'' is the same as the distance from ''B'' to ''A'').@@@@1@27@@danf@17-8-2009 10200160@unknown@formal@none@1@S@In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case.@@@@1@18@@danf@17-8-2009 10200170@unknown@formal@none@1@S@== Distance measure ==@@@@1@4@@danf@17-8-2009 10200180@unknown@formal@none@1@S@An important step in any clustering is to select a [[Distance|distance measure]], which will determine how the ''similarity'' of two elements is calculated.@@@@1@23@@danf@17-8-2009 10200190@unknown@formal@none@1@S@This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another.@@@@1@27@@danf@17-8-2009 10200200@unknown@formal@none@1@S@For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms, but the distance between the point (x=1, y=1) and the origin can be 2,\\sqrt 2 or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.@@@@1@53@@danf@17-8-2009 10200210@unknown@formal@none@1@S@Common distance functions:@@@@1@3@@danf@17-8-2009 10200220@unknown@formal@none@1@S@* The [[Euclidean distance]] (also called distance [[as the crow flies]] or 2-norm distance).@@@@1@14@@danf@17-8-2009 10200230@unknown@formal@none@1@S@A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.@@@@1@32@@danf@17-8-2009 10200240@unknown@formal@none@1@S@* The [[Manhattan distance]] (also called taxicab norm or 1-norm)@@@@1@10@@danf@17-8-2009 10200250@unknown@formal@none@1@S@* The [[Maximum_norm|maximum norm]]@@@@1@4@@danf@17-8-2009 10200260@unknown@formal@none@1@S@* The [[Mahalanobis distance]] corrects data for different scales and correlations in the variables@@@@1@14@@danf@17-8-2009 10200270@unknown@formal@none@1@S@* The angle between two vectors can be used as a distance measure when clustering high dimensional data.@@@@1@18@@danf@17-8-2009 10200280@unknown@formal@none@1@S@See [[Inner product space]].@@@@1@4@@danf@17-8-2009 10200290@unknown@formal@none@1@S@* The [[Hamming distance]] (sometimes edit distance) measures the minimum number of substitutions required to change one member into another.@@@@1@20@@danf@17-8-2009 10200300@unknown@formal@none@1@S@==Hierarchical clustering==@@@@1@2@@danf@17-8-2009 10200310@unknown@formal@none@1@S@===Creating clusters===@@@@1@2@@danf@17-8-2009 10200320@unknown@formal@none@1@S@Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters.@@@@1@12@@danf@17-8-2009 10200330@unknown@formal@none@1@S@The traditional representation of this hierarchy is a [[tree data structure|tree]] (called a [[dendrogram]]), with individual elements at one end and a single cluster containing every element at the other.@@@@1@30@@danf@17-8-2009 10200340@unknown@formal@none@1@S@Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the root.@@@@1@16@@danf@17-8-2009 10200350@unknown@formal@none@1@S@(In the figure, the arrows indicate an agglomerative clustering.)@@@@1@9@@danf@17-8-2009 10200360@unknown@formal@none@1@S@Cutting the tree at a given height will give a clustering at a selected precision.@@@@1@15@@danf@17-8-2009 10200370@unknown@formal@none@1@S@In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}.@@@@1@18@@danf@17-8-2009 10200380@unknown@formal@none@1@S@Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters.@@@@1@26@@danf@17-8-2009 10200390@unknown@formal@none@1@S@===Agglomerative hierarchical clustering===@@@@1@3@@danf@17-8-2009 10200400@unknown@formal@none@1@S@For example, suppose this data is to be clustered, and the [[euclidean distance]] is the [[Metric (mathematics)|distance metric]].@@@@1@18@@danf@17-8-2009 10200410@unknown@formal@none@1@S@The hierarchical clustering [[dendrogram]] would be as such:@@@@1@8@@danf@17-8-2009 10200420@unknown@formal@none@1@S@This method builds the hierarchy from the individual elements by progressively merging clusters.@@@@1@13@@danf@17-8-2009 10200430@unknown@formal@none@1@S@In our example, we have six elements {a} {b} {c} {d} {e} and {f}.@@@@1@14@@danf@17-8-2009 10200440@unknown@formal@none@1@S@The first step is to determine which elements to merge in a cluster.@@@@1@13@@danf@17-8-2009 10200450@unknown@formal@none@1@S@Usually, we want to take the two closest elements, according to the chosen distance.@@@@1@14@@danf@17-8-2009 10200460@unknown@formal@none@1@S@Optionally, one can also construct a [[distance matrix]] at this stage, where the number in the ''i''-th row ''j''-th column is the distance between the ''i''-th and ''j''-th elements.@@@@1@29@@danf@17-8-2009 10200470@unknown@formal@none@1@S@Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated.@@@@1@18@@danf@17-8-2009 10200480@unknown@formal@none@1@S@This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters.@@@@1@20@@danf@17-8-2009 10200490@unknown@formal@none@1@S@A simple agglomerative clustering algorithm is described in the [[single linkage clustering]] page; it can easily be adapted to different types of linkage (see below).@@@@1@25@@danf@17-8-2009 10200500@unknown@formal@none@1@S@Suppose we have merged the two closest elements ''b'' and ''c'', we now have the following clusters {''a''}, {''b'', ''c''}, {''d''}, {''e''} and {''f''}, and want to merge them further.@@@@1@30@@danf@17-8-2009 10200510@unknown@formal@none@1@S@To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters.@@@@1@22@@danf@17-8-2009 10200520@unknown@formal@none@1@S@Usually the distance between two clusters \\mathcal{A} and \\mathcal{B} is one of the following:@@@@1@14@@danf@17-8-2009 10200530@unknown@formal@none@1@S@* The maximum distance between elements of each cluster (also called complete linkage clustering):@@@@1@14@@danf@17-8-2009 10200540@unknown@formal@none@1@S@:: \\max \\{\\, d(x,y) : x \\in \\mathcal{A},\\, y \\in \\mathcal{B}\\,\\} @@@@1@12@@danf@17-8-2009 10200550@unknown@formal@none@1@S@* The minimum distance between elements of each cluster (also called [[single linkage clustering]]):@@@@1@14@@danf@17-8-2009 10200560@unknown@formal@none@1@S@:: \\min \\{\\, d(x,y) : x \\in \\mathcal{A},\\, y \\in \\mathcal{B} \\,\\} @@@@1@13@@danf@17-8-2009 10200570@unknown@formal@none@1@S@* The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in [[UPGMA]]):@@@@1@18@@danf@17-8-2009 10200580@unknown@formal@none@1@S@:: {1 \\over {|\\mathcal{A}|\\cdot|\\mathcal{B}|}}\\sum_{x \\in \\mathcal{A}}\\sum_{ y \\in \\mathcal{B}} d(x,y) @@@@1@11@@danf@17-8-2009 10200590@unknown@formal@none@1@S@* The sum of all intra-cluster variance@@@@1@7@@danf@17-8-2009 10200600@unknown@formal@none@1@S@* The increase in variance for the cluster being merged ([[Ward's criterion]])@@@@1@12@@danf@17-8-2009 10200610@unknown@formal@none@1@S@* The probability that candidate clusters spawn from the same distribution function (V-linkage)@@@@1@13@@danf@17-8-2009 10200620@unknown@formal@none@1@S@Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).@@@@1@45@@danf@17-8-2009 10200630@unknown@formal@none@1@S@=== Concept clustering ===@@@@1@4@@danf@17-8-2009 10200640@unknown@formal@none@1@S@Another variation of the agglomerative clustering approach is [[conceptual clustering]].@@@@1@10@@danf@17-8-2009 10200650@unknown@formal@none@1@S@==Partitional clustering==@@@@1@2@@danf@17-8-2009 10200660@unknown@formal@none@1@S@===''K''-means and derivatives===@@@@1@3@@danf@17-8-2009 10200670@unknown@formal@none@1@S@====''K''-means clustering====@@@@1@2@@danf@17-8-2009 10200680@unknown@formal@none@1@S@The [[K-means algorithm|''K''-means algorithm]] assigns each point to the cluster whose center (also called centroid) is nearest.@@@@1@17@@danf@17-8-2009 10200690@unknown@formal@none@1@S@The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster...@@@@1@32@@danf@17-8-2009 10200700@unknown@formal@none@1@S@:''Example:'' The data set has three dimensions and the cluster has two points: ''X'' = (''x''1, ''x''2, ''x''3) and ''Y'' = (''y''1, ''y''2, ''y''3).@@@@1@24@@danf@17-8-2009 10200710@unknown@formal@none@1@S@Then the centroid ''Z'' becomes ''Z'' = (''z''1, ''z''2, ''z''3), where ''z''1 = (''x''1 + ''y''1)/2 and ''z''2 = (''x''2 + ''y''2)/2 and ''z''3 = (''x''3 + ''y''3)/2.@@@@1@22@@danf@17-8-2009 10200720@unknown@formal@none@1@S@The algorithm steps are (J. MacQueen, 1967):@@@@1@7@@danf@17-8-2009 10200730@unknown@formal@none@1@S@* Choose the number of clusters, ''k''.@@@@1@7@@danf@17-8-2009 10200740@unknown@formal@none@1@S@* Randomly generate ''k'' clusters and determine the cluster centers, or directly generate ''k'' random points as cluster centers.@@@@1@19@@danf@17-8-2009 10200750@unknown@formal@none@1@S@* Assign each point to the nearest cluster center.@@@@1@9@@danf@17-8-2009 10200760@unknown@formal@none@1@S@* Recompute the new cluster centers.@@@@1@6@@danf@17-8-2009 10200770@unknown@formal@none@1@S@* Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed).@@@@1@18@@danf@17-8-2009 10200780@unknown@formal@none@1@S@The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets.@@@@1@19@@danf@17-8-2009 10200790@unknown@formal@none@1@S@Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments.@@@@1@24@@danf@17-8-2009 10200800@unknown@formal@none@1@S@It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance.@@@@1@17@@danf@17-8-2009 10200810@unknown@formal@none@1@S@====Fuzzy ''c''-means clustering====@@@@1@3@@danf@17-8-2009 10200820@unknown@formal@none@1@S@In [[fuzzy clustering]], each point has a degree of belonging to clusters, as in [[fuzzy logic]], rather than belonging completely to just one cluster.@@@@1@24@@danf@17-8-2009 10200830@unknown@formal@none@1@S@Thus, points on the edge of a cluster, may be ''in the cluster'' to a lesser degree than points in the center of cluster.@@@@1@24@@danf@17-8-2009 10200840@unknown@formal@none@1@S@For each point ''x'' we have a coefficient giving the degree of being in the ''k''th cluster u_k(x).@@@@1@18@@danf@17-8-2009 10200850@unknown@formal@none@1@S@Usually, the sum of those coefficients is defined to be 1:@@@@1@11@@danf@17-8-2009 10200860@unknown@formal@none@1@S@: \\forall x \\sum_{k=1}^{\\mathrm{num.}\\ \\mathrm{clusters}} u_k(x) \\ =1.@@@@1@8@@danf@17-8-2009 10200870@unknown@formal@none@1@S@With fuzzy ''c''-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:@@@@1@23@@danf@17-8-2009 10200880@unknown@formal@none@1@S@:\\mathrm{center}_k = {{\\sum_x u_k(x)^m x} \\over {\\sum_x u_k(x)^m}}.@@@@1@8@@danf@17-8-2009 10200890@unknown@formal@none@1@S@The degree of belonging is related to the inverse of the distance to the cluster@@@@1@15@@danf@17-8-2009 10200900@unknown@formal@none@1@S@:u_k(x) = {1 \\over d(\\mathrm{center}_k,x)},@@@@1@5@@danf@17-8-2009 10200910@unknown@formal@none@1@S@then the coefficients are normalized and fuzzyfied with a real parameter m>1 so that their sum is 1.@@@@1@18@@danf@17-8-2009 10200920@unknown@formal@none@1@S@So@@@@1@1@@danf@17-8-2009 10200930@unknown@formal@none@1@S@:u_k(x) = \\frac{1}{\\sum_j \\left(\\frac{d(\\mathrm{center}_k,x)}{d(\\mathrm{center}_j,x)}\\right)^{2/(m-1)}}.@@@@1@4@@danf@17-8-2009 10200940@unknown@formal@none@1@S@For ''m'' equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1.@@@@1@18@@danf@17-8-2009 10200950@unknown@formal@none@1@S@When ''m'' is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to ''k''-means.@@@@1@28@@danf@17-8-2009 10200960@unknown@formal@none@1@S@The fuzzy ''c''-means algorithm is very similar to the ''k''-means algorithm:@@@@1@11@@danf@17-8-2009 10200970@unknown@formal@none@1@S@* Choose a number of clusters.@@@@1@6@@danf@17-8-2009 10200980@unknown@formal@none@1@S@* Assign randomly to each point coefficients for being in the clusters.@@@@1@12@@danf@17-8-2009 10200990@unknown@formal@none@1@S@* Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than \\epsilon, the given sensitivity threshold) :@@@@1@25@@danf@17-8-2009 10201000@unknown@formal@none@1@S@** Compute the centroid for each cluster, using the formula above.@@@@1@11@@danf@17-8-2009 10201010@unknown@formal@none@1@S@** For each point, compute its coefficients of being in the clusters, using the formula above.@@@@1@16@@danf@17-8-2009 10201020@unknown@formal@none@1@S@The algorithm minimizes intra-cluster variance as well, but has the same problems as ''k''-means, the minimum is a local minimum, and the results depend on the initial choice of weights.@@@@1@30@@danf@17-8-2009 10201030@unknown@formal@none@1@S@The [[Expectation-maximization algorithm]] is a more statistically formalized method which includes some of these ideas: partial membership in classes.@@@@1@19@@danf@17-8-2009 10201040@unknown@formal@none@1@S@It has better convergence properties and is in general preferred to fuzzy-c-means.@@@@1@12@@danf@17-8-2009 10201050@unknown@formal@none@1@S@====QT clustering algorithm====@@@@1@3@@danf@17-8-2009 10201060@unknown@formal@none@1@S@QT (quality threshold) clustering (Heyer et al, 1999) is an alternative method of partitioning data, invented for gene clustering.@@@@1@19@@danf@17-8-2009 10201070@unknown@formal@none@1@S@It requires more computing power than ''k''-means, but does not require specifying the number of clusters ''a priori'', and always returns the same result when run several times.@@@@1@28@@danf@17-8-2009 10201080@unknown@formal@none@1@S@The algorithm is:@@@@1@3@@danf@17-8-2009 10201090@unknown@formal@none@1@S@* The user chooses a maximum diameter for clusters.@@@@1@9@@danf@17-8-2009 10201100@unknown@formal@none@1@S@* Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold.@@@@1@28@@danf@17-8-2009 10201110@unknown@formal@none@1@S@* Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration.@@@@1@24@@danf@17-8-2009 10201120@unknown@formal@none@1@S@Must clarify what happens if more than 1 cluster has the maximum number of points ?@@@@1@16@@danf@17-8-2009 10201130@unknown@formal@none@1@S@* [[Recursion|Recurse]] with the reduced set of points.@@@@1@8@@danf@17-8-2009 10201140@unknown@formal@none@1@S@The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters).@@@@1@39@@danf@17-8-2009 10201150@unknown@formal@none@1@S@=== Locality-sensitive hashing ===@@@@1@4@@danf@17-8-2009 10201160@unknown@formal@none@1@S@[[Locality-sensitive hashing]] can be used for clustering.@@@@1@7@@danf@17-8-2009 10201170@unknown@formal@none@1@S@Feature space vectors are sets, and the metric used is the [[Jaccard distance]].@@@@1@13@@danf@17-8-2009 10201180@unknown@formal@none@1@S@The feature space can be considered high-dimensional.@@@@1@7@@danf@17-8-2009 10201190@unknown@formal@none@1@S@The ''min-wise independent permutations'' LSH scheme (sometimes MinHash) is then used to put similar items into buckets.@@@@1@17@@danf@17-8-2009 10201200@unknown@formal@none@1@S@With just one set of hashing methods, there are only clusters of very similar elements.@@@@1@15@@danf@17-8-2009 10201210@unknown@formal@none@1@S@By seeding the hash functions several times (eg 20), it is possible to get bigger clusters.@@@@1@16@@danf@17-8-2009 10201220@unknown@formal@none@1@S@=== Graph-theoretic methods ===@@@@1@4@@danf@17-8-2009 10201230@unknown@formal@none@1@S@[[Formal concept analysis]] is a technique for generating clusters of objects and attributes, given a [[bipartite graph]] representing the relations between the objects and attributes.@@@@1@25@@danf@17-8-2009 10201240@unknown@formal@none@1@S@Other methods for generating ''overlapping clusters'' (a [[Cover (topology)|cover]] rather than a [[partition of a set|partition]]) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970).@@@@1@28@@danf@17-8-2009 10201250@unknown@formal@none@1@S@== Elbow criterion ==@@@@1@4@@danf@17-8-2009 10201260@unknown@formal@none@1@S@The elbow criterion is a common [[rule of thumb]] to determine what number of clusters should be chosen, for example for ''k''-means and agglomerative hierarchical clustering.@@@@1@26@@danf@17-8-2009 10201270@unknown@formal@none@1@S@It should also be noted that the initial assignment of cluster seeds has bearing on the final model performance.@@@@1@19@@danf@17-8-2009 10201280@unknown@formal@none@1@S@Thus, it is appropriate to re-run the cluster analysis multiple times.@@@@1@11@@danf@17-8-2009 10201290@unknown@formal@none@1@S@The elbow criterion says that you should choose a number of clusters so that adding another cluster doesn't add sufficient information.@@@@1@21@@danf@17-8-2009 10201300@unknown@formal@none@1@S@More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow).@@@@1@47@@danf@17-8-2009 10201310@unknown@formal@none@1@S@This elbow cannot always be unambiguously identified.@@@@1@7@@danf@17-8-2009 10201320@unknown@formal@none@1@S@Percentage of variance explained is the ratio of the between-group variance to the total variance.@@@@1@15@@danf@17-8-2009 10201330@unknown@formal@none@1@S@On the following graph, the elbow is indicated by the red circle.@@@@1@12@@danf@17-8-2009 10201340@unknown@formal@none@1@S@The number of clusters chosen should therefore be 4.@@@@1@9@@danf@17-8-2009 10201350@unknown@formal@none@1@S@== Spectral clustering ==@@@@1@4@@danf@17-8-2009 10201360@unknown@formal@none@1@S@Given a set of data points A, the [[similarity matrix]] may be defined as a matrix S where S_{ij} represents a measure of the similarity between points i, j\\in A.@@@@1@30@@danf@17-8-2009 10201370@unknown@formal@none@1@S@Spectral clustering techniques make use of the [[Spectrum of a matrix|spectrum]] of the similarity matrix of the data to perform [[dimensionality reduction]] for clustering in fewer dimensions.@@@@1@27@@danf@17-8-2009 10201380@unknown@formal@none@1@S@One such technique is the ''[[Shi-Malik algorithm]]'', commonly used for [[segmentation (image processing)|image segmentation]].@@@@1@14@@danf@17-8-2009 10201390@unknown@formal@none@1@S@It partitions points into two sets (S_1,S_2) based on the [[eigenvector]] v corresponding to the second-smallest [[eigenvalue]] of the [[Laplacian matrix]]@@@@1@21@@danf@17-8-2009 10201400@unknown@formal@none@1@S@:L = I - D^{-1/2}SD^{-1/2}@@@@1@5@@danf@17-8-2009 10201410@unknown@formal@none@1@S@of S, where D is the diagonal matrix@@@@1@8@@danf@17-8-2009 10201420@unknown@formal@none@1@S@:D_{ii} = \\sum_{j} S_{ij}.@@@@1@4@@danf@17-8-2009 10201430@unknown@formal@none@1@S@This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S_1, and the rest in S_2.@@@@1@39@@danf@17-8-2009 10201440@unknown@formal@none@1@S@The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion.@@@@1@16@@danf@17-8-2009 10201450@unknown@formal@none@1@S@A related algorithm is the ''[[Meila-Shi algorithm]]'', which takes the [[eigenvector]]s corresponding to the ''k'' largest [[eigenvalue]]s of the matrix P = SD^{-1} for some ''k'', and then invokes another (e.g. ''k''-means) to cluster points by their respective ''k'' components in these eigenvectors.@@@@1@43@@danf@17-8-2009 10201460@unknown@formal@none@1@S@==Applications==@@@@1@1@@danf@17-8-2009 10201470@unknown@formal@none@1@S@=== Biology ===@@@@1@3@@danf@17-8-2009 10201480@unknown@formal@none@1@S@In [[biology]] '''clustering''' has many applications@@@@1@6@@danf@17-8-2009 10201490@unknown@formal@none@1@S@*In imaging, data clustering may take different form based on the data dimensionality.@@@@1@13@@danf@17-8-2009 10201500@unknown@formal@none@1@S@For example, the [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture SOCR EM Mixture model segmentation activity and applet] shows how to obtain point, region or volume classification using the online [[SOCR]] computational libraries.@@@@1@27@@danf@17-8-2009 10201510@unknown@formal@none@1@S@*In the fields of [[plant]] and [[animal]] [[ecology]], clustering is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in [[Systematics|plant systematics]] to generate artificial [[Phylogeny|phylogenies]] or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes@@@@1@57@@danf@17-8-2009 10201520@unknown@formal@none@1@S@*In computational biology and [[bioinformatics]]:@@@@1@5@@danf@17-8-2009 10201530@unknown@formal@none@1@S@** In [[transcriptome|transcriptomics]], clustering is used to build groups of [[genes]] with related expression patterns (also known as coexpressed genes).@@@@1@20@@danf@17-8-2009 10201540@unknown@formal@none@1@S@Often such groups contain functionally related proteins, such as [[enzyme]]s for a specific [[metabolic pathway|pathway]], or genes that are co-regulated.@@@@1@20@@danf@17-8-2009 10201550@unknown@formal@none@1@S@High throughput experiments using [[expressed sequence tag]]s (ESTs) or [[DNA microarray]]s can be a powerful tool for [[genome annotation]], a general aspect of [[genomics]].@@@@1@24@@danf@17-8-2009 10201560@unknown@formal@none@1@S@** In [[sequence analysis]], clustering is used to group homologous sequences into [[list of gene families|gene families]].@@@@1@17@@danf@17-8-2009 10201570@unknown@formal@none@1@S@This is a very important concept in bioinformatics, and [[evolutionary biology]] in general.@@@@1@13@@danf@17-8-2009 10201580@unknown@formal@none@1@S@See evolution by [[gene duplication]].@@@@1@5@@danf@17-8-2009 10201590@unknown@formal@none@1@S@** In high-throughput genotyping platforms clustering algorithms are used to automatically assign [[genotypes]].@@@@1@13@@danf@17-8-2009 10201600@unknown@formal@none@1@S@=== Medicine ===@@@@1@3@@danf@17-8-2009 10201610@unknown@formal@none@1@S@In [[medical imaging]], such as [[PET scan|PET scans]], cluster analysis can be used to differentiate between different types of [[tissue (biology)|tissue]] and [[blood]] in a three dimensional image.@@@@1@28@@danf@17-8-2009 10201620@unknown@formal@none@1@S@In this application, actual position does not matter, but the [[voxel]] intensity is considered as a [[coordinate vector|vector]], with a dimension for each image that was taken over time.@@@@1@29@@danf@17-8-2009 10201630@unknown@formal@none@1@S@This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of [[arterial]] blood, an intrusive technique that is most common today.@@@@1@35@@danf@17-8-2009 10201640@unknown@formal@none@1@S@=== Market research ===@@@@1@4@@danf@17-8-2009 10201650@unknown@formal@none@1@S@Cluster analysis is widely used in [[market research]] when working with multivariate data from [[Statistical survey|surveys]] and test panels.@@@@1@19@@danf@17-8-2009 10201660@unknown@formal@none@1@S@Market researchers use cluster analysis to partition the general [[population]] of [[consumers]] into market segments and to better understand the relationships between different groups of consumers/potential [[customers]].@@@@1@27@@danf@17-8-2009 10201670@unknown@formal@none@1@S@* Segmenting the market and determining [[target market]]s@@@@1@8@@danf@17-8-2009 10201680@unknown@formal@none@1@S@* [[positioning (marketing)|Product positioning]]@@@@1@4@@danf@17-8-2009 10201690@unknown@formal@none@1@S@* [[New product development]]@@@@1@4@@danf@17-8-2009 10201700@unknown@formal@none@1@S@* Selecting test markets (see : [[experimental techniques]])@@@@1@8@@danf@17-8-2009 10201710@unknown@formal@none@1@S@=== Other applications ===@@@@1@4@@danf@17-8-2009 10201720@unknown@formal@none@1@S@'''Social network analysis''': In the study of [[social networks]], clustering may be used to recognize [[communities]] within large groups of people.@@@@1@21@@danf@17-8-2009 10201730@unknown@formal@none@1@S@'''Image segmentation''': Clustering can be used to divide a [[digital]] [[image]] into distinct regions for [[border detection]] or [[object recognition]].@@@@1@20@@danf@17-8-2009 10201740@unknown@formal@none@1@S@'''Data mining''': Many [[data mining]] applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples.@@@@1@21@@danf@17-8-2009 10201750@unknown@formal@none@1@S@Another common application is the division of documents, such as [[World Wide Web]] pages, into genres.@@@@1@16@@danf@17-8-2009 10201760@unknown@formal@none@1@S@'''Search result grouping''': In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like [[Google]].@@@@1@34@@danf@17-8-2009 10201770@unknown@formal@none@1@S@There are currently a number of web based clustering tools such as [[Clusty]].@@@@1@13@@danf@17-8-2009 10201780@unknown@formal@none@1@S@'''Slippy map optimization''': [[Flickr]]'s map of photos and other map sites use clustering to reduce the number of markers on a map.@@@@1@22@@danf@17-8-2009 10201790@unknown@formal@none@1@S@This makes it both faster and reduces the amount of visual clutter.@@@@1@12@@danf@17-8-2009 10201800@unknown@formal@none@1@S@'''IMRT segmentation''': Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.@@@@1@23@@danf@17-8-2009 10201810@unknown@formal@none@1@S@'''Grouping of Shopping Items''': Clustering can be used to group all the shopping items available on the web into a set of unique products.@@@@1@24@@danf@17-8-2009 10201820@unknown@formal@none@1@S@For example, all the items on eBay can be grouped into unique products.@@@@1@13@@danf@17-8-2009 10201825@unknown@formal@none@1@S@(eBay doesn't have the concept of a SKU)@@@@1@8@@danf@17-8-2009 10201830@unknown@formal@none@1@S@'''[[Mathematical chemistry]]''': To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 [[topological index|topological indices]].@@@@1@22@@danf@17-8-2009 10201840@unknown@formal@none@1@S@'''Petroleum Geology''': Cluster Analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties.@@@@1@23@@danf@17-8-2009 10201850@unknown@formal@none@1@S@== Comparisons between data clusterings ==@@@@1@6@@danf@17-8-2009 10201860@unknown@formal@none@1@S@There have been several suggestions for a measure of similarity between two clusterings.@@@@1@13@@danf@17-8-2009 10201870@unknown@formal@none@1@S@Such a measure can be used to compare how well different data clustering algorithms perform on a set of data.@@@@1@20@@danf@17-8-2009 10201880@unknown@formal@none@1@S@Many of these measures are derived from the [[matching matrix]] (aka [[confusion matrix]]), e.g., the [[Rand index|Rand measure]] and the Fowlkes-Mallows ''B''''k'' measures.@@@@1@23@@danf@17-8-2009 10201890@unknown@formal@none@1@S@[[Marina Meila]]'s Variation of Information metric is a more recent approach for measuring distance between clusterings.@@@@1@16@@danf@17-8-2009 10201900@unknown@formal@none@1@S@It uses [[Mutual information|mutual information]] and [[entropy]] to approximate the distance between two clusterings across the lattice of possible clusterings.@@@@1@20@@danf@17-8-2009 10201910@unknown@formal@none@1@S@==Algorithms==@@@@1@1@@danf@17-8-2009 10201920@unknown@formal@none@1@S@In recent years considerable effort has been put into improving algorithm performance (Z. Huang, 1998).@@@@1@15@@danf@17-8-2009 10201930@unknown@formal@none@1@S@Among the most popular are ''CLARANS'' (Ng and Han,1994), ''[[DBSCAN]]'' (Ester et al., 1996) and ''BIRCH'' (Zhang et al., 1996).@@@@1@20@@danf@17-8-2009 10210010@unknown@formal@none@1@S@
Data mining
@@@@1@2@@danf@17-8-2009 10210020@unknown@formal@none@1@S@'''Data mining''' is the process of [[sorting]] through large amounts of data and picking out relevant information.@@@@1@17@@danf@17-8-2009 10210030@unknown@formal@none@1@S@It is usually used by [[business intelligence]] organizations, and [[financial analyst]]s, but is increasingly being used in the sciences to extract information from the enormous [[data set]]s generated by modern experimental and observational methods.@@@@1@34@@danf@17-8-2009 10210040@unknown@formal@none@1@S@It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful [[information]] from [[data]]" and "the science of extracting useful information from large [[data set]]s or [[database]]s.@@@@1@31@@danf@17-8-2009 10210050@unknown@formal@none@1@S@" Data mining in relation to [[enterprise resource planning]] is the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making.@@@@1@29@@danf@17-8-2009 10210060@unknown@formal@none@1@S@==Background==@@@@1@1@@danf@17-8-2009 10210070@unknown@formal@none@1@S@Traditionally, business analysts have performed the task of extracting useful [[information]] from recorded [[data]], but the increasing volume of data in modern business and science calls for computer-based approaches.@@@@1@29@@danf@17-8-2009 10210080@unknown@formal@none@1@S@As [[data set]]s have grown in size and complexity, there has been a shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools.@@@@1@31@@danf@17-8-2009 10210090@unknown@formal@none@1@S@The modern technologies of [[computers]], [[networks]], and [[sensors]] have made [[data collection]] and organization much easier.@@@@1@16@@danf@17-8-2009 10210100@unknown@formal@none@1@S@However, the captured data needs to be converted into [[information]] and [[knowledge]] to become useful.@@@@1@15@@danf@17-8-2009 10210110@unknown@formal@none@1@S@Data mining is the entire process of applying computer-based [[methodology]], including new techniques for [[knowledge discovery]], to data.@@@@1@18@@danf@17-8-2009 10210120@unknown@formal@none@1@S@Data mining identifies trends within data that go beyond simple analysis.@@@@1@11@@danf@17-8-2009 10210130@unknown@formal@none@1@S@Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of business processes and target opportunities.@@@@1@21@@danf@17-8-2009 10210140@unknown@formal@none@1@S@However, abdicating control of this process from the statistician to the machine may result in false-positives or no useful results at all.@@@@1@22@@danf@17-8-2009 10210150@unknown@formal@none@1@S@Although data mining is a relatively new term, the technology is not.@@@@1@12@@danf@17-8-2009 10210160@unknown@formal@none@1@S@For many years, businesses have used powerful computers to sift through volumes of data such as supermarket scanner data to produce market research reports (although reporting is not considered to be data mining).@@@@1@33@@danf@17-8-2009 10210170@unknown@formal@none@1@S@Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of data analysis.@@@@1@21@@danf@17-8-2009 10210180@unknown@formal@none@1@S@Web 2.0 technologies have generated a colossal amount of user-generated data and media, making it hard to aggregate and consume information in a meaningful way without getting overloaded.@@@@1@28@@danf@17-8-2009 10210190@unknown@formal@none@1@S@Given the size of the data on the Internet, and the difficulty in contextualizing it, it is unclear whether the traditional approach to data mining is computationally viable.@@@@1@28@@danf@17-8-2009 10210200@unknown@formal@none@1@S@The term data mining is often used to apply to the two separate processes of knowledge discovery and [[prediction]].@@@@1@19@@danf@17-8-2009 10210210@unknown@formal@none@1@S@Knowledge discovery provides explicit information that has a readable form and can be understood by a user.@@@@1@17@@danf@17-8-2009 10210220@unknown@formal@none@1@S@[[Forecasting]], or [[predictive modeling]] provides predictions of future events and may be transparent and readable in some approaches (e.g., rule-based systems) and opaque in others such as [[neural network]]s.@@@@1@29@@danf@17-8-2009 10210230@unknown@formal@none@1@S@Moreover, some data-mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery.@@@@1@20@@danf@17-8-2009 10210240@unknown@formal@none@1@S@[[Metadata]], or data about a given data set, are often expressed in a condensed ''data-minable'' format, or one that facilitates the practice of data mining.@@@@1@25@@danf@17-8-2009 10210250@unknown@formal@none@1@S@Common examples include executive summaries and scientific abstracts.@@@@1@8@@danf@17-8-2009 10210260@unknown@formal@none@1@S@Data mining relies on the use of real world data.@@@@1@10@@danf@17-8-2009 10210270@unknown@formal@none@1@S@This data is extremely vulnerable to [[collinearity]] precisely because data from the real world may have unknown interrelations.@@@@1@18@@danf@17-8-2009 10210280@unknown@formal@none@1@S@An unavoidable weakness of data mining is that the critical data that may expose any relationship might have never been observed.@@@@1@21@@danf@17-8-2009 10210290@unknown@formal@none@1@S@Alternative approaches using an experiment-based approach such as [[Choice Modelling]] for human-generated data may be used.@@@@1@16@@danf@17-8-2009 10210300@unknown@formal@none@1@S@Inherent correlations are either controlled for or removed altogether through the construction of an [[experimental design]].@@@@1@16@@danf@17-8-2009 10210310@unknown@formal@none@1@S@Recently, there were some efforts to define a standard for data mining, for example the [[CRISP-DM]] standard for analysis processes or the [[Java Data-Mining]] Standard.@@@@1@25@@danf@17-8-2009 10210320@unknown@formal@none@1@S@Independent of these standardization efforts, freely available open-source software systems like [[RapidMiner]] and [[Weka (machine learning)| Weka]] have become an informal standard for defining data-mining processes.@@@@1@26@@danf@17-8-2009 10210330@unknown@formal@none@1@S@==Privacy concerns==@@@@1@2@@danf@17-8-2009 10210340@unknown@formal@none@1@S@There are also [[privacy]] and [[human rights]] concerns associated with data mining, specifically regarding the source of the data analyzed.@@@@1@20@@danf@17-8-2009 10210350@unknown@formal@none@1@S@Data mining provides information that may be difficult to obtain otherwise.@@@@1@11@@danf@17-8-2009 10210360@unknown@formal@none@1@S@When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.@@@@1@16@@danf@17-8-2009 10210370@unknown@formal@none@1@S@In particular, data mining government or commercial data sets for national security or law enforcement purposes has raised privacy concerns.@@@@1@20@@danf@17-8-2009 10210380@unknown@formal@none@1@S@==Notable uses of data mining==@@@@1@5@@danf@17-8-2009 10210390@unknown@formal@none@1@S@===Combatting Terrorism===@@@@1@2@@danf@17-8-2009 10210400@unknown@formal@none@1@S@Data mining has been cited as the method by which the U.S. Army unit [[Able Danger]] had identified the [[September 11, 2001 attacks]] leader, [[Mohamed Atta]], and three other 9/11 hijackers as possible members of an [[Al Qaeda]] cell operating in the U.S. more than a year before the attack.@@@@1@50@@danf@17-8-2009 10210410@unknown@formal@none@1@S@It has been suggested that both the [[Central Intelligence Agency]] and the [[Canadian Security Intelligence Service]] have employed this method.@@@@1@20@@danf@17-8-2009 10210420@unknown@formal@none@1@S@Previous data mining to stop terrorist programs under the US government include the Terrorism Information Awareness (TIA) program, Computer-Assisted Passenger Prescreening System (CAPPS II), Analysis, Dissemination, Visualization, Insight, and Semantic Enhancement (ADVISE), Multistate Anti-Terrorism Information Exchange (MATRIX), and the Secure Flight program [http://www.msnbc.msn.com/id/20604775/ Security-MSNBC].@@@@1@44@@danf@17-8-2009 10210430@unknown@formal@none@1@S@These programs have been discontinued due to controversy over whether they violate the US Constitution's 4th amendment.@@@@1@17@@danf@17-8-2009 10210440@unknown@formal@none@1@S@===Games===@@@@1@1@@danf@17-8-2009 10210450@unknown@formal@none@1@S@Since the early 1960s, with the availability of [[Oracle machine|oracle]]s for certain [[combinatorial game]]s, also called [[tablebase]]s (e.g. for 3x3-chess) with any beginning configuration, small-board [[dots-and-boxes]], small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up.@@@@1@45@@danf@17-8-2009 10210460@unknown@formal@none@1@S@This is the extraction of human-usable strategies from these oracles.@@@@1@10@@danf@17-8-2009 10210470@unknown@formal@none@1@S@Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully.@@@@1@22@@danf@17-8-2009 10210480@unknown@formal@none@1@S@Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns.@@@@1@32@@danf@17-8-2009 10210490@unknown@formal@none@1@S@[[Berlekamp]] in dots-and-boxes etc. and [[John Nunn]] in [[chess]] [[Chess endgame|endgames]] are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.@@@@1@30@@danf@17-8-2009 10210500@unknown@formal@none@1@S@===Business===@@@@1@1@@danf@17-8-2009 10210510@unknown@formal@none@1@S@Data mining in [[customer relationship management]] applications can contribute significantly to the bottom line.@@@@1@14@@danf@17-8-2009 10210520@unknown@formal@none@1@S@Rather than contacting a prospect or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted.@@@@1@31@@danf@17-8-2009 10210530@unknown@formal@none@1@S@More sophisticated methods may be used to optimize across campaigns so that we can predict which channel and which offer an individual is most likely to respond to - across all potential offers.@@@@1@33@@danf@17-8-2009 10210540@unknown@formal@none@1@S@Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer.@@@@1@33@@danf@17-8-2009 10210550@unknown@formal@none@1@S@[[Data clustering]] can also be used to automatically discover the segments or groups within a customer data set.@@@@1@18@@danf@17-8-2009 10210560@unknown@formal@none@1@S@Businesses employing data mining quickly see a return on investment, but also they recognize that the number of predictive models can quickly become very large.@@@@1@25@@danf@17-8-2009 10210570@unknown@formal@none@1@S@Rather than one model to predict which customers will [[Churning (stock trade)|churn]], a business could build a separate model for each region and customer type.@@@@1@25@@danf@17-8-2009 10210580@unknown@formal@none@1@S@Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers that will likely take to offer.@@@@1@29@@danf@17-8-2009 10210590@unknown@formal@none@1@S@And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable.@@@@1@33@@danf@17-8-2009 10210600@unknown@formal@none@1@S@In order to maintain this quantity of models, they need to manage model versions and move to ''automated data mining''.@@@@1@20@@danf@17-8-2009 10210610@unknown@formal@none@1@S@Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees.@@@@1@18@@danf@17-8-2009 10210620@unknown@formal@none@1@S@Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly.@@@@1@17@@danf@17-8-2009 10210630@unknown@formal@none@1@S@Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels.@@@@1@28@@danf@17-8-2009 10210640@unknown@formal@none@1@S@Another example of data mining, often called the [[market basket analysis]], relates to its use in retail sales.@@@@1@18@@danf@17-8-2009 10210650@unknown@formal@none@1@S@If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones.@@@@1@23@@danf@17-8-2009 10210660@unknown@formal@none@1@S@Although some explanations of relationships may be difficult, taking advantage of it is easier.@@@@1@14@@danf@17-8-2009 10210670@unknown@formal@none@1@S@The example deals with [[association rule]]s within transaction-based data.@@@@1@9@@danf@17-8-2009 10210680@unknown@formal@none@1@S@Not all data are transaction based and logical or inexact [[rule]]s may also be present within a [[database]].@@@@1@18@@danf@17-8-2009 10210690@unknown@formal@none@1@S@In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem will develop a secondary problem within the next six months.@@@@1@30@@danf@17-8-2009 10210700@unknown@formal@none@1@S@Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing."@@@@1@24@@danf@17-8-2009 10210710@unknown@formal@none@1@S@In this paper the application of data mining and decision analysis to the problem of die-level functional test is described.@@@@1@20@@danf@17-8-2009 10210720@unknown@formal@none@1@S@Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure which are then utilized to decide in real time which die to test next and when to stop testing.@@@@1@46@@danf@17-8-2009 10210730@unknown@formal@none@1@S@This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products.@@@@1@23@@danf@17-8-2009 10210740@unknown@formal@none@1@S@===Science and engineering===@@@@1@3@@danf@17-8-2009 10210750@unknown@formal@none@1@S@In recent years, data mining has been widely used in area of science and engineering, such as [[bioinformatic]]s, [[genetic]]s, [[medicine]], [[education]], and [[electrical power]] engineering.@@@@1@25@@danf@17-8-2009 10210760@unknown@formal@none@1@S@In the area of study on human genetics, the important goal is to understand the mapping relationship between the inter-individual variation in human [[DNA]] sequences and variability in disease susceptibility.@@@@1@30@@danf@17-8-2009 10210770@unknown@formal@none@1@S@In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as [[cancer]].@@@@1@26@@danf@17-8-2009 10210780@unknown@formal@none@1@S@This is very important to help improve the diagnosis, prevention and treatment of the diseases.@@@@1@15@@danf@17-8-2009 10210790@unknown@formal@none@1@S@The data mining technique that is used to perform this task is known as [[multifactor dimensionality reduction]].@@@@1@17@@danf@17-8-2009 10210800@unknown@formal@none@1@S@In the area of electrical power engineering, data mining techniques have been widely used for [[condition monitoring]] of high voltage electrical equipment.@@@@1@22@@danf@17-8-2009 10210810@unknown@formal@none@1@S@The purpose of condition monitoring is to obtain valuable information on the [[insulation]]'s health status of the equipment.@@@@1@18@@danf@17-8-2009 10210820@unknown@formal@none@1@S@[[Data clustering]] such as [[self-organizing map]] (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers(OLTCS).@@@@1@20@@danf@17-8-2009 10210830@unknown@formal@none@1@S@Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms.@@@@1@30@@danf@17-8-2009 10210840@unknown@formal@none@1@S@Obviously, different tap positions will generate different signals.@@@@1@8@@danf@17-8-2009 10210850@unknown@formal@none@1@S@However, there was considerable variability amongst normal condition signals for the exact same tap position.@@@@1@15@@danf@17-8-2009 10210860@unknown@formal@none@1@S@SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities.@@@@1@16@@danf@17-8-2009 10210870@unknown@formal@none@1@S@Data mining techniques have also been applied for [[dissolved gas analysis]] (DGA) on [[power transformer]]s.@@@@1@15@@danf@17-8-2009 10210880@unknown@formal@none@1@S@DGA, as a diagnostics for power transformer, has been available for centuries.@@@@1@12@@danf@17-8-2009 10210890@unknown@formal@none@1@S@Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle.@@@@1@30@@danf@17-8-2009 10210900@unknown@formal@none@1@S@A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning and to understand the factors influencing university student retention.@@@@1@45@@danf@17-8-2009 10210910@unknown@formal@none@1@S@Other examples of applying data mining technique applications are [[biomedical]] data facilitated by domain ontologies, mining clinical trial data, [[traffic analysis]] using SOM, et cetera.@@@@1@25@@danf@17-8-2009 10220010@unknown@formal@none@1@S@
Data set
@@@@1@2@@danf@17-8-2009 10220020@unknown@formal@none@1@S@A '''data set''' (or '''dataset''') is a collection of [[data]], usually presented in tabular form.@@@@1@15@@danf@17-8-2009 10220030@unknown@formal@none@1@S@Each column represents a particular variable.@@@@1@6@@danf@17-8-2009 10220040@unknown@formal@none@1@S@Each row corresponds to a given member of the data set in question.@@@@1@13@@danf@17-8-2009 10220050@unknown@formal@none@1@S@It lists values for each of the variables, such as height and weight of an object or values of random numbers.@@@@1@21@@danf@17-8-2009 10220060@unknown@formal@none@1@S@Each value is known as a [[datum]].@@@@1@7@@danf@17-8-2009 10220070@unknown@formal@none@1@S@The data set may comprise data for one or more members, corresponding to the number of rows.@@@@1@17@@danf@17-8-2009 10220080@unknown@formal@none@1@S@Historically, the term originated in the [[mainframe computer|mainframe field]], where it had a [[Data set (IBM mainframe)|well-defined meaning]], very close to contemporary ''[[computer file]]''.@@@@1@24@@danf@17-8-2009 10220090@unknown@formal@none@1@S@This topic is not covered here.@@@@1@6@@danf@17-8-2009 10220100@unknown@formal@none@1@S@In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list.@@@@1@26@@danf@17-8-2009 10220110@unknown@formal@none@1@S@The values may be numbers, such as [[real number]]s or [[integer]]s, for example representing a person's height in centimeters, but may also be [[nominal data]] (i.e., not consisting of [[numerical]] values), for example representing a person's ethnicity.@@@@1@37@@danf@17-8-2009 10220120@unknown@formal@none@1@S@More generally, values may be of any of the kinds described as a [[level of measurement]].@@@@1@16@@danf@17-8-2009 10220130@unknown@formal@none@1@S@For each variable, the values will normally all be of the same kind.@@@@1@13@@danf@17-8-2009 10220140@unknown@formal@none@1@S@However, there may also be "[[missing values]]", which need to be indicated in some way.@@@@1@15@@danf@17-8-2009 10220150@unknown@formal@none@1@S@In [[statistics]] data sets usually come from actual observations obtained by [[sampling (statistics)|sampling]] a [[statistical population]], and each row corresponds to the observations on one element of that population.@@@@1@29@@danf@17-8-2009 10220160@unknown@formal@none@1@S@Data sets may further be generated by [[algorithms]] for the purpose of testing certain kinds of [[software]].@@@@1@17@@danf@17-8-2009 10220170@unknown@formal@none@1@S@Some modern statistical analysis software such as [[PSPP]] still present their data in the classical dataset fashion.@@@@1@17@@danf@17-8-2009 10220180@unknown@formal@none@1@S@== Classic data sets ==@@@@1@5@@danf@17-8-2009 10220190@unknown@formal@none@1@S@Several classic [[data set]]s have been used extensively in the [[statistical]] literature:@@@@1@12@@danf@17-8-2009 10220200@unknown@formal@none@1@S@* [[Iris flower data set]] - multivariate data set introduced by [[Ronald Fisher]] (1936).@@@@1@14@@danf@17-8-2009 10220210@unknown@formal@none@1@S@* ''[[Categorical data analysis]]'' - Data sets used in the book, ''An Introduction to Categorical Data Analysis'', by Agresti are [http://lib.stat.cmu.edu/datasets/agresti provided on-line by StatLib.]@@@@1@25@@danf@17-8-2009 10220220@unknown@formal@none@1@S@*''[[Robust statistics]]'' - Data sets used in ''Robust Regression and Outlier Detection'' (Rousseeuw and Leroy, 1986). [http://www.uni-koeln.de/themen/Statistik/data/rousseeuw/ Provided on-line at the University of Cologne.]@@@@1@24@@danf@17-8-2009 10220230@unknown@formal@none@1@S@*''[[Time series]]'' - Data used in Chatfield's book, ''The Analysis of Time Series'', are [http://lib.stat.cmu.edu/modules.php?op=modload&name=PostWrap&file=index&page=datasets/ provided on-line by StatLib.]@@@@1@19@@danf@17-8-2009 10220240@unknown@formal@none@1@S@*''Extreme values'' - Data used in the book, ''An Introduction to the Statistical Modeling of Extreme Values'' are [http://homes.stat.unipd.it/coles/public_html/ismev/ismev.dat provided on-line by Stuart Coles], the book's author.@@@@1@27@@danf@17-8-2009 10220250@unknown@formal@none@1@S@*''Bayesian Data Analysis'' - Data used in the book, ''[[Bayesian]] Data Analysis'', are [http://www.stat.columbia.edu/~gelman/book/data/ provided on-line by Andrew Gelman], one of the book's authors.@@@@1@24@@danf@17-8-2009 10220260@unknown@formal@none@1@S@* The [ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders Bupa liver data], used in several papers in the machine learning (data mining) literature.@@@@1@17@@danf@17-8-2009 10230010@unknown@formal@none@1@S@
ELIZA
@@@@1@1@@danf@17-8-2009 10230020@unknown@formal@none@1@S@'''ELIZA''' is a [[computer program]] by [[Joseph Weizenbaum]], designed in [[1966]], which parodied a [[Rogerian psychotherapy|Rogerian therapist]], largely by rephrasing many of the patient's statements as questions and posing them to the patient.@@@@1@33@@danf@17-8-2009 10230030@unknown@formal@none@1@S@Thus, for example, the response to "My head hurts" might be "Why do you say your head hurts?"@@@@1@18@@danf@17-8-2009 10230040@unknown@formal@none@1@S@The response to "My mother hates me" might be "Who else in your family hates you?"@@@@1@16@@danf@17-8-2009 10230050@unknown@formal@none@1@S@ELIZA was named after Eliza Doolittle, a working-class character in [[George Bernard Shaw|George Bernard Shaw's]] play ''[[Pygmalion (play)|Pygmalion]]'', who is taught to speak with an [[upper class]] [[accent (linguistics)|accent]].@@@@1@29@@danf@17-8-2009 10230060@unknown@formal@none@1@S@==Overview==@@@@1@1@@danf@17-8-2009 10230070@unknown@formal@none@1@S@It is sometimes inaccurately said that ELIZA simulates a therapist.@@@@1@10@@danf@17-8-2009 10230080@unknown@formal@none@1@S@Weizenbaum said that ELIZA provided a "[[parody]]" of "the responses of a non-directional psychotherapist in an initial psychiatric interview."@@@@1@19@@danf@17-8-2009 10230090@unknown@formal@none@1@S@He chose the context of psychotherapy to "sidestep the problem of giving the program a data base of real-world knowledge", the therapeutic situation being one of the few real human situations in which a human being can reply to a statement with a question that indicates very little specific knowledge of the topic under discussion.@@@@1@55@@danf@17-8-2009 10230100@unknown@formal@none@1@S@For example, it is a context in which the question "Who is your favorite composer?" can be answered acceptably with responses such as "What about your own favorite composer?" or "Does that question interest you?"@@@@1@35@@danf@17-8-2009 10230110@unknown@formal@none@1@S@First implemented in Weizenbaum's own [[SLIP (programming language)|SLIP]] list-processing language, ELIZA worked by simple [[parsing]] and substitution of key words into canned phrases.@@@@1@23@@danf@17-8-2009 10230120@unknown@formal@none@1@S@Depending upon the initial entries by the user the illusion of a human writer could be instantly dispelled, or could continue through several interchanges.@@@@1@24@@danf@17-8-2009 10230130@unknown@formal@none@1@S@It was sometimes so convincing that there are many anecdotes about people becoming very emotionally caught up in dealing with ELIZA for several minutes until the machine's true lack of understanding became apparent.@@@@1@33@@danf@17-8-2009 10230140@unknown@formal@none@1@S@This was likely due to people's tendency to attach meanings to words which the computer never put there.@@@@1@18@@danf@17-8-2009 10230150@unknown@formal@none@1@S@In 1966, interactive computing (via a teletype) was new.@@@@1@9@@danf@17-8-2009 10230160@unknown@formal@none@1@S@It was 15 years before the personal computer became familiar to the general public, and two decades before most people encountered attempts at [[natural language processing]] in Internet services like [[Ask.com]] or PC help systems such as Microsoft Office [[Office Assistant|Clippy]].@@@@1@41@@danf@17-8-2009 10230170@unknown@formal@none@1@S@Although those programs included years of research and work (while ''[[Ecala]]'' eclipsed the functionality of ''ELIZA'' after less than two weeks of work by a single programmer), ''ELIZA'' remains a milestone simply because it was the first time a programmer had attempted such a human-machine interaction with the goal of creating the illusion (however brief) of human-''human'' interaction.@@@@1@58@@danf@17-8-2009 10230180@unknown@formal@none@1@S@In the article "theNewMediaReader" an excerpt from "From Computer Power and Human Reason" by Joseph Weizenbaum in 1976, edited by Noah Wardrip-Fruin and Nick Montfort he references how quickly and deeply people became emotionally involved with the computer program, taking offence when he asked to view the transcripts, saying it was an invasion of their privacy, even asking him to leave the room while they were working with ELIZA.@@@@1@69@@danf@17-8-2009 10230190@unknown@formal@none@1@S@==Influence on games==@@@@1@3@@danf@17-8-2009 10230200@unknown@formal@none@1@S@ELIZA impacted a number of early [[computer games]] by demonstrating additional kinds of [[interface design]]s.@@@@1@15@@danf@17-8-2009 10230210@unknown@formal@none@1@S@[[Don Daglow]] wrote an enhanced version of the program called ''Ecala'' on a [[PDP-10]] [[mainframe computer]] at [[Pomona College]] in [[1973]] before writing what was possibly the second or third computer [[role-playing game]], ''[[Dungeon (computer game)|Dungeon]]'' ([[1975]]) (The first was probably "[[dnd (computer game)|dnd]]", written on and for the PLATO system in 1974, and the second may have been [[Moria]], written in 1975).@@@@1@63@@danf@17-8-2009 10230220@unknown@formal@none@1@S@It is likely that ''ELIZA'' was also on the system where [[Will Crowther]] created ''[[Colossal Cave Adventure|Adventure]]'', the 1975 game that spawned the [[interactive fiction]] genre.@@@@1@26@@danf@17-8-2009 10230230@unknown@formal@none@1@S@But both these games appeared some nine years after the original ''ELIZA''.@@@@1@12@@danf@17-8-2009 10230240@unknown@formal@none@1@S@==Response and legacy==@@@@1@3@@danf@17-8-2009 10230250@unknown@formal@none@1@S@Lay responses to ELIZA were disturbing to Weizenbaum and motivated him to write his book ''Computer Power and Human Reason: From Judgment to Calculation'', in which he explains the limits of computers, as he wants to make clear in people's minds his opinion that the anthropomorphic views of computers are just a reduction of the human being and any life form for that matter.@@@@1@64@@danf@17-8-2009 10230260@unknown@formal@none@1@S@There are many programs based on ELIZA in different languages in addition to ''Ecala''.@@@@1@14@@danf@17-8-2009 10230270@unknown@formal@none@1@S@For example, in 1980, a company called "Don't Ask Software", founded by Randy Simon, created a version for the Apple II, Atari, and Commodore PCs, which verbally abused the user based on the user's input.@@@@1@35@@danf@17-8-2009 10230280@unknown@formal@none@1@S@In Spain, Jordi Perez developed the famous ZEBAL in 1993, written in [[Clipper programming language|Clipper]] for MS-DOS.@@@@1@17@@danf@17-8-2009 10230290@unknown@formal@none@1@S@Other versions adapted ELIZA around a religious theme, such as ones featuring Jesus (both serious and comedic) and another Apple II variant called ''I Am Buddha''.@@@@1@26@@danf@17-8-2009 10230300@unknown@formal@none@1@S@The 1980 game ''[[The Prisoner (computer game)|The Prisoner]]'' incorporated ELIZA-style interaction within its gameplay.@@@@1@14@@danf@17-8-2009 10230310@unknown@formal@none@1@S@ELIZA has also inspired a [[podcast]] called "The Eliza Podcast", in which the host engages in self-analysis using a computer generated voice prompting with questions in the same style as the ELIZA program.@@@@1@33@@danf@17-8-2009 10230320@unknown@formal@none@1@S@==Implementations==@@@@1@1@@danf@17-8-2009 10230330@unknown@formal@none@1@S@* Using [[JavaScript]]: http://www.manifestation.com/neurotoys/eliza.php3@@@@1@4@@danf@17-8-2009 10230340@unknown@formal@none@1@S@* Source code in [[Java (programming language)|Java]]: http://chayden.net/eliza/Eliza.html@@@@1@8@@danf@17-8-2009 10230350@unknown@formal@none@1@S@* Another [[Java (programming language)|Java]]-implementation of ELIZA: http://www.wedesoft.demon.co.uk/eliza/@@@@1@8@@danf@17-8-2009 10230360@unknown@formal@none@1@S@* Using [[C (programming language)|C]] on the [[TI-89]]: http://kaikostack.com/ti89_en.htm#eliza@@@@1@9@@danf@17-8-2009 10230370@unknown@formal@none@1@S@* Using [[z80#The Z80 assembly language|z80 Assembly]] on the [[TI-83#TI-83 Plus|TI-83 Plus]]: http://www.ticalc.org/archives/files/fileinfo/354/35463.html@@@@1@13@@danf@17-8-2009 10230380@unknown@formal@none@1@S@* A [[perl module]] [http://search.cpan.org/dist/Chatbot-Eliza/ Chatbot::Eliza] — [http://www.terrence.com/perl/eliza/eliza.cgi example implementation]@@@@1@10@@danf@17-8-2009 10230390@unknown@formal@none@1@S@* Trans-Tex Software has released shareware versions for Classic Mac OS and Mac OS X: http://www.tex-edit.com/index.html#Eliza@@@@1@16@@danf@17-8-2009 10230400@unknown@formal@none@1@S@* doctor.el (circa [[1985]]) in [[Emacs]].@@@@1@6@@danf@17-8-2009 10230410@unknown@formal@none@1@S@* Source code in [[Tcl]]: [http://wiki.tcl.tk/9235 http://wiki.tcl.tk/9235]@@@@1@7@@danf@17-8-2009 10230420@unknown@formal@none@1@S@* The [http://www.indyproject.org Indy] [[Delphi]] oriented TCP/IP components suite has an Eliza implementation as demo.@@@@1@15@@danf@17-8-2009 10230430@unknown@formal@none@1@S@*[http://www.cs.bham.ac.uk/research/projects/cogaff/eliza Pop-11 Eliza] in the [[poplog]] system.@@@@1@7@@danf@17-8-2009 10230440@unknown@formal@none@1@S@Goes back to about 1976, when it was used for teaching AI at [[Sussex University]].@@@@1@15@@danf@17-8-2009 10230450@unknown@formal@none@1@S@Now part of the free open source Poplog system.@@@@1@9@@danf@17-8-2009 10230460@unknown@formal@none@1@S@* Source code in [[BASIC]]: http://www.atariarchives.org/bigcomputergames/showpage.php?page=22@@@@1@6@@danf@17-8-2009 10230470@unknown@formal@none@1@S@* ECC-Eliza for Windows (actual program is for DOS, but unpacker is for Windows) (rename .txt to .exe before running): http://www5.domaindlx.com/ecceliza1/ecceliza.txt.@@@@1@21@@danf@17-8-2009 10230480@unknown@formal@none@1@S@More recent version at http://web.archive.org/web/20041117123025/http://www5.domaindlx.com/ecceliza1/ecceliza.txt.@@@@1@5@@danf@17-8-2009