RocketGit

caubert / CSCI_3410 (public) (License: CC BY 4.0) (since 2018-05-16) (hash sha1)

Material for Database Class.

Clone URLs: https://rocketgit.com/user/caubert/CSCI_3410 ssh://rocketgit@ssh.rocketgit.com/user/caubert/CSCI_3410 git://git.rocketgit.com/user/caubert/CSCI_3410

master

List of commits:

Subject	Hash	Author	Date (UTC)
Worked on nosql chapter.	38dfcfffce7c06d0eab7256a7a5be1e2481a8505	aubert@math.cnrs.fr	2020-04-20 17:52:55
Cleaned up bib file.	505b1000bca31e4d833bf7739d02d2b7e727e69b	aubert@math.cnrs.fr	2020-04-20 05:03:18
rapid adjustments in contrib	beb43332953ec4e2a8e4376a90ce58b1234eadba	aubert@math.cnrs.fr	2020-04-20 00:10:44
Updated CONTRIB.md	c29d920e1efbc84766a3caafc0db2fcab4220b32	pveeral@augusta.edu	2020-04-19 19:28:45
testing	4ece7ba3d5c5d99361ef5eac92bb0848f2ea5318	pveeral@augusta.edu	2020-04-19 18:27:50
Small edit, correcting maefile.	7baa188be7d322e5288b498afbb7beaa96a9770b	aubert@math.cnrs.fr	2020-04-19 07:36:08
Cleaned latex files.	79b68f7b709ddeebc8133f7962fe5aabb3376304	aubert@math.cnrs.fr	2020-04-19 06:22:30
Minor corrections in installation manual.	436cee8616c25ccbed8bc406d988c2b4d28420f8	aubert@math.cnrs.fr	2020-04-19 06:19:04
Minor corrections in installation manual.	cb8cdfbd506a1344c81aecda055165cc1ca54ece	aubert@math.cnrs.fr	2020-04-19 06:17:52
Working on install manual.	3702c6437ee163eb4a61b4d69cffee8c8a76dc3d	aubert@math.cnrs.fr	2020-04-19 06:04:22
Worked on makefiles and example file.	4255d5e85bb684349f7f7798455dd8b3a273254b	aubert@math.cnrs.fr	2020-04-19 04:56:53
Re-idented some of the code.	124375e6bed1edb96d1bb4bcec8f111c8a3a1197	aubert@math.cnrs.fr	2020-04-19 03:10:02
Java indentation	2b317a12b7ab52bdca576a1bb46b2a2ce295464f	guest	2020-04-18 22:21:04
test	6fefa044794ff1d74a3d2493556c836b3dd97e74	guest	2020-04-18 22:18:44
Java indentation	5b0e0eb38484a8c67517a36a438f148bd5efa740	guest	2020-04-18 22:14:01
Worked on install notes.	b46b931ef11e3cb7dfe87c7f91ec9d5c558567e6	aubert@math.cnrs.fr	2020-04-17 05:22:20
Started to integrate installation manual to notes.	fd27b7686dd4c9d99163cf7badc720cd4a050221	aubert@math.cnrs.fr	2020-04-17 04:09:11
Replaced picture with text in Naming_Convention.md.	aff8c98c70b0834f9b0f076b881975daf3cdda03	aubert@math.cnrs.fr	2020-04-17 03:48:38
Testing.	b1a0942b15742ce987e4ad63848e9e2afcdde7ae	aubert@math.cnrs.fr	2020-04-17 02:03:52
Added explanation on the importance of alt text in known bugs.	3bddd86e1072b095a9f9d02f8c035f8d8f3e7155	aubert@math.cnrs.fr	2020-04-17 01:44:51

Commit 38dfcfffce7c06d0eab7256a7a5be1e2481a8505 - Worked on nosql chapter.
Author: aubert@math.cnrs.fr
Author date (UTC): 2020-04-20 17:52
Committer name: aubert@math.cnrs.fr
Committer date (UTC): 2020-04-20 17:52
Parent(s): 505b1000bca31e4d833bf7739d02d2b7e727e69b
Signer:
Signing key:
Signing status: N
Tree: 3545d47614b0c3e0039f7f7da48e447cbafbd3cb

File	Lines added	Lines deleted
notes/lectures_notes.md	73	58

File notes/lectures_notes.md changed (mode: 100644) (index 6a34f7d..0a9f0bc)
...	...	To write this chapter, were used
8635	8635
8636	8636	## A Bit of History	## A Bit of History
8637	8637
8638		Inspired from [@NoSQLDistilled, Chap. 1]
	8638		This part is partially inspired from [@NoSQLDistilled, Chap. 1], but it has been further updated.
8639	8639
8640	8640	### Database Applications and Application Databases	### Database Applications and Application Databases
8641	8641
8642		When you write a Database application, you have two options:
	8642		When you write a database application, you have two options:
8643	8643
8644		#. One database for many softwares
8645		#. One database for each softwares
	8644		#. One database for multiple applications,
	8645		#. One database for each application.
8646	8646
8647		The first option can cause severe impacts on the efficiency of your database: since maintening the integrity of the database is a requirement, a lot of synchronization is needed.
8648		With the second option, you develop an "application database", and you have more freedom of choice: since only a program interact with a database, you can chose whatever data management you want.
	8647		The first option can cause severe impacts on the efficiency of your database: since maintening the integrity of the database is a requirement, a lot of synchronization is needed, and your database becomes a bottleneck.
	8648		With the second option, you develop an "application database" (i.e., a database dedicated to a particular application), and you have more freedom in the design, schema, and even DBMS (you can use one particular software solution for one particular database application, and a different one for a different database application).
8649	8649
8650		But people were attached to `SQL` and kept using it.
	8650		### Clusters, Clusters…
8651	8651
8652		### Clusters, clusters…
	8652		The increase in everything (traffic, size of data, number of clients, etc.) meant "up or out", and raised numerous challenges for the "one database for multiple application" option.
	8653		There was two ways to increase the resources and to scale up:
8653	8654
8654		Increase in everything (traffic, size of data, number of clients, etc.) meant "up or out", and there was two ways to increase the resources:
	8655		#. Bigger machines,
	8656		#. More machines.
8655	8657
8656		#. Bigger machines
8657		#. More machines
	8658		The second option was generally less expensive (compare buying 1,000 raspberry pi VS buying 1 supercomputer that is not a cluster of more modest computers), but came with two drawbacks w.r.t. databases:
8658	8659
8659		The second option was generally less expensive, but came with two drawbacks w.r.t. databases:
8660
8661		#. Cost of licences,
8662		#. Force to perform "unnatural acts": relational model are really not made to be distributed
	8660		#. The cost of licences was excessive (indeed, you had to buy one licence per computer),
	8661		#. and it forced to perform "unnatural acts": relational model are really not made to be distributed.
8663	8662
8664	8663	### A First Shift	### A First Shift
8665	8664
	8665		Developping DBMS more suited for distributed architectures became growingly important, and some comanies took at stab at it.
	8666		The more important attemts were
	8667
8666	8668	- [Google Big Table](https://cloud.google.com/bigtable/), 2004 (made public in … 2015!) [@Chang2006]	- [Google Big Table](https://cloud.google.com/bigtable/), 2004 (made public in … 2015!) [@Chang2006]
8667	8669	- [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), 2004 (used in Simple Storage Service (S3) in 2007)	- [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), 2004 (used in Simple Storage Service (S3) in 2007)
8668	8670	- Facebook's Cassandra is sometimes mentioned, but it came later on, around 2009 [@Lakshman2009].	- Facebook's Cassandra is sometimes mentioned, but it came later on, around 2009 [@Lakshman2009].
8669	8671
8670		Particular, big company, with specific needs, but people interrested in solving some of their problems.
8671		Now, people started to think that there could be other ways.
8672
8673		One goal was to get rid of "impedance mismatch": mapping classes or objects to database tables defined by a relational schema is complex and cumbersome.
8674
8675		Some issues:
	8672		It was solutions suited to the needs of those big companies, that were very specific.
	8673		But it was interresting to see SQL's supremacy being questionned.
8676	8674
8677		- No absolute notion of "private" and "public" in RDBMS (relative to needs)
8678		- Data-type differences (no pointer, weird way of defining string, etc.)
8679		- Value in a relational structure have to be simple (no complex datatype, no structure)
	8675		One of the goal was to get rid of "impedance mismatch": mapping classes or objects to database tables defined by a relational schema is complex and cumbersome.
	8676		However, if you want your database application to go naturally from their data representation to the representations in the DBMS, solving this issue becomes critical.
	8677		Among the issues,
8680	8678
8681		"Impedance mismatch" is that annoying need for a translation.
	8679		- There is no absolute notion of "private" and "public" in RDBMS (relative to needs),
	8680		- There are many differences in the data-type (no pointer, weird way of defining string, etc.),
	8681		- The values in a relational structure have to be simple (no complex datatype, no structure).
8682	8682
8683		Also, the data is now
	8683		The term "impedance mismatch" describes that annoying need for a translation, and one of the goal of this first shift was to get rid of it.
8684	8684
8685		- moving
8686		- growing
8687		- too diverse
8688
8689		for traditional relational DBMS.
	8685		Also, the data is now moving, growing fast, extremely diverse, and traditional relational DBMS seemed not necessarily wel-suited to hande those changes.
8690	8686
8691	8687	### Gathering Forces	### Gathering Forces
8692	8688
8693		Multiple attempts, going in multiple directions.
8694		A meetup to discuss them coined the term "NoSQL" in an attempt to have a "twittable" hashtag, and it stayed (even it is as specific as describing a dog with "no-cat").
	8689		To renew the world of DBMS, there were multiple attempts, going in multiple directions.
	8690		A meetup to discuss them coined the term "NoSQL" in an attempt to have a "twittable" hashtag, and it stayed (even it is as specific as describing a dog as "not being a cat").
8695	8691	The original meet-up asked for "open-source, distributed, nonrelational database".	The original meet-up asked for "open-source, distributed, nonrelational database".
8696		Today, no official definition, but NoSQL often implies the followig:
	8692		Today, there is no "official" definition of NoSQL, but NoSQL often implies the following:
8697	8693
8698		- No relational model
8699		- Not using `SQL`. Some still have a query language, and it ressembles `SQL` (to minimize learning cost), for instance Cassandra's CQL.
8700		- Run well on clusters
8701		- Schemaless: you can add records without having to define a change in the structure first.
	8694		- No relational model,
	8695		- Not using `SQL`. Some still have a query language, and it ressembles `SQL` (to minimize learning cost), for instance Cassandra's CQL.,
	8696		- Run well on clusters,
	8697		- Schemaless: you can add records without having to define a change in the structure first,
8702	8698	- Open source.	- Open source.
8703	8699
8704		Most importantly: polyglot persistence, "using different data storage technologies to handle varying data storage needs."
	8700		Another important notion that emerged was the notion of "polyglot persistence", which is the idea of "using different data storage technologies to handle varying data storage needs."
	8701		In other terms, if you adopt the "application database" approach (i.e., one database dedicated to one particular application), the you can use the DBMS A for your application 1, and the DBMS B for your application 2, or even use A and B for the same application!
8705	8702
8706	8703	### The Future or the Past?	### The Future or the Past?
8707	8704
8708		A lot of enthusiasm, also because it "frees the data" (and, actually, the metadata, cf. application/ld+json, JavaScript Object Notation for Linked Data, schema.org, etc.).
	8705		There was a lot of enthusiasm, also because this approach "frees the data" (and, actually, the metadata, cf. application/ld+json, JavaScript Object Notation for Linked Data, schema.org, etc.): sharing e.g. a `json` file is much easier that sharing a `SQL` view along with its schema (the example in the [Document-Oriented Database](#document-oriented-database) will make it clearer).
	8706
8709	8707	Some of it will last for sure: polyglot persistency, the possibility of being schema-less, being "distributed first", the possibility of sacrificing consistency for greater good, etc.	Some of it will last for sure: polyglot persistency, the possibility of being schema-less, being "distributed first", the possibility of sacrificing consistency for greater good, etc.
8710		Does not mean `SQL` ("OldSQL") and relational database are over: still useful in many scenario, and the powerfull query language is great (writing your own every time is a nightmare…).
	8708		This does not mean that `SQL` ("OldSQL") and relational database are over: there are still useful in many scenario, and the powerfull query language is great (writing your own every time is a nightmare…).
8711	8709
8712	8710	Starting ~ 2010, one reaction was to develop "NewSQL", which would combine aspects of both approaches.	Starting ~ 2010, one reaction was to develop "NewSQL", which would combine aspects of both approaches.
8713		Starting ~ 2010, one reaction was to develop "NewSQL", which would combine aspects of both approaches.
8714		MongoDB announced that it would have more and more of the ACID properties! <https://www.mongodb.com/blog/post/multi-document-transactions-in-mongodb>
	8711		For instance, having to drop the ACID requirements (detailled [in this Section](#sec:AcidVsCAP)) was often seen as a major drawback, but, for instance, [MongoDB announced](https://www.mongodb.com/blog/post/multi-document-transactions-in-mongodb) that it would have more and more of the ACID properties!
8715	8712
8716	8713	Also, a really great use of NoSQL is to adopt it at an early stage of the development, when it is not clear what the schemas should be.	Also, a really great use of NoSQL is to adopt it at an early stage of the development, when it is not clear what the schemas should be.
8717	8714	When the schemas are final, then you can shift to relational DBMS!	When the schemas are final, then you can shift to relational DBMS!

...	...	The retro-acronym "Not Only `SQL`" emphasizes that `SQL` will still be one of th
8720	8717
8721	8718	## Comparison	## Comparison
8722	8719
	8720		`SQL` and the NoSQL approach can be compared in many different ways.
	8721		Note that there is no "best tool": it would be like trying to decide if a hammer is better than a saw, the answer is "it depends of what you want to do with it!".
	8722		But you can use one relational or non-relational DBMS for different purposes, sometimes, again, within the same application ("polyglot persistency").
	8723
8723	8724	### Overview	### Overview
8724	8725
8725	8726	« Comparaison n'est pas raison »^[A French proverb, meaning that "things should be judged on the individual qualities they posses, rather than by comparing one with another." [@FactsOnFile]]	« Comparaison n'est pas raison »^[A French proverb, meaning that "things should be judged on the individual qualities they posses, rather than by comparing one with another." [@FactsOnFile]]
8726	8727
8727		- Semi-structured data (no schema)
8728		- High performance
8729		- Availability
8730		- Data Replication (improves availability and performance)
8731		- Scalability (horizontal scalabality (add nodes) instead of vertical (add memory))
8732		- Eventual Consistency
8733		- Natively versionning
	8728		NoSQL
	8729		~
	8730
	8731		- Semi-structured data (no schema)
	8732		- High performance
	8733		- Availability
	8734		- Data Replication (improves availability and performance)
	8735		- Scalability (horizontal scalabality (add nodes) instead of vertical (add memory))
	8736		- Eventual Consistency
	8737		- Natively versionning
8734	8738
8735		Vs
	8739		SQL
	8740		~
8736	8741
8737		- Immediate data consistency
8738		- Powerfull query language (for instance, join is often missing in NoSQL, has to be implemented on the application-side)
8739		- Structured data storage (can be too restrictive)
	8742		- Immediate data consistency
	8743		- Powerfull query language (for instance, join is often missing in NoSQL, has to be implemented on the application-side)
	8744		- Structured data storage (can be too restrictive)
8740	8745
8741	8746	### ACID vs CAP vs BASE {#sec:AcidVsCAP}	### ACID vs CAP vs BASE {#sec:AcidVsCAP}
8742	8747
	8748		ACID and BASE are three acronyms capturing desirable features of DBMS, while CAP is a theorem stating the impossibility to have some desirable properties at the same time in distributed systems.
	8749
8743	8750	ACID is the guarantee of validity even in the event of errors, power failures, etc.	ACID is the guarantee of validity even in the event of errors, power failures, etc.
8744	8751
8745	8752	- Atomicity → Transactions are all or nothing	- Atomicity → Transactions are all or nothing

...	...	ACID is the guarantee of validity even in the event of errors, power failures, e
8749	8756
8750	8757	CAP (a.k.a. Brewer's theorem): Roughly, "In a distributed system, one has to choose between consistency (every read receives the most recent write or an error) and availability (every request receives a (non-error) response, without guarantee that it contains the most recent write)" (the P. standing for "Partition tolerance", a guarantee of availability).	CAP (a.k.a. Brewer's theorem): Roughly, "In a distributed system, one has to choose between consistency (every read receives the most recent write or an error) and availability (every request receives a (non-error) response, without guarantee that it contains the most recent write)" (the P. standing for "Partition tolerance", a guarantee of availability).
8751	8758
8752		BASE is Basic Availability, Soft state, Eventual consistency.
	8759		BASE (also formulated by Brewer) corresponds to Basic Availability, Soft state, Eventual consistency.
	8760		It is a series of properties that can be reached by distributed systems, including NoSQL systems, and is often seen as the "NoSQL's version of ACID".
	8761		This [answer](https://stackoverflow.com/a/3382260) for answer, gives some insight on its meaning.
8753	8762
8754	8763	## Categories of NoSQL Systems	## Categories of NoSQL Systems
8755	8764
	8765		There are multiple ways to be "non-relational".
	8766		A rough hierarchy of the different approaches can be sketched as follows.
	8767
8756	8768	Model \| Description \| Examples \|	Model \| Description \| Examples \|
8757	8769	--- \| --- \| ---	--- \| --- \| ---
8758		Document-based \| Data is stored as "documents" (JSON, for instance), accessible via their ID (other indexes available). \| [Apache CouchDB](https://couchdb.apache.org/) (simble for web applications, and reliable), [MongoDB](https://www.mongodb.com/) (easy to operate), [Couchbase](https://www.couchbase.com/) (high concurrency, and high availability).
8759		Key-value stores \| Fast access by the key to the value. Value can be a record, an object, a document, or be even more complex. \| [Redis](https://redis.io/) (in-memory but persistent on disk database, stores everything in the RAM!)
	8770		Document-based \| Data is stored as "documents" (JSON, for instance), accessible via their ID (other indexes). \| [Apache CouchDB](https://couchdb.apache.org/) (simble for web applications, and reliable), [MongoDB](https://www.mongodb.com/) (easy to operate), [Couchbase](https://www.couchbase.com/) (high concurrency, and high availability).
	8771		Key-value stores \| Fast access by the key to the value. Value can be a record, an object, a document, or be more complex. \| [Redis](https://redis.io/) (in-memory but persistent on disk database, stores everything in the RAM!)
8760	8772	Column-based (a.k.a. wide column) \| Partition a table by colmuns into column families, where each column family is stored in its own files. \| [Cassandra](https://cassandra.apache.org/), [HBase](https://hbase.apache.org/) (both for huge amount of data)	Column-based (a.k.a. wide column) \| Partition a table by colmuns into column families, where each column family is stored in its own files. \| [Cassandra](https://cassandra.apache.org/), [HBase](https://hbase.apache.org/) (both for huge amount of data)
8761		Graph-based \| Data is represented as graphs, and related nodes can be found by traversing the edges using path expressions. \| [Neo4J](https://neo4j.com/) (excellent for pattern recognition, and data mining).
	8773		Graph-based \| Data is represented as graphs, and related nodes can be found by traversing the edges using path expressions. \| [Neo4J](https://neo4j.com/) (excellent for pattern recognition, and data mining)
8762	8774	Multi-model \| Support multiple data models \| [Apache Ignite](https://ignite.apache.org/), [ArangoDB](https://www.arangodb.com/), etc.	Multi-model \| Support multiple data models \| [Apache Ignite](https://ignite.apache.org/), [ArangoDB](https://www.arangodb.com/), etc.
8763	8775
	8776		---
	8777
8764	8778	## MongoDB	## MongoDB
8765	8779
8766	8780	### Resources	### Resources

...	...	Multi-model \| Support multiple data models \| [Apache Ignite](https://ignite.apa
8780	8794	- [@NoSQLDistilled, ch. 9]	- [@NoSQLDistilled, ch. 9]
8781	8795	- [@Sullivan2015, ch. 6]	- [@Sullivan2015, ch. 6]
8782	8796
	8797
8783	8798	### Introduction	### Introduction
8784	8799
8785	8800	MongoDB is	MongoDB is

Hints:
Before first commit, do not forget to setup your git environment:

git config --global user.name "your_name_here"
git config --global user.email "your@email_here"

Clone this repository using HTTP(S):

git clone https://rocketgit.com/user/caubert/CSCI_3410

Clone this repository using ssh (do not forget to upload a key first):

git clone ssh://rocketgit@ssh.rocketgit.com/user/caubert/CSCI_3410

Clone this repository using git:

git clone git://git.rocketgit.com/user/caubert/CSCI_3410

You are allowed to anonymously push to this repository.
This means that your pushed commits will automatically be transformed into a merge request:

... clone the repository ...
... make some changes and some commits ...
git push origin main