List of commits:
Subject Hash Author Date (UTC)
Worked on nosql chapter. 38dfcfffce7c06d0eab7256a7a5be1e2481a8505 aubert@math.cnrs.fr 2020-04-20 17:52:55
Cleaned up bib file. 505b1000bca31e4d833bf7739d02d2b7e727e69b aubert@math.cnrs.fr 2020-04-20 05:03:18
rapid adjustments in contrib beb43332953ec4e2a8e4376a90ce58b1234eadba aubert@math.cnrs.fr 2020-04-20 00:10:44
Updated CONTRIB.md c29d920e1efbc84766a3caafc0db2fcab4220b32 pveeral@augusta.edu 2020-04-19 19:28:45
testing 4ece7ba3d5c5d99361ef5eac92bb0848f2ea5318 pveeral@augusta.edu 2020-04-19 18:27:50
Small edit, correcting maefile. 7baa188be7d322e5288b498afbb7beaa96a9770b aubert@math.cnrs.fr 2020-04-19 07:36:08
Cleaned latex files. 79b68f7b709ddeebc8133f7962fe5aabb3376304 aubert@math.cnrs.fr 2020-04-19 06:22:30
Minor corrections in installation manual. 436cee8616c25ccbed8bc406d988c2b4d28420f8 aubert@math.cnrs.fr 2020-04-19 06:19:04
Minor corrections in installation manual. cb8cdfbd506a1344c81aecda055165cc1ca54ece aubert@math.cnrs.fr 2020-04-19 06:17:52
Working on install manual. 3702c6437ee163eb4a61b4d69cffee8c8a76dc3d aubert@math.cnrs.fr 2020-04-19 06:04:22
Worked on makefiles and example file. 4255d5e85bb684349f7f7798455dd8b3a273254b aubert@math.cnrs.fr 2020-04-19 04:56:53
Re-idented some of the code. 124375e6bed1edb96d1bb4bcec8f111c8a3a1197 aubert@math.cnrs.fr 2020-04-19 03:10:02
Java indentation 2b317a12b7ab52bdca576a1bb46b2a2ce295464f guest 2020-04-18 22:21:04
test 6fefa044794ff1d74a3d2493556c836b3dd97e74 guest 2020-04-18 22:18:44
Java indentation 5b0e0eb38484a8c67517a36a438f148bd5efa740 guest 2020-04-18 22:14:01
Worked on install notes. b46b931ef11e3cb7dfe87c7f91ec9d5c558567e6 aubert@math.cnrs.fr 2020-04-17 05:22:20
Started to integrate installation manual to notes. fd27b7686dd4c9d99163cf7badc720cd4a050221 aubert@math.cnrs.fr 2020-04-17 04:09:11
Replaced picture with text in Naming_Convention.md. aff8c98c70b0834f9b0f076b881975daf3cdda03 aubert@math.cnrs.fr 2020-04-17 03:48:38
Testing. b1a0942b15742ce987e4ad63848e9e2afcdde7ae aubert@math.cnrs.fr 2020-04-17 02:03:52
Added explanation on the importance of alt text in known bugs. 3bddd86e1072b095a9f9d02f8c035f8d8f3e7155 aubert@math.cnrs.fr 2020-04-17 01:44:51
Commit 38dfcfffce7c06d0eab7256a7a5be1e2481a8505 - Worked on nosql chapter.
Author: aubert@math.cnrs.fr
Author date (UTC): 2020-04-20 17:52
Committer name: aubert@math.cnrs.fr
Committer date (UTC): 2020-04-20 17:52
Parent(s): 505b1000bca31e4d833bf7739d02d2b7e727e69b
Signer:
Signing key:
Signing status: N
Tree: 3545d47614b0c3e0039f7f7da48e447cbafbd3cb
File Lines added Lines deleted
notes/lectures_notes.md 73 58
File notes/lectures_notes.md changed (mode: 100644) (index 6a34f7d..0a9f0bc)
... ... To write this chapter, were used
8635 8635
8636 8636 ## A Bit of History ## A Bit of History
8637 8637
8638 Inspired from [@NoSQLDistilled, Chap. 1]
8638 This part is partially inspired from [@NoSQLDistilled, Chap. 1], but it has been further updated.
8639 8639
8640 8640 ### Database Applications and Application Databases ### Database Applications and Application Databases
8641 8641
8642 When you write a Database application, you have two options:
8642 When you write a database application, you have two options:
8643 8643
8644 #. One database for many softwares
8645 #. One database for each softwares
8644 #. One database for multiple applications,
8645 #. One database for each application.
8646 8646
8647 The first option can cause severe impacts on the efficiency of your database: since maintening the integrity of the database is a requirement, a lot of synchronization is needed.
8648 With the second option, you develop an "application database", and you have more freedom of choice: since only a program interact with a database, you can chose whatever data management you want.
8647 The first option can cause severe impacts on the efficiency of your database: since maintening the integrity of the database is a requirement, a lot of synchronization is needed, and your database becomes a bottleneck.
8648 With the second option, you develop an "application database" (i.e., a database dedicated to a particular application), and you have more freedom in the design, schema, and even DBMS (you can use one particular software solution for one particular database application, and a different one for a different database application).
8649 8649
8650 But people were attached to `SQL` and kept using it.
8650 ### Clusters, Clusters…
8651 8651
8652 ### Clusters, clusters…
8652 The increase in everything (traffic, size of data, number of clients, etc.) meant "up or out", and raised numerous challenges for the "one database for multiple application" option.
8653 There was two ways to increase the resources and to scale up:
8653 8654
8654 Increase in everything (traffic, size of data, number of clients, etc.) meant "up or out", and there was two ways to increase the resources:
8655 #. Bigger machines,
8656 #. More machines.
8655 8657
8656 #. Bigger machines
8657 #. More machines
8658 The second option was generally less expensive (compare buying 1,000 raspberry pi VS buying 1 supercomputer that is not a cluster of more modest computers), but came with two drawbacks w.r.t. databases:
8658 8659
8659 The second option was generally less expensive, but came with two drawbacks w.r.t. databases:
8660
8661 #. Cost of licences,
8662 #. Force to perform "unnatural acts": relational model are really not made to be distributed
8660 #. The cost of licences was excessive (indeed, you had to buy one licence per computer),
8661 #. and it forced to perform "unnatural acts": relational model are really not made to be distributed.
8663 8662
8664 8663 ### A First Shift ### A First Shift
8665 8664
8665 Developping DBMS more suited for distributed architectures became growingly important, and some comanies took at stab at it.
8666 The more important attemts were
8667
8666 8668 - [Google Big Table](https://cloud.google.com/bigtable/), 2004 (made public in … 2015!) [@Chang2006] - [Google Big Table](https://cloud.google.com/bigtable/), 2004 (made public in … 2015!) [@Chang2006]
8667 8669 - [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), 2004 (used in Simple Storage Service (S3) in 2007) - [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), 2004 (used in Simple Storage Service (S3) in 2007)
8668 8670 - Facebook's Cassandra is sometimes mentioned, but it came later on, around 2009 [@Lakshman2009]. - Facebook's Cassandra is sometimes mentioned, but it came later on, around 2009 [@Lakshman2009].
8669 8671
8670 Particular, big company, with specific needs, but people interrested in solving some of their problems.
8671 Now, people started to think that there could be other ways.
8672
8673 One goal was to get rid of "impedance mismatch": mapping classes or objects to database tables defined by a relational schema is complex and cumbersome.
8674
8675 Some issues:
8672 It was solutions suited to the needs of those big companies, that were very specific.
8673 But it was interresting to see SQL's supremacy being questionned.
8676 8674
8677 - No absolute notion of "private" and "public" in RDBMS (relative to needs)
8678 - Data-type differences (no pointer, weird way of defining string, etc.)
8679 - Value in a relational structure have to be simple (no complex datatype, no structure)
8675 One of the goal was to get rid of "impedance mismatch": mapping classes or objects to database tables defined by a relational schema is complex and cumbersome.
8676 However, if you want your database application to go naturally from their data representation to the representations in the DBMS, solving this issue becomes critical.
8677 Among the issues,
8680 8678
8681 "Impedance mismatch" is that annoying need for a translation.
8679 - There is no absolute notion of "private" and "public" in RDBMS (relative to needs),
8680 - There are many differences in the data-type (no pointer, weird way of defining string, etc.),
8681 - The values in a relational structure have to be simple (no complex datatype, no structure).
8682 8682
8683 Also, the data is now
8683 The term "impedance mismatch" describes that annoying need for a translation, and one of the goal of this first shift was to get rid of it.
8684 8684
8685 - moving
8686 - growing
8687 - too diverse
8688
8689 for traditional relational DBMS.
8685 Also, the data is now moving, growing fast, extremely diverse, and traditional relational DBMS seemed not necessarily wel-suited to hande those changes.
8690 8686
8691 8687 ### Gathering Forces ### Gathering Forces
8692 8688
8693 Multiple attempts, going in multiple directions.
8694 A meetup to discuss them coined the term "NoSQL" in an attempt to have a "twittable" hashtag, and it stayed (even it is as specific as describing a dog with "no-cat").
8689 To renew the world of DBMS, there were multiple attempts, going in multiple directions.
8690 A meetup to discuss them coined the term "NoSQL" in an attempt to have a "twittable" hashtag, and it stayed (even it is as specific as describing a dog as "not being a cat").
8695 8691 The original meet-up asked for "open-source, distributed, nonrelational database". The original meet-up asked for "open-source, distributed, nonrelational database".
8696 Today, no official definition, but NoSQL often implies the followig:
8692 Today, there is no "official" definition of NoSQL, but NoSQL often implies the following:
8697 8693
8698 - No relational model
8699 - Not using `SQL`. Some still have a query language, and it ressembles `SQL` (to minimize learning cost), for instance Cassandra's CQL.
8700 - Run well on clusters
8701 - Schemaless: you can add records without having to define a change in the structure first.
8694 - No relational model,
8695 - Not using `SQL`. Some still have a query language, and it ressembles `SQL` (to minimize learning cost), for instance Cassandra's CQL.,
8696 - Run well on clusters,
8697 - Schemaless: you can add records without having to define a change in the structure first,
8702 8698 - Open source. - Open source.
8703 8699
8704 Most importantly: polyglot persistence, "using different data storage technologies to handle varying data storage needs."
8700 Another important notion that emerged was the notion of "polyglot persistence", which is the idea of "using different data storage technologies to handle varying data storage needs."
8701 In other terms, if you adopt the "application database" approach (i.e., one database dedicated to one particular application), the you can use the DBMS A for your application 1, and the DBMS B for your application 2, or even use A and B for the same application!
8705 8702
8706 8703 ### The Future or the Past? ### The Future or the Past?
8707 8704
8708 A lot of enthusiasm, also because it "frees the data" (and, actually, the metadata, cf. application/ld+json, JavaScript Object Notation for Linked Data, schema.org, etc.).
8705 There was a lot of enthusiasm, also because this approach "frees the data" (and, actually, the metadata, cf. application/ld+json, JavaScript Object Notation for Linked Data, schema.org, etc.): sharing e.g. a `json` file is much easier that sharing a `SQL` view along with its schema (the example in the [Document-Oriented Database](#document-oriented-database) will make it clearer).
8706
8709 8707 Some of it will last for sure: polyglot persistency, the possibility of being schema-less, being "distributed first", the possibility of sacrificing consistency for greater good, etc. Some of it will last for sure: polyglot persistency, the possibility of being schema-less, being "distributed first", the possibility of sacrificing consistency for greater good, etc.
8710 Does not mean `SQL` ("OldSQL") and relational database are over: still useful in many scenario, and the powerfull query language is great (writing your own every time is a nightmare…).
8708 This does not mean that `SQL` ("OldSQL") and relational database are over: there are still useful in many scenario, and the powerfull query language is great (writing your own every time is a nightmare…).
8711 8709
8712 8710 Starting ~ 2010, one reaction was to develop "NewSQL", which would combine aspects of both approaches. Starting ~ 2010, one reaction was to develop "NewSQL", which would combine aspects of both approaches.
8713 Starting ~ 2010, one reaction was to develop "NewSQL", which would combine aspects of both approaches.
8714 MongoDB announced that it would have more and more of the ACID properties! <https://www.mongodb.com/blog/post/multi-document-transactions-in-mongodb>
8711 For instance, having to drop the ACID requirements (detailled [in this Section](#sec:AcidVsCAP)) was often seen as a major drawback, but, for instance, [MongoDB announced](https://www.mongodb.com/blog/post/multi-document-transactions-in-mongodb) that it would have more and more of the ACID properties!
8715 8712
8716 8713 Also, a really great use of NoSQL is to adopt it at an early stage of the development, when it is not clear what the schemas should be. Also, a really great use of NoSQL is to adopt it at an early stage of the development, when it is not clear what the schemas should be.
8717 8714 When the schemas are final, then you can shift to relational DBMS! When the schemas are final, then you can shift to relational DBMS!
 
... ... The retro-acronym "Not Only `SQL`" emphasizes that `SQL` will still be one of th
8720 8717
8721 8718 ## Comparison ## Comparison
8722 8719
8720 `SQL` and the NoSQL approach can be compared in many different ways.
8721 Note that there is no "best tool": it would be like trying to decide if a hammer is better than a saw, the answer is "it depends of what you want to do with it!".
8722 But you can use one relational or non-relational DBMS for different purposes, sometimes, again, within the same application ("polyglot persistency").
8723
8723 8724 ### Overview ### Overview
8724 8725
8725 8726 *« Comparaison n'est pas raison »*^[A French proverb, meaning that "things should be judged on the individual qualities they posses, rather than by comparing one with another." [@FactsOnFile]] *« Comparaison n'est pas raison »*^[A French proverb, meaning that "things should be judged on the individual qualities they posses, rather than by comparing one with another." [@FactsOnFile]]
8726 8727
8727 - Semi-structured data (no schema)
8728 - High performance
8729 - Availability
8730 - Data Replication (improves availability and performance)
8731 - Scalability (horizontal scalabality (add nodes) instead of vertical (add memory))
8732 - Eventual Consistency
8733 - Natively versionning
8728 NoSQL
8729 ~
8730
8731 - Semi-structured data (no schema)
8732 - High performance
8733 - Availability
8734 - Data Replication (improves availability and performance)
8735 - Scalability (horizontal scalabality (add nodes) instead of vertical (add memory))
8736 - Eventual Consistency
8737 - Natively versionning
8734 8738
8735 Vs
8739 SQL
8740 ~
8736 8741
8737 - Immediate data consistency
8738 - Powerfull query language (for instance, join is often missing in NoSQL, has to be implemented on the application-side)
8739 - Structured data storage (can be too restrictive)
8742 - Immediate data consistency
8743 - Powerfull query language (for instance, join is often missing in NoSQL, has to be implemented on the application-side)
8744 - Structured data storage (can be too restrictive)
8740 8745
8741 8746 ### ACID vs CAP vs BASE {#sec:AcidVsCAP} ### ACID vs CAP vs BASE {#sec:AcidVsCAP}
8742 8747
8748 ACID and BASE are three acronyms capturing desirable features of DBMS, while CAP is a theorem stating the impossibility to have some desirable properties at the same time in distributed systems.
8749
8743 8750 ACID is the guarantee of validity even in the event of errors, power failures, etc. ACID is the guarantee of validity even in the event of errors, power failures, etc.
8744 8751
8745 8752 - Atomicity → Transactions are all or nothing - Atomicity → Transactions are all or nothing
 
... ... ACID is the guarantee of validity even in the event of errors, power failures, e
8749 8756
8750 8757 CAP (a.k.a. Brewer's theorem): Roughly, "In a distributed system, one has to choose between consistency (every read receives the most recent write or an error) and availability (every request receives a (non-error) response, without guarantee that it contains the most recent write)" (the P. standing for "Partition tolerance", a guarantee of availability). CAP (a.k.a. Brewer's theorem): Roughly, "In a distributed system, one has to choose between consistency (every read receives the most recent write or an error) and availability (every request receives a (non-error) response, without guarantee that it contains the most recent write)" (the P. standing for "Partition tolerance", a guarantee of availability).
8751 8758
8752 BASE is Basic Availability, Soft state, Eventual consistency.
8759 BASE (also formulated by Brewer) corresponds to Basic Availability, Soft state, Eventual consistency.
8760 It is a series of properties that can be reached by distributed systems, including NoSQL systems, and is often seen as the "NoSQL's version of ACID".
8761 This [answer](https://stackoverflow.com/a/3382260) for answer, gives some insight on its meaning.
8753 8762
8754 8763 ## Categories of NoSQL Systems ## Categories of NoSQL Systems
8755 8764
8765 There are multiple ways to be "non-relational".
8766 A rough hierarchy of the different approaches can be sketched as follows.
8767
8756 8768 Model | Description | Examples | Model | Description | Examples |
8757 8769 --- | --- | --- --- | --- | ---
8758 Document-based | Data is stored as "documents" (JSON, for instance), accessible via their ID (other indexes available). | [Apache CouchDB](https://couchdb.apache.org/) (simble for web applications, and reliable), [MongoDB](https://www.mongodb.com/) (easy to operate), [Couchbase](https://www.couchbase.com/) (high concurrency, and high availability).
8759 Key-value stores | Fast access by the key to the value. Value can be a record, an object, a document, or be even more complex. | [Redis](https://redis.io/) (in-memory but persistent on disk database, stores everything in the RAM!)
8770 Document-based | Data is stored as "documents" (JSON, for instance), accessible via their ID (other indexes). | [Apache CouchDB](https://couchdb.apache.org/) (simble for web applications, and reliable), [MongoDB](https://www.mongodb.com/) (easy to operate), [Couchbase](https://www.couchbase.com/) (high concurrency, and high availability).
8771 Key-value stores | Fast access by the key to the value. Value can be a record, an object, a document, or be more complex. | [Redis](https://redis.io/) (in-memory but persistent on disk database, stores everything in the RAM!)
8760 8772 Column-based (a.k.a. wide column) | Partition a table by colmuns into column families, where each column family is stored in its own files. | [Cassandra](https://cassandra.apache.org/), [HBase](https://hbase.apache.org/) (both for huge amount of data) Column-based (a.k.a. wide column) | Partition a table by colmuns into column families, where each column family is stored in its own files. | [Cassandra](https://cassandra.apache.org/), [HBase](https://hbase.apache.org/) (both for huge amount of data)
8761 Graph-based | Data is represented as graphs, and related nodes can be found by traversing the edges using path expressions. | [Neo4J](https://neo4j.com/) (excellent for pattern recognition, and data mining).
8773 Graph-based | Data is represented as graphs, and related nodes can be found by traversing the edges using path expressions. | [Neo4J](https://neo4j.com/) (excellent for pattern recognition, and data mining)
8762 8774 Multi-model | Support multiple data models | [Apache Ignite](https://ignite.apache.org/), [ArangoDB](https://www.arangodb.com/), etc. Multi-model | Support multiple data models | [Apache Ignite](https://ignite.apache.org/), [ArangoDB](https://www.arangodb.com/), etc.
8763 8775
8776 ---
8777
8764 8778 ## MongoDB ## MongoDB
8765 8779
8766 8780 ### Resources ### Resources
 
... ... Multi-model | Support multiple data models | [Apache Ignite](https://ignite.apa
8780 8794 - [@NoSQLDistilled, ch. 9] - [@NoSQLDistilled, ch. 9]
8781 8795 - [@Sullivan2015, ch. 6] - [@Sullivan2015, ch. 6]
8782 8796
8797
8783 8798 ### Introduction ### Introduction
8784 8799
8785 8800 MongoDB is MongoDB is
Hints:
Before first commit, do not forget to setup your git environment:
git config --global user.name "your_name_here"
git config --global user.email "your@email_here"

Clone this repository using HTTP(S):
git clone https://rocketgit.com/user/caubert/CSCI_3410

Clone this repository using ssh (do not forget to upload a key first):
git clone ssh://rocketgit@ssh.rocketgit.com/user/caubert/CSCI_3410

Clone this repository using git:
git clone git://git.rocketgit.com/user/caubert/CSCI_3410

You are allowed to anonymously push to this repository.
This means that your pushed commits will automatically be transformed into a merge request:
... clone the repository ...
... make some changes and some commits ...
git push origin main