File notes/lectures_notes.md changed (mode: 100644) (index 6a34f7d..0a9f0bc) |
... |
... |
To write this chapter, were used |
8635 |
8635 |
|
|
8636 |
8636 |
## A Bit of History |
## A Bit of History |
8637 |
8637 |
|
|
8638 |
|
Inspired from [@NoSQLDistilled, Chap. 1] |
|
|
8638 |
|
This part is partially inspired from [@NoSQLDistilled, Chap. 1], but it has been further updated. |
8639 |
8639 |
|
|
8640 |
8640 |
### Database Applications and Application Databases |
### Database Applications and Application Databases |
8641 |
8641 |
|
|
8642 |
|
When you write a Database application, you have two options: |
|
|
8642 |
|
When you write a database application, you have two options: |
8643 |
8643 |
|
|
8644 |
|
#. One database for many softwares |
|
8645 |
|
#. One database for each softwares |
|
|
8644 |
|
#. One database for multiple applications, |
|
8645 |
|
#. One database for each application. |
8646 |
8646 |
|
|
8647 |
|
The first option can cause severe impacts on the efficiency of your database: since maintening the integrity of the database is a requirement, a lot of synchronization is needed. |
|
8648 |
|
With the second option, you develop an "application database", and you have more freedom of choice: since only a program interact with a database, you can chose whatever data management you want. |
|
|
8647 |
|
The first option can cause severe impacts on the efficiency of your database: since maintening the integrity of the database is a requirement, a lot of synchronization is needed, and your database becomes a bottleneck. |
|
8648 |
|
With the second option, you develop an "application database" (i.e., a database dedicated to a particular application), and you have more freedom in the design, schema, and even DBMS (you can use one particular software solution for one particular database application, and a different one for a different database application). |
8649 |
8649 |
|
|
8650 |
|
But people were attached to `SQL` and kept using it. |
|
|
8650 |
|
### Clusters, Clusters… |
8651 |
8651 |
|
|
8652 |
|
### Clusters, clusters… |
|
|
8652 |
|
The increase in everything (traffic, size of data, number of clients, etc.) meant "up or out", and raised numerous challenges for the "one database for multiple application" option. |
|
8653 |
|
There was two ways to increase the resources and to scale up: |
8653 |
8654 |
|
|
8654 |
|
Increase in everything (traffic, size of data, number of clients, etc.) meant "up or out", and there was two ways to increase the resources: |
|
|
8655 |
|
#. Bigger machines, |
|
8656 |
|
#. More machines. |
8655 |
8657 |
|
|
8656 |
|
#. Bigger machines |
|
8657 |
|
#. More machines |
|
|
8658 |
|
The second option was generally less expensive (compare buying 1,000 raspberry pi VS buying 1 supercomputer that is not a cluster of more modest computers), but came with two drawbacks w.r.t. databases: |
8658 |
8659 |
|
|
8659 |
|
The second option was generally less expensive, but came with two drawbacks w.r.t. databases: |
|
8660 |
|
|
|
8661 |
|
#. Cost of licences, |
|
8662 |
|
#. Force to perform "unnatural acts": relational model are really not made to be distributed |
|
|
8660 |
|
#. The cost of licences was excessive (indeed, you had to buy one licence per computer), |
|
8661 |
|
#. and it forced to perform "unnatural acts": relational model are really not made to be distributed. |
8663 |
8662 |
|
|
8664 |
8663 |
### A First Shift |
### A First Shift |
8665 |
8664 |
|
|
|
8665 |
|
Developping DBMS more suited for distributed architectures became growingly important, and some comanies took at stab at it. |
|
8666 |
|
The more important attemts were |
|
8667 |
|
|
8666 |
8668 |
- [Google Big Table](https://cloud.google.com/bigtable/), 2004 (made public in … 2015!) [@Chang2006] |
- [Google Big Table](https://cloud.google.com/bigtable/), 2004 (made public in … 2015!) [@Chang2006] |
8667 |
8669 |
- [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), 2004 (used in Simple Storage Service (S3) in 2007) |
- [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), 2004 (used in Simple Storage Service (S3) in 2007) |
8668 |
8670 |
- Facebook's Cassandra is sometimes mentioned, but it came later on, around 2009 [@Lakshman2009]. |
- Facebook's Cassandra is sometimes mentioned, but it came later on, around 2009 [@Lakshman2009]. |
8669 |
8671 |
|
|
8670 |
|
Particular, big company, with specific needs, but people interrested in solving some of their problems. |
|
8671 |
|
Now, people started to think that there could be other ways. |
|
8672 |
|
|
|
8673 |
|
One goal was to get rid of "impedance mismatch": mapping classes or objects to database tables defined by a relational schema is complex and cumbersome. |
|
8674 |
|
|
|
8675 |
|
Some issues: |
|
|
8672 |
|
It was solutions suited to the needs of those big companies, that were very specific. |
|
8673 |
|
But it was interresting to see SQL's supremacy being questionned. |
8676 |
8674 |
|
|
8677 |
|
- No absolute notion of "private" and "public" in RDBMS (relative to needs) |
|
8678 |
|
- Data-type differences (no pointer, weird way of defining string, etc.) |
|
8679 |
|
- Value in a relational structure have to be simple (no complex datatype, no structure) |
|
|
8675 |
|
One of the goal was to get rid of "impedance mismatch": mapping classes or objects to database tables defined by a relational schema is complex and cumbersome. |
|
8676 |
|
However, if you want your database application to go naturally from their data representation to the representations in the DBMS, solving this issue becomes critical. |
|
8677 |
|
Among the issues, |
8680 |
8678 |
|
|
8681 |
|
"Impedance mismatch" is that annoying need for a translation. |
|
|
8679 |
|
- There is no absolute notion of "private" and "public" in RDBMS (relative to needs), |
|
8680 |
|
- There are many differences in the data-type (no pointer, weird way of defining string, etc.), |
|
8681 |
|
- The values in a relational structure have to be simple (no complex datatype, no structure). |
8682 |
8682 |
|
|
8683 |
|
Also, the data is now |
|
|
8683 |
|
The term "impedance mismatch" describes that annoying need for a translation, and one of the goal of this first shift was to get rid of it. |
8684 |
8684 |
|
|
8685 |
|
- moving |
|
8686 |
|
- growing |
|
8687 |
|
- too diverse |
|
8688 |
|
|
|
8689 |
|
for traditional relational DBMS. |
|
|
8685 |
|
Also, the data is now moving, growing fast, extremely diverse, and traditional relational DBMS seemed not necessarily wel-suited to hande those changes. |
8690 |
8686 |
|
|
8691 |
8687 |
### Gathering Forces |
### Gathering Forces |
8692 |
8688 |
|
|
8693 |
|
Multiple attempts, going in multiple directions. |
|
8694 |
|
A meetup to discuss them coined the term "NoSQL" in an attempt to have a "twittable" hashtag, and it stayed (even it is as specific as describing a dog with "no-cat"). |
|
|
8689 |
|
To renew the world of DBMS, there were multiple attempts, going in multiple directions. |
|
8690 |
|
A meetup to discuss them coined the term "NoSQL" in an attempt to have a "twittable" hashtag, and it stayed (even it is as specific as describing a dog as "not being a cat"). |
8695 |
8691 |
The original meet-up asked for "open-source, distributed, nonrelational database". |
The original meet-up asked for "open-source, distributed, nonrelational database". |
8696 |
|
Today, no official definition, but NoSQL often implies the followig: |
|
|
8692 |
|
Today, there is no "official" definition of NoSQL, but NoSQL often implies the following: |
8697 |
8693 |
|
|
8698 |
|
- No relational model |
|
8699 |
|
- Not using `SQL`. Some still have a query language, and it ressembles `SQL` (to minimize learning cost), for instance Cassandra's CQL. |
|
8700 |
|
- Run well on clusters |
|
8701 |
|
- Schemaless: you can add records without having to define a change in the structure first. |
|
|
8694 |
|
- No relational model, |
|
8695 |
|
- Not using `SQL`. Some still have a query language, and it ressembles `SQL` (to minimize learning cost), for instance Cassandra's CQL., |
|
8696 |
|
- Run well on clusters, |
|
8697 |
|
- Schemaless: you can add records without having to define a change in the structure first, |
8702 |
8698 |
- Open source. |
- Open source. |
8703 |
8699 |
|
|
8704 |
|
Most importantly: polyglot persistence, "using different data storage technologies to handle varying data storage needs." |
|
|
8700 |
|
Another important notion that emerged was the notion of "polyglot persistence", which is the idea of "using different data storage technologies to handle varying data storage needs." |
|
8701 |
|
In other terms, if you adopt the "application database" approach (i.e., one database dedicated to one particular application), the you can use the DBMS A for your application 1, and the DBMS B for your application 2, or even use A and B for the same application! |
8705 |
8702 |
|
|
8706 |
8703 |
### The Future or the Past? |
### The Future or the Past? |
8707 |
8704 |
|
|
8708 |
|
A lot of enthusiasm, also because it "frees the data" (and, actually, the metadata, cf. application/ld+json, JavaScript Object Notation for Linked Data, schema.org, etc.). |
|
|
8705 |
|
There was a lot of enthusiasm, also because this approach "frees the data" (and, actually, the metadata, cf. application/ld+json, JavaScript Object Notation for Linked Data, schema.org, etc.): sharing e.g. a `json` file is much easier that sharing a `SQL` view along with its schema (the example in the [Document-Oriented Database](#document-oriented-database) will make it clearer). |
|
8706 |
|
|
8709 |
8707 |
Some of it will last for sure: polyglot persistency, the possibility of being schema-less, being "distributed first", the possibility of sacrificing consistency for greater good, etc. |
Some of it will last for sure: polyglot persistency, the possibility of being schema-less, being "distributed first", the possibility of sacrificing consistency for greater good, etc. |
8710 |
|
Does not mean `SQL` ("OldSQL") and relational database are over: still useful in many scenario, and the powerfull query language is great (writing your own every time is a nightmare…). |
|
|
8708 |
|
This does not mean that `SQL` ("OldSQL") and relational database are over: there are still useful in many scenario, and the powerfull query language is great (writing your own every time is a nightmare…). |
8711 |
8709 |
|
|
8712 |
8710 |
Starting ~ 2010, one reaction was to develop "NewSQL", which would combine aspects of both approaches. |
Starting ~ 2010, one reaction was to develop "NewSQL", which would combine aspects of both approaches. |
8713 |
|
Starting ~ 2010, one reaction was to develop "NewSQL", which would combine aspects of both approaches. |
|
8714 |
|
MongoDB announced that it would have more and more of the ACID properties! <https://www.mongodb.com/blog/post/multi-document-transactions-in-mongodb> |
|
|
8711 |
|
For instance, having to drop the ACID requirements (detailled [in this Section](#sec:AcidVsCAP)) was often seen as a major drawback, but, for instance, [MongoDB announced](https://www.mongodb.com/blog/post/multi-document-transactions-in-mongodb) that it would have more and more of the ACID properties! |
8715 |
8712 |
|
|
8716 |
8713 |
Also, a really great use of NoSQL is to adopt it at an early stage of the development, when it is not clear what the schemas should be. |
Also, a really great use of NoSQL is to adopt it at an early stage of the development, when it is not clear what the schemas should be. |
8717 |
8714 |
When the schemas are final, then you can shift to relational DBMS! |
When the schemas are final, then you can shift to relational DBMS! |
|
... |
... |
The retro-acronym "Not Only `SQL`" emphasizes that `SQL` will still be one of th |
8720 |
8717 |
|
|
8721 |
8718 |
## Comparison |
## Comparison |
8722 |
8719 |
|
|
|
8720 |
|
`SQL` and the NoSQL approach can be compared in many different ways. |
|
8721 |
|
Note that there is no "best tool": it would be like trying to decide if a hammer is better than a saw, the answer is "it depends of what you want to do with it!". |
|
8722 |
|
But you can use one relational or non-relational DBMS for different purposes, sometimes, again, within the same application ("polyglot persistency"). |
|
8723 |
|
|
8723 |
8724 |
### Overview |
### Overview |
8724 |
8725 |
|
|
8725 |
8726 |
*« Comparaison n'est pas raison »*^[A French proverb, meaning that "things should be judged on the individual qualities they posses, rather than by comparing one with another." [@FactsOnFile]] |
*« Comparaison n'est pas raison »*^[A French proverb, meaning that "things should be judged on the individual qualities they posses, rather than by comparing one with another." [@FactsOnFile]] |
8726 |
8727 |
|
|
8727 |
|
- Semi-structured data (no schema) |
|
8728 |
|
- High performance |
|
8729 |
|
- Availability |
|
8730 |
|
- Data Replication (improves availability and performance) |
|
8731 |
|
- Scalability (horizontal scalabality (add nodes) instead of vertical (add memory)) |
|
8732 |
|
- Eventual Consistency |
|
8733 |
|
- Natively versionning |
|
|
8728 |
|
NoSQL |
|
8729 |
|
~ |
|
8730 |
|
|
|
8731 |
|
- Semi-structured data (no schema) |
|
8732 |
|
- High performance |
|
8733 |
|
- Availability |
|
8734 |
|
- Data Replication (improves availability and performance) |
|
8735 |
|
- Scalability (horizontal scalabality (add nodes) instead of vertical (add memory)) |
|
8736 |
|
- Eventual Consistency |
|
8737 |
|
- Natively versionning |
8734 |
8738 |
|
|
8735 |
|
Vs |
|
|
8739 |
|
SQL |
|
8740 |
|
~ |
8736 |
8741 |
|
|
8737 |
|
- Immediate data consistency |
|
8738 |
|
- Powerfull query language (for instance, join is often missing in NoSQL, has to be implemented on the application-side) |
|
8739 |
|
- Structured data storage (can be too restrictive) |
|
|
8742 |
|
- Immediate data consistency |
|
8743 |
|
- Powerfull query language (for instance, join is often missing in NoSQL, has to be implemented on the application-side) |
|
8744 |
|
- Structured data storage (can be too restrictive) |
8740 |
8745 |
|
|
8741 |
8746 |
### ACID vs CAP vs BASE {#sec:AcidVsCAP} |
### ACID vs CAP vs BASE {#sec:AcidVsCAP} |
8742 |
8747 |
|
|
|
8748 |
|
ACID and BASE are three acronyms capturing desirable features of DBMS, while CAP is a theorem stating the impossibility to have some desirable properties at the same time in distributed systems. |
|
8749 |
|
|
8743 |
8750 |
ACID is the guarantee of validity even in the event of errors, power failures, etc. |
ACID is the guarantee of validity even in the event of errors, power failures, etc. |
8744 |
8751 |
|
|
8745 |
8752 |
- Atomicity → Transactions are all or nothing |
- Atomicity → Transactions are all or nothing |
|
... |
... |
ACID is the guarantee of validity even in the event of errors, power failures, e |
8749 |
8756 |
|
|
8750 |
8757 |
CAP (a.k.a. Brewer's theorem): Roughly, "In a distributed system, one has to choose between consistency (every read receives the most recent write or an error) and availability (every request receives a (non-error) response, without guarantee that it contains the most recent write)" (the P. standing for "Partition tolerance", a guarantee of availability). |
CAP (a.k.a. Brewer's theorem): Roughly, "In a distributed system, one has to choose between consistency (every read receives the most recent write or an error) and availability (every request receives a (non-error) response, without guarantee that it contains the most recent write)" (the P. standing for "Partition tolerance", a guarantee of availability). |
8751 |
8758 |
|
|
8752 |
|
BASE is Basic Availability, Soft state, Eventual consistency. |
|
|
8759 |
|
BASE (also formulated by Brewer) corresponds to Basic Availability, Soft state, Eventual consistency. |
|
8760 |
|
It is a series of properties that can be reached by distributed systems, including NoSQL systems, and is often seen as the "NoSQL's version of ACID". |
|
8761 |
|
This [answer](https://stackoverflow.com/a/3382260) for answer, gives some insight on its meaning. |
8753 |
8762 |
|
|
8754 |
8763 |
## Categories of NoSQL Systems |
## Categories of NoSQL Systems |
8755 |
8764 |
|
|
|
8765 |
|
There are multiple ways to be "non-relational". |
|
8766 |
|
A rough hierarchy of the different approaches can be sketched as follows. |
|
8767 |
|
|
8756 |
8768 |
Model | Description | Examples | |
Model | Description | Examples | |
8757 |
8769 |
--- | --- | --- |
--- | --- | --- |
8758 |
|
Document-based | Data is stored as "documents" (JSON, for instance), accessible via their ID (other indexes available). | [Apache CouchDB](https://couchdb.apache.org/) (simble for web applications, and reliable), [MongoDB](https://www.mongodb.com/) (easy to operate), [Couchbase](https://www.couchbase.com/) (high concurrency, and high availability). |
|
8759 |
|
Key-value stores | Fast access by the key to the value. Value can be a record, an object, a document, or be even more complex. | [Redis](https://redis.io/) (in-memory but persistent on disk database, stores everything in the RAM!) |
|
|
8770 |
|
Document-based | Data is stored as "documents" (JSON, for instance), accessible via their ID (other indexes). | [Apache CouchDB](https://couchdb.apache.org/) (simble for web applications, and reliable), [MongoDB](https://www.mongodb.com/) (easy to operate), [Couchbase](https://www.couchbase.com/) (high concurrency, and high availability). |
|
8771 |
|
Key-value stores | Fast access by the key to the value. Value can be a record, an object, a document, or be more complex. | [Redis](https://redis.io/) (in-memory but persistent on disk database, stores everything in the RAM!) |
8760 |
8772 |
Column-based (a.k.a. wide column) | Partition a table by colmuns into column families, where each column family is stored in its own files. | [Cassandra](https://cassandra.apache.org/), [HBase](https://hbase.apache.org/) (both for huge amount of data) |
Column-based (a.k.a. wide column) | Partition a table by colmuns into column families, where each column family is stored in its own files. | [Cassandra](https://cassandra.apache.org/), [HBase](https://hbase.apache.org/) (both for huge amount of data) |
8761 |
|
Graph-based | Data is represented as graphs, and related nodes can be found by traversing the edges using path expressions. | [Neo4J](https://neo4j.com/) (excellent for pattern recognition, and data mining). |
|
|
8773 |
|
Graph-based | Data is represented as graphs, and related nodes can be found by traversing the edges using path expressions. | [Neo4J](https://neo4j.com/) (excellent for pattern recognition, and data mining) |
8762 |
8774 |
Multi-model | Support multiple data models | [Apache Ignite](https://ignite.apache.org/), [ArangoDB](https://www.arangodb.com/), etc. |
Multi-model | Support multiple data models | [Apache Ignite](https://ignite.apache.org/), [ArangoDB](https://www.arangodb.com/), etc. |
8763 |
8775 |
|
|
|
8776 |
|
--- |
|
8777 |
|
|
8764 |
8778 |
## MongoDB |
## MongoDB |
8765 |
8779 |
|
|
8766 |
8780 |
### Resources |
### Resources |
|
... |
... |
Multi-model | Support multiple data models | [Apache Ignite](https://ignite.apa |
8780 |
8794 |
- [@NoSQLDistilled, ch. 9] |
- [@NoSQLDistilled, ch. 9] |
8781 |
8795 |
- [@Sullivan2015, ch. 6] |
- [@Sullivan2015, ch. 6] |
8782 |
8796 |
|
|
|
8797 |
|
|
8783 |
8798 |
### Introduction |
### Introduction |
8784 |
8799 |
|
|
8785 |
8800 |
MongoDB is |
MongoDB is |