Single MySQL First Generation instances are limited to storing 500 GB of data. Second Generation instances of MySQL, PostgreSQL, and SQL Server can store up to 30 TB per instance. In general, Cloud SQL is a good choice for applications that need a relational database and that serve requests in a single region.
The limits specified here are the limits that Google has in place as of this writing. They may have changed by the time you read this. Always use Google Cloud documentation for the definitive limits of any GCP service.
Velocity of data is the rate at which it is sent to and processed by an application. Web applications and mobile apps that collect and store human-entered data are typically low velocity, at least when measured by individual user. Machine-generated data, such IoT and time-series data, can be high velocity, especially when many different devices are generating data at short intervals of time. Here are some examples of various rates for low to high velocity:
Nightly uploads of data to a data
Hourly summaries of the number of orders taken in the last hour
Analysis of the last three minutes of telemetry data
Alerting based on a log message as soon as it is received is an example of real-time processing
If data is ingested and written to storage, it is important to match the velocity of incoming data with the rate at which the data store can write data. For example, Bigtable is designed for high-velocity data and can write up to 10,000 rows per second using a 10-node cluster with SSDs. When high-velocity data is processed as it is ingested, it is a good practice to write the data to a Cloud Pub/Sub topic. The processing application can then use a pull subscription to read the data at a rate that it can sustain. Cloud Pub/Sub is a scalable, managed messaging service that scales automatically. Users do not have to provision resources or configure scaling parameters.
At the other end of the velocity spectrum are low-velocity migrations or archiving operations. For example, an organization that uses the Transfer Appliance for large-scale migration may wait days before the data is available in Cloud Storage.
Another key attribute to consider when choosing a storage technology is the amount of variation that you expect in the data structure. Some data structures have low variance. For example, a weather sensor that sends temperature, humidity, and pressure readings at regular time intervals has virtually no variation in the data structure. All data sent to the storage system will have those three measures unless there is an error, such as a lost network packet or corrupted data.
Many business applications that use relational databases also have limited variation in data structure. For example, all customers have most attributes in common, such as name and address, but other business applications may have name suffixes, such as M.D. and Ph.D., stored in an additional field. In those cases, it is common to allow NULL values for attributes that may not be needed.
Not all business applications fit well into the rigid structure of strictly relational databases. NoSQL databases, such as MongoDB, CouchDB, and OrientDB, are examples of document databases. These databases use sets of key-value pairs to represent varying attributes. For example, instead of having a fixed set of attributes, like a relational database table, they include the attribute name along with the attribute value in the database (see Table 1.1).
Table 1.1 Example of structured, relational data
First_name |
Last_name |
Street_Address |
City |
Postal_Code |
Michael |
Johnson |
334 Bay Rd |
Santa Fe |
87501 |
Wang |
Li |
74 Alder St |
Boise |
83701 |
Sandra |
Connor |
123 Main St |
Los Angeles |
90014 |
The data in the first row would be represented in a document database using a structure something like the following:
{ ’first_name’: ’Michael’, ’last_name’: ’Johnson’. ’street’_address’: ’334 Bay Rd’, ’city’: ’Santa Fe’, ’postal_code’: ’87501’ }
Since most rows in a table of names and addresses will have the same attributes, it is not necessary to use a data structure like a document structure. Consider the case of a product catalog that lists both appliances and furniture. Here is an example of how a dishwasher and a chair might be represented:
{ {’id’: ’123456’, ’product_type’: ’dishwasher’, ’length’: ’24 in’, ’width’: ’34 in’, ’weight’: ’175 lbs’, ’power’: ’1800 watts’ } {’id’:’987654’, ’product_type’: ’chair’, ’weight’: ’15 kg’, ’style’: ’modern’, ’color’: ’brown’ } }
In addition to document databases, wide-column databases, such as Bigtable and Cassandra, are also used with datasets with varying attributes.
Data is accessed in different ways for different use cases. Some time-series data points may be read immediately after they are written, but they are not likely to be read once they are more than a day old. Customer order data may be read repeatedly as an order is processed. Archived data may be accessed less than once a year. Four metrics to consider about data access are as follows:
How much data is retrieved in a read operation?
How much data is written in an insert operation?
How often is data written?
How often is data read?
Some read and write operations apply to small amounts of data. Reading or writing a single piece of telemetry data is an example. Writing an e-commerce transaction may also entail a small amount of data. A database storing telemetry data from thousands of sensors that push data every five seconds will be writing large volumes, whereas an online transaction processing database for a small online retailer will also write small individual units of data but at a much smaller rate. These will require different kinds of databases. The telemetry data, for example, is better suited to Bigtable, with its low-latency writes, and the retailer transaction data is a good use case for Cloud SQL, with support for sufficient I/O operations to handle relational database loads.
Cloud Storage supports ingesting large volumes of data in bulk using tools such as the Cloud Transfer Service and Transfer Appliance. (Cloud Storage also supports streaming transfers, but bulk reads and writes are more common.) Data in Cloud Storage is read at the object or the file level. You typically don’t, for example, seek a particular block within a file as you can when storing a file on a filesystem.
It is common to read large volumes of data in BigQuery as well; however, in that case we often read a small number of columns across a large number of rows. BigQuery optimizes for these kinds of reads by using a columnar storage format known as Capacitor. Capacitor is designed to store semi-structured data with nested and repeated fields.
Data access patterns can help identify the best storage technology for a use case by highlighting key features needed to support those access patterns.
Different storage systems will have different levels of access controls. Cloud Storage, for example, can have access controls at the bucket and the object level. If someone has access to a file in Cloud Storage, they will have access to all the data in that file. If some users have access only to a subset of a dataset, then the data could be stored in a relational database and a view could be created that includes only the data that the user is allowed to access.
Читать дальше