
Give Push to your Success with Google Cloud Certified Professional-Data-Engineer Exam Questions
Professional-Data-Engineer 100% Guarantee Download Professional-Data-Engineer Exam PDF Q&A
NEW QUESTION # 15
Your infrastructure includes a set of YouTube channels. You have been tasked with creating a process for sending the YouTube channel data to Google Cloud for analysis. You want to design a solution that allows your world-wide marketing teams to perform ANSI SQL and other types of analysis on up-to-date YouTube channels log dat a. How should you set up the log data transfer into Google Cloud?
- A. Use BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
- B. Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Regional bucket as a final destination.
- C. Use BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Regional
- D. Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
Answer: B
Explanation:
storage bucket as a final destination.
NEW QUESTION # 16
You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?
- A. Cloud Scheduler
- B. Cloud Composer
- C. Cloud Dataflow
- D. Cloud Functions
Answer: B
NEW QUESTION # 17
You store historic data in Cloud Storage. You need to perform analytics on the historic data. You want to use a solution to detect invalid data entries and perform data transformations that will not require programming or knowledge of SQL.
What should you do?
- A. Use Cloud Dataprep with recipes to detect errors and perform transformations.
- B. Use federated tables in BigQuery with queries to detect errors and perform transformations.
- C. Use Cloud Dataflow with Beam to detect errors and perform transformations.
- D. Use Cloud Dataproc with a Hadoop job to detect errors and perform transformations.
Answer: A
NEW QUESTION # 18
Case Study: 2 - MJTelco
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost. Their management and operations teams are situated all around the globe creating many-to- many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments ?development/test, staging, and production ?
to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community. Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
Provide reliable and timely access to data for analysis from distributed research workers Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
You need to compose visualizations for operations teams with the following requirements:
Which approach meets the requirements?
- A. Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a table.
- B. Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show only suboptimal links in a table.
- C. Load the data into Google Cloud Datastore tables, write a Google App Engine Application that queries all rows, applies a function to derive the metric, and then renders results in a table using the Google charts and visualization API.
- D. Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates the metric, and shows only suboptimal rows in a table in Google Sheets.
Answer: C
NEW QUESTION # 19
Your team is building a data lake platform on Google Cloud. As a part of the data foundation design, you are planning to store all the raw data in Cloud Storage You are expecting to ingest approximately 25 GB of data a day and your billing department is worried about the increasing cost of storing old dat a. The current business requirements are:
* The old data can be deleted anytime
* You plan to use the visualization layer for current and historical reporting
* The old data should be available instantly when accessed
* There should not be any charges for data retrieval.
What should you do to optimize for cost?
- A. Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearline, 90 days to coldline. and 365 days to archive storage class. Delete old data as needed.
- B. Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to coldline, 90 days to nearline. and 365 days to archive storage class Delete old data as needed.
- C. Create the bucket with the Autoclass storage class feature.
- D. Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearlme. 45 days to coldline. and 60 days to archive storage class Delete old data as needed.
Answer: A
Explanation:
- Autoclass automatically moves objects between storage classes without impacting performance or availability, nor incurring retrieval costs. - It continuously optimizes storage costs based on access patterns without the need to set specific lifecycle management policies.
NEW QUESTION # 20
Which SQL keyword can be used to reduce the number of columns processed by BigQuery?
- A. BETWEEN
- B. LIMIT
- C. WHERE
- D. SELECT
Answer: D
Explanation:
SELECT allows you to query specific columns rather than the whole table. LIMIT, BETWEEN, and WHERE clauses will not reduce the number of columns processed by BigQuery.
Reference: https://cloud.google.com/bigquery/launch-
checklist#architecture_design_and_development_checklist
NEW QUESTION # 21
You are designing a cloud-native historical data processing system to meet the following conditions:
* The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
* A streaming data pipeline stores new data daily.
* Peformance is not a factor in the solution.
* The solution design should maximize availability.
How should you design data storage for this solution?
- A. Store the data in BigQuery. Access the data using the BigQuery Connector or Cloud Dataproc and Compute Engine.
- B. Store the data in a regional Cloud Storage bucket. Aceess the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
- C. Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.
- D. Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.
Answer: C
NEW QUESTION # 22
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
MJTelco's Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations.
You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?
- A. The maximum number of workers
- B. The disk size per worker
- C. The number of workers
- D. The zone
Answer: D
NEW QUESTION # 23
Your company is in the process of migrating its on-premises data warehousing solutions to BigQuery. The existing data warehouse uses trigger-based change data capture (CDC) to apply updates from multiple transactional database sources on a daily basis. With BigQuery, your company hopes to improve its handling of CDC so that changes to the source systems are available to query in BigQuery in near-real time using log- based CDC streams, while also optimizing for the performance of applying changes to the data warehouse.
Which two steps should they take to ensure that changes are available in the BigQuery reporting table with minimal latency while reducing compute overhead? (Choose two.)
- A. Periodically DELETE outdated records from the reporting table.
- B. Perform a DML INSERT, UPDATE, or DELETE to replicate each individual CDC record in real time directly on the reporting table.
- C. Periodically use a DML MERGE to perform several DML INSERT, UPDATE, and DELETE operations at the same time on the reporting table.
- D. Insert each new CDC record and corresponding operation type in real time to the reporting table, and use a materialized view to expose only the newest version of each unique record.
- E. Insert each new CDC record and corresponding operation type to a staging table in real time.
Answer: B,E
NEW QUESTION # 24
Which of these statements about BigQuery caching is true?
- A. There is no charge for a query that retrieves its results from cache.
- B. Query results are cached even if you specify a destination table.
- C. By default, a query's results are not cached.
- D. BigQuery caches query results for 48 hours.
Answer: A
Explanation:
When query results are retrieved from a cached results table, you are not charged for the query.
BigQuery caches query results for 24 hours, not 48 hours.
Query results are not cached if you specify a destination table.
A query's results are always cached except under certain conditions, such as if you specify a destination table.
NEW QUESTION # 25
You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)
- A. Decrease the regularization parameters
- B. Use a larger set of features
- C. Increase the regularization parameters
- D. Reduce the number of training examples
- E. Use a smaller set of features
- F. Get more training examples
Answer: A,B,F
NEW QUESTION # 26
You used Cloud Dataprep to create a recipe on a sample of data in a BigQuery table. You want to reuse this recipe on a daily upload of data with the same schema, after the load job with variable execution time completes. What should you do?
- A. Create an App Engine cron job to schedule the execution of the Cloud Dataprep job.
- B. Export the recipe as a Cloud Dataprep template, and create a job in Cloud Scheduler.
- C. Export the Cloud Dataprep job as a Cloud Dataflow template, and incorporate it into a Cloud Composer job.
- D. Create a cron schedule in Cloud Dataprep.
Answer: B
NEW QUESTION # 27
You have several different unstructured data sources, within your on-premises data center as well as in the cloud. The data is in various formats, such as Apache Parquet and CSV. You want to centralize this data in Cloud Storage. You need to set up an object sink for your data that allows you to use your own encryption keys. You want to use a GUI-based solution. What should you do?
- A. Use Storage Transfer Service to move files into Cloud Storage.
- B. Use Cloud Data Fusion to move files into Cloud Storage.
- C. Use BigQuery Data Transfer Service to move files into BigQuery.
- D. Use Dataflow to move files into Cloud Storage.
Answer: B
Explanation:
To centralize unstructured data from various sources into Cloud Storage using a GUI-based solution while allowing the use of your own encryption keys, Cloud Data Fusion is the most suitable option. Here's why:
Cloud Data Fusion:
Cloud Data Fusion is a fully managed, cloud-native data integration service that helps in building and managing ETL pipelines with a visual interface.
It supports a wide range of data sources and formats, including Apache Parquet and CSV, and provides a user-friendly GUI for pipeline creation and management.
Custom Encryption Keys:
Cloud Data Fusion allows the use of customer-managed encryption keys (CMEK) for data encryption, ensuring that your data is securely stored according to your encryption policies.
Centralizing Data:
Cloud Data Fusion simplifies the process of moving data from on-premises and cloud sources into Cloud Storage, providing a centralized repository for your unstructured data.
Steps to Implement:
Set Up Cloud Data Fusion:
Deploy a Cloud Data Fusion instance and configure it to connect to your various data sources.
Create ETL Pipelines:
Use the GUI to create data pipelines that extract data from your sources and load it into Cloud Storage. Configure the pipelines to use your custom encryption keys.
Run and Monitor Pipelines:
Execute the pipelines and monitor their performance and data movement through the Cloud Data Fusion dashboard.
Reference:
Cloud Data Fusion Documentation
Using Customer-Managed Encryption Keys (CMEK)
NEW QUESTION # 28
The YARN ResourceManager and the HDFS NameNode interfaces are available on a Cloud Dataproc cluster
____.
- A. worker node
- B. application node
- C. conditional node
- D. master node
Answer: D
Explanation:
The YARN ResourceManager and the HDFS NameNode interfaces are available on a Cloud Dataproc cluster master node. The cluster master-host-name is the name of your Cloud Dataproc cluster followed by an -m suffix-for example, if your cluster is named "my-cluster", the master-host-name would be "my-cluster-m".
Reference: https://cloud.google.com/dataproc/docs/concepts/cluster-web-interfaces#interfaces
NEW QUESTION # 29
You operate a database that stores stock trades and an application that retrieves average stock price for a given company over an adjustable window of time. The data is stored in Cloud Bigtable where the datetime of the stock trade is the beginning of the row key. Your application has thousands of concurrent users, and you notice that performance is starting to degrade as more stocks are added. What should you do to improve the performance of your application?
- A. Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol.
- B. Use Cloud Dataflow to write summary of each day's stock trades to an Avro file on Cloud Storage. Update your application to read from Cloud Storage and Cloud Bigtable to compute the responses.
- C. Change the data pipeline to use BigQuery for storing stock trades, and update your application.
- D. Change the row key syntax in your Cloud Bigtable table to begin with a random number per second.
Answer: A
NEW QUESTION # 30
You are developing an application on Google Cloud that will automatically generate subject labels for users' blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do?
- A. Call the Cloud Natural Language API from your application. Process the generated Sentiment Analysis as labels.
- B. Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.
- C. Build and train a text classification model using TensorFlow. Deploy the model using Cloud Machine Learning Engine. Call the model from your application and process the results as labels.
- D. Build and train a text classification model using TensorFlow. Deploy the model using a Kubernetes Engine cluster. Call the model from your application and process the results as labels.
Answer: A
NEW QUESTION # 31
You want to use a BigQuery table as a data sink. In which writing mode(s) can you use BigQuery as a sink?
- A. Both batch and streaming
- B. Only batch
- C. BigQuery cannot be used as a sink
- D. Only streaming
Answer: A
Explanation:
When you apply a BigQueryIO.Write transform in batch mode to write to a single table, Dataflow invokes a BigQuery load job. When you apply a BigQueryIO.Write transform in streaming mode or in batch mode using a function to specify the destination table, Dataflow uses BigQuery's streaming inserts
NEW QUESTION # 32
You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You've collected a labeled dataset that has on average 1000 examples for each unique component.
Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do?
- A. Use Cloud Vision AutoML with the existing dataset.
- B. Use Cloud Vision AutoML, but reduce your dataset twice.
- C. Train your own image recognition model leveraging transfer learning techniques.
- D. Use Cloud Vision API by providing custom labels as recognition hints.
Answer: A
NEW QUESTION # 33
Your company is loading comma-separated values (CSV) files into Google BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file. What is the most likely cause of this problem?
- A. The CSV data loaded in BigQuery is not using BigQuery's default encoding.
- B. The CSV data loaded in BigQuery is not flagged as CSV.
- C. The CSV data has not gone through an ETL phase before loading into BigQuery.
- D. The CSV data has invalid rows that were skipped on import.
Answer: D
NEW QUESTION # 34
The _________ for Cloud Bigtable makes it possible to use Cloud Bigtable in a Cloud Dataflow pipeline.
- A. DataFlow SDK
- B. BiqQuery API
- C. Cloud Dataflow connector
- D. BigQuery Data Transfer Service
Answer: C
Explanation:
Explanation
The Cloud Dataflow connector for Cloud Bigtable makes it possible to use Cloud Bigtable in a Cloud Dataflow pipeline. You can use the connector for both batch and streaming operations.
Reference: https://cloud.google.com/bigtable/docs/dataflow-hbase
NEW QUESTION # 35
You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?
- A. Instead of feeding in each feature individually, average their values in batches of 3.
- B. Remove the features that have null values for more than 50% of the training records.
- C. Eliminate features that are highly correlated to the output labels.
- D. Combine highly co-dependent features into one representative feature.
Answer: D
NEW QUESTION # 36
You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:
The user profile: What the user likes and doesn't like to eat
The user account information: Name, address, preferred meal times
The order information: When orders are made, from where, to whom
The database will be used to store all the transactional data of the product. You want to optimize the data schem
a. Which Google Cloud Platform product should you use?
- A. Cloud Datastore
- B. Cloud SQL
- C. BigQuery
- D. Cloud Bigtable
Answer: C
NEW QUESTION # 37
Your organization has been collecting and analyzing data in Google BigQuery for 6 months. The majority of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the cost of queries, your organization created a view called events, which queries only the last 14 days of data. The view is described in legacy SQL. Next month, existing applications will be connecting to BigQuery to read the eventsdata via an ODBC connection. You need to ensure the applications can connect. Which two actions should you take? (Choose two.)
- A. Create a new view over events_partitioned using standard SQL
- B. Create a new view over events using standard SQL
- C. Create a service account for the ODBC connection to use for authentication
- D. Create a new partitioned table using a standard SQL query
- E. Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection and shared "events"
Answer: B,E
NEW QUESTION # 38
......
Get Professional-Data-Engineer Actual Free Exam Q&As to Prepare Certification: https://examtorrent.vce4dumps.com/Professional-Data-Engineer-latest-dumps.html