
2022 Latest Databricks-Certified-Professional-Data-Engineer dumps - Instant Download PDF
Updated Verified Databricks-Certified-Professional-Data-Engineer Downloadable Printable Exam Dumps
NEW QUESTION 25
Two junior data engineers are authoring separate parts of a single data pipeline notebook. They are working on
separate Git branches so they can pair program on the same notebook simultaneously. A senior data engineer
experienced in Databricks suggests there is a better alternative for this type of collaboration.
Which of the following supports the senior data engineer's claim?
- A. Databricks Notebooks support the use of multiple languages in the same notebook
- B. Databricks Notebooks support real-time co-authoring on a single notebook
- C. Databricks Notebooks support commenting and notification comments
- D. Databricks Notebooks support the creation of interactive data visualizations
- E. Databricks Notebooks support automatic change-tracking and versioning
Answer: B
NEW QUESTION 26
A data engineering team needs to query a Delta table to extract rows that all meet the same condi-tion.
However, the team has noticed that the query is running slowly. The team has already tuned the size of the
data files. Upon investigating, the team has concluded that the rows meeting the condition are sparsely located
throughout each of the data files.
Based on the scenario, which of the following optimization techniques could speed up the query?
- A. Z-Ordering
- B. Write as a Parquet file
- C. Data skipping
- D. Tuning the file size
- E. Bin-packing
Answer: A
NEW QUESTION 27
A data engineer is overwriting data in a table by deleting the table and recreating the table. Another data
engineer suggests that this is inefficient and the table should simply be overwritten instead.
Which of the following reasons to overwrite the table instead of deleting and recreating the table is incorrect?
- A. Overwriting a table results in a clean table history for logging and audit purposes
- B. Overwriting a table allows for concurrent queries to be completed while in progress
- C. Overwriting a table is efficient because no files need to be deleted
- D. Overwriting a table is an atomic operation and will not leave the table in an unfinished state
- E. Overwriting a table maintains the old version of the table for Time Travel
Answer: A
NEW QUESTION 28
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then
perform a streaming write into a new table. The code block used by the data engineer is below:
1. (spark.table("sales")
2. .withColumn("avg_price", col("sales") / col("units"))
3. .writeStream
4. .option("checkpointLocation", checkpointPath)
5. .outputMode("complete")
6. ._____
7. .table("new_sales")
8.)
If the data engineer only wants the query to execute a single micro-batch to process all of the available data,
which of the following lines of code should the data engineer use to fill in the blank?
- A. .processingTime("once")
- B. .trigger(processingTime="once")
- C. .trigger(continuous="once")
- D. .processingTime(1)
- E. .trigger(once=True)
Answer: E
NEW QUESTION 29
Which of the following commands will return records from an existing Delta table my_table where duplicates
have been removed?
- A. 1. MERGE INTO my_table a
2. USING new_records b; - B. 1. SELECT DISTINCT *
2. FROM my_table; - C. 1. SELECT *
2. FROM my_table
3. WHERE duplicate = False; - D. 1. MERGE INTO my_table a
2. USING new_records b ON a.id = b.id
3. WHEN NOT MATCHED
4. THEN INSERT *; - E. 1. DROP DUPLICATES
2. FROM my_table;
Answer: B
NEW QUESTION 30
A data engineering team has been using a Databricks SQL query to monitor the performance of an ELT job.
The ELT job is triggered by a specific number of input records being ready to process. The Databricks SQL
query returns the number of minutes since the job's most recent runtime.
Which of the following approaches can enable the data engineering team to be notified if the ELT job has not
been run in an hour?
- A. They can set up an Alert for the accompanying dashboard to notify when it has not re-freshed in 60
minutes - B. They can set up an Alert for the accompanying dashboard to notify them if the returned value is greater
than 60 - C. They can set up an Alert for the query to notify when the ELT job fails
- D. They can set up an Alert for the query to notify them if the returned value is greater than 60
- E. This type of alerting is not possible in Databricks
Answer: D
NEW QUESTION 31
Consider flipping a coin for which the probability of heads is p, where p is unknown, and our goa is to
estimate p. The obvious approach is to count how many times the coin came up heads and divide by the total
number of coin flips. If we flip the coin 1000 times and it comes up heads 367 times, it is very reasonable to
estimate p as approximately 0.367. However, suppose we flip the coin only twice and we get heads both times.
Is it reasonable to estimate p as 1.0? Intuitively, given that we only flipped the coin twice, it seems a bit
rash to conclude that the coin will always come up heads, and____________is a way of avoiding such rash
conclusions.
- A. Logistic Regression
- B. Linear Regression
- C. Naive Bayes
- D. Laplace Smoothing
Answer: D
Explanation:
Explanation
Smooth the estimates:consider flipping a coin for which the probability of heads is p, where p is unknown, and
our goal is to estimate p. The obvious approach is to count how many times the coin came up heads and divide
by the total number of coin flips. If we flip the coin 1000 times and it comes up heads 367 times, it is very
reasonable to estimate p as approximately 0.367. However, suppose we flip the coin only twice and we get
heads both times. Is it reasonable to estimate p as 1.0? Intuitively, given that we only flipped the coin twice, it
seems a bit rash to conclude that the coin will always come up heads, and smoothing is a way of avoiding such
rash conclusions. A simple smoothing method, called Laplace smoothing (or Laplace's law of succession or
add-one smoothing in R&N), is to estimate p by (one plus the number of heads) / (two plus the total number of
flips). Said differently, if we are keeping count of the number of heads and the number of tails, this rule is
equivalent to starting each of our counts at one, rather than zero. Another advantage of Laplace smoothing is
that it avoids estimating any probabilities to be zero, even for events never observed in the data. Laplace
add-one smoothing now assigns too much probability to unseen words
NEW QUESTION 32
A data engineer has written the following query:
1. SELECT *
2. FROM json.`/path/to/json/file.json`;
The data engineer asks a colleague for help to convert this query for use in a Delta Live Tables (DLT)
pipeline. The query should create the first table in the DLT pipeline.
Which of the following describes the change the colleague needs to make to the query?
- A. They need to add a CREATE LIVE TABLE table_name AS line at the beginning of the query
- B. They need to add a CREATE DELTA LIVE TABLE table_name AS line at the beginning of the query
- C. They need to add the cloud_files(...) wrapper to the JSON file path
- D. They need to add a COMMENT line at the beginning of the query
- E. They need to add a live. prefix prior to json. in the FROM line
Answer: A
NEW QUESTION 33
Question-3: In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel
trick), is a fast and space-efficient way of vectorizing features (such as the words in a language), i.e., turning
arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and
using their hash values modulo the number of features as indices directly, rather than looking the indices up in
an associative array. So what is the primary reason of the hashing trick for building classifiers?
- A. It creates the smaller models
- B. It requires the lesser memory to store the coefficients for the model
- C. It reduces the non-significant features e.g. punctuations
- D. Noisy features are removed
Answer: B
Explanation:
Explanation
This hashed feature approach has the distinct advantage of requiring less memory and one less pass through
the training data, but it can make it much harder to reverse engineer vectors to determine which original
feature mapped to a vector location. This is because multiple features may hash to the same location. With
large vectors or with multiple locations per feature, this isn't a problem for accuracy but it can make it hard to
understand what a classifier is doing.
Models always have a coefficient per feature, which are stored in memory during model building. The hashing
trick collapses a high number of features to a small number which reduces the number of coefficients and thus
memory requirements. Noisy features are not removed; they are combined with other features and so still have
an impact.
The validity of this approach depends a lot on the nature of the features and problem domain; knowledge of
the domain is important to understand whether it is applicable or will likely produce poor results. While
hashing features may produce a smaller model, it will be one built from odd combinations of real-world
features, and so will be harder to interpret.
An additional benefit of feature hashing is that the unknown and unbounded vocabularies typical of word-like
variables aren't a problem.
NEW QUESTION 34
A data engineer has created a Delta table as part of a data pipeline. Downstream data analysts now need
SELECT permission on the Delta table.
Assuming the data engineer is the Delta table owner, which part of the Databricks Lakehouse Plat-form can
the data engineer use to grant the data analysts the appropriate access?
- A. Jobs
B Dashboards - B. Repos
- C. Databricks Filesystem
- D. Data Explorer
Answer: B
NEW QUESTION 35
A data engineer has set up a notebook to automatically process using a Job. The data engineer's manager wants
to version control the schedule due to its complexity.
Which of the following approaches can the data engineer use to obtain a version-controllable con-figuration of
the Job's schedule?
- A. They can link the Job to notebooks that are a part of a Databricks Repo
- B. They can submit the Job once on an all-purpose cluster
- C. They can download the JSON description of the Job from the Job's page
- D. They can download the XML description of the Job from the Job's page
- E. They can submit the Job once on a Job cluster
Answer: C
NEW QUESTION 36
A data engineer wants to horizontally combine two tables as a part of a query. They want to use a shared
column as a key column, and they only want the query result to contain rows whose value in the key column is
present in both tables.
Which of the following SQL commands can they use to accomplish this task?
- A. MERGE
- B. OUTER JOIN
- C. INNER JOIN
- D. LEFT JOIN
- E. UNION
Answer: C
NEW QUESTION 37
A data engineer has ingested a JSON file into a table raw_table with the following schema:
1.transaction_id STRING,
2.payload ARRAY<customer_id:STRING, date:TIMESTAMP, store_id:STRING>
The data engineer wants to efficiently extract the date of each transaction into a table with the fol-lowing
schema:
1.transaction_id STRING,
2.date TIMESTAMP
Which of the following commands should the data engineer run to complete this task?
- A. 1.SELECT transaction_id, date from payload
2.FROM raw_table; - B. 1.SELECT transaction_id, payload[date]
2.FROM raw_table; - C. 1.SELECT transaction_id, payload.date
2.FROM raw_table; - D. 1.SELECT transaction_id, explode(payload)
2.FROM raw_table; - E. 1.SELECT transaction_id, date
2.FROM raw_table;
Answer: C
NEW QUESTION 38
A new data engineer [email protected] has been assigned to an ELT project. The new data
engineer will need full privileges on the table sales to fully manage the project.
Which of the following commands can be used to grant full permissions on the table to the new data engineer?
- A. 1. GRANT SELECT ON TABLE sales TO [email protected];
- B. 1. GRANT SELECT CREATE MODIFY ON TABLE sales TO [email protected];
- C. 1. GRANT ALL PRIVILEGES ON TABLE [email protected] TO sales;
- D. 1. GRANT USAGE ON TABLE sales TO [email protected];
- E. 1. GRANT ALL PRIVILEGES ON TABLE sales TO [email protected];
Answer: E
NEW QUESTION 39
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE.
Three datasets are defined against Delta Lake table sources using LIVE TABLE . The table is configured to
run in Development mode using the Triggered Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after
clicking Start to update the pipeline?
- A. All datasets will be updated once and the pipeline will shut down. The compute resources will be
terminated - B. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to
allow for additional testing - C. All datasets will be updated continuously and the pipeline will not shut down. The compute resources
will persist with the pipeline - D. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
be deployed for the update and terminated when the pipeline is stopped - E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
persist after the pipeline is stopped to allow for additional testing
Answer: B
NEW QUESTION 40
If E1 and E2 are two events, how do you represent the conditional probability given that E2 occurs given that
E1 has occurred?
- A. P(E1+E2)/P(E1)
- B. P(E1)/P(E2)
- C. P(E2)/(P(E1+E2)
- D. P(E2)/P(E1)
Answer: D
NEW QUESTION 41
Which of the following data workloads will utilize a Bronze table as its source?
- A. A job that develops a feature set for a machine learning application
- B. A job that ingests raw data from a streaming source into the Lakehouse
- C. A job that enriches data by parsing its timestamps into a human-readable format
- D. A job that queries aggregated data to publish key insights into a dashboard
- E. A job that aggregates cleaned data to create standard summary statistics
Answer: C
NEW QUESTION 42
A dataset has been defined using Delta Live Tables and includes an expectations clause:
1. CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01')
What is the expected behaviour when a batch of data containing data that violates these constraints is
processed?
- A. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log
- B. Records that violate the expectation cause the job to fail
- C. Records that violate the expectation are added to the target dataset and flagged as in-valid in a field added to the target dataset
- D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log
- E. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table
Answer: D
NEW QUESTION 43
A data architect is designing a data model that works for both video-based machine learning work-loads and
highly audited batch ETL/ELT workloads.
Which of the following describes how using a data lakehouse can help the data architect meet the needs of
both workloads?
- A. A data lakehouse combines compute and storage for simple governance
- B. A data lakehouse fully exists in the cloud
- C. A data lakehouse stores unstructured data and is ACID-compliant
- D. A data lakehouse requires very little data modeling
- E. A data lakehouse provides autoscaling for compute clusters
Answer: C
NEW QUESTION 44
Question-26. There are 5000 different color balls, out of which 1200 are pink color. What is the maximum
likelihood estimate for the proportion of "pink" items in the test set of color balls?
- A. 4.8
- B. .24
- C. 24 0
- D. 2.4
- E. .48
Answer: B
Explanation:
Explanation
Given no additional information, the MLE for the probability of an item in the test set is exactly its frequency
in the training set. The method of maximum likelihood corresponds to many well-known estimation methods
in statistics. For example, one may be interested in the heights of adult female penguins, but be unable to
measure the height of every single penguin in a population due to cost or time constraints. Assuming that the
heights are normally (Gaussian) distributed with some unknown mean and variance, the mean and variance
can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE
would accomplish this by taking the mean and variance as parameters and finding particular parametric values
that make the observed results the most probable (given the model).
In general, for a fixed set of data and underlying statistical model the method of maximum likelihood selects
the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes
the "agreement" of the selected model with the observed data, and for discrete random variables it indeed
maximizes the probability of the observed data under the resulting distribution. Maximum-likelihood
estimation gives a unified approach to estimation, which is well-defined in the case of the normal distribution
and many other problems. However in some complicated problems, difficulties do occur: in such problems,
maximum-likelihood estimators are unsuitable or do not exist.
NEW QUESTION 45
A data engineer has ingested data from an external source into a PySpark DataFrame raw_df. They need to
briefly make this data available in SQL for a data analyst to perform a quality assurance check on the data.
Which of the following commands should the data engineer run to make this data available in SQL for only
the remainder of the Spark session?
- A. raw_df.createOrReplaceTempView("raw_df")
- B. raw_df.write.save("raw_df")
- C. There is no way to share data between PySpark and SQL
- D. raw_df.createTable("raw_df")
- E. raw_df.saveAsTable("raw_df")
Answer: A
NEW QUESTION 46
A data engineer has three notebooks in an ELT pipeline. The notebooks need to be executed in a specific order
for the pipeline to complete successfully. The data engineer would like to use Delta Live Tables to manage this
process.
Which of the following steps must the data engineer take as part of implementing this pipeline using Delta
Live Tables?
- A. They need to refactor their notebook to use Python and the dlt library
- B. They need to refactor their notebook to use SQL and CREATE LIVE TABLE keyword
- C. They need to create a Delta Live tables pipeline from the Compute page
- D. They need to create a Delta Live Tables pipeline from the Data page
- E. They need to create a Delta Live Tables pipeline from the Jobs page
Answer: E
NEW QUESTION 47
A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also
used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The
data engineer needs to identify which files are new since the previous run in the pipeline, and set up the
pipeline to only ingest those new files with each run.
Which of the following tools can the data engineer use to solve this problem?
- A. Delta Lake
- B. Auto Loader
- C. Data Explorer
- D. Databricks SQL
- E. Unity Catalog
Answer: B
NEW QUESTION 48
Which of the following is a Continuous Probability Distributions?
- A. Negative binomial distribution
- B. Binomial probability distribution
- C. Poisson probability distribution
- D. Normal probability distribution
Answer: D
NEW QUESTION 49
......
The Ultimate Databricks Databricks-Certified-Professional-Data-Engineer Dumps PDF Review: https://www.exam4free.com/Databricks-Certified-Professional-Data-Engineer-valid-dumps.html
