List comprehension vs Generator expression
When to choose List comprehension and generator expression?
syntax
List comprehension
[expression for item in iterable if condition]
Generator expression
(expression for item in iterable if condition)
only difference is [] & ()
How Generator Expression Works:
Initialization:
- When you create a generator expression, it doesn't compute any values right away. Instead, it returns a generator object that can produce values one at a time.
Iteration:
As you iterate over the generator (e.g., when the
sum
function requests the next value), the generator expression computes the next value and yields it to the caller.The generator keeps track of its state internally, so it knows where it left off each time it is called to produce a new value.
On-the-Fly Computation:
- The computation of each value happens only when it is needed, which means memory usage is minimized since it doesn't store all values at once.
Example with sum
Function:
When you use a generator expression with the sum
function, the sum
function requests values one at a time and accumulates them. Hereโs the process step-by-step:
large_range = range(1, 1000000)
total_sum = sum(x ** 2 for x in large_range)
print(total_sum) # Output: Sum of squares of numbers from 1 to 999999
Generator Creation: The generator expression
(x ** 2 for x in large_range)
creates a generator object.Summation: The
sum
function starts iterating over the generator:It requests the first value:
1 ** 2 = 1
It requests the second value:
2 ** 2 = 4
It continues requesting and summing values until the end of the range.
when to use which
Certainly! Here are some use case examples for both list comprehensions and generator expressions to illustrate when and why you might use each.
Use Case Examples for List Comprehensions:
Transforming Data:
- Example: Squaring numbers in a list.
numbers = [1, 2, 3, 4, 5]
squares = [x ** 2 for x in numbers]
print(squares) # Output: [1, 4, 9, 16, 25]
Filtering Data:
- Example: Filtering even numbers from a list.
numbers = [1, 2, 3, 4, 5, 6]
evens = [x for x in numbers if x % 2 == 0]
print(evens) # Output: [2, 4, 6]
Flattening a List of Lists:
- Example: Flattening a 2D list.
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flat_list = [num for row in matrix for num in row]
print(flat_list) # Output: [1, 2, 3, 4, 5, 6, 7, 8, 9]
Creating Dictionaries or Sets:
- Example: Creating a dictionary with keys and values squared.
keys = [1, 2, 3]
squares_dict = {x: x ** 2 for x in keys}
print(squares_dict) # Output: {1: 1, 2: 4, 3: 9}
Conditional Logic:
- Example: Replacing negative numbers with 0.
numbers = [-1, 2, -3, 4, -5]
non_negatives = [x if x >= 0 else 0 for x in numbers]
print(non_negatives) # Output: [0, 2, 0, 4, 0]
Use Case Examples for Generator Expressions:
Processing Large Datasets:
- Example: Summing squares of a large range without storing the list.
large_range = range(1, 1000000)
total = sum(x ** 2 for x in large_range)
print(total) # Output: A large number (sum of squares)
Memory-Efficient Data Processing:
- Example: Filtering and transforming a large list of numbers.
large_list = range(1, 1000000)
evens_squared = (x ** 2 for x in large_list if x % 2 == 0)
total = sum(evens_squared)
print(total) # Output: Sum of squares of even numbers
In Data-bricks count the number of files present in the directory and exclude directories
# Replace '/path/to/folder' with the actual path to your folder folder_path = "dbfs:/FileStore/PySpark_demo/" # List the contents of the folder folder_contents = dbutils.fs.ls(folder_path) # Filter to count only files (exclude directories) file_count = sum(1 for item in folder_contents if not item.isDir()) # Display the file count print(f"The number of files (excluding directories) in the folder '{folder_path}' is: {file_count}")
Summary:
List Comprehensions: Ideal for creating lists when you need to work with the entire dataset at once and can use after some time. They are concise and readable for transforming and filtering data. and can access the value by index in generator expression can't.
Generator Expressions: Useful for handling large datasets or streams where you want to process items one by one without storing them all in memory. They are more memory-efficient.