1.2 Programming

WiSe 2023/2024
Julian Huber

Bio Data-Science

Bio Data-Science

2.1 Variables and Data Types

Bio Data-Science

2.1.1 Python with Google Colab

Bio Data-Science

🎯 Learning objectives

You will be able to

  • write Python code and Markdown Text in Juypter Notebooks / Google Colab
  • execute Notebooks on their on PC or using Google Colabs
Bio Data-Science

Hello World!

  1. Python
    print("Hello, World!")
    
Bio Data-Science

Python

"Python is a high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."

  • high-level - easy to understand for humans
  • interpreted - we do not have to compile whole programs to run it
  • general-purpose - we will use it for bio informatics, but You could also build web services, games, etc.
  • indentation - white space has meaning
https://en.wikipedia.org/wiki/Python_(programming_language)
Bio Data-Science
Executing Python Code
  • There are three options to follow this course
    • If you have a Google-Account: use Google Colab in Your browser
    • Running a online-notebook with binder
    • Running Python Jupyter Notebooks on Your PC. Install the free version of Anaconda (follow this installation guide)
Bio Data-Science

How we will work together

  • Before You start, put the the red card on top, this will indicate that You are still working on the challenge
  • ✍️ are simple practical task You should try on Your own
  • 🏆 are more challenging practical task, where You can work in a group
  • 🤓 are optional task, if You want to learn more. Only start them if You finished all non-optional tasks until the next 🏁
  • 🏁 Once, You reach the recap mark, switch the cards. A green card indicates that everything is clear, a yellow card that we should discuss the solution together
  • At any time, if You have a question: Raise Your hand and work with Your class mates
Bio Data-Science

✍️ Copying a Google Colab Notebook

  • You also find the link to the notebooks in Sakai
  • read and try out 1.2.1.1 Python with google Colab
  • The solution to all tasks can be found here

⌛ 15 minutes

Bio Data-Science
🤓 Without a Google Account
  • Click this button
  • Make sure to download script on the end of the lecture or Your changes will be lost
Bio Data-Science
Downloading a Google Colab Notebook
Bio Data-Science
Working locally
  • open Jupyter Lab or VS Code and navigate to the .ipynb file to open it
  • on the local PCs You can start both programs using the Anaconda Navigator program
Bio Data-Science

2.1.2 Basic elements of Python code

Bio Data-Science

🎯 Learning objectives

You will be able to

  • store numbers and text in Python variables
  • use proper naming conventions for naming variables
  • to interpret a Python error message and use Google to solve it
Bio Data-Science

Sequential Execution

  • Each line is evaluated after the other
  • lines starting with # are comments that are not executed
# This is a comment
1+2
# > 3

"""
There are also
multi-line comments
"""
Bio Data-Science
Output in Jupyter Notebooks
  • Cells in a Jupyter Notebook are executed individually
  • If the last line of a cell has a return value, it will output this value below
    1+2
    
  • This will output 3
Bio Data-Science

Commands

  • special keywords with instructions
    print(1)
    
Bio Data-Science

Variables

  • store data in the memory and make is accessible by a name
        message = "Hello Python world!"
    
  • variables are stored as long as the runtime is not restarted and can be used in all cells
  • the runtime defines the current working memory, we use
Bio Data-Science
  • A variable is a pair consisting of a name (indicates space in memory),
    to which a value is assigned.
  • The value can be changed by further assignments (is variable).
  • The type of the value (content) of a variable is defined by the data type.
    • e.g., digits, characters, letters, words, lists,...
    • according to this, different amounts of space are reserved in the memory
Bio Data-Science

✍️ Task: Basic elements of Python code

⌛ 15 minutes

Bio Data-Science

2.1.3 Data Types

Bio Data-Science

🎯 Learning objectives

You will be able to

  • identify the data types of variables
  • apply methods to strings and variables containing strings
Bio Data-Science

Types

  • Each variable has a type, that defines, how the information is coded in the memory

    an_int = 42
    type(an_int)
    # > <class 'int'>
    
    a_string = 'This is a string.'
    type(a_string)
    # > <class 'str'>
    
Strings in Python 3 are Unicode
Bio Data-Science

Bio Data-Science

✍️ Basic elements of Python code

  • read and try out 2.1.3 Data Types

  • ✍️ There are two tasks You should try on Your own

  • 🏆 For the other task You can form a group of three

⌛ 25 minutes

Bio Data-Science

2.1.4 Calculating with Python

Bio Data-Science

🎯 Learning Objectives

You will be able to

  • calculate mathematical formulas in Python using the math package
  • use the print function to show Your results
Bio Data-Science

Importing Packages

  • Packages are are unit of code that are provided in a central package index
  • Some packages (like the math package) are part of the Python programming language and can be loaded with the import statement
    import math
    math.pi
    print(math.sin(math.radians(30)))
    
Bio Data-Science

Goal of this lecture: Using Python instead of Your Calculator

  • Python can do everything your calculator does
  • Jupyter Notebooks can store results
  • Code makes it easy to reuse your calculations
https://www.amazon.de/Casio-FX-991-Wissenschaftlicher-Taschenrechner-Bedienungsanleitung/dp/B0034BAQS8
Bio Data-Science

🏆Case Study

P(t)=K1+(KP0P0)ert.{\displaystyle P(t)={\frac {K}{1+\left({\frac {K-P_{0}}{P_{0}}}\right)e^{-rt}}}.}

Bio Data-Science

Logistic Growth

P(t)=K1+(KP0P0)ert,{\displaystyle P(t)={\frac {K}{1+\left({\frac {K-P_{0}}{P_{0}}}\right)e^{-rt}}},}

  • tt ... time
  • P(t)P(t) ... population size
  • P0P_0 ... initial population
  • KK ... carrying capacity
  • rr ... growth rate
Bio Data-Science
  • Growth start exponentially
  • slows down as it reaches the carrying capacity
  • can we use Python the replicate this formula?
https://opentextbc.ca/conceptsofbiologyopenstax/chapter/population-growth-and-regulation/
Bio Data-Science
  • Write a flexible simulation for logistic growth, so can simulate:
    • Seal populations
    • Bacterial growth
    • etc.

1 Variables and Data Types

⌛ 25 minutes

Bio Data-Science

Hints

# importing the math library
import math

# here the user can define the input
...


# here the calculation is made
...

# here, we print out the results
...
Bio Data-Science

2.2 Lists and loops

Bio Data-Science

2.2.1 Special Data Types

🎯 Learning Objectives

You will be able to

  • use Python to formulate and evaluate Boolean expressions
  • use None type variables
Bio Data-Science

Difference between Assignment (=) and Comparison (==)

eve_is_here = True                  # Assignment: eve_is_here is set to True
adam_is_here = False                # Assignment
print(adam_is_here == eve_is_here)  # Comparison of the values
#> False
Bio Data-Science

NoneType

reading_frame_start_position = None
type(reading_frame_start_position)
#> NoneType
Bio Data-Science

✍️ Special Data Types

⌛ 10 minutes

Bio Data-Science

2.2.2 Defining Lists

🎯 Learning Objectives

You will be able to

  • define and fill lists manually
  • define list containing integer sequences
  • manipulate lists
Bio Data-Science

Lists

nucleotides = ["Guanine","Adenine","Cytosine","Thymine"]

print(nucleotides[0])

#> "Guanine"
Bio Data-Science

DNA & RNA from the perspective of a computer scientist

  • Nature does not use binary (0/1)
  • Nature stores data in long polymers of four nucleotides, usually symbolized by a single letter:
    either A, T, C, or G (for DNA)
  • For the first part of the lecture, we only consider DNA sequences and we can treat them as a list
Bio Data-Science
  • Nature always has a backup:
    • double-stranded DNA stores the complement on the other stand
    • ATGG
    • TACC
https://www.genome.gov/genetics-glossary/Base-Pair
Bio Data-Science
  • It's not the most efficient way, but
    we can consider the nucleotide sequence
    a string: sting = "ATGG"
  • a string is a list of single characters ["A","T","G","G"]
  • Why is this not efficient?
    • A character in a string requires
      at least 8 Bit (ASCII-Encoding)
    • As we know we will only have
      to deal with four different states (A,T,C, or G)
    • 2 Bit should be enough
Bio Data-Science
  • like with other encodings the data is not the information of interest
  • pairs of three nucleotides are called a codon
  • the translate to an amino acid or a stopping point
https://www.genome.gov/genetics-glossary/Codon
Bio Data-Science
  • How many possible codons are there?
  • nkn^k with n=4n=4 possible bases in a combination of k=3k=3
  • 43=644^3=64
  • This information is stored in Codon Tables
https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables
Bio Data-Science

✍️ Defining Lists

Bio Data-Science

2.2.3 Working with Lists

🎯 Learning Objectives

You will be able to

  • write scripts that loops through a list to automate calculations
  • copy parts of lists
  • decide, when to use tuples instead of lists
Bio Data-Science
nucleotides = ["Guanine","Adenine","Cytosine","Thymine"]

for nucleotide in nucleotides:
    print(nucleotide)
#> "Guanine"
#> "Adenine"
#> "Cytosine"
#> "Thymine"
Bio Data-Science

✍️ Working with Lists

Bio Data-Science

2.2.4 Applications for Lists

🎯 Learning Objectives

You will be able to

  • use list operations to clean up DNA-sequence data
Bio Data-Science

🏆 Case Study: Use Python to break up a RNA sequence into codons

  • You are given data of the following RNA-Sequence.
    dna_sequence = ["A", "U", "C", "C", "G","A", "G", "C", "U", "E", "G","A", "G", "C", "U", "G", "Z", "G","A", "G", "C", "U","U"]
  • What sequence of amino acids does this RNA sequence encode?
  • Bonus: Find an open reading frame fist
Bio Data-Science

Task

  • As You see, the data might have some errors, it should contain only the nucleotides A, U, T, and G. First clean the data by removing all corrupted items from the list.
  • Next create a loop, that prints out all three letter codons of the remaining sequence and store it in a list:
    ['A', 'U', 'C']
    ['C', 'G', 'A']
    ['G', 'C', 'U']
    ['G', 'A', 'G']
    ['C', 'U', 'G']
    ['G', 'A', 'G']
    ['C', 'U', 'U']
    
Bio Data-Science

Link

2 Lists and loops

⌛ 10 minutes

Bio Data-Science

2.3 What if?

Bio Data-Science

2.3.1 Boolean Expressions

🎯 Learning Objectives

You will be able to

  • write conditional expressions
Bio Data-Science

Boolean Expressions

  • use Boolean algebra and result in a Boolean value
    a = True
    b = True
    c = False
    result = (a and b) or (b or c)
    
  • comparisons of other (non Boolean) data types can also result in a Boolean
    a = ("abc" == "abc")
    b = (5<0)
    
  • note the difference between =and ==
Bio Data-Science

✍️ Boolean Expressions

⌛ 10 minutes

Bio Data-Science

2.3.2 Conditionals

🎯 Learning Objectives

You will be able to

  • write conditional expressions to control the execution of Python code
  • decide when to use an if, elif or else statement
Bio Data-Science

Conditionals

  • A key concept of higher programming languages are conditional statements
  • this means that some operations B and C are only made, if the condition A is either True or False
https://upload.wikimedia.org/wikipedia/commons/c/c5/If-Then-Else-diagram.svg
Bio Data-Science

if-clause

  • most languages use some form of if-clause
    if <logical expression>:
        <statement>
    
    •  if True:
           print("This is printed!")
      
    •   if False:
            print("This is not printed!")
      
Bio Data-Science

🧠 if-clause - one-sided

if <logical expression>:
    <statement>
  • logical expression is evaluated
    • TRUE: statement is executed
    • FALSE: statement is ignored
pin_correct = True

if pin_correct:
     print('Pin is correct!') 
Bio Data-Science

🧠 if-else-clause - two-sided

if <logical expression>:
    <statement a>
else <logical expression>:
    <statement b>
  • logical expression is evaluated
    • TRUE: statement a is executed
    • FALSE: statement b is ignored
# Beispiel printed Lösung
a = 4
b = 3
if a>b:
     print("a is larger than b")
else:
     print("b is larger than a")    
Bio Data-Science

✍️ Case Study: Write a program that manages lab access

Write a program the checks the lab clearance of anyone wanting to enter the lab:

  • Make a list of five or more usernames called users_with_clearance.
  • Make another list of five usernames called users_wanting_to_enter. Make sure one or two of the new usernames are also in the users_with_clearance list.
  • Loop through the users_wanting_to_enter list to see if each of them has a lab clearance. Print a message that to greet all the persons. If the person hasn't sent them to a supervisor.
Bio Data-Science

⌛ 45 minutes

3 What if?

Bio Data-Science

2.4 Storing Information and while-loops

Bio Data-Science

2.4.1 Dictionaries

🎯 Learning Objectives

You will be able to

  • define dictionaries
  • decide, when to use dictionaries instead of lists
  • work with nested structures of dictionaries and lists
Bio Data-Science

Sometimes a list is not enough

  • Imaging we want to store the information what codons encodes which amino acid
  • we could use lists to do so:
  • list_of_codons = ["UUU","UCU", "UUC" <...>]
  • list_of_amino_acids = ["phenilalanyne","serine", "phenilalanyne" <...>]
  • we could remember the index of the codon we are looking for
  • Drawbacks
    • What if we loose a single item in one list?
Bio Data-Science

Enter Dictionaries

  • A dictionary stores key-value pairs

    dict = {<key> : <value>}
    
    dict = {"UUU" : "phenilalanyne",
            "UCU" : "serine",
            "UUC" : "phenilalanyne",
            [...]  }
    
  • We can get the data using the keys (index)

    print(dict["UUU"])
    >>> "phenilalanyne"
    
Bio Data-Science

Data structures

  • List: each node except the last has exactly one reference to its successor.
  • Tree: each node can have multiple pointers to successors, but each node is only referenced by at most one predecessor node.
  • Graph: Nodes can also be referenced by multiple predecessor nodes.

Bio Data-Science
Hierarchical Dictionaries (Trees)
  • If we would like to store data of different populations, wouldn't it be nice to have some structure?

    dict = {"E. Coli" : 
                {   "Initial population" : 1000,
                    "Carrying capacity" : 100000,
                    "Growth rate" : 0.1
                    "Data" : [{
                        "time" : 0,
                        "population" : 1000
                    },
                    {    "time" : 1,
                        "population" : 1010
                    }]
    
                },
            "Seals" : [...]  
            }
    
Bio Data-Science

✍️ Case Study: Write a program to analyze logistic growth

  • Write a simulation for logistic growth that evaluates the population size for all time steps starting from zero and only ends if the population surpasses a closeness of ϵ=1\epsilon = 1 to the carrying capacity.
Bio Data-Science

Sometimes it is unclear when to terminate the algorithm

  • Solution with for loop

    for t in range(1,100): 
        current_population = <...>
    
  • Solution with while-loop

    t = 0
    
    while epsilon > 1:
        epsilon = carrying_capacity - current_population
        t = t+1 
        current_population = <...>
    
Bio Data-Science
  • Store the results for each time step in a list. The list should contain dictionaries that have the time step, current population size and the population growth since the last time step.

  • Find the time step with the maximum growth in population.

Bio Data-Science

Link

4 Storing and Entering Information

  • read and try out 2.4.1 Dictionaries
Bio Data-Science

2.4.2 While-Loops

🎯 Learning Objectives

You will be able to

  • use while loops and prevent endless loops by using break statements
Bio Data-Science

Sometimes it is unclear when to terminate the algorithm

distance_to_wall = 15

while distance_to_wall > 1:
    do_step_forward()
Bio Data-Science

✍️ While-Loops

Bio Data-Science

2.5 Functions

Bio Data-Science

🎯 Learning Objectives

You will be able to

  • create functions with (optional) parameters and return values
  • Students use docstings to comment on a function
  • Students can restructure code with functions
Bio Data-Science

You probably created a lot of Code during the last lectures

# importing the math library
import math

t = 0
epsilon = 1

# here the user can define the input
population_size = 1
carrying_capacity = 200
growth_rate = 0.1

results = []

print("Started simulation with the following parameters: \n initial population size: {} \n carrying capacity: {} \n growth rate: {}".format(initial_population_size, carrying_capacity, growth_rate))

while population_size + epsilon  < carrying_capacity:

  last_population_size = population_size

  population_size = carrying_capacity / (1+((carrying_capacity - initial_population_size)/initial_population_size) * math.exp(- growth_rate*t))

  growth = population_size - last_population_size

  print("The population after {} time steps is {}".format(t, population_size))

  t = t + 1  

  new_result = {"Time step" : t, "Population size" : population_size, "Growth" : growth}
  results.append(new_result)

  last_growth = 0

for data_point in results:
  growth = data_point["Growth"] 
  is_current_growth_lower_than_before = (last_growth > growth)

  if is_current_growth_lower_than_before:
    print("Found maximum growth of {} at {}!".format(last_data_point["Growth"],last_data_point["Time step"] ))
    break
  last_growth = data_point["Growth"] 
  last_data_point = data_point
Bio Data-Science

What if You

  • have do some changes (e.g., integrate a new growth function)?
  • want to reuse code to evaluate another population?
Bio Data-Science

Structure code with functions to make simulations reusable

Scenario 1 Scenario 2 Scenario ...
ϵ\epsilon 1 10 ...
Initial population 1 200 ...
Carrying capacity 200 1000 ...
Growth rate 0.1 0.01 ...
Bio Data-Science

Functional programming

  • What we did so far is called imperative programming: We told Python what to do.

  • functional programming focuses on functions

    • f(x)=2x=yf(x) = 2\cdot x = y
      def f(x):
          """this function's name is f. It takes a value x and returns a value y"""
          [...]
          y=x*2
          return y
      
      f(2) # function call
      
  • Functional programming makes code easier to understand and maintain

  • Functions should

    • be shorter than 20 lines of code
    • do one thing
Bio Data-Science
def add_two_numbers(first_number, second_number):
  """Returns the sum of two numbers."""
  new_sum = first_number + second_number
  return new_sum

new_sum = add_two_numbers(first_number = 1,second_number = 2)
print(new_sum)
Bio Data-Science

Anatomy of a Function

def <function_name>(<parameter>):
  """<docstring>"""
  ...
  return <return_value>

<function_name>(<parameter>=<argument>)

Bio Data-Science

✍️ Case Study: Structure code with functions to make simulations reusable

Scenario 1 Scenario 2 Scenario ...
ϵ\epsilon 1 10 ...
Initial population 1 200 ...
Carrying capacity 200 1000 ...
Growth rate 0.1 0.01 ...
  • refactor Your code into three functions and use them
    to make simulations for other populations
Bio Data-Science

Link

5 Functions

  • there are some 🤓 optional tasks, try them anyway but only after You solved anything else!
  • there is a group work at the end 🏆
Bio Data-Science

2.6 Algorithms for Bio Informatics

Bio Data-Science

2.6.1 Simple Algorithms for Bio Informatics

🎯 Learning Objectives

You will be able to

  • work with biological sequence data in the form of strings
  • design and implement simple algorithms to solve problems in DNA sequence analysis
  • differentiate between global and local approaches to compare sequence data
Bio Data-Science

Genome - Code of Life

  • Similarity / Evolution
  • Malfunction / Function
https://www.genome.gov/genetics-glossary/Codon
Bio Data-Science

Simple Questions

"TAGCTAGCTAGCTTTTAGTTAGCAGCC"
  • How common are the four nucleotides?
  • Finding the reverse complement (other string of the double-helix)
  • Do we find a certain sequence in the genome and where?
Bio Data-Science

✍️ 2.6.1 Simple Algorithms for Bio Informatics

⌛ 25 minutes

Bio Data-Science

2.6.2 Sequence alignment

🎯 Learning Objectives

You will be able to

  • implement global and local approaches to compare sequence data
Bio Data-Science

Comparing sequence data

  • so far we only looked for perfect matches
  • we cannot expect perfect matches (mutation, reading errors, etc.)
  • we need other ways to compare sequence data
Bio Data-Science

Sequences are similar, if their nucleotide / amino acid patterns match well

  • each row is one sample sequence
  • amino acids are coded in different colors
https://bioinf.comav.upv.es/courses/biotech3/static/multiple/alig_mul_prot.png
Bio Data-Science

conservative means that the resulting protein is still functional
Bio Data-Science

Sequence similarity helps us to find evolutionary relationship

https://www.yourgenome.org/facts/how-do-you-find-out-the-significance-of-a-genome-after-sequencing
Bio Data-Science

Sequence similarity helps us to find stuff in databases

CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA
AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA
https://de.wikipedia.org/wiki/BLAST-Algorithmus
Bio Data-Science

Measuring Similarity between Strings

  • Hamming distance HH
    • How many letters to we have to change?
      TIME \rightarrow MINE
    • H=2H=2
  • Levenshtein [sic!] Distance LL
    • How many edits required to transform one string to another?
      • Substitution of a character
      • Deletion of a character
      • Insertion of a character
    • WOMAN \rightarrow MAN
    • L=2L=2
    • H=2H=2
Bio Data-Science

Global vs Local Alignment

  • How similar are these sequences globally?
    "TAGCTAGCTAGCTTTTAGTTAGCAGCC"
    "AGCTAGCTAGCTTTTAGTTAGCAGCCT"
    • Hamming H=22H=22: not similar at all
    • Levenshtein L=2L = 2: very similar
  • Local alignment
    compare how well a subset aligns at different positions
    "AGCTAGCT"
Bio Data-Science

Could these sequences have the same origin?

seq_1 = "ABCDEF"
seq_2 = "ABDEFG"
  • H=3H=3, probably not
  • L=2L=2, probably
  • kk-mers Approach
Bio Data-Science
kk-mers Approach
  • the sequences are split into kk fragments of length kk, which are compared
  • kk corresponds to the word length in BLAST
seq_1 = "ABC"
seq_2 = "ABD"
seq_1 = "ABC"

# 2-mers of seq_1
mer_1_seq_1 = "AB"
mer_2_seq_1 = "BC"

seq_2 = "ABD"
# 2-mers of seq_2
mer_1_seq_1 = "AB"
mer_2_seq_1 = "BD"
Bio Data-Science
counter = 0
for k_mer_1 in seq_1:               # Create a possible k-mers from seq_1
    for k_mer_2 in seq_2:           # Create a possible k-mers from seq_2
        if k_mer_1 == k_mer_2:      # Compare them
            counter = counter + 1   # Count the identical k-mers

seq_1 = "ABC"
seq_2 = "ABD"

"AB" == "AB" # 1/4 of k-mers matches
"AB" == "BD"
"BC" == "AB"
"BC" == "BD"

# H = 1
# L = 1
Bio Data-Science

For two identical sequences

seq_1 = "ABC"
seq_2 = "ABC"

"AB" == "AB" # 1/4 matches
"AB" == "BC"
"BC" == "AB"   
"BC" == "BC" # 2/4 matches

# H = 0
# L = 0
Bio Data-Science

For longers sequences

seq_1 = "ABCD"
seq_2 = "ABAB"

"AB" == "AB" # 1/9 matches
"AB" == "BA" 
"AB" == "AB" # 2/9 matches
"BC" == "AB"   
"BC" == "BA" 
"BC" == "AB" 
"CD" == "AB"
"CD" == "BA"
"CD" == "AB"

# H = 2
# L = 2
Bio Data-Science

2.6.3 Analyzing Complexity

  • When several algorithms solve a problem, how do you know which one
    is best?
    • Is it the simplest?
    • The fastest?
    • The smallest?
    • Or something else?
  • One way to judge an algorithm is by its run time. An algorithm’s run time is the amount of time
    import time
    start = time.perf_counter()
    for i in range(1, 6):
        print(i)
    end = time.perf_counter()
    print(end – start)
    
Bio Data-Science

Linear time

  • The same algorithm would take (about) twice as long if we double number of elements
  • e.g., looping through a range
    import time
    start = time.perf_counter()
    for i in range(1, 12):
        print(i)
    end = time.perf_counter()
    print(end-start)
    
Bio Data-Science

Constant Time

  • Finding the value to a key in a dictionary hash a constant time independent of the length of the dictionary. The computer only has to compute the hash-function of the key to find the memory address:
    data =    {"M25925" : "AGCAAAAGCAGGCAAACCATTT..." ,
            "M25926" :  "AGCAAAAGCAGGCAAACCATTTG....",
            "MT058709" :  "CAAACCATTTGAATGGATGTCAA...."}
    
    data[M25926]
    
  • Fining data in a linked list is more expensive, as the computer has to traverse the list to find the right position in the memory
Bio Data-Science

Quadratic Time

  • Nesting two for-loops has quadratic time complexity
  • if the sequences are twice as long the algorithm will take four times as long
    counter = 0
    for k_mer_1 in seq_1:               # Create a possible k-mers from seq_1
        for k_mer_2 in seq_2:           # Create a possible k-mers from seq_1
            if k_mer_1 == k_mer_2:      # Compare them
                counter = counter + 1   # Count the identical k-mers
    
Bio Data-Science

O-Notation

  • gives us the order of magnitude of the time-complexity of a algorithm
  • with larger data set (larger nn) this becomes very important

Bio Data-Science

🏆 Case Study: Develop an Algorithm for finding Similarity

  • We have three Sequences from three different Influenza viruses
  • The sequences encode for the same protein but have have some variations
  • Your task is to find the two samples, that are more closely related to each other
Bio Data-Science
M25925 = {"Sample Name" : "M25925", "Sequence" : "AGCAAAAGCAGGCAAACCATTT..." }

M25926 = {"Sample Name" : "M25926", "Sequence" :  "AGCAAAAGCAGGCAAACCATTTG...."}

MT058709 = {"Sample Name" : "MT058709", "Sequence" :  "CAAACCATTTGAATGGATGTCAA...."}
Bio Data-Science

Link

6 Algorithms for Bio Informatics

  • this might be the most challenging task so far, as it requires some creativity
  • pen and paper are useful tool to structure Your ideas
  • this is a good preparation for Your final project

⌛ 60 minutes

Bio Data-Science

2.7 Testing and Final Challenge

Bio Data-Science

2.7.1 Testing

🎯 Learning Objectives

You will be able to

  • use unit tests, to test the functionality of their functions
  • develop and test their own algorithms and code to find practical solutions for sequence analysis
Bio Data-Science

Functions

def get_formatted_name(first, last):
  """Generate a neatly formatted full name."""
  full_name = first + ' ' + last
  return full_name.title()
  • must fulfill certain requirements
  • grow and change over time
Bio Data-Science

Unittests

import unittest

# We define a class the holds different test for this specific function
class NamesTestCase(unittest.TestCase):
  """Tests for 'name_function.py'."""

# We define on of several test function for the function
  def test_first_last_name(self):
    """Do names like 'Janis Joplin' work?"""
    # we all the function
    formatted_name = get_formatted_name('janis', 'joplin')
    # we check whether we get the expected result
    self.assertEqual(formatted_name, 'Janis Joplin')

unittest.main(argv=[''], verbosity=2, exit=False)
  • provide a automatic test cast for a unit of code
  • you can run them after each change in the function to check whether they still work as expected
Bio Data-Science

✍️ 2.7.1 Testing Your Code

7 Reconstruction in Shotgun-Sequencing

⌛ 30 minutes

Bio Data-Science

2.7.2 🏆 Final Project: Reconstruction in Shotgun-Sequencing

So far, we assumed to have full length DNA-Sequences available from a database or any other source. In fact, we are still developing the technology yet to read long strands of DNA and directly storing them in a database. Instead, the DNA strand is broken down in shorter sequences for reading the results. However, we have to bring the short sequences in the right order again.

https://www.yourgenome.org/facts/how-do-you-find-out-the-significance-of-a-genome-after-sequencing
Bio Data-Science

Shotgun Sequencing / Assembly

https://www.genome.gov/sites/default/files/media/images/tg/Shotgun-sequencing.jpg
Bio Data-Science

"Shotgun sequencing is a laboratory technique for determining the DNA sequence of an organism’s genome. The method involves randomly breaking up the genome into small DNA fragments that are sequenced individually. A computer program looks for overlaps in the DNA sequences, using them to reassemble the fragments in their correct order to reconstitute the genome." genome.gov

You have a list of fragmented DNA-snippets that have to be ordered in the right order. Your task is to implement an algorithm, that brings the DNA-sequence into thr right order.

Bio Data-Science

Verbal Description of the Algorithm

  • Randomly pick a sequence of five to eight nucleotides form the following original sequence and write them on a piece of paper
    "TAGCTAGCTAGCTTTTAGTTAGCAGCC"
  • write Your name on the other side of the paper
Bio Data-Science

Algorithm assemble_sequence()

  • Inputs: List with segment fragment
  • Output: Assembled sequence
1) draw a random seed sequence
2) while sequences left in the list
    2.1) draw next sequence
    2.2) compare_beginning(overlap)
        if exact match 
            glue_fragments_end()
    2.3) compare_end(overlap)
        if exact match 
            glue_fragments_end()
Bio Data-Science
  • does the algorithm terminate?
    • yes, at one point we cannot find any new matches
    • but, we cannot be sure, that we found the whole original sequence
  • is the algorithm deterministic?
    • it depends on how we select the sequences
    • we can get stuck, if we do it in the wrong order
    • hence, we should draw by random and to it multiple times
Bio Data-Science

Your task

  • You will implement this algorithm in Python
  • The starting point is given in the notebook
  • Your submission is a *.ipynb file showing all the results in the sakai-task
  • the naming convention must be
    • <lastname-member1>.pdf
    • e.g. huber.pdf
  • Deadline: 27.11.2023 (24:00)
Bio Data-Science

What is given to You

  • Jupyter Notebook with the outline
  • the original DNA-sequence and the list of fragments
  • four functions with name, parameters and docstring
  • four three cases, which must be passed to get the points
Bio Data-Science

Grading

This is the basis for the final grade.

  • Passing the unit tests for the following functions:
    • CompareBeginningsTestCase - 20 %
    • CompareEndingsTestCase- 20%
    • GlueFragmentsTestCase - 20 %
  • Implement the following function without unit test:
    • assemble_sequence() - 10 %
  • Coding style and comments - 20 %
  • Length of final sequence and speed of execution of assemble_sequence() - 10 %
Bio Data-Science

Link

7 Reconstruction in Shotgun-Sequencing

  • make yourself familiar with the task

⌛ 30 minutes

Bio Data-Science