Visual Basic for Applications/Pseudo Random Repeated Substrings

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Summary[edit]

This page describes some matters that apply to the Rnd() function of VBA. In particular it illustrates that repeated substrings can result when the Randomize() function is wrongly positioned inside the same loop as Rnd(), instead of before it.

The VBA Rnd() Function[edit]

  • The Rnd() function is pseudo random, as opposed to true random . True randomness is found rarely, one notable example being the sequence of data that can be derived from white noise. White noise, like the radio noise from the sun or perhaps the unintentional noise in a radio or other electronic device, has a fairly uniform distribution of frequencies, and can be exploited to produce random distributions of data; also known as linear probability distributions because their frequency distributions are straight lines parallel to the horizontal axis.
  • Pseudo randomness can be obtained with a feedback algorithm, where a sequence of output values of a function is fed back and assists in the making of the next part of an output stream. These are referred to as pseudo random number generators (PRNG). Such a process, although complex, is nonetheless determinate, being based entirely on its starting value. Such generators, depending on their design can produce long sequences of values, all unique, before the entire stream eventually repeats itself.
  • A PRNG output stream will always repeat itself if a long enough set of values is generated. The Rnd function in VBA can generate a sequence of up to 16,777,216 numbers before any one number is repeated, at which time the entire sequence itself is repeated. This is adequate in most cases. The Rnd() function has been described by Microsoft as belonging to the set of PRNGs called Linear Congruential Generators (LCG), though it is unclear as to whether or not the algorithm has since been modified.
  • The Rnd function is not suitable for large tables or for cryptographic use, and VBA itself is quite insecure in its own right. For given starting values the generator always will produce the same sequence. Clearly, if any part of the stream is known, this allows other values in the sequence to be predicted, and this state of affairs is insecure for cryptographic use. Perhaps surprisingly, modeling methods that make much use of random values need even longer unique sequences than that produced by Rnd().
  • The exact coding of the Microsoft Rnd() function is not available, and their descriptive material for it is quite sketchy. A recent attempt by me to implement the assumed algorithm in VBA code failed because of overflow, so those who intend to study such generators in VBA need to use another algorithm. Perhaps study of the Wichmann-Hill (1982) CLCG algorithm, that can be implemented in VBA would be a better choice. A VBA implementation, (by others), of the Wichmann-Hill (1982) algorithm is provided in another page of this series, along with some simpler generator examples.

Worst Case for Rnd() Substrings?[edit]

  • A well designed PRNG stream consists of unique numbers, but this applies only to the designer's unfiltered set of numbers in the range from zero to unity, [0,1]. As soon as we start to take some values from the stream, and ignore others, say to make custom outputs, the new steams will take on different characteristics. The combination of cherry-picking elements from the natural sequence and the mapping of a large set to a very small set takes its toll. When observing the new set, the count of characters to the cycle repeat point shortens, and the number of repeated substrings increases throughout the set.
  • The code listing below allows checking of a Rnd() stream for substrings, using preset filter settings, eg; capitals, lower case, integers, etc., and in addition, includes a similar generator based on a hash for those who would like to compare it.
  • The repeat substring procedure is quite slow, depending as it does on the location of repeats. The worst case is for no repeats found where the number of cycles becomes maximum at (0.5*n)^2, the square of half the number of characters in the test string. Of course the smallest number of cycles is just one when a simple string is repeated, eg; abcabc. Clearly, an increase of string length by a factor of ten could increase the run time by a factor of one hundred. (1000 characters in about 2 seconds, 2000 in 4secs, 10000 in 200secs, is, so far, the best timing!).
  • Coding layout can affect the length of repeated substrings too. The reader might compare the effect of placing the Randomize function outside the random number loop, then inside the loop, while making an output of only capitals. (See code in MakeLongRndStr). In the past the repeat strings have worsened considerably when placed within. The code as listed here to test Rnd() with 1000 samples of capitals, no use of DoEvents, and Randomize wrongly placed inside the loop, will return a repeat substring of up to four hundred characters for this author. Increasing the code lines in the loop, affecting the time (?) for each iteration of the loop to run, also affects the length of any substrings.
Option Explicit

Sub TestRndForRepeats()
    'run this to make a pseudo random string
    'and to test it for longest repeated substring
    
    Dim strRnd As String, sOut As String
    Dim nOut As Long, nLen As Long
    
    strRnd = MakeLongRndStr(1000)
    MsgBox strRnd,, "Long string..."
    
    sOut = LongestRepeatSubstring(strRnd, nOut)
    
    MsgBox "Repeated substring found is : " & _
       vbCrLf & sOut & vbCrLf & _
       "Number of these found : " & nOut & vbCrLf & _
       "Length of each : " & Len(sOut),, "Repeat substring..."

End Sub

Sub TestHashForRepeats()
    'run this to make a long hash-based output
    'and to test it for longest repeated substring
    
    Dim sOut As String, sHash As String, nOut As Long
    
    sHash = LongHash("String to hash", 1000)
    
    MsgBox "The following sha256-based hash has " & _
           Len(sHash) & " characters." & _
           vbCrLf & vbCrLf & sHash,, "Long hash..."

    sOut = LongestRepeatSubstring(sHash, nOut)
    
    MsgBox "Repeated substring found is : " & _
       vbCrLf & sOut & vbCrLf & _
       "Number of these found : " & nOut & vbCrLf & _
       "Length of each : " & Len(sOut),, "Repeat substring..."

End Sub

Function MakeLongRndStr(nNumChr As Long) As String
    'Makes a long capital letters string using rnd VBA function
    
    Dim n As Long, sChr As String, nAsc As Long
    Dim nSamp As Long, sRec As String
    
    '========================================================================
    ' Notes and Conclusions:
    ' The VBA function rnd is UNSUITED to generation of long random strings.
    ' Both length and number of repeats increases rapidly near 256 charas.
    ' Reasonable results can be obtained by keeping below 128 characters.
    ' For longer strings, consider hash-based methods of generation.
    '========================================================================
    'Randomize 'right place
    Do Until n >= nNumChr
        'DoEvents
        Randomize 'wrong place
        nSamp = Int((122 - 48 + 1) * Rnd + 48) 'range includes all charas
        sChr = Chr(nSamp)
        
        'cherry-picks 10, 26, 36, 52, or 62 from a set of 75
        Select Case nSamp 'chara filter
                Case 65 To 90  'upper case letters
                    sRec = sRec & sChr
                Case 48 To 57  'integers
                    'sRec = sRec & sChr
                Case 97 To 122 'lower case letters
                    'sRec = sRec & sChr
                Case Else
                    'disregard
        End Select
        n = Len(sRec)
    Loop
    
    'MsgBox sAccum
    
    MakeLongRndStr = Left$(sRec, nNumChr)

End Function

Function LongHash(sIn As String, nReq As Long, Optional sSeed As String = "") As String
    'makes a long sha256 hash - length specified by user
    'Parameters: sIn;   the string to hash
                'nReq;  the length of output needed
                'sSeed; optional added string modifier
    
    Dim n As Long, m As Long, c As Long, nAsc As Integer, sChr As String
    Dim sF As String, sHash As String, sRec As String, sAccum As String
    
    Do Until m >= nReq
        DoEvents
        n = n + 1 'increment
        'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        'you set your own cycle increment here
        sF = sIn & sSeed & sAccum & (7 * n * m / 3)
        'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        'get a single hash of sF
        sHash = HashSHA256(sF)
        'filter output for chara type
        For c = 1 To Len(sHash)
            sChr = Mid$(sHash, c, 1)
            nAsc = Asc(sChr)
            'cherry-picks 10, 26, 36 ,52, or 62 from a set of 64
            Select Case nAsc 'chara filter
                Case 65 To 90  'upper case letters
                    sRec = sRec & sChr
                Case 48 To 57  'integers
                    'sRec = sRec & sChr
                Case 97 To 122 'lower case letters
                    'sRec = sRec & sChr
                Case Else
                    'disregard
            End Select
        Next c
        'accumulate
        sAccum = sAccum & sRec
        m = Len(sAccum)
        sRec = "" 'delete line at your peril!
    Loop
    
    LongHash = Left$(sAccum, nReq)

End Function

Function HashSHA256(sIn As String) As String
    'Set a reference to mscorlib 4.0 64-bit
    'HASHES sIn string using SHA2-256 algorithm
    
    'NOTE
    'total 88 output text charas of base 64
    'Standard empty string input gives : 47DEQpj8HBSa+/...etc,
    
    Dim oT As Object, oSHA256 As Object
    Dim TextToHash() As Byte, bytes() As Byte
    
    Set oT = CreateObject("System.Text.UTF8Encoding")
    Set oSHA256 = CreateObject("System.Security.Cryptography.SHA256Managed")
    
    TextToHash = oT.GetBytes_4(sIn)
    bytes = oSHA256.ComputeHash_2((TextToHash))
    
    HashSHA256 = ConvB64(bytes)
    
    Set oT = Nothing
    Set oSHA256 = Nothing
   
End Function

Function ConvB64(vIn As Variant) As Variant
    'used to produce a base-64 output
    'Set a reference to mscorlib 4.0 64-bit
    
    Dim oD As Object
    
    Set oD = CreateObject("MSXML2.DOMDocument")
    With oD
        .LoadXML "<root />"
        .DocumentElement.DataType = "bin.base64"
        .DocumentElement.nodeTypedValue = vIn
    End With
    ConvB64 = Replace(oD.DocumentElement.Text, vbLf, "")
    
    Set oD = Nothing
    
End Function

Function LongestRepeatSubstring(sIn As String, Optional nSS As Long) As String
    'finds longest repeated non-overlapping substring and number of repeats
    'greatest number cycles = (0.5*n)^2 for when "none found", eg; abcdef (9)
    'shortest number cycles = 1 for one simple duplicated string; eg abcabc
    
    Dim s1 As String, s2 As String, X As Long
    Dim sPrev As String, nPrev As Long, nLPrev As Long
    Dim nL As Long, nTrial As Long, nPos As Long, vAr As Variant
        
    nL = Len(sIn)
    For nTrial = Int(nL / 2) To 1 Step -1
        DoEvents
        For nPos = 1 To (nL - (2 * nTrial) + 1)
            X = 0
            s1 = Mid(sIn, nPos, nTrial)
            s2 = Right(sIn, (nL - nPos - nTrial + 1))
            vAr = Split(s2, s1)
            X = UBound(vAr) - LBound(vAr)
            If X > 0 Then
                If nPrev < X Then
                    sPrev = s1
                    nPrev = X
                End If
            End If
        Next nPos
        If nPrev <> 0 Then
            LongestRepeatSubstring = sPrev
            nSS = nPrev
            Exit Function
        End If
    Next nTrial
End Function