Duplicate code

Last updated

In computer programming, duplicate code is a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. Duplicate code is generally considered undesirable for a number of reasons. [1] A minimum requirement is usually applied to the quantity of code that must appear in a sequence for it to be considered duplicate rather than coincidentally similar. Sequences of duplicate code are sometimes known as code clones or just clones, the automated process of finding duplications in source code is called clone detection.

Contents

Two code sequences may be duplicates of each other without being character-for-character identical, for example by being character-for-character identical only when white space characters and comments are ignored, or by being token-for-token identical, or token-for-token identical with occasional variation. Even code sequences that are only functionally identical may be considered duplicate code.

Emergence

Some of the ways in which duplicate code may be created are:

It may also happen that functionality is required that is very similar to that in another part of a program, and a developer independently writes code that is very similar to what exists elsewhere. Studies suggest that such independently rewritten code is typically not syntactically similar. [2]

Automatically generated code, where having duplicate code may be desired to increase speed or ease of development, is another reason for duplication. Note that the actual generator will not contain duplicates in its source code, only the output it produces.

Fixing

Duplicate code is most commonly fixed by moving the code into its own unit (function or module) and calling that unit from all of the places where it was originally used. Using a more open-source style of development, in which components are in centralized locations, may also help with duplication.

Costs and benefits

Code which includes duplicate functionality is more difficult to support because,

On the other hand, if one copy of the code is being used for different purposes, and it is not properly documented, there is a danger that it will be updated for one purpose, but this update will not be required or appropriate to its other purposes.

These considerations are not relevant for automatically generated code, if there is just one copy of the functionality in the source code.

In the past, when memory space was more limited, duplicate code had the additional disadvantage of taking up more space, but nowadays this is unlikely to be an issue.

When code with a software vulnerability is copied, the vulnerability may continue to exist in the copied code if the developer is not aware of such copies. [3] Refactoring duplicate code can improve many software metrics, such as lines of code, cyclomatic complexity, and coupling. This may lead to shorter compilation times, lower cognitive load, less human error, and fewer forgotten or overlooked pieces of code. However, not all code duplication can be refactored. [4] Clones may be the most effective solution if the programming language provides inadequate or overly complex abstractions, particularly if supported with user interface techniques such as simultaneous editing. Furthermore, the risks of breaking code when refactoring may outweigh any maintenance benefits. [5] A study by Wagner, Abdulkhaleq, and Kaya concluded that while additional work must be done to keep duplicates in sync, if the programmers involved are aware of the duplicate code there weren't significantly more faults caused than in unduplicated code. [6] [ disputed discuss ]

Detecting duplicate code

A number of different algorithms have been proposed to detect duplicate code. For example:

Example of functionally duplicate code

Consider the following code snippet for calculating the average of an array of integers

externintarray_a[];externintarray_b[];intsum_a=0;for(inti=0;i<4;i++)sum_a+=array_a[i];intaverage_a=sum_a/4;intsum_b=0;for(inti=0;i<4;i++)sum_b+=array_b[i];intaverage_b=sum_b/4;

The two loops can be rewritten as the single function:

intcalc_average_of_four(int*array){intsum=0;for(inti=0;i<4;i++)sum+=array[i];returnsum/4;}

or, usually preferably, by parameterising the number of elements in the array.

Using the above function will give source code that has no loop duplication:

externintarray1[];externintarray2[];intaverage1=calc_average_of_four(array1);intaverage2=calc_average_of_four(array2);

Note that in this trivial case, the compiler may choose to inline both calls to the function, such that the resulting machine code is identical for both the duplicated and non-duplicated examples above. If the function is not inlined, then the additional overhead of the function calls will probably take longer to run (on the order of 10 processor instructions for most high-performance languages). Theoretically, this additional time to run could matter.

Example of duplicate code fix via code replaced by the method Duplicate code.gif
Example of duplicate code fix via code replaced by the method

See also

Related Research Articles

In computer programming and software design, code refactoring is the process of restructuring existing source code—changing the factoring—without changing its external behavior. Refactoring is intended to improve the design, structure, and/or implementation of the software, while preserving its functionality. Potential advantages of refactoring may include improved code readability and reduced complexity; these can improve the source code's maintainability and create a simpler, cleaner, or more expressive internal architecture or object model to improve extensibility. Another potential goal for refactoring is improved performance; software engineers face an ongoing challenge to write programs that perform faster or use less memory.

Programming style, also known as coding style, refers to the conventions and patterns used in writing source code, resulting in a consistent and readable codebase. These conventions often encompass aspects such as indentation, naming conventions, capitalization, and comments. Consistent programming style is generally considered beneficial for code readability and maintainability, particularly in collaborative environments.

In computing, inline expansion, or inlining, is a manual or compiler optimization that replaces a function call site with the body of the called function. Inline expansion is similar to macro expansion, but occurs during compilation, without changing the source code, while macro expansion occurs prior to compilation, and results in different text that is then processed by the compiler.

Copy-and-paste programming, sometimes referred to as just pasting, is the production of highly repetitive computer programming code, as produced by copy and paste operations. It is primarily a pejorative term; those who use the term are often implying a lack of programming competence and ability to create abstractions. It may also be the result of technology limitations as subroutines or libraries would normally be used instead. However, there are occasions when copy-and-paste programming is considered acceptable or necessary, such as for boilerplate, loop unrolling, languages with limited metaprogramming facilities, or certain programming idioms, and it is supported by some source code editors in the form of snippets.

<span class="mw-page-title-main">F Sharp (programming language)</span> Microsoft programming language

F# is a general-purpose, high-level, strongly typed, multi-paradigm programming language that encompasses functional, imperative, and object-oriented programming methods. It is most often used as a cross-platform Common Language Infrastructure (CLI) language on .NET, but can also generate JavaScript and graphics processing unit (GPU) code.

<span class="mw-page-title-main">D (programming language)</span> Multi-paradigm system programming language

D, also known as dlang, is a multi-paradigm system programming language created by Walter Bright at Digital Mars and released in 2001. Andrei Alexandrescu joined the design and development effort in 2007. Though it originated as a re-engineering of C++, D is now a very different language. As it has developed, it has drawn inspiration from other high-level programming languages. Notably, it has been influenced by Java, Python, Ruby, C#, and Eiffel.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

MBASIC is the Microsoft BASIC implementation of BASIC for the CP/M operating system. MBASIC is a descendant of the original Altair BASIC interpreters that were among Microsoft's first products. MBASIC was one of the two versions of BASIC bundled with the Osborne 1 computer. The name "MBASIC" is derived from the disk file name MBASIC.COM of the BASIC interpreter.

In mathematics and in computer programming, a variadic function is a function of indefinite arity, i.e., one which accepts a variable number of arguments. Support for variadic functions differs widely among programming languages.

<span class="mw-page-title-main">AutoIt</span> Scripting language to automate tasks in Microsoft Windows

AutoIt is a freeware programming language for Microsoft Windows. In its earliest release, it was primarily intended to create automation scripts for Microsoft Windows programs but has since grown to include enhancements in both programming language design and overall functionality.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

The C and C++ programming languages are closely related but have many significant differences. C++ began as a fork of an early, pre-standardized C, and was designed to be mostly source-and-link compatible with C compilers of the time. Due to this, development tools for the two languages are often integrated into a single product, with the programmer able to specify C or C++ as their source language.

Tacit programming, also called point-free style, is a programming paradigm in which function definitions do not identify the arguments on which they operate. Instead the definitions merely compose other functions, among which are combinators that manipulate the arguments. Tacit programming is of theoretical interest, because the strict use of composition results in programs that are well adapted for equational reasoning. It is also the natural style of certain programming languages, including APL and its derivatives, and concatenative languages such as Forth. The lack of argument naming gives point-free style a reputation of being unnecessarily obscure, hence the epithet "pointless style".

Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.

The syntax and semantics of PHP, a programming language, form a set of rules that define how a PHP program can be written and interpreted.

<span class="mw-page-title-main">Control table</span> Data structures that control the execution order of computer commands

Control tables are tables that control the control flow or play a major part in program control. There are no rigid rules about the structure or content of a control table—its qualifying attribute is its ability to direct control flow in some way through "execution" by a processor or interpreter. The design of such tables is sometimes referred to as table-driven design. In some cases, control tables can be specific implementations of finite-state-machine-based automata-based programming. If there are several hierarchical levels of control table they may behave in a manner equivalent to UML state machines

This article describes the features in the programming language Haskell.

Nemerle is a general-purpose, high-level, statically typed programming language designed for platforms using the Common Language Infrastructure (.NET/Mono). It offers functional, object-oriented, aspect-oriented, reflective and imperative features. It has a simple C#-like syntax and a powerful metaprogramming system.

<span class="mw-page-title-main">Dafny</span> Programming language

Dafny is an imperative and functional compiled language that compiles to other programming languages, such as C#, Java, JavaScript, Go and Python. It supports formal specification through preconditions, postconditions, loop invariants, loop variants, termination specifications and read/write framing specifications. The language combines ideas from the functional and imperative paradigms; it includes support for object-oriented programming. Features include generic classes, dynamic allocation, inductive datatypes and a variation of separation logic known as implicit dynamic frames for reasoning about side effects. Dafny was created by Rustan Leino at Microsoft Research after his previous work on developing ESC/Modula-3, ESC/Java, and Spec#.

This glossary of computer science is a list of definitions of terms and concepts used in computer science, its sub-disciplines, and related fields, including terms relevant to software, data science, and computer programming.

References

  1. Spinellis, Diomidis. "The Bad Code Spotter's Guide". InformIT.com. Retrieved 2008-06-06.
  2. Code similarities beyond copy & paste by Elmar Juergens, Florian Deissenboeck, Benjamin Hummel.
  3. Li, Hongzhe; Kwon, Hyuckmin; Kwon, Jonghoon; Lee, Heejo (25 April 2016). "CLORIFI: software vulnerability discovery using code clone verification". Concurrency and Computation: Practice and Experience. 28 (6): 1900–1917. doi:10.1002/cpe.3532. S2CID   17363758.
  4. Arcelli Fontana, Francesca; Zanoni, Marco; Ranchetti, Andrea; Ranchetti, Davide (2013). "Software Clone Detection and Refactoring" (PDF). ISRN Software Engineering. 2013: 1–8. doi: 10.1155/2013/129437 .
  5. Kapser, C.; Godfrey, M.W., ""Cloning Considered Harmful" Considered Harmful," 13th Working Conference on Reverse Engineering (WCRE), pp. 19-28, Oct. 2006
  6. Wagner, Stefan; Abdulkhaleq, Asim; Kaya, Kamer; Paar, Alexander (2016). "On the Relationship of Inconsistent Software Clones and Faults: An Empirical Study". 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). pp. 79–89. arXiv: 1611.08005 . doi:10.1109/SANER.2016.94. ISBN   978-1-5090-1855-0. S2CID   3154845.
  7. Brenda S. Baker. A Program for Identifying Duplicated Code. Computing Science and Statistics, 24:49–57, 1992.
  8. Ira D. Baxter, et al. Clone Detection Using Abstract Syntax Trees
  9. Visual Detection of Duplicated Code Archived 2006-06-29 at the Wayback Machine by Matthias Rieger, Stephane Ducasse.
  10. Yuan, Y. and Guo, Y. CMCD: Count Matrix Based Code Clone Detection, in 2011 18th Asia-Pacific Software Engineering Conference. IEEE, Dec. 2011, pp. 250–257.
  11. Chen, X., Wang, A. Y., & Tempero, E. D. (2014). A Replication and Reproduction of Code Clone Detection Studies. In ACSC (pp. 105-114).
  12. Bulychev, Peter, and Marius Minea. "Duplicate code detection using anti-unification." Proceedings of the Spring/Summer Young Researchers’ Colloquium on Software Engineering. No. 2. Федеральное государственное бюджетное учреждение науки Институт системного программирования Российской академии наук, 2008.