Get More Juice out of Your Enterprise Code Base with Code Search

When most people think about a company's reusable assets, source code doesn't usually show up on the list, even though millions of dollars are spent every year on creating and maintaining code. Most large companies are managing hundreds of millions of lines of code—the majority of which was purpose-built to solve a specific application problem. Most of that code is locked up in source control management systems (SCMs) specific to an application or a siloed organization.

Add to this the world of open-source software development where similarly billions of lines of code exist, but where source code is shared publicly and regularly reused—both wholesale and through forking. Here too, plenty of effort and resources are spent in writing and maintaining source code. Source code is maintained, extended and reused by a large number of developers. And, like enterprise code, open-source code also is stored in various source code repositories.

Collectively, the code that lives in internal SCMs across large organizations, together with the billions of lines of code that exist in the open-source world, reflect the implementation artifacts of literally millions of developers. These artifacts can be used as powerful resources to assist with the design, development, analysis and problem-solving of future applications.

But, how can we leverage this massive resource?

Code Search Engine

A code search engine is a tool that can help developers unlock the wealth of diverse implementation knowledge buried inside large repositories. A code search engine facilitates search operations that are specific to source code and applies analysis and heuristics specific to source code while processing, indexing and retrieving source code. A source code engine, unlike general text-based search engines, is designed and implemented especially to cater to developers' information needs related to source code. With these features, a code search engine facilitates source code search.

Code Search

Source code search (or simply, code search) is a technique to find relevant source code in multiple source code repositories. Code search can help fulfill commonly occurring search needs during development tasks, such as finding the usage of APIs across different projects, finding how a known information structure is implemented in code (such as base 64 encoding) and so on. What a developer finds useful in code search results depends on the search need at hand. An effective code search engine facilitates fulfilling such search needs by delivering relevant results and providing the means to explore and narrow down search results in cases where the need is vague and unclear. Given that alternative choices are available in the results, a code search engine can act as a choice engine by allowing code-specific faceting and filtering mechanisms.

Enterprise Code Search

Enterprise code search is code search as applied inside a company's firewall, searching corporate source code repositories. Enterprise code search must adhere to additional enterprise requirements, such as authorization and access policies on source code visibility. This poses additional requirements and challenges when considering a code search engine for an enterprise, since the search tool has to meet the company's standards and needs to fit with existing IT, enterprise tools and deployment procedures in place.

Use Cases for Code Search

Developers frequently use code search tools for copy-paste programming. Best practices developers frequently seek to reuse existing solutions, and once implemented, a common solution to a problem (such as a well-known algorithm) can be used again and again. Copying and pasting code from an existing solution, when legally permissible, often can be the most efficient approach, saving developers time and resources to focus on more challenging tasks. A code search engine can be an ideal tool to find such solutions. Although there certainly are reservations against practices like copy-paste programming, some of which are reasonable (for example, one might not be able to trust someone else's code blindly), code search engines deployed inside enterprises can winnow down results to internal projects that reveal code written by experts, helping to alleviate such concerns while still permitting the much-practiced copy-paste programming.

Developers are not always looking for exact lines of code to copy and reuse. More often they seek useful patterns they can add to their repertoire of knowledge to solve recurring tasks. For example, while using APIs, developers need to learn the patterns of API usage. Today's applications frequently leverage API calls to other internal or external components. The typical API has little documentation and few good examples, so it can be frustrating and time-consuming for developers to figure out how to use them successfully. Two easy answers to this problem would be either to enable developers to see examples of how other developers have used an API or to provide visibility into the code behind the API. To accomplish this, developers need an easy way to search and view an API call or other code that calls the API. A code search engine allows developers to accomplish this task easily. Code samples and examples are vital learning tools for developers who often will copy and modify existing examples to fit their purposes. A code search engine lets developers use existing code repositories as sources of examples—in the above case, sources of API usage examples.

A code search engine can be helpful in various other scenarios. When starting a new project with new languages and frameworks, developers would benefit by researching and studying the code bases of mature projects using the same languages and frameworks. Open-source implementations can be a great way for developers to learn solutions to complex computing problems, such as implementing distributed systems, search engines, network servers and so on. Code search engines in the enterprise also can be extremely helpful during normal development activities, such as maintenance, porting and working with legacy code. Code search engines can be used to index and cross-link files spanning multiple types and languages, thus supporting traceability in the search results. Developers can use code search during maintenance to find source files, unit tests and configuration files related to a particular feature.

Challenges in Code Search

The use cases presented above demonstrate the potential benefits of a code search engine, but these benefits cannot be realized unless the code search engine is effective and efficient. Code search results must be relevant, comprehensive and meet the users' information need for the tool to be effective. It must be designed with the features and capabilities needed by a wide range of developers who are under constant pressure to work more efficiently and cost effectively. To be efficient, the code search solution must be capable of delivering effective results within acceptable response times by having the capacity to scale to very large repositories.

Source code, unlike plain or natural language text, tends to be very sparse. This poses a serious challenge in building effective code search engines if one resorts only to techniques that work for natural text. The lack of rich vocabulary in code has to be compensated with additional attributes that can be leveraged and would exist only in source code. One such attribute is the rich structural information that exists in source code. Unlike natural text, source code is highly structured with definitions of various nested elements and relations between these elements. For example, in a typical object-oriented program, one would find classes and methods, where classes extend to other classes, and method calls to other methods. A code search engine needs to parse source code to extract such elements to provide search operators that specifically allow the retrieval of these elements. For example, when a developer needs to find a certain method name, an operator (such as mdef in ohloh.code) easily can deliver effective search results on such a query.

This rich interlinked structure relates several elements with one another and can be the basis of accumulating similar terms when vocabulary is sparse. Similar to the Web, the link structure in code itself can be used to build new metrics of popularity and ranking, if used properly. There are several conventions (such as naming conventions) found in source code writing that are uncommon in natural text that make special tokenization and processing suitable for source code. (To learn more about these topics, refer to the author's doctoral dissertation: Facilitating Internet-Scale Code Retrieval at http://dl.acm.org/citation.cfm?id=2019966.)

For proper extraction of elements in source code and relations among such elements, a code search engine first needs to be able to detect the implementation language and perform detailed parsing of the code, which can be nontrivial for complex languages and for repositories where erroneous or incomplete code exists.

Beyond lexical and structural properties, source code has executable properties making it an executable artifact with runtime behaviors that change as the code evolves. Understanding such behavior is vital to activities like fixing bugs or improving performance. A code search engine can leverage the stored representations of runtime behavior as captured in test coverage reports, call traces, profiling outputs and logs, and relate them with appropriate elements defined in source code to provide answers related to unexpected behavior in code.

Finally, being produced and maintained by developers who work collaboratively, source code even has human-centric attributes. Since most of the activities on source code are logged in source repositories, a source code engine can tap into information connected to such activities to provide answers related to developers and their activities when needed. For example, it can help find an expert on a certain feature, or a developer tasked with managing a specific project can be notified when a certain portion of the code in question is changed.

An effective code search engine allows developers to extract, represent, store, mine and use these source code-specific attributes irrespective of the scale at which all such attributes can expand in size when applied to enterprise or Internet-scale source code repositories.

What's Different in Enterprise Code Search

There are some important differences between enterprise and open-source code search. Open-source code search is done over code repositories found on the Internet and can be seen as an instance of Internet code search—developers searching for code on the Internet. Results can vary widely when searching one's own enterprise code base compared to searching open-source repositories. Inside an enterprise, it's likely there are more stringent code quality checks, better practices for using APIs and stricter code authorship attribution. These are just a few of the factors that can influence the examples developers can find when searching their enterprise code bases.

From a tool-builder's perspective, additional benefits of enterprise code search include tighter integration with ALM tools. Tool builders also can use code search to conduct more accurate analyses during indexing, because code in enterprise source code repositories could be quality controlled or automated to prevent erroneous and incomplete check-ins. In short, there are even more opportunities for us to explore leveraging the unique aspects of enterprise code base.

Measuring the Benefits of Enterprise Code Search

The usage of enterprise code search engines is still in the early adoption phase, so measuring the benefits can be a challenge. Without hard empirical data, these benefits are difficult to quantify but not impossible. Following are examples of how enterprises can assess the benefits:

  • As a productivity tool for developers: how much time and effort is spent on questions about code every day? How long does a developer have to wait to get an answer? How much time and effort could a developer save with code search tools, not only herself, but also for other members of the development team who collaborate with her and each other on a daily basis? With a code search tool, many such delays could be avoided, saving the valuable time of not only one but many developers.

  • Value of code search engine as a knowledge-enhancing tool: enhancing one's own knowledge is certainly invaluable, and if a code search engine works as a knowledge-building tool for developers, its value is already justified. To developers, source code is their literature, and a code search engine can act as a tool to navigate and master such literature.

  • More quantitative measures: there can be more quantitative and long-term means of measuring the benefits of a code search engine. Detailed tracking and logging of activities in the code search engine can lead to quantifiable discoveries of code reuse. Looking at activities over time (as permitted by honoring privacy concerns), such as searches, downloads and copy-paste events, enterprises can gain invaluable insights into their code base that can be applied to improving developer efficiency and software performance proactively.

Overall, as a team or a company, one can devise a strategy to measure the benefits of a code search tool by looking at things one can quantify, such as logs, and by understanding benefits that could be qualitative—by asking the end users, developers, managers and other collaborators to share how these tools benefit them individually and as a group.

Conclusion

Leveraging code artifacts from other developers can open up new opportunities for learning, code reuse and lowering the time and cost of software development and maintenance. The ability to search collections of large code repositories rapidly is fundamental to realizing these benefits. By putting more focus on leveraging code as a valuable learning asset, we can build upon the collective experiences within our industry to work more efficiently as innovators in developing new code.

______________________

Sushil Krishna Bajracharya is passionate about building tools that make software developers more effective and efficient.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Good article. Are there any

Latest Government Jobs in India's picture

Good article.
Are there any examples of a Code search program or Open source project?

thanks for the article Are

sarkari naukri's picture

thanks for the article
Are there any examples of a Code search program or Open source project?

Reply to comment | Linux Journal

colon cleanse's picture

of course like your website however you have to take a look at the spelling on quite a
few of your posts. Many of them are rife with spelling issues and I find it very bothersome to tell the truth nevertheless I will surely come back again.

Reply to comment | Linux Journal

how to lose belly fast's picture

Your style is so unique in comparison to other folks I have read stuff from.
Many thanks for posting when you've got the opportunity, Guess I will just book mark this page.

Reply to comment | Linux Journal

workouts to lose belly fat for women's picture

Hi, after reading this remarkable post i am also glad to share my know-how here with colleagues.

Reply to comment | Linux Journal

the panorama's picture

Well connected to Major Expressway such as Pan-Island Expressway (PIE) and Central Expressway (CTE) to get you to places in
no time. the panorama

Preview comment | Linux Journal

keyword tool's picture

Hi, Neat post. There is a problem together with your
web site in internet explorer, would test this? IE still is
the market chief and a big component to other folks will omit your magnificent writing because of this problem.

examples of code search engines (programs)

Sushil K Bajracharya's picture

Hi Colin.

Thanks for liking the article.

An example of a real world implementation is Ohloh Code (http://code.ohloh.net). It lets you search across more than 20 billion lines of open source code.

Ohloh Code is powered by an enterprise code search engine called CodeSight. If you want index and search your own code, you can download a free edition here http://www.blackducksoftware.com/code-sight/develope

I recommend you follow these tips to get good results in Ohloh Code (and also with CodeSight): http://meta.ohloh.net/2013/05/usage-tips-for-searching-code-effectively-...

Hope that helps.

- Sushil

corrected CodeSight link

Sushil K Bajracharya's picture

Correct link to CodeSight: http://www.blackducksoftware.com/code-sight/developer

Sorry, missed 'r' at the end in the original link.

Examples of code search programs

Colin J McDermott's picture

Good article.

Are there any examples of a Code search program or Open source project?

Is this something that is a feature/addon to github or subversion?

You have nailed the theory, just wondering what programs have been implemented in the real world...

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState