count source lines of code (SLOC) using pygments
Pygount is a command line tool to scan folders for source code files and count the number of source code lines in it. It is similar to tools like sloccount and cloc but uses the pygments package to analyze the source code and consequently can analyze any programming language supported by pygments.
The name is a combination of pygments and count.
Pygount is available from https://pypi.python.org/pypi/pygount and can be installed running:
$ pip install pygount
Simply run and specify the folder to analyze recursively, for example:
$ pygount ~/development/sometool
If you omit the folder, the current folder of your shell is used as starting point. Apart from folders you can also specify single files and shell patterns (using ?, * and ranges like [a-z]).
Certain files and folders are automatically excluded from the analysis:
To specify alternative patterns, use --folders-to-skip and --names-to-skip. Both take a comma separated list of patterns, see below on the pattern syntax. To for example also prevent folders starting with two underscores (_) from being analyzed, specify --folders-to-skip=[...],__*.
To limit the analysis on certain file types, you can specify a comma separated list of suffixes to take into account, for example --suffix=py,sql,xml.
By default the result of the analysis are written to the standard output in a format similar to sloccount. To redirect the output to a file, use e.g. --out=counts.txt. To change the format to an XML file similar to cloc, use --format=cloc-xml.
Some command line arguments take patterns as values.
By default, patterns are shell patterns using *, ? and ranges like [a-z] as placeholders. Depending on your platform, the are case sensitive (Unix) or not (Mac OS, Windows).
If a pattern starts with [regex] you can specifiy a comma separated list of regular expressions instead using all the constructs supported by the Python regular expression syntax. Regular expressions are case sensitive unless they include a (?i) flag.
If the first actual pattern is [...] default patterns are included. Without it, defaults are ignored and only the pattern explicitely stated are taken into account.
So for example to specify that generated code can also contain the German word “Generiert” in a case insensivie way use --generated=[regex][...](?i).*generiert.
When reading source code, pygount automatically detects the encoding. It uses a simple algorithm where it recognizes BOM, XML declaractions such as:
and “magic” comments such as:
# -*- coding: cp1252 -*-
If the file does not have an appropriate heading, pygount attempts to read it using UTF-8. If this fails, it reads the file using a fallback encoding (by default CP1252) and ignores any encoding errors.
You can change this behavior using the --encoding option:
If a source code is not counted, the number of lines is 0 and the language shown is a pseudo language indicating the reason:
To get a description of all the available command line options, run:
$ pygount --help
To get the version number, run:
$ pygount --version
It’s recommended to run pygount as one of the first steps in your build process before any undesired file like compiler targets or generated source code are built.
An example “Execute shell” build step for Jenkins is:
pygount --format=cloc-xml --out cloc.xml --suffix=py --verbose
Then add a post-build action “Publish SLOCCount analysis results” and set “SLOCCount report” to “cloc.xml”.
Pygount basically counts physical lines of source code.
First, it lexes the code using the lexers pygments assigned to it. If pygments cannot find an appropriate lexer, pygount has a few additional internal lexers that can at least distinguish between code and comments:
Furthermore plain text has a separate lexer that counts all lines as comments.
Lines that only contain comment tokens and white space count as comments. Lines that only contain white space are not taken into account. Everything else counts as code.
If a line contains only “white characters” it is not taken into account presumably because the code is only formatted that way to make it easier to read. Currently white characters are:
Because of that, pygount reports about 10 to 20 percent fewer SLOC for C-like languages than other similar tools.
For some languages “no operations” are detected and treated as white space. For example Python’s pass or Transact-SQL’s begin and end .
As example consider this Python code:
class SomeError(Exception): """ Some error caused by some issue. """ pass
This counts as 1 line of code and 3 lines of comments. The line with pass is considered a “no operation” and thus not taken into account.
Pygount can analyze more languages than other common tools such as sloccount or cloc because it builds on pygments, which provides lexers for hundreds of languages. This also makes it easy to support another language: simply write your own lexer.
For certain corner cases pygount give more accurate results because it actually lexes the code unlike other tools that mostly look for comment markers and can get confused when they show up inside strings. In practice though this should not make much of a difference.
Pygount is slower than most other tools. Partially this is due to actually lexing instead of just scanning the code. Partially other tools can use statically compiled languages such as Java or C, which are generally faster than dynamic languages. For many applications though pygount should be “fast enough”, especially when called during a nightly build.
Pygount provides a simple API to integrate it in other tools. This however is currently still a work in progress and subject to change.
Here’s an example on how to analyze one of pygount’s own source codes:
>>> import pygount >>> analysis = pygount.source_analysis('pygount/analysis.py', 'pygount') >>> analysis SourceAnalysis(path='pygount/analysis.py', language='Python', group='pygount', code=302, documentation=66, empty=62, string=23, state='analyzed', state_info=None)
Version 0.9, 2017-05-04
Version 0.8, 2016-10-07
Version 0.7, 2016-09-28
Version 0.6, 2016-09-26
Version 0.5, 2016-09-22
Version 0.4, 2016-09-11
Version 0.3, 2016-08-20
Version 0.2, 2016-07-10
Version 0.1, 2016-07-05